JUCBNMT at WMT2018 News Translation Task: Character Based Neural Machine Translation of Finnish to English

In the current work, we present a description of the system submitted to WMT 2018 News Translation Shared task. The system was created to translate news text from Finnish to English. The system used a Character Based Neural Machine Translation model to accomplish the given task. The current paper documents the preprocessing steps, the description of the submitted system and the results produced using the same. Our system garnered a BLEU score of 12.9.


Introduction
Machine Translation (MT) is automated translation of one natural language to another using computer software. Translation is a tough task, not only for computers, but humans as well as it incorporates a thorough understanding of the syntax and semantics of both languages. For any MT system to return good translations, it needs good quality and sufficient amount of parallel corpus (Mahata et al., 2016(Mahata et al., , 2017. In the modern context, MT systems can be categorized into Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT has had its share in making MT very popular among the masses. It includes creating statistical models, whose input parameters are derived from the analysis of bilingual text corpora, created by professional translators (Weaver, 1955). The state-of-art for SMT is Moses Toolkit 1 , created by Koehn et al. (2007), incorporates subcomponents like Language Model generation, Word Alignment and Phrase Table generation. Various works have been done in SMT (Lopez, 2008;Koehn, 2009) and it has shown good results for many language pairs. 1 http://www.statmt.org/moses/ On the other hand NMT , though relatively new, has shown considerable improvements in the translation results when compared to SMT (Mahata et al., 2018). This includes better fluency of the output and better handling of the Out-of-Vocabulary problem. Unlike SMT, it doesn't depend on alignment and phrasal unit translations (Kalchbrenner and Blunsom, 2013). On the contrary, it uses an Encoder-Decoder approach incorporating Recurrent Neural Cells . As a result, when given sufficient amount of training data, it gives much more accurate results when compared to SMT (Doherty et al., 2010;Vaswani et al., 2013;Liu et al., 2014).
Further, NMT can be of two types, namely Word Level NMT and Character Level NMT. Word Level NMT, though very successful, suffers from a few disadvantages. It are unable to model rare words (Lee et al., 2016). Also, since it does not learn the morphological structure of a language it suffers when accommodating morphologically rich languages (Ling et al., 2015). We can address this issue, by training the models with huge parallel corpus, but, this in turn, produces very complex and resource consuming models that aren't feasible enough.
To combat this, we plan to use Character level NMT, so that it can learn the morphological aspects of a language and construct a word, character by character, and hence tackle the rare word occurrence problem to some extent.
In the current work, we participated in the WMT 2018 News Translation Shared Task 2 that focused on translating news text, for European language pairs. The Character Based NMT system discussed in this paper was designed to accommodate Finnish to English translations. The orga-nizers provided the required parallel corpora, consisting of 3,255,303 sentence pairs, for training the translation model. The statistics of the parallel corpus is depicted in Table 1 Table 1: Statistics of the Finnish-English parallel corpus provided by the organizers. "#" depicts No. of. "Fi" and "En" depict Finnish and English, respectively. "char" means character and "vocab" means vocabulary of unique tokens.
The remainder of the paper is organized as follows. Section 2 will describe the methodology of creating the character based NMT model and will include the preprocessing steps, a brief summary of the encoder-decoder approach and the architecture of our system. This will be followed by the results and conclusion in Section 3 and 4, respectively.

Methodology
For designing the model we followed some standard preprocessing steps, which are discussed below.

Preprocessing
The following steps were applied to preprocess and clean the data before using it for training our character based neural machine translation model. We used the NLTK toolkit 3 for performing the steps.
• Tokenization: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens. In our case, these tokens were words, punctuation marks, numbers. NLTK supports • Truecasing: This refers to the process of restoring case information to badly-cased or non-cased text (Lita et al., 2003). Truecasing helps in reducing data sparsity.

Neural Machine Translation
Neural machine translation (NMT) is an approach to machine translation that uses neural networks to predict the likelihood of a sequence of words.
The main functionality of NMT is based on the sequence to sequence (seq2seq) architecture, which is described in Section 2.2.1.

Sequence to Sequence Model
Sequence to Sequence learning is a concept in neural networks, that helps it to learn sequences. Essentially, it takes as input a sequence of tokens (characters in our case) X = {x 1 , x 2 , ..., x n } and tries to generate the target sequence as output Y = {y 1 , y 2 , ..., y m } where x i and y i are the input and target symbols respectively. Sequence to Sequence architecture consists of two parts, an Encoder and a Decoder.
The encoder takes a variable length sequence as input and encodes it into a fixed length vector, which is supposed to summarize its meaning and taking into account its context as well. A Long Short Term Memory (LSTM) cell was used to achieve this. The uni-directional encoder reads the characters of the Finnish texts, as a sequence from one end to the other (left to right in our case), Here, E x is the input embedding lookup table (dictionary), f enc is the transfer function for the Long Short Term Memory (LSTM) recurrent unit. The cell state h and context vector C is constructed and is passed on to the decoder.
The decoder takes as input, the context vector C and the cell state h from the encoder, and computes the hidden state at time t as, Subsequently, a parametric function out k returns the conditional probability using the next target symbol k.
Z is the normalizing constant, The entire model can be trained end-to-end by minimizing the log likelihood which is defined as where N is the number of sentence pairs, and X n and y t n are the input sentence and the t-th target symbol in the n-th pair respectively.
The input to the decoder was one hot tensor (embeddings at character level) of English sentences while the target data was identical, but with an offset of one time-step ahead.

Training
For training the model, we preprocessed the Finnish and English texts to normalize the data. Thereafter, Finnish and English characters were encoded as One-Hot vectors. The Finnish characters were considered as the input to the encoder and subsequent English characters was given as input to the decoder. A single LSTM layer was used to encode the Finnish characters. The output of the encoder was discarded and only the cell states were saved for passing on to the decoder. The cell states of the encoder and the English characters were given as input to the decoder. Lastly, a Dense layer was used to map the output of the decoder to the English characters, that were mapped with an offset of 1. The batch size was set to 128, number of epochs was set to 100, activation function was softmax, optimizer chosen was rmsprop and loss function used was categorical cross-entropy. Learning rate was set to 0.001. The architecture of the constructed model is shown in Figure 1.

Results
Our system was a constrained system, which means that we only used data given by the organizers to train our system. The output was converted to an SGML format, the code for which was provided by the organizers. The results were submitted to http://matrix.statmt. org/ for evaluation. The organizers calculated the BLEU score, BLEU-cased score, TER score, BEER 2.0 score, and Character TER score for our submission. As for the human ranking scores, the system fetched a standardized Average Z score of -0.404 and a non-standardized Average % score of 58.9 (Bojar et al., 2018). The results of the automated and human evaluation scores are given in Table 2.

Conclusion
The paper presents the working of the translation system submitted to WMT 2018 News Translation shared task. We have used character based encoding for our proposed NMT system. We have used a single LSTM layer as an encoder as well as a decoder. As a future prospect, we plan to use more LSTM layers in our model. We plan to create another NMT model, which takes as input words, and not characters and subsequently use various embedding schemes to improve the translation quality.