Tilde’s Machine Translation Systems for WMT 2017

The paper describes Tilde’s EnglishLatvian and Latvian-English machine translation systems for the WMT 2017 shared task in news translation. Both constrained and unconstrained systems are described. Our constrained systems were ranked as the best performing systems according to the automatic evaluation results. The paper gives details to how we pre-processed training data, the NMT system architecture that we used for training the NMT models, the SMT systems and their usage in NMT-SMT hybrid system configurations.


Introduction
The year 2016 marked the first time when neural machine translation (NMT) systems achieved significantly better results than statistical machine translation (SMT) systems for most of the translation directions in the news translation shared task of the WMT conference. This was achieved due to a number of architectural and data preprocessing novelties that the winning systems incorporated, for instance, the use of an attention mechanism in the decoder of the NMT system (Bahdanau et al., 2014), back-translation of additional in-domain monolingual data for domain adaptation of the NMT system after training of a broad domain model or during re-training of the whole NMT model, use of sub-word units to address the problem of out-of-vocabulary word translation, and others .
Since then, a number of further advances have been made in machine translation and related fields. A lot of effort has been invested in the search for the best hyper-parameter configurations and neural network architectures for NMT sys-tem training (Britz et al., 2017). In particular, the use of long short-term memory (LSTM) cells and deep architectures has shown to allow increasing translation quality. Parallel to that, a number of novelties in neural network architectures have been introduced for other sequence processing tasks, some of which, like the multiplicative LSTM (MLSTM) units (Krause et al., 2016), promise advantages even over deep recurrent network architectures. For data pre-processing, we have shown that the language agnostic word splitting method using byte pair encoding (BPE) inconsistently splits words for morphologically rich languages and that the method can be improved by linguistically motivating word splitting (Pinnis et al., 2017b).
For the WMT 2017 shared task in news translation, we build upon the NMT toolkit Nematus  that achieved the best results in the WMT 2016 shared task. We also incorporate in our systems the latest advancements in the field, for instance, MLSTM recurrent layers, morphology-driven word splitting, better handling of unknown and rare words with robust NMT models, and hybrid methods. The improvements over the baseline NMT model have allowed us to develop the best scoring systems for the English-Latvian and Latvian-English translation directions.
The paper is further structured as follows: Section 2 provides an overview of our WMT 2017 systems, Section 3 describes the data and the different data processing workflows used for preparing the data for training, Section 4 describes SMT systems that were used in NMT-SMT hybrid system configurations, Section 5 describes the NMT architecture used for training of our NMT systems, Section 6 describes the hybrid NMT-SMT system architecture, Section 7 describes our evaluation results, and Section 8 concludes the paper.

374
2 System Overview For the WMT 2017 shared task, we developed both constrained and unconstrained MT systems. In total, we submitted five systems: • Constrained English-Latvian and Latvian-English NMT-SMT hybrid systems.
• Unconstrained English-Latvian and Latvian-English NMT-SMT hybrid systems trained on significantly larger corpora.
• An unconstrained English-Latvian SMT system that achieves higher automatic evaluation results than the NMT-SMT hybrid systems.

Data
For training of MT systems, we used the WMT 2017 training data, however, for the unconstrained systems we also used resources from the Tilde Data Library. 1 All data were filtered using our data filtering methods (see Section 3.1), pre-processed with standard and custom pre-processing tools (see Section 3.2), and supplemented with synthetic data (see Section 3.3). For tuning and for decision-making during the development, we used the newsdev2017 data set that was provided by the WMT 2017 organisers.

Data Filtering
Our previous research in NMT system development has shown that NMT systems are more sensitive to the noise present in the training data (Pinnis et al., 2017a) than SMT systems, therefore, we performed parallel data filtering to reduce potential non-parallelities and the negative effect of noise on the NMT systems. The filtering consisted of the following steps: 1. Long sentence filtering (longer than 1500 symbols or 80 tokens).
2. Sentence length difference filter (sentence pairs with a length ratio smaller than 0.3 were filtered out).
5. Bad encoding filter that filtered out sentences containing foreign and corrupt symbols.
6. Digit mismatch filter that showed to be an effective method for dealing with sentence segmentation issues in a number of corpora (including the Digital Corpus of European Parliament).
Training data statistics for both constrained and unconstrained systems are shown in Table 1.

Data Pre-processing
After filtering, all training data were pre-processed using the following steps: 1. Normalisation of punctuation. Only one standard of quotation marks and apostrophes were used, hyphenated tokens were split and the hyphens were replaced with a special symbol.
2. Identification of non-translatable entities. Email addresses, URLs, file addresses and XML tags were identified and replaced with place-holders.

4.
Truecasing. The Moses truecase.perl was used to truecase the first word of each sentence.
5. Morphology-driven word splitting (Pinnis et al., 2017b). Tokens were split using a morphological analyser and further processed with byte pair encoding (BPE) (Sennrich et al., 2015) to ensure an open vocabulary. For both languages, we used morphological analysers that were developed by Deksne (2013) using finite state transducer technology.
6. Factorisation. Following the work of Sennrich and Haddow (2016), who showed that linguistic input features allow increasing NMT system translation quality, we developed our NMT systems using factored models. Therefore, the source data were further factored using a language-specific tag-  Table 1: Training data statistics (sentence counts) for SMT and NMT systems before and after filtering ger or parser. For Latvian, we used an averaged perceptron-based morpho-syntactic tagger (Nikiforovs, 2014) that was trained on the data from Pinnis and Goba (2011). For English, we used the lexicalized probabilistic parser (Klein et al., 2002) from the Stanford CoreNLP toolkit (Manning et al., 2014).

Synthetic Data
Similarly to the method by Pinnis et al. (2017b) that allows training NMT models that are more robust to unknown and rarely occurring words, we supplemented the parallel training data with synthetic parallel training sentences. To create the synthetic corpus, we performed word alignment on the parallel corpus using fast-align (Dyer et al., 2013). Then, we randomly replaced one to three unambiguously (one-to-one) aligned content words with unknown word <UNK> placeholders. Finally, we copied factor information from the original factored source sentence to the synthetic sentence.
Using the filtered and the synthetic training data, we trained initial target-to-source NMT models (see Section 5 for details on the NMT architecture). Then, we shuffled the available in-domain monolingual data (news articles or news commentary in the target language) and for each system back-translated a part of the monolingual data from the target language into the source language in order to create additional synthetic source-totarget parallel corpora. The data were selected such that the amount would approximately correspond to the original training data. Experiments with different back-translated data proportions showed that the best results could be achieved with a proportion of 1-to-1.
The back-translated parallel corpora were also supplemented with sentence pairs where content words with unambiguous alignments were randomly replaced with unknown word placeholders. Finally, the additional synthetic data were  added to the existing training data. The statistics of the synthetic corpora and the final training data for NMT system training are given in Table 2. It can be seen that the synthetic data creation process increased the size of the training data four times.

SMT Systems
SMT systems were trained using Moses (Koehn et al., 2007) in the Tilde MT platform (Vasiļjevs et al., 2012). All systems were trained using the filtered training data (see Table 1). Word alignment was performed using fast-align (Dyer et al., 2013). All SMT systems feature 7-gram translation models and the wbe-msd-bidirectional-feallff 2 reordering models. The systems have two language models that were trained using KenLM (Heafield, 2011) -an in-domain language model trained on the news article and news commentary corpora and an out-of-domain language model trained on the remaining monolingual data. The systems were tuned using MERT on the news-dev2017 data set.

NMT System Architecture
The NMT system architecture is based on the implementation available with the Nematus toolkit that was used by  to produce   We also use linguistic input features as described by . I.e., each factor of a word part has its own embedding vector and in order to obtain one embedding vector for the whole word part, the individual embedding vectors are concatenated. In more detail, the encoder's embedding layer has a total of 500 dimensions, which are split among the different input factors as specified in Table 5. It accommodates a vocabulary of 25 thousand sub-word units. The embedding layer is followed by a bidirectional MLSTM layer with 1024 dimensions for gates and cell states.
The decoder has a similar architecture to the implementation in the Nematus toolkit (Sennrich et al., 2017) which improves on the original attention-based NMT model (Bahdanau et al., 2014) by conditioning the attention weights on the previously decoded word in addition to the hidden state at the previous time-step. This is achieved by first computing an intermediate statê s t = GRU(s t−1 , e y t−1 ), then using it to compute the attention context where s t−1 and e y t−1 are the decoder's hidden state and the embedding of the decoded word at the previous time-step respectively, and h is the annotation matrix produced by the encoder. The hidden state is then calculated as s t = GRU(ŝ t , c t ).
We modify this scheme by using an MLSTM cell to calculate the intermediate state (ŝ t , z t ) = MLSTM(s t−1 , z t−1 , e y t−1 ), whereŝ t and z t are the MLSTM cell's output and hidden states respectively.
Similarly to the encoder, all of the gates and intermediate states of the decoder have a dimensionality of 1024. The decoder's embeddings have a dimensionality of 500.
For training, we also used dropout with the rate of 0.2 for hidden layers, and 0.1 for input and output embedding layers. For optimisation, we used Adadelta (Zeiler, 2012) with a learning rate of 0.0001, and we also used gradient clipping with a threshold of 1.
After training, 5 to 7 models that achieved the highest mixed metric evaluation results on the tuning data (i.e., the newsdev2017 data set) were selected for ensemble decoding with a beam size of 12.

Hybrid System Architecture
After developing the initial NMT models, a preliminary manual analysis of the translations of the English-Latvian constrained system showed that only 34-44% of named entities within the tuning set were translated correctly. At the same time, the SMT system was able to handle approximately 70% of named entities correctly. Taking into account that our previous research in hybrid machine translation system development has shown that SMT systems in hybrid NMT-SMT system scenarios can handle rare and unknown word translation in hybrid scenarios better than the NMT models (Pinnis, 2016) alone, we decided to chain the NMT and SMT systems into a hybrid NMT-SMT system set-up. In the hybrid set-up, a sentence is at first translated with the NMT system, after which rare and unknown words that are left untranslated by the NMT system are translated with the SMT system.
The hybrid translation method performs translation in six steps as follows (see Table 4 for an example of a sentence processed through all of the hybrid translation steps): 1. First, rare and unknown words are identified in the source sentence and replaced by unknown word place-holders. Words are considered rare if they consist of at least one subword unit (or a sub-word unit bigram), which

NMT translation
watch the βIDβ -βIDβ start at the Rio Games today .

Post-processed translation
Watch the Ikauniece-Admidina start at the Rio Games today.

NMT only transl. (for comparison)
Today, look at the start of the Isolence-Admidias in the Rio Games.  Table 5: Rare word detection thresholds occurrence count in the training data is below a certain threshold. The thresholds for our submitted systems were empirically identified by analysing the hybrid method's performance on the tuning data. The thresholds are given in Table 5.
2. Then, the pre-processed sentence is translated with the NMT system. Our NMT models have been trained to leave the unknown word place-holders untranslated, i.e., to pass them through to the target side (Pinnis et al., 2017b). The capability of the NMT system to pass the place-holders through unchanged is vital for the further steps to work.
3. After translation, the NMT model's produced attention matrix is used to perform word alignment. Here, we also identify which source words correspond to each place-holder on the target side.

Then, a Moses XML document is prepared
for the sentence such that the Moses SMT system will have to translate only the words that were replaced by the place-holders but leave the remaining part as it was translated by the NMT system. 5. Then, for the Latvian-English unconstrained system, we use a person name and surname dictionary to look-up translations of untranslated person names. The translations from the dictionary are merged in the Moses XML document so that the SMT system would be constrained to the translations found in the dictionary.
6. Finally, the Moses XML document is translated with the SMT system.
In the hybrid set-up, the same pre-processing and post-processing steps are used as for the individual NMT and SMT systems.

Results
We evaluated all MT systems using multiple automatic evaluation metrics including BLEU (Papineni et al., 2002), BEER 2.0 (Stanojevic and Sima'an, 2014), CharacTER (Wang et al., 2016),  Table 6: Automatic evaluation results of Tilde's systems (CS stands for case sensitive evaluation; the results are significant compared to the SMT system with p = 0.01 †; the BLEU scores are given with a 95% confidence interval that was calculated using bootstrap resampling (Koehn, 2004)) and TER (Snover et al., 2006). The automatic evaluation results (see Table 6) on the new-stest2017 data set show that for English-Latvian the constrained NMT system and for Latvian-English both the constrained and unconstrained NMT systems achieve significantly better results than the SMT systems. The difference between the quality of the unconstrained English-Latvian SMT and NMT systems is not statistically significant.
Since the automatic metrics have shown not to be sufficient to evaluate MT systems of the two different paradigms (Pinnis et al., 2017a), we also performed (blind) human comparative evaluation of the SMT and NMT system translations. Five professional translators were given the source sentence and translations of two MT systems and asked to select, which system (NMT, SMT, or neither) produces a better translation. The evaluation was performed on the tuning data set. In total, 200-250 sentences were evaluated in each evaluation task. The results in Figure 1 show that the NMT system translations are preferred more than the translations of the SMT systems. According to the methodology by Skadiņš et al. (2010), the results are weakly sufficient for all scenarios (except the Latvian-English unconstrained scenario for which the results are strongly sufficient) to conclude that the NMT systems produce better translations than the SMT systems.
The results also show that there is an insignificant quality increase for the hybrid systems over the NMT systems. The increase is minimal as only sentences that contain words with rare word parts are translated differently. However, the hybrid scenario (and the components used in the hybrid scenario) allows us to integrate the NMT systems in our existing SMT infrastructure for formattingrich document translation, which is a vital requirement for us to provide NMT services for customers.
Compared to other submitted systems, it is evident (see Table 7) that our constrained NMT-SMT hybrid systems significantly outperform other submitted systems.

Conclusions
In the paper, we have described English-Latvian and Latvian-English MT systems that were developed by Tilde for the WMT 2017 shared task in news translation. In total, we submitted five systems: four NMT-SMT hybrid systems (two constrained and two unconstrained systems) and one unconstrained English-Latvian SMT system that achieves similar translation quality as the NMT system according to automatic evaluation.
We have documented the methodology used to prepare the data for training of the systems, the SMT and NMT system training set-ups, the workflow for chaining the NMT and SMT systems into a hybrid NMT-SMT system, as well as our evaluation efforts.
The automatic and manual evaluation results show that three out of four NMT systems significantly outperform the SMT systems. Although the hybrid systems did not produce a significant improvement, the minimal improvement is con-  Table 7: Automatic evaluation results of the top three English-Latvian and Latvian-English constrained systems submitted for the WMT 2017 shared task on news translation (CS stands for case sensitive evaluation; the results are significant compared to other systems with p = 0.01 †; the BLEU scores are given with a 95% confidence interval that was calculated using bootstrap resampling (Koehn, 2004)) sistent across all language pairs. The results also showed that in terms of automatic evaluation our submitted NMT-SMT hybrid systems significantly outperform the systems submitted by other participants of the shared task.