The AFRL-MITLL WMT17 Systems: Old, New, Borrowed, BLEU

This paper describes the AFRL-MITLL machine translation systems and the improvements that were developed during the WMT17 evaluation campaign. This year, we explore the continuing proliferation of Neural Machine Translation toolk-its, revisit our previous data-selection efforts for use in training systems with these new toolkits and expand our participation to the Russian–English, Turkish–English and Chinese–English translation pairs.


Introduction
As part of the 2017 Conference on Machine Translation (WMT, 2017) news-translation shared task, the MITLL and AFRL human language technology teams participated in the Russian-English, English-Russian, Turkish-English and Chinese-English tasks.
Our machine translation systems this year are a departure from our previous Moses (Koehn et al., 2007) based systems from WMT16 . We employ systems built with the Nematus (Sennrich et al., 2017) toolkit as in our IWSLT2016 (Kazi et al., 2016 systems, the Nematus-compatible Marian training toolkit and AmuNMT decoder  and the OpenNMT (Klein et al., 2017) toolkit.
For the Russian-English and Turkish-English language pairs, we submitted an entry comprising the best systems combined using the Jane system combination method (Freitag et al., 2014) and the best-scoring single system for that language pair. For the Chinese-English and English-Russian language pairs, we only submitted our single-best system.

Data Preparation
The Russian/English files were cleaned to remove blank lines, replace carriage returns with line feed characters, remove wrong-language text, and correct mixed alphabet spellings, following techniques outlined in  and (Schwartz et al., 2014).
The number of non-parallel blank lines in the Russian/English news commentary files indicated some sentence alignment errors, so these files were re-processed using the NLTK Punkt (Kiss and Strunk, 2006) sentence segmenter and the Champollion sentence aligner (Ma, 2006) before cleaning. Altogether, 9537 of the original 236,314 newscommentary lines were removed during the clean-up process.
The Chinese files were word-segmented with Jieba 2 and the Stanford Chinese segmenter (Chang et al., 2008). The Chinese-English parallel data was cleaned to replace carriage returns and to remove wrong-language text. Lines with URLs in http were also removed, because of a difference in the Chinese and English tokenization. Altogether, the clean-up process removed 310,121 of the 21,248,495 lines in the combined file.

Subselection
We use our corpus subselection algorithm, defined in . We use a vocabulary of up to 4-grams for subselection, after using bytepair encoding (see Section 3) to produce sub-word units. We believe that selecting from subwords is especially beneficial in morphologically-complex languages like Turkish and Russian.
For Russian we conducted monolingual selection from provided Common Crawl, to match test sets from 2012-2016 (15K lines total). This corpus was broken into 571 chunks of one million lines each, and five thousand lines were selected from each (2.9M lines total). 3-gram and 4-gram subword subselection vocabulary was used.
For Turkish we conducted monolingual selection from Common Crawl, to match SE Times and dev/test 2016 corpora (212K lines total). This corpus was broken into 502 chunks of one million lines each, and 25 thousand lines were selected from each (12.5M lines total). 4-gram subword subselection vocabulary was used.
After this subselection process completed for various languages we then sampled the first 3000 (e.g. top-scoring) English sentences from each selected chunk. For Russian and Turkish, we utilized the entire subselected chunk. Final line-counts for these selected data sets are listed in Table 1

MT System Descriptions
This year we participated in the Russian-English, English-Russian, Turkish-English and Chinese-English translation pairs using a variety of toolkits and techniques. Of particular note, we employed byte-pair encoding (Sennrich et al., 2016b) (BPE) of the source and target training data to address the out-of-vocabulary(OOV) problem.

Russian-English
The Russian-English language pair has been our largest focus since our participation in WMT14.
We spent significant effort building a variety of systems described as described below.

AFRL Nematus/Marian Systems
Our Nematus/Marian systems follow the general approach of the WMT16 Edinburgh NMT systems (Sennrich et al., 2016a) with the following differences: We use the data selection algorithm described in Section 2.3 yielding approximately 5 million additional lines of backtranslated data. In order to produce this backtranslated data we performed the following steps: 1) We first used Edinburgh's backtranslated data from WMT16 to produce a Nematus-based Russian-English system. 2) Once trained, we used the Amun decoder to translate the 2.8 million lines of subselected monolingual Commoncrawl Russian data into English.
3) The resulting data was then used to train an English-Russian Marian system that then used the Amun decoder to translate the 8.9 million lines of subselected English data to Russian. 4) Following this decoding, a final Russian-English system was trained using Marian with this backtranslated data. Three separate Marian training runs were performed with this final data set. Additionally, a Nematus system was trained for rescoring purposes where the English target data was reversed in word order. The combination of these final inputs was optimized with Drem  to determine feature weights.

AFRL OpenNMT Systems
We trained four OpenNMT systems. Two systems employed the backtranslated data used in last year's University of Edinburgh NMT systems (Sennrich et al., 2017). The other two systems employed the subselected data as described in Section 2.3. All systems used 1000 hidden units and 600 unit word embeddings.
The two WMT16-based systems were each fine-tuned with newstest2012-2015 data. One system was also incrementally trained with the same newstest data. The subselected systems had cased BLEU scores of 30.04 and 30.67 on newstest2017 while the WMT16-based systems had BLEU scores of 32.16 and 32.78. They were all single systems.
Since OpenNMT currently does not support ensemble decoding, we decided to try doing system combination on the last four epochs of training. Taking the best system from the subselected data then gave a BLEU score of 33.95 while the best WMT16-based systems increased to 33.23. Combining the four ensembles of each of those systems resulted in a score of 34.45 BLEU. This last ensemble system combination was done after the submission deadline.

MITLL Phrase-Based System
While similar to last year's phrase-based system , this year's system differs in a few key ways: 1) We use Moses truecased training data, to make our tokenization scheme uniform; 2) We rescore using systems built from data made available by Edinburgh's WMT16 System (Sennrich et al., 2016a); 3) We updated our language models with the new monolingual data sources, and finally 4) We add an additional 4 million lines from the UN v1.0 corpus (Ziemski et al., 2016) into the parallel training data.
For the last item, we used Moore-Lewis (Moore and Lewis, 2010) filtering on the English side of the training data. The in-domain language model was trained on news.2015.shuffled.en using a single layer LSTM language model developed in-house. The out-of-domain language model (trained on UNv1.0) used the same vocabulary. We compared word vs character-level language model results, and noted that character-level language modeling did a good job of data cleanup (giving bad scores to personnel records and poorly formatted data). We swept data selection sizes of two, four, and eight million, and found the middle size consistently the best. Our phrase-based system results can be summarized in Table 2

MITLL OpenNMT Systems
We trained an OpenNMT system with the same in-domain data as our phrase-based system, using the default 9 epochs at learning rate 1.0, and reducing the learning rate by 0.7 each epoch thereafter. This yielded a system with 29.07 BLEU on newstest2016. Creating an n-best list from the epoch 13 model and rescoring that n-best list with the models from epoch 11 and 12, combined with equal weight, yielded 29.55 BLEU.

AFRL Phrase-Based Systems
In order to provide diversity for system combination, we trained a Moses system with the provided parallel data and the subselected, backtranslated data as outlined in Section 3.1.1. We trained a 5gram, BPE'd language model from the data used to train the BigLM used in our WMT15  systems.

English-Russian
Due to the surprising effectiveness of the Marian English-Russian translation system used to produce backtranslated data, we decided to enter this system in the English-Russian translation task. This system was used in Step 2 of the Russian-English training process detailed in Section 3.1.1. Results of decoding newstest2017 are listed as entry 3 in Table 7.

Turkish-English
We apply the techniques employed in building our Russian-English systems to build Turkish-English translation systems.

AFRL Nematus/Marian Systems
For the Turkish-English task, the only provided parallel data was the SETimes corpus (Tyers and Alperen, 2010) of approximately 220,000 parallel lines. This presented a challenge for our goal of training a neural-based system similar to our Russian-English system (Section 3.1.4). We adopted a multiple step approach as before, but first starting with a Turkish-English Moses (Koehn et al., 2007) system built on the SETimes corpus with BPE applied. An order-5 KenLM (Heafield, 2011) language model was built on a BPE'd version of the BigLM employed in our WMT15 system . Hierarchical lexicalized reordering  and an order-5 Operation Sequence Model (Durrani et al., 2011) were also employed in this system. Drem  was used to optimize system feature weights using the Expected Corpus Bleu (ECB) metric.
In the interest of speed, Moses2  was used to decode the subselected Turkish corpus. An English-Turkish Marian system was then trained (with default parameters) with the provided parallel data and the backtranslated data from the previous step. This system was then used to decode the English subselected corpus. Finally, our non-combination submission system was trained using both the parallel provided data and the data generated from the previous backtranslation step. This final Marian system was trained with a source vocabulary of 70k, target vocabulary of 50k, a 2048-unit RNN hidden layer and a 512-unit word embedding layer. A Nematus system was trained with reversed target sentences to provide right-to-left(r2l) rescoring. Two Marian left-to-right (l2r) and one Nematus r2l training instances were run. Each of the 3 final models are an average of the 8 best-scoring model checkpoints for each distinct training run. These resulting l2r averaged models were used to ensemble decode the test set, with the averaged r2l model rescoring the resulting n-best lists. Finally, the one-best was output and submitted as System 5 in Table 7.

MITLL OpenNMT Systems
In the final week of the evaluation, to produce a diverse system, we attempted backtranslation, iteratively. We began with a Moses system trained on the SETimes corpus. We then took 800K sentences from news.2016.shuffled for either language. In training a Turkish to English MT system, we backtranslated the English news data into Turkish using the current best English-Turkish MT system. We then repeated the process in the other direction. In the interest of time, we used a small network with 256 sized word embeddings, 512 sized rnn, and learning rate decay starting at epoch 6. Each pass took one day. Perplexities converged after 3 iterations. See Table 3 Table 3: MITLL OpenNMT Turkish-English system perplexities on newsdev2016.

AFRL Moses Phrase-Based Systems
For contrast, a phrase-based system was built in the same manner as described in Step 1 of Section 3.3.1, but using the provided and backtranslated data used in the final step. This system contributed to the system combination listed as entry 4 of Table  7.

MITLL Nematus and OpenNMT Systems
As in our other systems, we used Moore-Lewis filtering (on characters only here due to time constraints) to sort the data. In this case, we used the entire parallel training corpora provided (25M lines), and filtered it, since we had no prior knowledge of which corpora were useful. For our Nematus system, we took the top 20 million lines, using the subselection method as a form of data "cleanup". Since this system took a month to train, for our OpenNMT system we instead extracted the top 5M sentences, and this system trained in one week. The Nematus system trained to a BLEU score of 16.39 on newstest2016, ensembled to 18.59, and the single-best OpenNMT system trained to 18.30. (OpenNMT did not have ensemble decoding implemented at the time of the evaluation.) We also rescored the Nematus ensembled n-best list with our OpenNMT system. We used an n-best list size of 12, and achieved a score of 20.06 (+.06) on newstest2017.

AFRL OpenNMT Systems
Similarly to the Chinese-English systems in the previous section, we down-sampled the available parallel data using the algorithms described in  resulting in a 5 million line parallel training set. OpenNMT systems were trained in the same manner described in Section 3.1.2. The outputs of the 8 best-scoring epochs were ensembled using system combination again in the same manner as the Russian-English systems. This resulting system is listed as entry 6 in Table 7.

AFRL Marian Systems
Again for contrast, we experimented using 5 million lines of down-selected data from the parallel UN corpus as in Section 3.4.2. We charactersegmented all Chinese characters on the source side of the data, then applied a BPE model to any remaining non-Chinese words. This BPE model is the same as the one learned from and applied to the target side of the parallel training data. Interestingly, this approach limited the source vocabulary to only 22,000 terms. The target vocabulary is a more typical 40K due to the application of BPE.
Marian was used to train models with 1024, 2048, and 3072 hidden units in the RNN layer. We saw a performance gain when increasing the number of units from 1024 to 2048, but not from 2048 to 3072 (at least for this experiment). These scores are shown in the

System Combination
Jane System Combination (Freitag et al., 2014) was used to combine a variety of systems for our Russian-English and Turkish-English combination submissions. We show the individual system combination inputs and final scores for Russian-English in Table 5 and Turkish-English in Table 6. It is important to note that our single-best Russian-English submission did not contribute to the system-combination entry as this system was a late addition at the end of the evaluation period.
For each system combination, five experiment replicates were run to account for variance in the combination process. The resulting best replicate was submitted. Results are shown in Table 7.

Conclusion
We present a series of improvements to our Russian-English systems and apply these lessons learned to creating Turkish-English and Chinese-English systems.
While researchers in recent years have been searching for principled methods to combine the strengths of statistical and neural MT, we find that carefully devised system combination and ensembling provides provides aggregate improvement. Thus, "borrowing" the Jane system combination technique allows one to combine old and new for better BLEU. OpenNMT with subsel, backtrans data, single decode 30.04 6 OpenNMT with backtrans data, finetune, inc. train 32.16 7 OpenNMT with backtrans data, finetune only 32.78 8 OpenNMT with UN data + Nematus rescore 33.24 9 OpenNMT with backtrans data, ensemble-syscomb 4-best 33.23 10 OpenNMT with subsel, backtrans data, ensemble syscomb 4-best 31.91 comb Submitted System Combination 34.71