The AFRL WMT19 Systems: Old Favorites and New Tricks

This paper describes the Air Force Research Laboratory (AFRL) machine translation systems and the improvements that were developed during the WMT19 evaluation campaign. This year, we refine our approach to training popular neural machine translation toolkits, experiment with a new domain adaptation technique and again measure improvements in performance on the Russian–English language pair.


Introduction
As part of the 2019 Conference on Machine Translation (Bojar et al., 2019) news-translation shared task, the AFRL Human Language Technology team participated in the Russian-English portion of the competition. We build on our strategies from last year (Gwinnup et al., 2018), adding additional language ID based data processing and optimizing subword segmentation strategies. For Russian-English we again submitted an entry comprising our best systems trained with Marian (Junczys-Dowmunt et al., 2018), Sockeye (Hieber et al., 2017) with Elastic Weight Consolidation (EWC) (Thompson et al., 2019), OpenNMT (Klein et al., 2018), and Moses (Koehn et al., 2007) combined using the Jane system combination method (Freitag et al., 2014).

Data Preparation
We used and preprocess data as outlined in Gwinnup et al. (2018). For all systems trained, we applied either byte-pair encoding (BPE) (Sennrich et al., 2016) or SentencePiece (Kudo and Richardson, 2018) subword strategies to address the vocabulary-size problem.
For this year, we also employed a language ID filtering step for the BPE-based systems. Using the pre-built language ID model developed by the authors of fastText (Joulin et al., 2016a,b), we developed a utility that examined the source and target sentence pairs and discarded that pair if either side fell below 0.8 1 probability of the desired language. We applied this filtering to all provided parallel corpora, removing 33.7% of lines. This process was particularly effective when used to filter the Paracrawl corpus where 57.1% of lines were removed. Pre and post-filtering line counts for various corpora are shown in Table 1  A comparison with the organizer-provided parallel training data used in our WMT18 system (which is largely the same as the provided parallel data for WMT19 in the Russian-English language pair) on baseline Marian transformer systems with identical training conditions show that aggressive language ID based filtering yields an approximate +1 BLEU point improvement as measured by SacreBLEU (Post, 2018). These results are shown in Table 2.

Exploration of Byte-Pair Encoding Merge Sizes
One of the problems faced when addressing the closed-vocabulary problem is the granularity of the subword units either produced by SentencePiece or BPE. To that end, we examined varying the number of BPE merge operations in order to determine an optimal setting to maximize performance for the Russian-English language pair. For the OpenNMT-based systems, a vocabulary size of 32k entries was employed during training of a SentencePiece segmentation model 2 . This vocabulary size was determined empirically from the training data.
Alternatively, for the BPE-based systems, we systematically examined varying sizes of BPE merge operations and vocabulary sizes in 10k increments from 30k to 80k. Results in Table 3 show that 40k BPE merge operations perform best across all test sets decoded for this language pair. All subsequent Marian experiments in this work utilize this 40k BPE training corpus.

MT Systems
This year, we focused system-building efforts on the Marian, Sockeye, OpenNMT, and Moses toolkits, having explored a variety of parameters, data, and conditions. While most of our experimentation builds off of previous years' efforts, we did examine domain adaptation via continued training, including Elastic Weight Consolidation (EWC) (Thompson et al., 2019).

Marian
As with last year's efforts, we train multiple Marian (Junczys-Dowmunt et al., 2018) models with both University of Edinburgh's "bi-deep"  and Google's transformer (Vaswani et al., 2017) architectures. Network hyperparameters are the same as detailed in Gwinnup et al. (2018). We again use newstest2014 as the validation set during training.
Utilizing the best-performing BPE parameters from Section 2.2, we first trained a baseline system in each of the two network architectures, noting the Transformer system's better performance of +0.82 BLEU on average across decoded test sets. An additional six distinct transformer models were then independently 3 trained for use in ensemble decoding. We then ensemble decoded test sets with all eight models.
Marian typically assigns each model used in ensemble decoding a feature weight of 1.0; thus each model contributes equally to the decoding process. Borrowing from our Moses training approach, we utilize a multi-iteration decode and optimize feature weights using the "Expected Corpus BLEU" (ECB) metric with the Drem optimizer (Erdmann and Gwinnup, 2015). We experimented using newstest2014 and newstest2017 as tuning sets -2017 did not help performance, but using 2014 did improve performance by up to +0.9 BLEU 4 over the non-tuned ensemble.
Scores for all the above-mentioned systems are shown in Table 4. The best-performing ensemble (ensemble tune14) was used in system combination.

Sockeye
For our Sockeye (Hieber et al., 2017) systems, we experimented with continued training (Luong and Manning, 2015;Sennrich et al., 2015) -a means to specialize a model in a new domain after a period of training on a general domain. One downside of utilizing continued training is the model adapts "too-well" to the new domain at the expense of performance in the original domain (Freitag and Al-Onaizan, 2016). One method to mitigate this performance drop is to prevent certain parameters of the network from changing with Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017). Thompson et al. (2019) conveniently provides an implementation of this technique in Sockeye.
That work illustrated a use case where the original domain is news articles, while the new domain is text of patent applications -a marked dif-   ference in style and content. Here, we created a news subdomain corpus from the newstest2014 through newstest2017 test sets. The intuition is that more current events will be discussed in these test sets than the remainder of the provided training corpora, allowing better adaptation of new events in the newest test sets (newstest2018 and newstest2019.) We first trained a baseline transformer system using the best-performing BPE parameters from Section 2.2, 512-dimension word embeddings, 6 layer encoder and decoder, 8 attention heads, label smoothing and transformer attention dropout of 0.1. We then continue-train a model on the adaptation set described above. We also followed the Sockeye EWC training procedure, producing a model more resilient to overfitting due to continued training. Results for these systems are shown in Table 5.
We see that the baseline Sockeye transformer model performs similarly to the baseline singlemodel Marian transformer system shown in Table  4. The continued-training system (con't train) system predictably overfit on newstest2014 as expected, since that test set is a part of the adaptation set. Likewise, performance on the out-ofdomain newstest2018 also dropped as a result of overfitting. The best-performing EWC system 5 5 EWC applied with weight-decay of 0.001 and learning-actually improved performance on 2018 with lesspronounced overfitting on 2014.  For system combination outlined later in Section 4, we decoded test sets with an ensemble of the four highest-scoring model checkpoints from the best EWC training run.

OpenNMT-T
Our first Open-NMT system was trained using the Transformer architecture with the default "Trans-formerBig" settings as described in Vaswani et al. (2017): 6 layers of 1024 units, 16 attention heads. Dropout rates of 0.3 for layers and 0.1 for attention heads and relu's. Training data for this system utilized the training corpus from our WMT17 Russian-English system (Gwinnup et al., 2017) consisting of provided parallel and backtranslated rate of 0.00001 data. This data was then processed with a joint 32k word vocabulary SentencePiece model.

OpenNMT-G
For our second OpenNMT system, we first trained language-specific, 32k word vocabularies using SentencePiece. WMT news test data from all years except 2014 and 2017 were used to train Senten-cePiece. These data, with the addition of the language ID filtered ParaCrawl corpus outlined in Section 2.1, were used for training the system. WMT news test data from 2014 was used for validation. OpenNMT-tf was used to create the system, using the stock "Transformer" model.

Moses
As in previous years, we trained a phrase-based Moses (Koehn et al., 2007) system with the same data as the Marian system outlined in Section 3.1 in order to provide diversity for system combination. This system employed a hierarchical reordering model (Galley and Manning, 2008) and 5-gram operation sequence model (Durrani et al., 2011). The 5-gram English language model was trained with KenLM on all permissable monolingual English news-crawl data. The BPE model used was applied to both the parallel training data and the language modeling corpus. System weights were tuned with the Drem (Erdmann and Gwinnup, 2015) optimizer using the "Expected Corpus BLEU" (ECB) metric.

System Combination
Jane system combination (Freitag et al., 2014) was employed to combine outputs from the best systems from each approach outlined above. Individual component system and final combination scores are shown in Table 6 for cased, detokenized BLEU and BEER 2.0 (Stanojević and Sima'an, 2014) .

Submission Systems
We submitted the final 5-system combination outlined in Section 4 and the four-checkpoint EWC ensemble detailed in Section 3.2 to the Russian-English portion of the WMT19 news task evaluation. Selected newstest2019 automatic scores from the WMT Evaluation Matrix 6 are shown in Table 7

Conclusion
We presented a series of improvements to our Russian-English systems, including improved preprocessing and domain adaptation. Clever remixing of older techniques from the phrasebased MT era enabled improvements in ensembled neural decoding. Lastly, we performed system combination to leverage benefits from these new techniques and favorite approaches from previous years.