JHU 2019 Robustness Task System Description

We describe the JHU submissions to the French–English, Japanese–English, and English–Japanese Robustness Task at WMT 2019. Our goal was to evaluate the performance of baseline systems on both the official noisy test set as well as news data, in order to ensure that performance gains in the latter did not come at the expense of general-domain performance. To this end, we built straightforward 6-layer Transformer models and experimented with a handful of variables including subword processing (FR→EN) and a handful of hyperparameters settings (JA↔EN). As expected, our systems performed reasonably.


Introduction
The team at JHU submitted three systems to the WMT19 Robustness task: French-English, Japanese-English, and English-Japanese. Our goal was to evaluate the performance of reasonable state-of-the-art systems against both the robustness test set as well as more standard "general domain" test sets. We believe this is an important component of evaluating for actual robustness. In this way, we ensure that performance gains on robustness data are not purchased at the expense of this general-domain performance. Our systems used no monolingual data and relatively straightforward state-of-the-art techniques, and produced systems of roughly average performance.
2 French-English Systems

Training Data
We constrained our data use to the officially supplied data, comprising the WMT15 English-French parallel data (Bojar et al., 2015). For French, we experimented with three data settings: • all of Europarl and News Commentary; • the best million lines each of CommonCrawl, Gigaword, and the UN corpus; and • the MTNT training data.
Data sizes are indicated in Table 1 To filter the data, we applied dual cross-entropy filtering (Junczys-Dowmunt, 2018). We trained two smaller 4-layer Transformer models, one each for EN-FR and FR-EN, and used them to score the data according to the formula: where s 1 is the score (a negative logprob) from the forward FR-EN model and s 2 the score from the reverse EN-FR model. We then uniqued this data, sorted by score, and took a random sample of one million lines from the set of all sentence pairs with a score greater than 0.1. 1 For all but FR-EN Gigaword, what remained was well less than a million lines. We did this both because prior work has indicated the utility of filtering, and to make our training data sizes more manageable. We therefore did not compare against a model trained on all of the filtered data.
We experimented with two preprocessing regimes. In the first, we applied standard preprocessing techniques from the Moses pipeline 2 (Koehn et al., 2007), followed by subword splitting with BPE (Sennrich et al., 2016) using 32k merge operations. In the second scenario, we did not use any data preparation, instead applying sentencepiece (Kudo and Richardson, 2018) with subword regularization (Kudo, 2018) directly to the raw text. In this latter setting, we varied the size of the learned subword models, experimenting with 8k, 16k, 24k, and 32k.

Models
We used Sockeye (Hieber et al., 2017), a sequence to sequence transduction framework written in Python and based on MXNet. Our models were variations of the Transformer architecture (Vaswani et al., 2017), mostly using default settings supplied with Sockeye: an embedding and model size of 512, a feed-forward layer size of 2048, 8 attention heads, and three-way tied embeddings. We used batch sizes of 4,096 words, checkpointed every 5,000 updates, and stopped training with the best-perplexity checkpoint when validation perplexity had failed to improve for 10 consecutive checkpoints. The initial learning rate was set to 0.0002, the Sockeye default.

Scoring
At test time, we decoded with beam search using a beam of size 12.
We scored with sacreBLEU (Post, 2018), with international tokenization. 3 In the spirit of the robustness task, we measure BLEU not just on the reddit dataset, but also on the WMT15 newstest dataset, in order to examine how experimental variables vary in both in-and out-of-domain settings. We believe that testing both in-and outof-domain data is essential to measuring robustness.   Observation 1 Improvements are to be had both from more data and from better (in-domain) data. Adding the large filtered dataset to the 6 layer model improved BLEU more (27.9 → 33.7, +5.8) than adding the MTNT training data (27.9 → 32.9, +5), but the gains from both were even greater (+12).

Results & Discussion
Observation 2 In order to ensure that our models did not increase accuracy on the MTNT data at the expense of in-domain data, we report scores on both WMT and MTNT test sets. In only one situation was there a problem: For the 6-layer Transformer, adding the MTNT data alone (without the large amount of filtered bitext) helped on MTNT 1 8 (+5) but caused a small drop on WMT15 (-0.1).
Observation 3 In all situations, the sentencepiece model (with no other preprocessing) was just as good as the BPE model (with the Moses preprocessing pipeline). In one situation (adding the filtered data alone), it caused a gain of 0.8 over its BPE counterpart.
We further conducted a small experiment varying the sentencepiece model size (Table 3). Larger sentencepiece models were consistently better in this relatively large-data setting.
Our score on the official MTNT2019 blind test set was 40.2.
3 Japanese-English Systems

Training Data
We trained systems using only the bitext data allowed in the shared task constrained setting: • The in-domain Reddit dataset-MTNT version 1.1 (Michel and Neubig, 2018)   For preprocessing on the English side, we apply the standard Moses pipeline in the same fashion 4 http://www.cs.cmu.edu/˜pmichel1/mtnt/ 5 The data is also downloaded in pre-packaged form from the MTNT website via https://github.com/ pmichel31415/mtnt/releases/download/v1. 1/clean-data-en-ja.tar.gz, but do not confuse these with the MTNT data, which is in the Reddit domain. as the French-English system. For preprocessing on the Japanese side, we first performed word segmentation by Kytea (Neubig et al., 2011) 6 , then ran the English Moses preprocessing pipeline to handle potential code-switched English/Japanese in the data. Finally, we induced BPE subword units with 10k, 30k, and 50k merge operations, independently for each side on the bitexts (JA→EN Train-ALL and EN→JA Train-ALL). Unlike the French-English systems, the Japanese-English systems do not use shared BPE and embeddings.

Models
We use the Sockeye Transformer models for both JA→EN and EN→JA directions, similar to our French-English systems. The hyperparameter settings are different, however. We performed random search in the following hyperparameter space (see Table 5): The training process follows a continuedtraining procedure (c.f. ; Khayrallah et al. (2018)): In Stage 1, we train systems from scratch on Train-ALL, and perform early stopping on Valid-ALL. This represents a mixed corpus with both in-domain and out-of-domain bitexts. For all models, we used batch sizes of 4,096 words, checkpointed every 2,000 updates, and stopped training with the bestperplexity checkpoint when validation perplexity on Valid-ALL had failed to improve for 16 consecutive checkpoints.
In Stage 2, we fine-tuned the above systems by training on Train-MTNT, and perform early stopping on Valid-MTNT. Effectively, we initialize a new model with Stage 1 model weights, reset the optimizer's learning rate schedule, and train on only in-domain data. To prevent overfitting to the small Train-MTNT bitext, we now checkpoint more frequently, saving a checkpoint after every 50 updates, and stopped training either when the perplexity on Valid-MTNT fails to improve for 16 consecutive checkpoints or when we reached 30 checkpoints (i.e., 30 × 50 = 1500 updates of 4,096 word batches), to prevent fitting excessively on the Train-MTNT bitext.

Scoring
At test time, we decoded with beam search using a beam of size 5. We scored with sacreBLEU (Post, 2018), with international tokenization. 7 Per organizer suggestion, we applied Kytea to Japanese output prior to scoring. We measure BLEU on both VALID-ALL and Test18-MTNT in order to compare the results on mixed and in-domain corpora.

Results & Discussion
The BLEU results for Stage 1 models are shown in Table 5. We performed random search in hyperparameter space, training approximately 40 models in each language-pair. The table is sorted by Test18-MTNT BLEU score and shows the top 5 models in terms of BLEU (id=a,b,c,d,e; id=z,y,x,w,v) as well as another 5 randomly selected model (id=e,f,g,h,i,j; id=u,t,s,r,q).
Observation 1: Despite the relatively narrow range of hyperparameter settings, there is a comparatively large range of BLEU scores in the table. For example in JA→EN, the best Test18-MTNT BLEU is 11.1, 2.7 points better than the worst BLEU (8.4) in the table; there are other poorer performing systems, not sampled for the table. This suggests that hyperparameter search is important in practice, even for relatively standard hyperparameters.
Additionally, we note it is difficult to make posthoc recommendations on the "best" hyperparameter settings, as there are no clear trends in the data. For example, from the top 5 JA→EN models, it appears that 30k BPE merge operations is good, but there is an competitive outlier with 10k BPE (id=c). In the results (not all shown in the table), most 10k BPE models achieve Test18-MTNT BLEU in the 8-10 range, so it is difficult to explain the strong BLEU score of id=c. Also, it does appear that layer=4 is consistently better than layer=2 in the JA→EN results, but the results are more mixed in the EN→JA direction. 7 BLEU+case.mixed+refs.1+smooth.exp+tok.intl+v1.2.14 Observation 2: There is some correlation between the BLEU scores of Valid-ALL and Test18-MTNT; the system rankings are relatively similar. But we note that there are a few outliers, e.g. the top 5 models in EN→JA perform similarly on Test18-MTNT, but there are noticeable degradations for id=x and id=v on Valid-ALL. Similarly, id=b and id=c perform close on Test18-MTNT but not on Valid-ALL. With the goal of robustness, we think these kinds of BLEU gaps due to domain differences deserve more investigation.
Continued Training: Next, we perform continued training on the top 5 models. The results on Test18-MTNT are shown in Table 6. We observe consistent BLEU gains in these Stage 2 models, close to 2 or 3 points across all systems. This re-affirms the surprising effectiveness of a simple procedure such as continued training; but we should also note that preliminary efforts on English-French did not yield similar gains.
Note that we do not measure Valid-ALL in this case since we now expect the models to be optimized specifically for MTNT; it is likely Valid-ALL scores will degrade due to catastrophic forgetting (Thompson et al., 2019).
Final Submission: In the final official submission, we performed an 4-ensemble of the Stage 2 Continued Training models of id=a,b,d,e for JA→EN and id=z,y,w,v for EN→JA. Note that the ensemble method in Sockeye currently assumes the same vocabulary, so BPE needs to be the same for all models in the ensemble. This is a reasonable assumption, but in the spirit of subword regularization (Kudo, 2018), we think it may be interesting to explore whether ensembles of systems with diverse BPE will lead to more robust outputs. For

Conclusion
We constructed reasonably-scoring systems on three language pairs without too much effort. Our  scores fell into roughly the middle tier among those reported on matrix.statmt.org. It is certain that much higher gains could be had by adding even known techniques to our pipeline, such as backtranslating monolingual data (Sennrich et al., 2016). We also believe that our approach of evaluating on multiple test sets is essential to the robustness task. Without this, the task reduces to domain adaptation, and one has no assurance that high scores on the out-of-domain data do not come at the expense of general-domain performance.