Training and Inference Methods for High-Coverage Neural Machine Translation

In this paper, we introduce a system built for the Duolingo Simultaneous Translation And Paraphrase for Language Education (STAPLE) shared task at the 4th Workshop on Neural Generation and Translation (WNGT 2020). We participated in the English-to-Japanese track with a Transformer model pretrained on the JParaCrawl corpus and fine-tuned in two steps on the JESC corpus and then the (smaller) Duolingo training corpus. First, during training, we find it is essential to deliberately expose the model to higher-quality translations more often during training for optimal translation performance. For inference, encouraging a small amount of diversity with Diverse Beam Search to improve translation coverage yielded marginal improvement over regular Beam Search. Finally, using an auxiliary filtering model to filter out unlikely candidates from Beam Search improves performance further. We achieve a weighted F1 score of 27.56% on our own test set, outperforming the STAPLE AWS translations baseline score of 4.31%.


Introduction
Currently, state of the art machine translation systems generally produce a single output translation. However, human evaluators of translation tasks will often accept multiple translations as correct. We introduce a neural machine translation (NMT) system that generates high-coverage translation sets for a single given prompt in the source language.
Our system was prepared for the English-to-Japanese track 1 of the Duolingo Simultaneous Translation And Paraphrase for Language Education (STAPLE) shared task (Mayhew et al., 2020) at the 4th Workshop on Neural Generation and Translation (WNGT 2020). The shared task datasets consist of English prompts and a weighted set of target language translations for each prompt. The task requires systems to produce translation sets for given English prompts that are evaluated on weighted F1 score, defined in Appendix A. We have made our code publicly available. 2 We experimented with models trained and finetuned on the provided Duolingo English-Japanese prompt-translation data (Mayhew et al., 2020), the JParaCrawl web-crawled corpus (Morishita et al., 2019), as well as the Japanese-English Subtitle Corpus (JESC) (Pryzant et al., 2018). The sizes of each dataset are summarized in Table 1.
Our system uses a Transformer-based (Vaswani et al., 2017) NMT model and we began with weights pretrained on the large JParaCrawl corpus (Morishita et al., 2019). Section 4 describes in detail how the model was pretrained. Our system's NMT model was then obtained by fine-tuning first on the Japanese-English Subtitle Corpus (JESC) (Pryzant et al., 2018) before further fine-tuning on the Duolingo training set (Mayhew et al., 2020). We outline these datasets in more detail in Section 2.
Given the small size of the Duolingo data, this multi-step fine-tuning helped the model generalize and outperformed single-step fine-tuning and no fine-tuning. High-coverage translation bitext data is not easy to mine or create, so we expect that in other settings, the size of such available training data will also be small. Therefore, it is very likely that adopting a multi-step fine-tuning method may be advantageous more generally. The fine-tuning procedure is described in Section 6.
Outputting the entire beam of candidates from 150-width Beam Search, scored on per token log likelihood, this two-step fine-tuned system produced the translations that we submitted to the shared task leaderboard. It achieved 25.69 % weighted F1 score on the shared task blind development set and 26.0% on the blind test set. After the leaderboard closed, we conducted further experiments and discovered several notable optimizations.
The most effective optimization was using the ground truth weights that indicate variations in translation quality during training. We find that it is essential to deliberately expose the model to higher-quality translations more often during training. Otherwise, overexposure to low-quality translations harms the model's translation performance.
Secondly, Diverse Beam Search with a very small penalty outperformed Beam Search. However, too much diversity begins to introduce minor semantic shifts that deviate from correct translations.
We also explored introducing an auxiliary filtering model for post-processing candidates. Our proposed filtering model is able to refine the candidates generated by the NMT model, which improved the system's performance with respect to the weighted F1 score.
We share our results in Section 7. Our best result was a weighted F1 score of 27.56% on our own test set of 200 prompts randomly selected from the training data.

Duolingo High-coverage Translations
Duolingo provided training, development and test sets (Mayhew et al., 2020). However, the development and test datasets were 'blind' and did not contain ground truth translations, so we did not use these for training or development.
The training set consists of 2,500 English prompts, each of which are paired with a variable number of Japanese translations (Table 1). Duolingo provides weights for each translation, which can be interpreted as a quality score. For our experiments, we randomly split the the 2,500 prompts into 2,100, 200 and 200-prompt training, development and test sets respectively. For the shared task submission, we retrained a model over all 2,500 prompts with our best hyperparameters.

JParaCrawl
As our base model, we use a model pre-trained on the JParaCrawl corpus (Morishita et al., 2019). This corpus contains over 8.7 million sentence pairs which were crawled from the web and then automatically aligned, similar to European corpora in the ParaCrawl project 3 . Though noisy due to an imperfect alignment method, this is currently the largest publicly-available English-Japanese bitext corpus.

Japanese-English Subtitle Corpus
The Japanese-English Subtitle Corpus (JESC) (Pryzant et al., 2018), is a large parallel training corpus that contains 2.8 million pairs of TV and movie subtitles. With an average length of 8, the corpus mostly consists of short sentences, which is similar to the data present in the Duolingo training corpus. Even though JESC contains some noise, it captures sufficient information that is useful for downstream NMT tasks.

Related work
Machine Translation Machine translation (MT) involves finding a target sentence y = y 1 , ...y m with the maximum probability conditioned on a source sentence x = x 1 , ...x n , i.e argmax y P (y|x).
There are various neural approaches to tackle machine translation. These include utilizing recurrent neural networks (Cho et al., 2014b), convolutional neural networks (Kalchbrenner et al., 2016), attention-based models (Luong et al., 2014;Bahdanau et al., 2015) and transformer networks (Vaswani et al., 2017). Sequence to sequence models deal with the task of mapping an input sequence to an output sequence. These were first introduced by  and typically use an RNN based encoder-decoder architecture, where the encoder outputs a fixed length representation of the input which is fed into the decoder to get a target translation. RNN and LSTM based approaches struggle to handle long sequences and long-range dependencies since the encoder network is tasked with encoding all relevant information in a fixedlength hidden state vector. Bahdanau et al. (2015) overcome this by utilizing attention, an alignment model that can attend to important parts of the input during translation. Luong et al. (2014) used the attention mechanism to great effect, observing gains of 5.0 BLEU over non-attention based techniques for NMT.
The Transformer Architecture For our experiments, we used the the Transformer architecture proposed by Vaswani et al. (2017). It is a selfattention based model that produces superior results for machine translation tasks compared to CNN and LSTM based models. By stacking multiple layers of multi-head self-attention blocks, they demonstrate that the attention mechanism by itself is very powerful for sequence encoding and decoding. Recently, Transformer-based models that are pre-trained on large-scale datasets have produced superior performance on various Natural Language Processing (NLP) tasks (Rajpurkar et al., 2016;Talmor and Berant, 2019;Mayhew et al., 2019). In Section 4 we further describe the transformer architecture and our pretraining procedure.
Domain Adaptation Domain adaptation involves making use of out-of-domain data in situations where high quality in-domain data are scarce. This fine tuning approach has been shown to be effective for NMT (Luong and Manning, 2015;Sennrich et al., 2015;Freitag and Al-Onaizan, 2016). Morishita et al. (2019) show that pre-training with JParaCrawl vastly improves in-domain performance for English-Japanese translations. We make use of these ideas in our multi-step fine-tuning experiments.
Inference with Beam Search Beam Search is an approximate search algorithm used for finding high likelihood sequences from sequential decoders. At every time step, the top k outputs are traversed and the rest are discarded. A common issue with beam search is that it generates similar outputs that only differ by a few words or minor morphological variations (Li and Jurafsky, 2016). Vijayakumar et al. (2016) propose Diverse Beam Search, a method that reduces redundancy during decoding in NMT models to generate a wider range of candidate outputs. This is achieved by splitting the beam width into evenly-sized groups and adding a penalty term for the presence of similar candidates across groups. The authors find most success with the Hamming Diversity penalty term, which penalizes the selection of tokens used in previous groups proportionally to the number of times it was selected before. We detail our experiments using both search strategies in Section 6.
Post-processing in NLP For tasks that require sets of outputs rather than single outputs, postprocessing or reranking methods are often used as a downstream step after a model generates an initial set. They have proven to be useful techniques for various NLP tasks, such as Question Answering (Kratzwald et al., 2019), Named Entity Recognition (Yang et al., 2017) and Neural Summarization (Cao et al., 2018). The basic methodology is to first generate an initial candidate set and rerank or prune these candidates to generate the final set. This set up reduces reliance on generators by introducing an auxiliary discriminator to refine the outputs of the generator. Section 6 describes our experiments with pruning or filtering Beam Search candidates during decoding.

Pretrained Base Model
As our base model, we used a model pretrained by Morishita et al. (2019) on the JParaCrawl data using the fairseq framework (Ott et al., 2019). Morishita et al. (2019) preprocessed the JParaCrawl English and Japanese text using sentencepiece (Kudo and Richardson, 2018) to obtain 32,000-token vocabularies on both the English and Japanese sides. Architecture The pretrained model follows the Transformer 'base' architecture (Vaswani et al., 2017), with a dropout probability of 0.3 (Srivastava et al., 2014).

Data Preprocessing
Transformer is a multi-layer self-attention model. Both its encoder and its decoder contain multiple similar sub-modules which include a multi-head attention layer (MultiHead) and a position-wise feed-forward network (FFN).
Here, Q, K, V are the matrix representation of the query, key, and value separately. W and b denote the weights and biases of the linear layers. d k denotes the dimension of the key matrix.
Learning rate scheduling The learning rate schedule adopted for the pretrained model was the so-called 'Noam' schedule (Vaswani et al., 2017). This schedule linearly increases the learning rate for 4000 'warm-up' steps from a starting learning rate of 10 −7 to the target learning rate of 10 −3 , then decreases it from that point proportionally to the inverse square root of the step number.

Filtering Model
Apart from the NMT model, we additionally introduce a neural filtering model to post-process the NMT model's candidates. Instead of designing a model that will assign a real-value score to each of the candidates, we simplify the task by formulating it as a binary classification problem. Namely, the filtering model is trained to classify a given candidate sentence as a valid sample (in the goldstandard list) or an invalid sample. The intuition is that the gold-standard candidate list contains a small number of high-quality sentences (with larger weights) and a large number of lower-quality sentences. Thus it is more important to distinguish the hits from misses than high-quality hits from low-quality hits.
To construct the dataset for the filtering model, we augmented the Duolinguo dataset with the results of NMT model. Specifically, we labeled those result sentences that appear in the gold-standard list as True and labeled others as False.
As for the model architecture, we encode the source sentence and the candidate sentence separately with a one-layer bidirectional LSTM model. The encoding is the concatenation of the hidden vectors in both directions after complete traversal of the sequence, along with a (learned) positional embedding vector. This embedding encodes the position of the candidate sentence in the candidate list generated by the NMT model, which is sorted by descending score order. 4 Lastly, we use a multilayer perception (MLP) to classify the concatenated vector.
Here, s denotes the source sentence, c i denoted the i-th candidate, v i denotes the positional encoding, and p i denotes the predicted likelihood. The filtering model is optimized with binary cross-entropy loss.

Multi-step Fine-tuning
We experiment with several different fine-tuning scenarios, each time evaluating the models using the Weighted F1 metric on our 200-prompt Duolingo test set. First as a baseline, we directly evaluate the JParaCrawl pretrained model without fine-tuning. Then we evaluate the performance of models fine-tuned on either JESC or on all English-Japanese pairs in our 2,100-prompt Duolingo training set. 5 Finally, we experiment with first finetuning on the JESC data and then on the Duolingo training set.
Before training, we preprocessed the JESC and Duolingo data using the same 32,000-token English and Japanese sentencepiece models as Morishita et al. (2019) used on the JParaCrawl data.
Training procedure We adopted the same optimizer settings as they used for the pretrained model, described in Section 4. Using mini-batches of up to 5,000 tokens, we made an update step every 16 mini-batches with mixed precision computation for increased training speed (Micikevicius et al., 2018). While the pretrained model was trained for 24,000 steps, each time we fine-tuned the model, we did so for 2,000 steps, continuing the inverse square root learning rate schedule from the pretraining. We saved the model parameters every 100 steps and for each fine-tuning experiment, we averaged the last eight parameter checkpoints to obtain our final model weights. For the model with two-step finetuning, we use the averaged checkpoint from the JESC fine-tuning experiment as the starting point for further fine-tuning on the Duolingo dataset.

Decoding Strategies
For producing multiple translations for each prompt, we output the entire beam width of candidates from the Beam Search or Diverse Beam Search (Vijayakumar et al., 2016) algorithms. Our motivation for experimenting with using Diverse Beam Search is to improve the coverage of our translation sets. In all our experiments, we capped the generated sequence length at 200 tokens.
Beam Search scoring Beam Search using sequence log likelihood (or likelihood) as scores results in a well-known length bias towards shorter sequences, with worsening bias for wider beams (Murray and Chiang, 2018). To address this, we scored beam candidates based on the mean log likelihood per token (Cho et al., 2014a). Further work could involve the use of more complex adjustments for length bias and including a coverage penalty over the source prompt (Wu et al., 2016).

Training Data Augmentation
Aligning data distributions The ground truth weights of the Duolingo reference translations invariably follow skewed distributions, with long tails of low weight translations (Figure 1). Consequently, one drawback of training with all English-Japanese pairs in the Duolingo data is that each pair is essentially provided to the model with equal weight. In other words, the distribution over reference translations at training time is uniform, whereas the distribution when evaluating weighted F1 score is skewed.
To address this, we sampled the training data such that the model was trained on prompts with equal probability but for each prompt, reference translations were sampled according to the distribution given by the ground truth weights. In effect, this aligns the distribution over reference translations during training time and evaluation time.
Loss smoothing to improve coverage Aside from helping NMT models generalize, Müller et al. (2019) show that use of loss smoothing also better calibrates NMT models, preventing them from becoming over-confident. To encourage our NMT model to produce high-coverage translations, we hypothesize that increasing loss smoothing to decrease the model's confidence will improve its performance in producing a wider variety of correct translation candidates.

Filtering Model
Since our filtering model is trained with the results of the NMT model, we trained two filtering models with two different decoding strategies of the NMT model, namely, Regular and Diverse Beam Search with beam widths set such that approximately 100 unique candidates are output for each prompt. The NMT model is trained with the best hyper-parameters we found with the weighted sampling technique. We use the same train/dev/test splits as the NMT model and select the checkpoint with the best classification accuracy on the development set. We used the Adam optimizer with initial learning rate 0.0001 and halved the learning rate when the validation accuracy plateaued for 2 epochs. The word embedding dimension, positional embedding dimension, the hidden dimension of the LSTM and MLP are all set to 128. The dropout rate was 0.2.   The post-processing procedure involved pruning all candidates with predicted likelihood less than 0.5.

Results
We conducted our experiments sequentially and generally used the best results so far as a baseline for subsequent experiments.

Multi-step Fine-tuning Results
Our best performing model was the one trained using multi-step fine-tuning, as shown in Table  2. The performance of this model was superior to the other fine-tuning settings on every metric, suggesting this result was not simply a matter of imbalance between precision and recall. This result provides strong evidence that the first fine-tuning step on the JESC data helped the model generalize to the Duolingo test set. In contrast, the model only fine-tuned on the Duolingo training set may not have generalized as well due to the training set's small size.
In order to balance precision and (weighted) recall appropriately to maximize the weighted F1 metric, we experimented with tuning the number of Beam Search candidates to output and found that 100 was optimal (Table 3). Note that the number of unique candidates returned can be fewer than the beam width as Beam Search searches over sequences of subword tokens and sometimes detokenization results in duplicates.

Diverse Beam Search Results
Our experiments with Diverse Beam Search show that using 3 beam groups with a very low Hamming diversity penalty can result in marginal performance improvement (Table 4). The algorithm evenly divides the total beam width between the groups and although the algorithm penalizes duplicate sequences, high scoring candidates are still often duplicated across groups. As such, we varied the total beam widths so that the mean number of unique candidates per prompt were approximately 100. 6 We conclude that encouraging a small amount diversity can allow the model to capture a wider range of variations without sacrificing too much precision.
We found that performance deteriorates when increasing the diversity penalty or the number of groups further. These results suggest that standard beam search by itself is relatively good at producing high-coverage translations and that acceptable variations of translations are rather homogeneous rather than diverse. To illustrate, Table 5 contains some examples of error candidates produced by Diverse Beam Search. Even though they would backtrackslate to the English prompt correctly, they nevertheless introduce a minor semantic variation that    makes them unacceptable translations.

Training Data Augmentation Results
Sampling training data according to the ground truth weights meaningfully improves performance, as shown in Table 6. Our previous best weighted F1 score using Diverse Beam Search was 26.29%, and this improved to 27.21%. Moreover, evaluating the model on the standard machine translation metric of BLEU-4 score between the single best candidates and the single best ground truth translations, we observe a remarkable increase in BLEU score if weighted sampling is used during train-ing. From this result, we conclude that unweighted sampling of training data overexposes the model to poorer translations, which significantly reduces the model's effectiveness as a general-purpose NMT model.
As for loss smoothing, contrary to our hypothesis, increasing the loss smoothing rate was detrimental. and, in fact, decreasing the rate from 0.1 to 0.05 even improved the weighted F1 score slightly from 27.21% to 27.43%. This suggests that the effect of loss smoothing on the high-coverage translation task is not necessarily different to the usual machine translation task.   Table 7 shows the results of the filtering algorithm. The filtering model can improve the weighted F1 score with both the diverse beam search and regular beam search, especially with the regular beam search. This improvement results from a larger gain in precision from filtering than the loss in recall. One thing to note is that our filtering model suffers from over-fitting. For example, with Regular Beam Search, our filtering model improves the weighted F1 score by 0.43% on the test set (Table 7). However, using the same technique on the training set results in an improvement of 6.25%. 7 This may result from the limited size of Duolinguo dataset, and the fact that over-fitting introduced by the NMT model would be amplified since the filtering model is trained on the results of the NMT model.

Conclusions and Future Work
Our machine translation system produces highcoverage sets of target language translations from single source language prompts.
We used multi-step fine-tuning to train a robust NMT model. This involved first training or finetuning a model on a large bitext dataset, then finetuning on the bitext dataset with high coverage sets of target language translations, which is likely to be small. In our experiments, we find that fine-tuning a pretrained model first on a corpus similar to our intended domain and then fine-tuning further on our smaller in-domain dataset produced the best results.
During training, we find that if the ground truth translations come with weights that indicate variations in their quality / likelihood, it is essential 7 On the training set, the filtering algorithm improves the weighted F1 score from 56.78% to 63.03%. to expose the model to higher-quality translations more often during training. One way to do this is to to sample the training data with probabilities commensurate to the ground truth weights. Doing so will prevent overexposure to low-quality translations that ultimately harm the model's translation performance.
For decoding, we find that Beam Search scored on per token log likelihood finds very good translation candidates on its own. Nevertheless, instead using Diverse Beam Search with a very small penalty improves coverage.
We observed a further performance boost from post-processing the translation candidates. This was achieved by training an auxiliary filtering model on the results of the NMT model to prune unlikely candidates as a final step.
One idea for future work is to directly optimize the weighted F1 score during training using reinforcement learning. As the weighted F1 score is not a differentiable function, it is impossible to train directly on this metric using maximum likelihood estimation. Instead, one may use policy gradients under a reinforcement learning paradigm to do so.

A Appendices
To evaluate the result, the weighted macro F 1 (equation 8) with respect to the accepted translations is the metric of interest. This is the average weighted F 1 score (equation 12) over all prompts s in the corpus, where weighted F 1 is calculated with (unweighted) precision and weighted recall.
Weighted Macro F 1 = s∈S Weighted F 1 (s) |S| (8) Calculating the weighted recall requires the use of weights included in the dataset. These weights are associated with each human-curated acceptable translation, which represent the likelihood that an English learner would respond with that translation.
For each prompt s, the weighted true positives (WTP) and weighted false negatives (WFN) are: With these, the weighted recall for each s can be calculated as follows Weighted Recall(s) = WTP s WTP s + WFN s (11) Precision is calculated in the usual way, so the weighted F 1 score, Weighted F 1 (s), for a particular input s is given by