Generating Diverse Translations via Weighted Fine-tuning and Hypotheses Filtering for the Duolingo STAPLE Task

This paper describes the University of Maryland’s submission to the Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Unlike the standard machine translation task, STAPLE requires generating a set of outputs for a given input sequence, aiming to cover the space of translations produced by language learners. We adapt neural machine translation models to this requirement by (a) generating n-best translation hypotheses from a model fine-tuned on learner translations, oversampled to reflect the distribution of learner responses, and (b) filtering hypotheses using a feature-rich binary classifier that directly optimizes a close approximation of the official evaluation metric. Combination of systems that use these two strategies achieves F1 scores of 53.9% and 52.5% on Vietnamese and Portuguese, respectively ranking 2nd and 4th on the leaderboard.


Introduction
While machine translation (MT) typically produces a single output for each input, scoring and generation for second language learning applications might benefit from systems whose outputs better capture the diversity of translations produced by language learners. The Duolingo Simultaneous Translation And Paraphrase for Language Education (STAPLE) shared task (Mayhew et al., 2020) provides a framework for developing and testing such systems, grounded in real translations produced by English learners into five native languages (Portuguese, Vietnamese, Hungarian, Japanese, Korean). In this task, given an English sentence prompt, systems are asked to produce a set of translations for that prompt, and are scored based on how well their outputs cover human-curated acceptable translations, weighted by the likelihood that an English learner would respond with each translation (Table 1).

Prompt is my explanation clear?
Output minha explicação está clara? 0.267 minha explicação é clara? 0.161 a minha explicação está clara? 0.111 a minha explicação é clara? 0.088 minha explanação está clara? 0.057 está clara minha explicação? 0.044 minha explanação é clara? 0.039 While the multiple translations can be viewed as paraphrases, we propose to address the STAPLE task primarily as a MT task to better understand the strengths and weaknesses of neural MT architectures for generating multiple learner-relevant translations. Given a Transformer model for the language pair of interest, we use beam search to generate n-best translation candidates. However, since n-best lists are known to lack diversity, we propose to generate hypotheses that better match the requirements of the STAPLE task via: 1. Frequency-Aware n-Best Lists: We encourage hypotheses to reflect the diversity and frequency of learner responses by fine-tuning models on STAPLE data, oversampling translation options to reflect learner preferences.

Hypothesis Filtering:
We filter the resulting n-best lists using a binary classifier which identifies good translations that are likely to be produced by a learner.
Controlled experiments and analysis show the benefits of both strategies. Our final submission which includes both techniques achieves F1 scores of 53.9% and 52.5% for en-vi and en-pt respec-tively, reaching a rank of 2 nd and 4 th on the leaderboard, only 2 points below the top scoring system. For completeness, we also submitted systems for the remaining language pairs using Frequency-Aware n-best lists: our system ranked 2 nd for Japanese and 3 rd for Korean and Hungarian.

Background
Unlike in the STAPLE task, recent attempts at generating multiple translations for a single source have targeted output variability along specific stylistic dimensions (Sennrich et al., 2016b;Rabinovich et al., 2016;Agrawal and Carpuat, 2019) or produce diverse outputs without a specific use case (Kikuchi et al., 2016;Shu et al., 2019). The techniques used can be divided in three categories: (a) constrain the decoding process to generate diverse candidates (Li and Jurafsky, 2016;Li et al., 2015;Cho, 2016); (b) optimize via a diversity promoting loss function (Li et al., 2015); (c) expose the model to different translation candidates with side-constraints (Rabinovich et al., 2016;Sennrich et al., 2016a;Agrawal and Carpuat, 2019;Shu et al., 2019) or without (Shen et al., 2019). Since it is unclear what dimensions of variations are captured in the STAPLE translation, we focus instead on improving n-best lists generated by a standard neural MT model.
Source texts with multiple references have mostly been used to evaluate rather than train MT systems (Papineni et al., 2002;Banerjee and Lavie, 2005;Qin and Specia, 2015). Evaluation sets with 4 or 5 references have been converted to singlereference training samples (Zheng et al., 2018) to improve MT training, but reference translations vary in arbitrary ways and often exhibit poor diversity, mostly limited to translationese effects. The STAPLE data presents an opportunity to explore multiple translations generated in a more comprehensive fashion.

Frequency-Aware Hypotheses Generation
While neural MT systems can generate multiple translation candidates per source using beam search, the n-best translations often lack diversity. One issue is that systems are trained on singletranslation training samples. We propose to tailor MT to the STAPLE task by fine-tuning models on LRF-weighted multi-reference samples to obtain more diverse translations and a ranking that better reflect learner preferences.
Given the STAPLE data for a language pair, where the i-th training example, (e i , F i , W i ) includes a source sentence in English, a reference set .., w K i }, we create MT training samples by copying the translation pair (e i , f j i ), w j i × O times. 1 Given model parameters θ, this yields a weighted crossentropy loss:

Hypothesis Filtering as Binary Classification
Even when informed by STAPLE data and LRF scores, n-best lists might include translations that are not in the reference set, due to translation errors or selecting paraphrases that do not match language learners' preferences. We design a binary classifier that further filters the n-best lists by predicting for each hypothesis whether or not it should be included in the final set. This lets us define features based on the complete prompt and hypothesis sequence pairs, while the MT model generates the hypothesis incrementally.
..,f N i )} M 1 represent the n-best list generated via beam search for all the source prompts in the training dataset: e i corresponds to the i-th source prompt andf j i corresponds to the j-th candidate hypothesis extracted via beam search. x i j represents the feature vector extracted from the source (e i ) and j-th candidate hypothesis (f j i ) and y j i is a binary label indicative of whether the candidate hypothesis,f j i , is found in the gold standard data. The classification model f : X → R maps the feature vector to a real value, where, f is a two-layer Neural Network (NN) to enable learning feature combinations.
Features We aim to capture the quality of a source-hypothesis pair using multiple sentencelevel features: • Length features |f |, |e|, |f | |e| , |e| |f | might indicate mismatches between source and target content.
• Word alignment features have proved useful to identify semantic divergences in bitext (Munteanu and Marcu, 2005;Vyas et al., 2018). We use the Forward and Reverse Alignment score, the count of unaligned words for source and target, and the top three largest fertilities for source and target.
• Scores from various MT models as often done when reranking n-best lists (Cherry and Foster, 2012;Neubig et al., 2015;Hassan et al., 2018) including a left-to-right model, a right-to-left model, and a target-to-source model, which provide different views of the example and might better estimate the adequacy of the translation than the original MT model score.
• Target 5-gram language model score to estimate the fluency of the hypothesis.

Loss
We optimize a Soft Macro-F1 objective (Hsieh et al., 2018) function to approximate the official evaluation metric. 2 The true positive (tp), false positive(fp), and true negative (tn) rate for each source prompt e i are estimated as: Then, the precision, recall, F1 for a source e i , and the loss are defined as:

STAPLE Data
The shared task provides English source prompts, associated with high-coverage sets Figure 1: Average of the top-1, top-5, mean and median LRF values across source prompts: the LRF distribution is more uniform for languages with many more references per prompt (e.g. en-ja).
of plausible translations in five other languages. These translations are weighted and ranked according to LRF scores indicating which translations are more likely. About 3000 prompts per language are available (see Table 2 for details) and the number of reference translations available per prompt vary across languages (mean: 174.2, variance: 116). Figure 1 illustrates the differences in LRF distributions across languages: for languages with many references per prompt (e.g. en-ja, en-ko), the gap between the top-1 and the mean LRF value is small, indicating an almost uniform distribution. Average top-1 LRF scores also vary across languages (e.g en-vi: 0.25, en-ja: 0.05) depending upon the number of references available per prompt.
For system development, we divide the STAPLE dataset into train, development and test datasets using 72%, 8%, and 20% of source prompts respectively. We refer to these subsets as STAPLE train, internal dev and internal test. Note that the last two differ from the official blind development and test sets available to participants on codalab.
Other Bitexts We use bitext from OpenSubtitles (Tiedemann, 2012) and Tatoeba (Tiedemann, 2012) as described in Table 3. The Tatoeba corpus provides multiple reference translations for some sources (with 2 translation per source on average), but unlike in the STAPLE data, these translations are not weighted by frequency of usage.
Preprocessing All datasets are pre-processed using Moses tools for normalization, tokenization and lowercasing. We further segment tokens into subwords using a joint source-target Byte Pair Encoding (Sennrich et al., 2016c)    operations. For Japanese, we use kytea 3 toolkit for word tokenization.

MT configurations
Model Architecture We use the Transformer model implemented in the Sockeye toolkit 4 as a baseline MT system. Both encoder and decoder are 6-layer Transformer models with model size of 1, 024, feed-forward network size of 4, 096, and 16 attention heads. We adopt label smoothing and weight tying. We tie the output weight matrix with the target embeddings. We use Adam optimizer with initial learning rate of 0.0002.

Experimental Conditions
We train several models with the above configuration: • OpenSubs a baseline model trained and validated on the OpenSubtitles bitext.
• Unweighted builds on the baseline by finetuning on multi-reference samples including the Tatoeba bitext and STAPLE train.
We create one training sample per sourcereference pair, and the resulting samples are not weighted. We use the internal dev set (1best reference only) as a validation set.
We generate n-best list of translations for various models by running beam search with a beam size corresponding to the desired n.

Filtering configurations
Classifier The 2-layer feed-forward NN has 5 hidden units and 2 output units. It is trained with the Adam optimizer with an initial learning rate of 0.001 and runs for 2000 epochs on the internal dev set. The best model is selected based on internal test set performance. We consider two losses: the soft macro F1 loss which approximates the official evaluation metric ( § 3.2) and the standard crossentropy loss as a baseline.

Reranking Baseline
We compare our NN based classifer with a standard MT n-best list reranker trained on the internal dev set. We use the n-best batch MIRA ranker (Cherry and Foster, 2012) included in Moses. A threshold to filter candidates in the reranked list is selected by maximizing the Weighted Macro F1 on the internal dev dataset.

Features
We use eflomal 5 trained on the Opensubtitles dataset to obtain word alignment between source and translation hypotheses. The language model is trained with the kenlm (Heafield, 2011) toolkit with default hyper-parameters 6 on the target side of the Opensubtitles and the STAPLE dataset. The Right-to-left and Target-to-source MT models were trained on OpenSubtitles (same configuration as in § 4.2).

Evaluation
We evaluate the lowercased detokenized output of the systems on our internal test dataset using: Weighted Macro F1 This is the official scoring metric which quantifies how the set of system outputs covers the human-curated acceptable translations, weighted by the LRF of each translation. It is defined as the harmonic mean of unweighted precision (P) and weighted recall (WR) calculated for each prompt e i , and averaged over all the prompts in the corpus. Specifically, using the same notation as introduced in § 3.1, for each translation T i generated by the MT model, we have: The weighted Macro F1 (WMF1) is then given by: We also report the translation quality of the 1-best neural MT output compared against the highest LRF reference translation using the standard BLEU metric (Papineni et al., 2002). Table 4 summarizes the evaluation of n-best lists obtained with our neural MT systems.

Impact of Frequency-Aware Fine-Tuning
Baselines We confirm that the neural MT configuration is sound by comparing our neural MT baseline to the provided AWS system. Our baseline ("OpenSubs") improves the BLEU@1 score by 2 points for en-pt, and remains 6 points lower for envi, as can be expected given the smaller size of the OpenSubtitles training set. However, the "Open-Subs" n-best lists improve over the AWS baseline according to the official task metric (WMF1), establishing that this system is a good starting point for fine-tuning.

Fine-Tuning
The Frequency-Aware n-best hypotheses consistently yield the best Weighted Recall and Weighted Macro-F1 scores for all languages. The improvement in recall and therefore F1 score is largest for en-ja and en-ko which have larger translation reference sets (Table 4). Frequency-Aware oversampling also improves precision over the Unweighted model for all but one language (en-pt). The impact on the auxiliary BLEU@1 metric is less consistent: the Frequency-Aware system achieves the best BLEU@1 in 3 out of 5 languages, but outperforms the OpenSubs baseline in 4 out of 5. BLEU@1 drops when finetuning on all the samples without weighting (Unweighted) which we attribute to the increased uncertainty during training as the model is exposed to many different translations for the same source English text. Overall, these results show the benefits of finetuning on task-relevant data and shows that incorporating LRF weights via oversampling improves the ranking of n-best hypotheses. This is further illustrated in Table 5, which shows the top 10 Vietnamese translations for two randomly sampled source prompts: the Frequency-Aware n-best list yields Weighted Recall of 81% at a Precision of 60% and 76% at a Precision of 100% for the two source prompts respectively, illustrating that the model generates high-quality candidates that cover reference translations well, but not perfectly.

N-Best List Quality
How well do n-best translations cover the space of reference learner translations? Figure 2 shows the impact of increasing the decoding beam (and resulting n-best list size) from 10 to 500 for the Frequency-Aware model. For en-pt, while weighted recall increases up to 66%, the drop in precision hurts the weighted F1 score. The oracle F1 score, which represents the Weighted Macro F1 at a Precision of 100%, also increases gradually, reaching a score of 76%. This suggests that the raw n-best lists contain many useful translation candidates but need to be filtered down to better match translations preferred by language learners.

Impact of Hypothesis Filtering
Due to time constraints, we explore the impact of hypothesis filtering only for en-pt and en-vi.
Filtering consistently improves Precision and Weighted Macro F1 (  as the loss leads to a better balance between Precision and Weighted Recall than cross-entropy. The classifier outperforms the MIRA reranker. Since the reranker is trained to maximize BLEU@1, it tends to prefer candidates that are lexically similar to the top reference translation and misses some of the more diverse learner translations. This confirms the benefits of framing the selection of candidate hypothesis as binary classification. Ablation Experiments show that the MT scores are the most useful of the features used, as they capture not only the generation probability of a candidate hypothesis but estimate adequacy via the Target-to-source neural MT model (Table 8).
Length features help precision but not recall, while the alignment and language model scores have little impact overall. This suggests that the classifier could benefit from improved feature design and selection in future work.

Analysis of Translation Diversity
How diverse are the translations returned by various system configurations? Following , we quantify diversity using the entropy of k-gram distributions within a translation set: where V is the set of all k-grams that appear in the translation set, and F (w) denotes the frequency of w in the translations. The higher the Ent-k score, the greater the diversity. Fine-tuned models improve the diversity of 10best lists compared to the "OpenSubs" baseline for both en-vi and en-pt (Table 9). Overall filtering bridges 40% and 25% of the gap between baseline and reference learner translations for en-pt and envi respectively.

System Combinations
A manual examination of translation sets returned by different models suggest that they make complementary errors. We therefore consider combining system outputs by taking the union of the set of translations they return. We evaluate the following combinations (Table 7): C1 Frequency-aware (10-best) + Unweighted (10best)  C2 Frequency-aware (10-best) + Frequencyaware (filtered 50-best) C3 Unweighted (10-best) + Unweighted (filtered 50-best) C4 Union of all of the above.
For en-pt and en-vi, it helps to combine higher precision unfiltered 10-best lists, and higher recall filtered 50-best lists. For en-pt, the union of all outputs (C4) performs best overall. Recall increases when combining the Frequency-Aware and the Unweighted model (C1) compared to individual lists (Unweighted: +1.6, Frequency-Aware: +2) without compromising Precision. Similar trends are observed when adding the filtered 50-best list to unfiltered 10-best lists (C2: +2.2, C3: +4.8). For en-vi, a different combination (C2) yields the best result, perhaps due to the smaller set of reference translations per source prompt (en-vi: 56, en-pt: 131) and high Precision of the "Unweighted" model for en-pt.

Submitted Systems
We tested our systems on the official blind development set to select the best performing models for final evaluation on the test set. For Portuguese and Vietnamese, our official submissions include frequency-aware hypothesis generation and hypothesis filtering: en-vi C2: Frequency-aware (10-best) + Frequencyaware (filtered 50-best)     en-pt C4: Frequency-aware (10-best) + Frequencyaware (filtered 50-best) + Unweighted (10best) + Unweighted (filtered 50-best) We did not build hypothesis filtering models for the other languages, and submitted systems based only on unfiltered models: en-ja Frequency-aware (50-best) + Unweighted (50best) en-hu Frequency-aware (10-best) + Unweighted (10best) en-ko Frequency-aware (50-best) + Unweighted (50best) Table 10 and 11 compares our submissions to baselines, as well as top and median submissions across participants, for all the languages. On our focus languages (en-pt and en-vi), where systems benefitted from both frequency-aware generation and filtering models, our submissions obtain a Weighted Macro F1 score of 0.539 for en-vi and 0.525 for en-pt on the official test set, achieving a rank of 2 nd and 4 th on the leader-board, within 2% of the top performing submission. On the other language pairs, where our submissions did not use any filtering, Weighted Macro F1 outperform the baselines and median submission consistently. Interestingly on the en-ja task, our system ranks second amongst all the submissions despite not using any filtering.

Conclusion
We proposed two strategies to obtain multiple outputs that mimic translations by produced by language learners from a standard neural MT model. Our experiments showed that (1) finetuning MT models using all reference translations and their weight yields more diverse n-best hypotheses that better reflect learner preferences, and (2) filtering these n-best lists using a feature-rich classifier trained to maximize an approximation of the STAPLE evaluation metric yields further improvements. Combinations of systems that use these two strategies approach the top scoring submission in the official evaluation. While these results suggest that some degree of output diversity can be achieved with little change to core neural MT models, oracle scores obtained with unfiltered n-best lists indicate that better modeling the space of learner translations might benefit both candidate generation and the filtering model in future work.