Growing Together: Modeling Human Language Learning With n-Best Multi-Checkpoint Machine Translation

We describe our submission to the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). We view MT models at various training stages (i.e., checkpoints) as human learners at different levels. Hence, we employ an ensemble of multi-checkpoints from the same model to generate translation sequences with various levels of fluency. From each checkpoint, for our best model, we sample n-Best sequences (n=10) with a beam width =100. We achieve an 37.57 macro F1 with a 6 checkpoint model ensemble on the official shared task test data, outperforming a baseline Amazon translation system of 21.30 macro F1 and ultimately demonstrating the utility of our intuitive method.


Introduction
Machine Translation (MT) systems are usually trained to output a single translation.However, many possible translations of a given input text can be acceptable.This situation is common in online language learning applications such as Duolingo, 1 Babbel 2 , and Busuu. 3In applications of this type, learning happens via translation-based activities while evaluation is performed by comparing learners' responses to a large set of human acceptable translations.Figure 1 shows an example of a typical situation extracted from the Duolingo application.
The main set up of the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE 2020) (Mayhew et al., 2020) is such that one starts with a set of English sentences (prompts) and then generates highcoverage sets of plausible translations in the five 1 https://www.duolingo.com/ 2 https://www.babbel.com/ 3 https://www.busuu.com/target languages: Portuguese, Hungarian, Japanese, Korean, and Vietnamese.For instance, if we want to translate the English (En) sentence "is my explanation clear?" to Portuguese (Pt), all the translated Portuguese sentences illustrated in Table 1 would be acceptable. 4imited training data.One challenge for training a sufficiently effective model we faced is the limited size of the source training data released by organizers (4, 000 source English sentences coupled with 226, 466 Portuguese target sentences).We circumvent this limitation by training a model on a large dataset acquired from the OPUS corpus (as described in Section 3), which gives us a powerful MT system that we build on (see Section 4.2).We then exploit the STAPLE-provided training data in multiple ways (see Sections 4.3 and 4.4) to extend this primary model as a way to nuance the model to the shared task domain.Paraphrase via MT.In essence, the shared task is a mixture of MT and paraphrase.This poses a second challenge: there is no paraphrase dataset to train the system on.For this reason, we resort to using outputs from the MT system in place of paraphrases.This required generating multiple sentences for each source sentence.To meet this need, we generate multiple translation hypotheses (n-Best) using a wide beam search (Section 5.1), perform 'round-trip' translations exploiting these multiple outputs (Section 5.2), and employ ensembles of checkpoints (Section 5.3).Diverse outputs.A third challenge is that the target Portuguese sentences provided for training by organizers are produced by learners of English at various levels of fluency.This makes some of these Portuguese translations inarticulate (i.e., not quite fluent).MT systems are not usually trained to produce inarticulate translations (part of the time), and hence we needed to offer a solution that matches the different levels of language learners who produced the translations.Intuitively, we view MT systems trained at various stages (i.e., checkpoint) as learners with various levels of fluency.As such, we employ an ensemble of checkpoints to generate translations matching the different levels of learner fluency (Section 5.3).Ultimately, our contributions lie in alleviating the 3 challenges listed above.
The remainder of the paper is organized as follows: Section 2 is a brief overview of related work.In Section 3, we describe the data we use for both training and fine-tuning our models.Section 4 presents the proposed MT system.Section 5 describes our different methods.We discuss our results in Section 6, and conclude in Section 7.

Related Work
We focus our related work overview on the task of paraphrase generation and its intersection with machine translation.Paraphrasing is the task of expressing the same textual units (e.g.sentence) with alternative forms using different words while keeping the original meaning intact. 5Over the last few years, MT has been the dominant approach for paraphrase generation.For instance, Barzilay and McKeown (2001); Pang et al. (2003) use multiple translations of the same text to train a paraphrase system.Similarly, Bannard and Callison-Burch (2005)  More recently, advances in neural machine translation (NMT) have spurred interest in paraphrase generation (Sutskever et al., 2014;Luong and Manning, 2015;Aharoni et al., 2019).For example, Prakash et al. (2016) employ a stacked residual LSTM network to learn a sequence-to-sequence model on paraphrase data.A parpahrase model with adversarial training is presented by (Li et al., 2017).Wieting and Gimpel (2017); Iyyer et al. (2018) propose a translation-based paraphrasing system, which is based on NTM to translate one side of a parallel corpus.Paraphrase generation with pivot NMT is used by (Mallinson et al., 2017;Yu et al., 2018).
To train our models, we extract more than 77.7M parallel (i.e., English-Portuguese) sentences from the whole collection.The extracted dataset comprises more than 1.5B English tokens and 1.4B Portuguese tokens.More details about the training dataset are given in Table 2.

Pre-Processing
Pre-processing is an important step in building any MT model as it can significantly affect the end results.We remove punctuation and tokenize all data with the Moses tokenizer (Koehn et al., 2007).We also use joint Byte-Pair Encoding (BPE) with 60K split operations for subword segmentation (Sennrich et al., 2015).

Models
In this section, we first describe the architecture of our models.We then explain the different ways we train the models on various subsets of the data.

Architecture
Our models are mainly based on a Convolutional Neural Network (CNN) architecture (Kim, 2014;Gehring et al., 2017).This convolutional architecture exploits BPE (Sennrich et al., 2015).The architecture is as follows: 20 layers in the encoder and 20 layers in the decoder, a multiplicative attention (Luong et al., 2015) in every decoder layer, a kernel width of 3 for both the encoder and the decoder, a hidden size 512, and an embedding size of 512, and 256 for the encoder and decoder layers respectively.We use a Fairseq implementation (Ott et al., 2019).

Basic En↔Pt Models
We trained two MT models, English-to-Portuguese (En→Pt) and Portuguese-to-English (Pt→En), on 4 V100 GPUs, following the setup described in Ott et al. (2018).For both models, the learning rate was set to 0.25, a dropout of 0.2, and a maximum tokens of 4, 000 for each mini-batch.We train our models on the 77.7M parallel sentences of the OPUS dataset described in Section 3. Validation is performed on the development data from STAPLE 2020 (Mayhew et al., 2020).

En→Pt Extended Model
We use the training data of the STAPLE 2020 shared task 13 to create a new En-Pt parallel dataset.
More specifically, at the target side, we use all the Portuguese gold translations while duplicating the same English source sentence at the source side.This results in a new training set of 251, 442 En-Pt parallel sentences.We refer to this training dataset as STAPLE-TRAIN, or simply S-TRAIN.We then merge OPUS and S-TRAIN to train an En→Pt model from scratch.We refer to this new model as the extended model.

En→Pt Fine-Tuned Model
Fine-tuning with domain-specific data, from a domain of interest, can be an effective strategy when 13 http://sharedtask.duolingo.com/#datait is desirable to develop systems for such a domain (Ott et al., 2019(Ott et al., , 2018)).Motivated by this, we experiment with using the STAPLE-based S-TRAIN parallel dataset from the previous subsection to fine-tune our En→Pt basic model for 5 epochs. 14We will refer to the model resulting from this fine-tuning process simply as the finetuned model.

Model Deployment Methods
In order to enhance the 1-to-n En-Pt translation, we propose three methods based on the previously discussed MT models (see section 4).These methods are n-Best prediction, multi-checkpoint translation, and paraphrasing.

n-Best Prediction
We first use our three MT models (basic, extended, and fine-tuned) with a beam search size of 100 to generate n-Best translation hypotheses.We then use the average log-likelihood to score each of these hypotheses.Finally, we select the hypothesis with the n highest score as our output.

Paraphrasing
Paraphrasing is an effective data augmentation method which is commonly used in MT tasks (Poliak et al., 2018;Iyyer et al., 2018).In order to extend the list of accepted Portuguese translations, we use both of our En→Pt and Pt→En models, as follows: 1. Translate the English sentences using the En→Pt model.For instance, we generate n-Best (n = 10) Portuguese sentences for each English source sentence.
2. Then, we use the Pt→En model to get n -Best English translations (we experiment with n = 1, 3, and 5) for each of the 10 Portuguese sentence.At this point, we would have 10 * n new English sentences (oftentimes with duplicate generations that we remove).These new sentences represent paraphrases of the original English sentence.
3. After de-duplication, the new English sentences are fed to the En→Pt model to get the 1-Best Portuguese translation.

Multi-Checkpoint Translation
Our third method is based on saving the models at given epochs (checkpoints) during training.We use the m last checkpoints (models) to generate the n-Best translation hypotheses (the same way as our n-Best prediction method).We then de-duplicate the outputs of all the m models and use them in evaluation.We now describe our evaluation.

Evaluation
In order to evaluate our methods, we carry out a number of experiments.First, we consider performance of each proposed method on the official training and development datasets of STA-PLE (Mayhew et al., 2020).Our models were ultimately evaluated on the shared task test data.We now describe STAPLE evaluation metrics and baselines as provided by organizers, before report-ing on our results on training, development, and test.

Evaluation Metrics & Baselines
Weights of Translation.We note that each Portuguese translated sentence has a weight as provided in the gold dataset.The weights of translations correspond to user (learner) response rates.These weights are used primarily for scoring.The STAPLE 2020 shared task data takes the format illustrated in Table 3.
Metrics.Performance of MT systems in the shared task is quantified and scored based on how well a model can return all human-curated acceptable translations, weighted by the likelihood that an English learner would respond with each translation (Mayhew et al., 2020).As such, the main scoring metric is the weighted macro F 1 , with respect to the accepted translations.compute weighted macro F 1 (see formula 6), the weighted F 1 for each English sentence (s) is calculated and the average over all the sentences in the corpus is computed.The weighted F 1 (see formula 5) is computed using the unweighted precision (see formula 1) and the weighted recall (see formulas 2, 3 and 4).
Baselines.We adopt the two baselines offered by the task organizers.These are based on Amazon and Fairseq translation systems and are at 21.30% and 13.57%, respectively.More information about these baselines can be reviewed at the shared task site listed earlier.

Evaluation on TRAIN and DEV
In this section, we report the results of our 3 proposed methods, (a) n-Best prediction, (b) paraphrasing, and (c) multi-checkpoint translation using the MT models presented in section 4.
Evaluation on TRAIN.For (a) the n-Best prediction method, we explore the 4 different values of n in the set {5, 10, 15, 20}.For (b) the paraphrase method, we set the number of Portuguese sentences to n = {1, 3, 5}.Finally, (c) the multi-checkpoint method was tested with 4 different values for the number of checkpoints m = {2, 4, 6, 8}.
For paraphrasing and multi-checkpoint translation, we fix the number of n-best translations n to 10, varying the values of n and m only when evaluating our extended model.This leads us to identifying the best evaluation values of n = 3 and m = 6, which we then use when evaluating our basic and fine-tuned models.
Evaluation on DEV.For evaluation on the STAPLE development data, we adopt the same procedure followed for evaluation on the train split.Table 4 summarizes our experiments with different configurations (i.e., values of n, n , and m ) on train and development task data, respectively.
Discussion. Results presented in Table 4 demonstrate that all the models with the different methods and configurations outperform the the official shared task baseline with macro F 1 scores between 27.41% and 40.78%.As expected, finetuning the En→Pt basic model with the S-TRAIN data-set improves the results with a mean of +1.46% on the training data.We also observe that training on the concatenated OPUS and S-TRAIN data-sets from scratch leads to better results compared to the exclusive fine-tuning method.
Based on these results, we can see that the best configuration is the multi-checkpoint method used with the extended MT model.This configuration obtains the best macro F 1 score of 40.78% and 39.21% on the training and development STAPLE data splits, respectively.

Figure 1 :
Figure 1: Translations proposed by English language learners at various levels of fluency, from diverse backgrounds.Our multi-checkpoint ensemble models mimic learner fluency.4

Figure 2 :
Figure 2: An illustration of our proposed models and methods: (a) n-Best prediction method with n = 10 resulting in the En→Pt basic model; (b) paraphrasing method with n = 10 and n = 3 used in the En→Pt fine-tuning and the En↔Pt basic models, (c) multi-checkpoint method used with n = 10 and m = 4 for the En→Pt extended model.

Table 1 :
use an MT phrase table to mapping an English sentences to various non-English sentences.English sentences with their Portuguese translation samples from shared task training split.

Table 2 :
English-Portuguese datasets from Tiedemann (2012) used in our training.

To
English Sentence : is my explanation clear?

Table 3 :
English sentences with their Portuguese translation and Weights samples from shared task train data.