Exploring Model Consensus to Generate Translation Paraphrases

This paper describes our submission to the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). This task focuses on improving the ability of neural MT systems to generate diverse translations. Our submission explores various methods, including N-best translation, Monte Carlo dropout, Diverse Beam Search, Mixture of Experts, Ensembling, and Lexical Substitution. Our main submission is based on the integration of multiple translations from multiple methods using Consensus Voting. Experiments show that the proposed approach achieves a considerable degree of diversity without introducing noisy translations. Our final submission achieves a 0.5510 weighted F1 score on the blind test set for the English-Portuguese track.


Introduction
Machine Translation (MT) systems are typically used to produce a single output for a given source sentence, whereas in human translation the same source sentence can often be translated in various different ways while still preserving its meaning.
In the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE) (Mayhew et al., 2020), participating MT systems are evaluated using multiple reference translations to measure their ability to generate diverse, yet high quality translations. For that, a new dataset with multiple human translations for each source sentence is provided. These human translations were produced by language learners as part of a translation exercise on the Duolingo platform 2 where they were asked to translate sentences from the language they were learning (e.g. English) to their native language. Each translation in the dataset is assigned a weight based on the learner response frequency. Table 1 gives an example of the weighted translations in the dataset for English-Portuguese. The STAPLE dataset includes five language pairs: English to Portuguese, Hungarian, Japanese, Korean, and Vietnamese. In the shared task, we only participated in English-Portuguese (En-Pt) track. In this paper, we experiment with various methods to improve the diversity of translations, while preserving their quality. We show that simply by generating N-best translations with larger beam size, we can already achieve a considerable degree of diversity. Our final submission is based on the integration of multiple translations from various methods, namely N-best translation, Monte Carlo dropout, Mixture of Experts, Ensembling, and Lexical Substitution, through a consensus voting mechanism. It achieves 0.5510 weighted F1 score on the official blind test set. This paper is structured as follows: Section 2 describes the methods we used in our experiments. Section 3 introduces the experimental settings, including data preparation, model hyperparameters, and the evaluation procedure. Section 4 describes the results and analysis. Section 5 presents our three official submissions to STAPLE blind test set. Finally, Section 6 summarises our submission to the shared task and our contributions.

Methods
In what follows we describe the methods used in our experiments, including N-best translation, Monte Carlo dropout, Diverse Beam Search, Mixture of Experts, Ensembling and Lexical Substitution. We combine all of these methods except the Diverse Beam Search in our official submissions through a consensus voting mechanism. Details about the submissions can be found in Section 5.

N-best
The simplest method to generate multiple translations for a given sentence is to use N-best translations with a large beam size during decoding. Larger beam size might lead to more translation options with similar meanings. We experimented with multiple sizes for N , and used the same value for N-best and beam size. Gal and Ghahramani (2016) proposed the Monte Carlo (MC) dropout method to estimate predictive NMT model uncertainty. The method consists in running several forward passes through the model (i.e., at inference time), each applying dropout before every weight layer and collecting posterior probabilities generated by the model with parameters perturbed by dropout. The mean and variance of the resulting distribution can then be used to represent model uncertainty. Instead of using this method for scoring translations, we use it as a way to generate alternative MT hypotheses for a given source sentence. Specifically, we run inference with dropout M times and collect the resulting translations. In our experiments, the dropout rate is set to 0.1 and M = 10.

Diverse Beam Search
Vijayakumar et al. (2016) proposed the Diverse Beam Search algorithm to improve the diversity of beam hypotheses. The algorithm proceeds by dividing the beam budget into groups and enforcing diversity between groups of beams. In our experiments we use the implementation of this algorithm in fairseq (Ott et al., 2019) with default parameters.

Mixture of Experts
Shen et al. (2019) introduced the Mixture of Experts (MoE) framework to capture the inherent uncertainty of the MT task where the same input sen-tence can have multiple correct translations. A mixture model introduces a multinomial latent variable to control generation and produce a diverse set of MT hypotheses. In our experiment we use hard mixture model with uniform prior and 5 mixture components.

Ensembling
Training an ensemble of various MT models initialized with different random seeds is a common strategy used to boost the output quality (Garmash and Monz, 2016). Unlike the typical ensembling method that combines prediction distributions from different models by averaging, we use each system in the ensemble to generate a separate set of translation hypotheses, and take the set of dictinct translations as the final output.

Lexical substitution
In the STAPLE dataset, we observed that many of the paraphrases in translations are simple variants with word substitutions in the target language. Therefore, we built a dictionary containing all lexical substitutions from the STAPLE training data. The substitutions are sorted according to two criteria: 1) number of occurrences 2) substitution probability. The substitution probability is calculated as follows: The top-5 lexical substitutions from frequencysorted and probability-sorted dictionaries are listed in Table 2. We filtered the substitution dictionary with a stopword list 3 and a threshsold (which can be either frequency count or substitution probability), to avoid generating ungrammatical translations.

Consensus voting
To integrate translations from different models, we employed a consensus voting mechanism by counting the number of systems that predicted each translation. A threshold T con is set, meaning that a translation must be predicted by at least T con + 1 systems, otherwise it is removed. Considering the lexical translation might generate rare but correct translation, we assign the lexical-substituted translations a weight W sub so that they can be seen as generated by W sub systems. The consensus method guarantees a high precision by removing translations that are likely to be incorrect.

Data
To build the NMT model, we used parallel corpora for En-Pt from OPUS (Tiedemann, 2012) as out-ofdomain data, including ParaCrawl 4 , EUbookshop 5 , Europarl 6 , Wikipedia 7 , QED 8 , and Tatoeba 9 . The combination of these corpora contains 22.42 million parallel sentence pairs. The STAPLE dataset, which contains 4000 source sentences with 526,466 translations, is used as in-domain data for finetuning. Since in the STAPLE dataset a source sentence have an average number of 131 reference translations, we constructed parallel data by duplicating the source sentence to match the number of translations, as shown in Figure 1. We also experimented with different data filtering strategies on the STAPLE dataset by only keeping the top-K translations with the highest weights (we refer to this as tune-K). Statistics regarding the corpus size after filtering are shown in Table 3.

Filtering Source Translations
tune-5 20,000 5.00 tune-10 40,000 10.00 tune-20 78,439 19.61 tune-all 526,466 131.62 All sentences are tokenized with Moses (Koehn et al., 2007), and then processed via Byte-Pair-Encoding (BPE) (Sennrich et al., 2016). A shared vocabulary of 40,000 subwords is constructed for both English and Portuguese. The training data was then cleaned by removing sentence pairs with more than 250 subwords or with length ratio over 1.5, using the clean-corpus-n.perl 10 script in Moses.

Model and hyperparameters
We used the Transformer model (Vaswani et al., 2017) as our baseline model. The model is trained using fairseq toolkit (Ott et al., 2019) with the default hyperparameter settings using transformer_wmt_en_de architecture. The model was trained on 8 GPUs with a batch size of 4096 tokens on each GPU. We used mixedprecision training to accelerate the training. The model was pre-trained on OPUS data for 30 epochs and then fine-tuned on STAPLE data. We set 5 as the number of experts for training the MoE system. For ensembling, we pretrained with 3 random seeds and fine-tuned with 4 random seeds, resulting in 12 different MT systems.

Generation of Translations
When generating an integration of translations from multiple systems, we follows the procedure as described below: 1. Generate translations from N systems, resulting in N translation sets s 1 , s 2 , s 3 , ..., s N 2. Apply consensus voting to the N system translations with threshold T con , resulting in one translation set s consensus 3. Apply lexical substitution to s consensus , resulting in a separate translation set s lexical 4. Apply consensus voting to the N system translations and the lexical substitution translation s 1 , s 2 , s 3 , ..., s N , s lexical with threshold T con and weight W sub , resulting in the final translation set s lexical&consensus .

Evaluation
The shared task provides a blind dev set (blind-dev) and a blind test set (blind-test) for evaluation. Since the number of submissions is limited, we also take a small random split from the STAPLE training set for dev (heldout-dev) and test (heldout-test) sets with 500 source sentences. The translations are evaluated at sentence-level as a classification problem where true positives (TP) occur when the system produces one of the translations in the given set of references, false positives (FP) when a translation out of this set is produced, and false negatives (FN) when translations in this set are missed by the system. The official evaluation metric is a weighted macro F1-score averaging over all source sentences. The weighted F1 score is calculated with weighted recall and unweighted precision:

Results
N-best We present the F1 score with respect to n-best size (from 1 to 20) in Figure 2. The models fine-tuned with different filtered data are evaluated on our heldout test set. As shown in Figure 2, the pre-trained model (tune-0) shows a poorer performance than the other fine-tuned models.
The tune-1 model shows a good performance when the N-best size is small, but experiences a degradation when N-best increases. Models fine-tuned with 5, 10, and 20 reference translations show similar performances with F1 score around 0.49. However, the optimal n-best size is closely related to the number of translations used for fine-tuning, with N-best=3,10,12,18 for model tuned with 1, 5, 10 and 20 references respectively. The models fine-tuned with all translations in the STAPLE dataset show a growing trend in F1 score as n-best size increases, but the overall F1 score is still much lower than for the three fine-tuned models. We found that the upper bound for tune-all model is around 0.415 F1 score.  Table 4 shows a comparison on the heldout-test set between the N-best and N-best with MC dropout. It can be seen that the N-best12 achieves a higher recall than the N-best5, which leads to an increase of 0.038 in F1 score. When decoding with dropout, the N-best5 could match the performance of N-best12. Although noticing that MC Dropout could improve the performance for small N-best size, we found that when the N-best size gets larger the weighted F1 score does not improve further.   Ensemble & Consensus In Table 6, we present our ensembling submission and consensus submission (with threshold T con set to 1) on the blind-dev set. Both ensembling and consensus voting improve over the N-best by increasing the recall and reducing the precision. However, since consensus voting removed translations with fewer votes from other systems, the precision score is higher than that of ensembling while the recall is similar. This leads to a higher F1 score with the consensus submission.
Ensembling can be seen as a special case of consensus voting, with the threshold T con being zero. Ensembling maximizes the recall by taking translations from all the systems but sacrifices the precision. Increasing the value of the threshold T con would compensate for the precision loss while maintaining the gain in recall. Table 7 shows the submissions on the blind-dev set after applying lexical substitution to a consensus output combining ensembled N-best, MC dropout, and MoE systems.We first generated a set of translations with all lexical substitutions, using the translations from an N-best system. The translations with lexical substitution achieve an F1 score of 0.127, which shows potential benefits of this method. However, as shown in Table 7, simply adding the substituted translations will harm performance, and this will happen for both frequency-based sorting and probability-based sorting. This is due to the fact that the translations after substitution are likely to be ungrammatical since the substituted word does not fit in the context. To alleviate this, we added the substituted translations to the consensus pool for higher precision. This only improves over the consensus system without lexical substitution by +0.002 F1 score.
In the experiment combining theses methods, we found that the N-best translations contributes the most score among all these methods. While an N-best system could achieve a weighted F1 score of nearly 0.5, other methods such as MC-Dropout, Ensembling and Consensus would only result in an extra improvement of less than 0.05 weighted F1 score. In our experiments, Diverse Beam Search and Mixture of Experts systems didn't contribute much.

Official submissions
Our official submissions combine translations from 12 tune-10 N-best systems (12 random seeds, finetuned with top-10 references, N = 12), 12 tune-20 N-best systems (12 random seeds, finetuned with top-20 references, N = 20), 2 MC Dropout systems (n = 3, M = 50; n = 5, M = 10 ), 3 experts from the MoE system, and lexical substitution (with a probability threshold of 0.7). The consensus voting threshold T con is set to be 10, and the weight W sub for lexical substitution is 9. Results for our three official submissions to the blind test set are shown in Table 8.  The best submission, which achieves the best F1 score of 0.5510, applies both consensus voting and lexical substitution. As shown in the second submission, removing lexical substitution would reduce the F1 score by 0.006, although the precision is improved marginally. In the third submission, we set the consensus voting threshold T to be 1 to see the upper bound for recall. The recall increases from 0.516 to 0.580 while the precision drops significantly from 0.741 to 0.579.
Our best submission achieves the second position in the English-Portuguese track, with only 0.0006 weighted F1 score behind the winning submission. The official result on STAPLE test set is shown in Table 9.

Conclusions
This paper describes our submissions to the STA-PLE shared task for English-Portuguese translation.
We showed that simply generating N-best translations already achieves a considerable degree of diversity and quality. We experimented with various methods to improve the diversity in the MT output, including N-best translation, MC Dropout, Diverse Beam Search, Mixture of Experts, Ensembling, Consensus Voting, and Lexical Substitution. We showed the benefits and drawbacks of these methods in generating diverse, high quality translations. Our systems combining these methods further improve over the N-best translation and achieve 0.5510 weighted F1 score on STAPLE blind test set, which is only 0.0006 behind the winning submission.

A Appendices
A.1 Checkpoiting vs tune-K Table 10 presents the best finetuning checkpoint for models finetuned with different number of references. Models trained with more references might converge faster, and when the tuning number is larger than 40, only 1 epoch is used for finetuning.

A.2 Submission on blind-dev set
To provide a comprehensive understanding of the different methods, we selectively list our submissions to the blind-dev set in Table 11.