Simultaneous paraphrasing and translation by fine-tuning Transformer models

This paper describes the third place submission to the shared task on simultaneous translation and paraphrasing for language education at the 4th workshop on Neural Generation and Translation (WNGT) for ACL 2020. The final system leverages pre-trained translation models and uses a Transformer architecture combined with an oversampling strategy to achieve a competitive performance. This system significantly outperforms the baseline on Hungarian (27% absolute improvement in Weighted Macro F1 score) and Portuguese (33% absolute improvement) languages.


Introduction
This paper describes the third place submission to the shared task Mayhew et al. (2020) on simultaneous translation and paraphrasing for language education at the 4th workshop on Neural Generation and Translation (WNGT) for ACL 2020. The shared task involves generating multiple translations for a given source text in English and a target language. The five target languages in the task are Hungarian (hu), Portuguese (pt), Japanese (ja), Korean (ko) and Vietnamese (vi). We competed in the Hungarian and Portuguese tracks. A goal of the shared task, hosted by Duolingo, is to enable development of automated grading processes and curation systems for language learners' responses. A high-coverage and precise multi-output translation and paraphrasing system would vastly help such automated efforts. For the task, participants were provided with hand-crafted and field-tested sets of several possible translations for each English sentence. Each of these translations were also ranked and weighted according to actual learner response frequency and these weights were provided as additional features. Along with these, translations from AWS were provided as a baseline and additional data. The challenges associated with the shared task are two-fold: i) Translating from English to target languages and ii) Producing multiple valid translations (paraphrases) while balancing precision with the coverage. We conduct several experiments to address these two challenges and develop a simple system that leverages pre-trained transformer Vaswani et al. (2017) models and a wide beam search strategy. Furthermore, we leverage the provided translation scores and experiment with multiple training distribution strategies to develop a simple oversampling strategy that produces improvements over the vanilla method of using one translation one time.

Related work
Paraphrasing and machine translation are wellstudied research areas in general but there's not much research specifically in the context of multi-output translation systems, especially for low resource languages. Tan et al. (2019) train a Transformer-based Neural Machine Translation model for Hungarian-English and Portugese-English translation. However, their goal was to assess the benefits of multilingual modeling by clustering languages and is different from that of a multi-output translation system. For English-Portuguese, Aires et al. (2016) build a phrasebased machine translation system to translate biomedical texts. For multilingual parahrasing, Ganitkevitch and Callison-Burch (2014) release a database consisting of paraphrases for several languages, including Hungarian and Portuguese, at lexical, phrasal and syntactic level. Guo et al. (2019) build a zero-shot multilingual paraphrase generation model to show mixed results. However, their end goal was to generate paraphrases in the same language (English) as opposed to our shared task which requires generating paraphrases in a different language.

Task
We describe dataset statistics and evaluation metrics in this section.

Data
There are two phases of the competition -Dev and Test. Table 1 shows data statistics for all phases. There were 4000 train prompts provided, in English, for both Hungarian and Portuguese languages. However, each of these prompts were accompanied with multiple translations leading to 251,442 English-Hungarian (en-hu) pairs and 526,466 English-Portuguese (en-pt) pairs. There were 500 prompts in both dev and test phases. After tokenization, for en-hu, most of the source sentences were shorter than 11 tokens and target sentences were shorter than 14 tokens. For en-pt, most of the source sentences were shorter than 25 tokens and target sentences were shorter than 15 tokens.

Evaluation Metrics
The main scoring metric for the competition is the weighted macro F1 score. This is a measure of how well the system returns all human-curated translations weighted by the likelihood that an English learner would respond with each translation. For each prompt p, weighted macro F1 is calculated as the harmonic mean of precision and weighted recall (note that the precision is unweighted). To calculated weighted recall for each example, we first calculate Weighted True Positives (WTP) and Weighted False Negatives (WFN) as: Then, weighted recall (WR) is calculated as: The weighted Macro F1 (WF) over all prompts P is then calculated by averaging over all prompts in the corpus as:

System Design
We now describe the final submitted system design in detail. We have experimented with several other variants and describe these in a later section 5.

Data sampling
For the final system, we chose to use weighted sampling of the data where the weights correspond to the provided learner response frequency. Specifically, we multiply the frequency of the translation (a number between 0 and 1) with a heuristic value of 50 and duplicate the source-translation pair that many number of times. In effect, this would create repeated samples of certain pairs whose frequency is greater than 0.02 while eliminating pairs whose frequency is less than 0.02. With this sampling, we end up with 40,500 en-hu pairs and 42,000 en-pt pairs. We separate 15% of the provided prompts as a validation set. The performance on this validation set is used to pick the best model.

Preprocessing
For text pre-processing, we use sentencepiece tokenization Kudo and Richardson (2018) for en-hu and byte-pair encoding Sennrich et al. (2016) for en-pt data. We use pre-trained tokenization models provided in OPUS-MT.

Model Architecture
The final submitted model architecture, shown in Figure 1, uses the standard Transformer sequenceto-sequence model. This has 6 encoder and 6 decoder layers and an 8-headed attention mechanism

Transformer
Beam Search Post-processing English Prompt

N-best hypotheses
Multi-output translations Figure 1: Architecture of the final system in both encoder and decoder. We initialize the model with the pre-trained representations obtained from the OPUS-MT data. This model is then finetuned on the task data. We tie the encoder, decoder and output embedding weights and use a shared vocab size of 60,522. For position-wise feed-forward layers, the Swish activation function Ramachandran et al. (2018) is used. The whole model is fine-tuned, through an early stopping mechanism, on the dataset constructed as detailed in 4.1 .
For fine-tuning, we use the standard crossentropy loss objective on the target sequence along with a label smoothing loss Szegedy et al. (2016).
For decoding, we use beam search with a beam size of 10 and select top 10 hypotheses for en-hu track. For en-pt track, we use a beam size of 28 and select top 28 hypotheses. We implement the model in Marian NMT Junczys-Dowmunt et al. (2018).

Postprocessing
The beam search outputs scores for each individual token. These scores represent the log likelihood of that token in the output sentence. As a post-processing step, we remove all translation predictions where the maximum of these tokenlevel scores is less than -3.5. This value was determined by studying the impact of the maximum score thresholding on validation set performance.

Hyperparameters
We use the following hyperparameters. Batch size is set to 500. Dropout is set to 0.1. Label smoothing is set to 0.1. We use Adam optimizer with learning rate of 3e-4, β 1 =0.9, β 2 =0.98 and epsilon = 1e-9. We decay the learning rate by an inverse square root mechanism for 16000 steps. The gradient clip norm is set to 5. And patience for early stopping is set to 5.

Ablations
We have performed several ablation studies on the en-hu task. The results of all these studies are listed in Table 3. We list the experiment methodologies below.
No fine-tuning: Here, we applied the pre-trained translation model directly on the task without any fine-tuning. The decoding was done using beam search beam size of 12 and by selecting top 12 hypotheses (determined based on validation performance).
No oversampling: Here, we use all provided translation pairs without any filtering based on the learner response frequency. We fine-tune the pretrained model on this dataset and decode using beam search with a beam size of 15 and selecting top 15 hypotheses. No post-processing: This is the same as the final submitted model without the post-processing (maximum score thresholding).

Other Modeling Variants
We experimented with different modeling alternatives for the shared task. We describe them in this section. The results of these variations are listed in Table 4.

Multi-output sequence formulation
Here, we re-formulate the task as a multi-output prediction task by taking the top 5 translation pairs (based on the learner response frequency) and concatenating them into a single target sequence. The pre-trained model is then fine-tuned on this dataset. Nucleus sampling: Here, we use the above multi-output sequence model and add Nucleus sampling Holtzman et al. (2019) while decoding with p value set to 0.95.

Back Translation
Here, we start with a pre-trained hu-en translation model. We then construct a hu-en dataset from the provided en-hu translation pairs. The pre-trained model is fine-tuned on this dataset. We apply this fine-tuned hu-en model on the provided reference AWS translations of the target hu sentences. With a beam size of 15 and top-5 hypotheses selection,    we generate 5 English paraphrases for each given English prompt. Now, the en-hu fine-tuned model from the "Multi-output sequence formulation" ablation is made to predict separately for each of the generated English paraphrases and all the outputs are combined into the final prediction.

Model-based Prediction Filtering
Here, we start with the final submission model and build a binary XGBoost classifier on top of it to filter predictions (accept vs reject). The features of the XGBoost model are the token-level scores, as described in Section 4.4, that are obtained from the final submission model. As different sequences have different lengths, we build a fixed size feature vector by truncating or padding all sequences to a length of 11. This is the 99 percentile source length listed in Table 1. The binary labels for training are obtained by comparing output translation with the provided gold translations. We do a randomized search on "max depth", "colsample bytree", "colsample bylevel" and "n estimators" hyperparameters of the XGBoost model to find the best set of values. We then perform a 5-fold cross-validation to identify the best model. The F1 score of this model on the "accept" class is 0.81 and on the "reject" class is 0.48. The overall accuracy is about 72%.  Table 3 shows the results for several ablations for en-hu model listed in section 5. And Table 4 shows results for several modeling variants listed in section 6. There are several interesting observations to be made from these ablations and variants. First, there's a clear improvement of about 15.4 points in Weighted Macro F1 from fine-tuning the pretrained model on the provided dataset. The simple post-processing strategy of score thresholding yielded a gain of about 1.79 absolute points. Similarly, there's also a big improvement of about 10.7 absolute points from the oversampling strategy we used (as opposed to no oversampling). However, this gap seemed to have been closed by a big margin (about 7 absolute points) through the multioutput sequence formulation and slightly more by adding Nucleus sampling on top of it. A separate approach that uses back translation seemed to also have yielded similar gains upon the "No oversampling" approach. The model-based prediction filtering yielded an improvement of about 4 abso-lute points. Interestingly, all of these variants still ended up inferior (by varying levels) to the simple oversampling + fine-tuning + post-processing strategy that was used for the final submission.

Summary
We describe the system for our submission to the shared task on simultaneous translation and paraphrasing for language education at the 4th workshop on Neural Generation and Translation (WNGT) for ACL 2020. The final submitted system leverages pre-trained translation models, with Transformer architecture, and an oversampling strategy to achieve competitive performance. For future, it'd be interesting to see if initializing the model with latest state-of-the-art sequence-tosequence pre-trained models such as BART Lewis et al. (2019) and T5 Raffel et al. (2019) and finetuning could help boost performance. It would also be a promising direction to explore the benefit of using cross-lingual models such as XLM-Roberta Conneau et al. (2019). One way to use them would be to initialize the encoder part of the architecture with pre-trained representations. Given the shared representations, it might be interesting to see if concatenating several language pairs' train datasets and training a joint model produces additional benefits.