GTCOM Neural Machine Translation Systems for WMT19

This paper describes the Global Tone Communication Co., Ltd.’s submission of the WMT19 shared news translation task. We participate in six directions: English to (Gujarati, Lithuanian and Finnish) and (Gujarati, Lithuanian and Finnish) to English. Further, we get the best BLEU scores in the directions of English to Gujarati and Lithuanian to English (28.2 and 36.3 respectively) among all the participants. The submitted systems mainly focus on back-translation, knowledge distillation and reranking to build a competitive model for this task. Also, we apply language model to filter monolingual data, back-translated data and parallel data. The techniques we apply for data filtering include filtering by rules, language models. Besides, We conduct several experiments to validate different knowledge distillation techniques and right-to-left (R2L) reranking.


Introduction
We participated in the WMT shared news translation task and focus on the bidirections: English and Gujarati, English and Lithuanian, as well as English and Finnish. Our neural machine translation system is developed as transformer (Vaswani et al., 2017a) architecture and the toolkit we used is Marian (Junczys-Dowmunt et al., 2018). Since BLEU (Papineni et al., 2002) is the main ranking index for all submitted systems, we apply BLEU as the evaluation matrix for our translation system. In addition to data filtering, which is basically the same as the techniques we applied in WMT 2018 last year, we verify different knowledge distillation and reranking techniques to improve the performance of all our systems.
For data preprocessing, the basic methods include punctuation normalization, tokenization, truecase and byte pair encoding(BPE) (Sennrich et al., 2015b). Besides, human rules and language model are also involved to clean English parallel data, monolingual data and synthetic data. Regard to the techniques on model training, backtranslation (Sennrich et al., 2015a), knowledge distillation and R2L reranking (Sennrich et al., 2016) are applied to verify whether these techniques could improve the performance of our systems.
In order to explore the application of knowledge distillation technology in the field of neural machine translation, we conduct a number of experiments for sequence-level knowledge distillation and sequence-level interpolation (Kim and Rush, 2016). Another, R2L reranking didn't get the better performance in last year experiment. In order to improve the performance of R2L reranking, we increase the beam size step by step, and explore the effect of any combination for R2L models with every step. This paper is arranged as follows. We firstly describe the task and provided data information, then introduce the method of data filtering, mainly in the application of language model. After that, we describe the techniques on transformer architecture and show the conducted experiments in detail of all directions, including data preprocessing, model architecture, back-translation and knowledge distillation. At last, we analyze the results of experiments and draw the conclusion.

Task Description
The task focuses on bilingual text translation in news domain and the provided data is show in Table 1, including parallel data and monolingual data. For the direction between English and Lithuanian, the parallel data is mainly from Europarl v9, ParaCrawl v3, Wiki Titles v1 and Rapid corpus of EU press releases (Rozis and Skadiņš, 2017). For the direction between English

Data Filtering
The methods of data filtering by human rules are mainly the same as we did in English to Chinese (Bei et al., 2018) last year, but language models are used to clean all data, including monolingual data, parallel data and synthetic data. We use Marian to train the transformer language model for each language (i.e. English, Gujarati, Lithuanian and Finnish). We introduce this section in two condition: • For monolingual data and synthetic data (i.e. back-translate data from target side and knowledge distillation from source side), Every sentence are scored by language model, and the score for sentence is calculated as follows: Here Score lm is score of language model for sentence, and L sentence is length of sentence in token level.
• For parallel data, considering scores of two sides, we combine the two side score of parallel data with liner: Score combine = λ * Score src +(1−λ)Score tgt direction number of cleaned data en-lt parallel data 4.08M en-gu parallel data 77K en-fi parallel data 9M en monolingual data 17.6M lt monolingual data 2.92M gu monolingual data 4.28M fi monolingual data 15M en-gu unconstrained data 4.55M .
Here, λ is 0.5. According the sorted score for each sentence or sentence pair, we clean the sentences that is obviously not influence. Table 2 shows the number of cleaned data.

Back-translation
It has been proved that back translation (Sennrich et al., 2015a) is an effective way to improve the translation quality, especially in low-resource condition. Same as we did in last year, we firstly train models from target to source, then we use these model to translate the provided monolingual data in target side onto source side. Besides, the target parallel data is also translated to source side. It should be noticed that the ratio of parallel data and synthetic data is 1:1. Joint-training  is another method which has been proved that it can improve the performance of back-translation. In another perspective, back-translation is the first step of joint-training. When getting the best model from back-translation, we consecutively translate the monolingual data from the target side of parallel data and mix parallel data and synthetic data with the ratio of 1:1. Then the new training set is used to train a new model until there is no improvement. We only repeated this procedure twice due to the time limitation.

Sequence-level Knowledge Distillation
Sequence-level Knowledge distillation describes the method of training a smaller student network to perform better by learning from a teacher network. Knowledge distillation suggests training by matching the student's predictions to the teacher's predictions. We consider two different kinds of methods to improve the performance for NMT: • Ensemble Teacher As according (Freitag et al., 2017), we translate the source side sentences of parallel data with ensemble models and get the synthetic target side sentences. The synthetic data is applied to training.
• R2L Teacher Inspired by ) (Hassan et al., 2018), we translate the source side sentences of parallel data to target side with R2L model to improve L2R model.
To avoid bad translation, we filter the synthetic data with BLEU score lower than 30.

Sequence-level Interpolation
After sequence-level Knowledge distillation, the trained models are fine-tuned with n-best knowledge distillation data. The n-best knowledge distillation data is from the n-best translation from sequence-level knowledge distillation with different kinds of teachers. For every translation with the same source side sentence in an n-best translation, we extract the highest BLEU score and get the n-best knowledge distillation data.

R2L Reranking
Last year we didn't get better result with applying R2L reranking technique from English to Chinese. And we found out that the reason is we didn't increase the beam size step by step and didn't use all combination of R2L models. Therefore, to increase search space and get better translation, we applied the above procedure this time.

Experiment
This section describes the all experiments we conducted and illustrates how we get the evaluation result step by step.

Model Architecture
We use transformer big model to train our model with Marian according (Vaswani et al., 2017b). The model configuration and the training parameters are show in Table 3 and Table 4 respectively.

Date preprocessing
Both of parallel data and monolingual data are fully filtered. After that, we normalize   the punctuation of all sentences by normalizepunctuation.perl in Moses toolkit (Koehn et al., 2007). We apply tokenizer and truecaser in Moses toolkit for English, Lithuanian and Finnish sentences and use polyglot 1 to tokenize Gujarati sentences. Finally, BPE is applied on tokenized English, Lithuanian, Finnish and Gujarati sentences respectively. Here, the BPE merge operation is set to 30000, and the vocabulary size is 30500.

Training Step
Here we introduce the training step in detail.
• Baseline model We use transformer big model to train our baseline model with only parallel data cleaned by human rules and language model. Besides, R2L models are trained with the same data with 4 different seeds.
• Back-translation When getting the baseline model, we decode monolingual data in target side to source side with ensemble models trained from source side to target side. For example, if we want to train an English to Gujarati model with synthetic data, using Gujarati-to-English baseline model to translate Gujarati sentences to English. Then, the translated English sentences are filtered by language model. The synthetic data and parallel data, which are mixed with ratio of 1:1, are applied to train back-translation model.
• Joint Training When getting the backtranslation model, repeat back-translation step until there is no improvement. We repeated this step twice.
• Sequence-level Knowledge Distillation Different from back-translation, we use different teachers of source-to-target model to translate the source sentence of parallel data to target side. For example, we use English-to-Gujarati model to translate English sentences to Gujarati. Compared with golden reference, each translation with the BLEU score lower than 30 will be removed. Considering the low-resource condition, we mix parallel data, synthetic data and knowledge distillation data with ratio of 1:1:1 to train the new model.
• Sequence-level Interpolation After sequence-level knowledge distillation, the best models are fine-tuned with the n-best knowledge distillation data.
• Ensemble Decoding To get the best performance over all models efficiently, we use GMSE Algorithm (Deng et al., 2018) to select models.
• R2L Reranking To enlarge search space, we increase the beam size step by step and rescore it with all combination of R2L models for each step. Here, the step size is 10 and maximum beam size is 200.

Summary
This paper describes GTCOM's neural machine translation systems for the WMT19 shared news translation task. For all translation directions, we build systems mainly from data aspect, including acquiring more quantities and higher quality data. Besides, decoding strategies such as GSME algorithm and R2L reranking give us more robust and high quality translation. Finally, the directions of English to Gujarati (unconstrained) and Lithuanian to English get the best case-sensitive BLEU score of all systems.