CASIA’s System for IWSLT 2020 Open Domain Translation

This paper describes the CASIA’s system for the IWSLT 2020 open domain translation task. This year we participate in both Chinese→Japanese and Japanese→Chinese translation tasks. Our system is neural machine translation system based on Transformer model. We augment the training data with knowledge distillation and back translation to improve the translation performance. Domain data classification and weighted domain model ensemble are introduced to generate the final translation result. We compare and analyze the performance on development data with different model settings and different data processing techniques.


Introduction
Neural machine translation(NMT) has been introduced and made great success during the past few years (Sutskever et al., 2014;Bahdanau et al., 2015;Luong et al., 2015;Wu et al., 2016;Gehring et al., 2017;Zhou et al., 2017;Vaswani et al., 2017). Among those different neural network architectures, the Transformer, which is based on self-attention mechanism, has further improved the translation quality due to the ability of feature extraction and word sense disambiguation (Tang et al., 2018a,b). In this paper, we describe our Transformer based neural machine translation system submitted to the IWSLT 2020 Chinese→Japanese and Japanese→Chinese open domain translation task (Ansari et al., 2020).
Our system is built upon Transformer neural machine translation architecture. We also adopt Relative Position (Shaw et al., 2018) and Dynamic Convolutions (Wu et al., 2019) to investigate the performances of advanced model variations. For the implementation, we extend the latest release of Fairseq 1 (Ott et al., 2019). 1 https://github.com/pytorch/fairseq For data pre-processing, we use byte-pair encoding(BPE) segmentation (Sennrich et al., 2016b) for the source side and character level segmentation for the target side to improve the model performance on rare words. We also investigate the influence of different segmentation methods including word, BPE and character segmentation for both sides.
To further improve the translation quality, we utilize data augmentation techniques of backtranslation with a sub-selected monolingual corpus to build additional pseudo parallel training data. Sentence level knowledge distillation is used to strengthen the performance of student model from multi-policy teacher models including left→right, right→left, source→target and target→source.
We also investigate the domain information of the large training data by using a Bert based domain classifier, which is a masked language model and has been shown effective in large scale text classification tasks (Devlin et al., 2019). With the in-domain data, we transfer the model of general domain to each specific domain, and use weighted domain model ensemble as decoding strategy. Figure 1 depicts the whole process of our submission system, in which we pre-process the provided data and train our advanced Transformer models on the bilingual data together with synthetic corpora from back-translation and knowledge distillation. With domain classification and fine tuning techniques, we obtain multiple models for ensemble strategy and post-processing. In this section, we will introduce each process step in detail.

NMT Baseline
In this work, we build our model based on the powerful Transformer (Vaswani et al., 2017). The Transformer is a sequence-to-sequence neural  Figure 2: The Transformer model. model that consists of two components: the encoder and the decoder, as shown in Figure 2. The encoder network transforms an input sequence of symbols into a sequence of continues representations. The decoder, on the other hand, produces the target word sequence by predicting the words using a combination of the previously predicted word and relevant parts of the input sequence representations. Particularly, relying entirely on the multi-head attention mechanism, the Transformer with beam search algorithm achieves the state-ofthe-art results for machine translation.

Multi-Head Attention
We use the multi-head attention with h heads, which allow the model to jointly attend to information from different representation subspaces at different positions. Formally, multi-head attention first obtains h different representations of (Q i , K i , V i ). Specifically, for each attention head i, we project the hidden state matrix into distinct query, key and value represen- Then we perform scaled dot-product attention for each representation, concatenate the results, and project the concatenation with a feedforward layer.
Scaled Dot-Product Attention An attention function can be described as a mapping from a query and a set of key-value pairs to an output. Specifically, we can multiply query Q i by key K i to obtain an attention weight matrix, which is then multiplied by value V i for each token to obtain the self-attention token representation. As shown in Figure 3, we compute the matrix of outputs as: where d k is the dimension of the key. For the sake of brevity, we refer the reader to Vaswani et al. (2017) for more details.

Back-Translation
Back-translation is an effective and commonly used data augmentation technique to incorporate monolingual data into a translation system (Sennrich et al., 2016a;Zhang and Zong, 2016). Especially for low-resource language tasks, it is indispensable to augment the training data by mixing the pseudo corpus with the parallel part. Back-translation first trains an intermediate target-to-source system that is used to translate monolingual target data into additional synthetic parallel data. This data is used in conjunction with human translated bitext data to train the desired source-to-target system. How to select the appropriate sentences from the abundant monolingual data is a crucial issue due to the limitation of equipment and huge overhead time. We trained a n-gram based language model on the target side of bilingual data to score the monolingual sentences for each translation direction.
Recent work (Edunov et al., 2018) has shown that different methods of generating pseudo corpus made discrepant influence on translation performance. Edunov et al. (2018) indicated that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam search or greedy search. We adopt the back-translation script from fairseq 2 and generate back-translated data with sampling for both translation directions.

Knowledge Distillation
The goal of knowledge distillation is to deliver a student model that matches the accuracy of a teacher model (Kim and Rush, 2016). Prior work  demonstrates that student model can surpass the accuracy of the teacher model. In our experiments, we adopt sequence-level knowledge distillation method and investigate four different teacher models to boost the translation quality of student model. S2T+L2R Teacher Model: We translate the source sentences of the parallel data into target language using our source-to-target (briefly, S2T) system described in Section 2.1 with left-to-right (briefly, L2R) manner.
S2T+R2L Teacher Model: We translate the source sentences of the parallel data into target language using our S2T system with right-to-left (briefly, R2L) manner.

T2S+L2R Teacher Model:
We translate the target sentences of the parallel data into source language using our target-to-source (briefly, T2S) system with L2R manner.
T2S+R2L Teacher Model: We translate the target sentences of the parallel data into source language using our T2S system with R2L manner.
In the final stage, we use the combination of the translated pseudo corpus to improve the student model. It is worth noting that we also mix the original bilingual sentences into these pseudo training corpus.

Model Ensemble and Reranking
Model ensemble is a method to integrate the probability distributions of multiple models before predicting next target word (Liu et al., 2018). We average the last 20 checkpoints for single model to avoid overfitting. One checkpoint is saved per 1000 steps. For model ensemble, we train six separate models. To achieve this, we fine-tune our student model described in Section 2.3 and back translation model described in Section 2.2 using corpus from three different domains (Spoken domain, Wiki domain and News domain). We use weighted ensemble to generate the translation result, in which the weights for each domain model is calculated from a Bert based domain classifier. The domain specific data for training the domain classifier and fine tuning the student translation model will be described in detail in Section 3.4.
For reranking, we rescore 50-best lists output from the ensemble model using a rescoring model, which includes the models we trained with different model sizes, different corpus portions and different token granularities.

Data Preparation
This section introduces the methods we employ to prepare the provided parallel data (18.9M web crawled corpus and 1.9M existing parallel sources) and monolingual sentences (unaligned web crawled data). We also describe how to prepare domain specific data to facilitate translation.
The provided parallel corpus existing parallel for the two translation directions consists of around 1.9M sentence pairs with around 33.5M characters (Chinese side) in total. Furthermore, a large, noisy set of Japanese-Chinese segment pairs built from web data web crawled is also provided, which con-sists of around 18.9M sentence pairs with around 493.9M characters (Chinese side) in total. We use the provided development dataset as the validation set during training, which consists of 5,304 sentence pairs. The average length and length ratio of the provided parallel corpus and development dataset is shown as in Table 1.

Pre-processing and Post-processing
In the open domain translation task both on Chinese→Japanese and Japanese→Chinese translation directions, we first implement pre-processing on training corpus and then filter it.
Before pre-processing, We remove illegal sentences in the provided Japanese-Chinese parallel corpus which include duplicate sentences and sentences in different languages other than source or target (filtered by our language detector tools).
Pre-processing steps include escape character transformation, text normalization, language filtering and word segmentation. There are lots of escape characters in the existing parallel and web crawled which do not occur in development set. As a result, we transform all these escape characters into corresponding marks with a well designed rule-based method to make it consistent between the training and evaluation.
Text normalization step mainly focuses on normalization of numbers and punctuation. Based on analysis on development set, we found that in Chinese, most of the punctuation are double byte characters (DBC), while most of the numbers are single byte characters (SBC). However, most of the numbers and punctuation in Japanese are double byte characters (DBC). Hence we normalize the numbers and punctuation format to make it the same as development set.
In word segmentation step, we apply Jieba 3 as our Chinese word segmentation tool for segmenting Chinese parallel data and monolingual data. For Japanese text, word segmentation is used Mecab (Toshinori Sato and Okumura, 2017). After preprocessing, we filter the training corpus as mentioned in section 3.2.
Finally, we apply Byte Pair Encoding (BPE) (Sennrich et al., 2016b)   in source language since it has the best performance on preliminary machine translation experiments.
For target side, we determine to use character granular because character level decoder could perform better in our preliminary experiments. Post-processing steps are similar to preprocessing without filtering. We apply escape character transformation, text normalization and unknown words (UNK) processing steps on machine translation results. The same methods are used to implement escape character transformation and text normalization as pre-processing. For UNK processing, we find some of the numbers can not be well translated by model and we replace these UNKs with the numbers in source sentence. Otherwise, we remove the UNK symbols.

Parallel Data Filtering
The following methods are applied to further filter the parallel sentence pairs.
We remove sentences longer than 50 and select the parallel sentences where the length ratio (Ja/Zh) is between 0.53 and 2.90. We then calculate word alignment of each sentence pair by using fast align 4 (Dyer et al., 2013). The percentage of aligned words and alignment perplexities are used as the metric where the thresholds are set as 0.4 and −30 respectively. Through the above filtering procedure, the number of the remaining data is reduced from 20.9M to 15.7M, as shown in Table 2.

Monolingual Data Filtering
It is proven that back-translation is a simple but effective approach to enhance the translation quality   as described in Section 2.2. To achieve that, we extract the high-quality monolingual sentences from the provided unaligned web crawled data. After removing illegal sentences from web crawled corpus, we limit the maximum sentence length as 50 and remove dirty data by a language model. Specially, we use KenLM 5 toolkit to train two language models with Japanese and Chinese monolingual data extracted from the provided parallel corpus existing parallel. We then rank the sentences based on the perplexities calculated by the trained language models and filter by perplexity threshold of 4 for Chinese and 3 for Japanese. Note that the perplexities are normalized by sentence lengths. obtain 6.1M and 16.4M monolingual sentences for Japanese and Chinese separately. The filtering results are presented in Table 3. The obtained monolingual sentences are fed to the trained model to generate pseudo parallel sentence pairs, which are employed to boost the performance of the model.

Domain Data Processing
Although the amount of provided training data is large enough, it is a noise set of web data built from multiple domain sources. Koehn and Knowles (2017)   the training data. Only the same or similar corpora are typically able to improve translation performance. Therefore, we apply domain adaptation methods in this task. Adaptation methods for neural machine translation have attracted much attention in the research community (Britz et al., 2017;Wang et al., 2017;Chu and Wang, 2018;Zhang and Xiong, 2018;Wang et al., 2020). They can be roughly classified into two categories, namely data selection and model adaptation. The former focuses on selecting the similar training data from out-of-domain parallel corpora, while the latter focuses on the internal model to improve model performance. Following these two categories, our domain data processing takes the following steps, as shown in Figure 4. Domain Label In this task, there are two kinds of domain labels provided: domains in existing parallel and domains in web crawled parallel. Since the later is mainly source document index for each sentence pair, the former is more meaningful for domain classification. We categorize the domain label of existing parallel data into three commonly used classes, namely Wiki, Spoken, and News. The domain Wiki includes wiki facebook, wiki zh ja tallip2015 and wiktionary. The label Spoken includes ted and opensubtitles. The label News includes global-voices, newscommentary and tatoeba.
Domain classification Data selection can be conduct in supervised or unsupervised manners (Dou et al., 2019). Since there is a provided data source descriptive file in the existing parallel data which can be regarded as domain labels, we choose the supervised way here. We use two BERT models pretrained on Chinese 6 and Japanese 7 data, respectively. Then the BERT models are fine tuned as a text classification task, based on the source and target side of existing parallel with three domain label we defined. Since the domain data is uneven, we also adopt oversampling and use extra data to enlarge News domain For the remaining data in web crawled parallel, we use the classification model to classify the total data into the three different domains. The statistics of domain data we used is shown in Table 4.
Decoding Stage Considering the test set is also composed of a mixed-genre data, we first classify the domain of each sentence in the test set and obtain the probabilities corresponding to each domain. Then we apply a weighted ensemble method to integrate NMT models. Specifically, when computing the output probability of the next word, we multiply the output probability in each domain specific translation model with the corresponding domain probability of each sentence.

Other Data Resource
The task description says that the test data is a mixture of genres but the provided development set is mainly from spoken domain. Furthermore, we find that the domain distribution of the training data is severely unbalanced (as shown in Table  4). Especially, the data of News domain is quite limited. Due to above two reasons, we decided to crawl some data from other domains.
It is easy to find that hujiangjp 8 which is a website helping people to study foreign languages contains some parallel Chinese-Japanese sentences. Accordingly, we crawled all the available data in this website before test data release. The total amount of extra data consists of 12, 665 parallel sentences. We randomly select 4, 877 sentence pairs to build an extra development set. When training each domain model, all the extra data are used as part of News domain. We 6 https://github.com/ymcui/ Chinese-BERT-wwm 7 https://github.com/cl-tohoku/ bert-japanese 8 https://https://jp.hjenglish.com/new/ find that 383 Chinese→Japanese pairs and 421 Japanese→Chinese pairs in the crawled data are overlapped with the final test set. We just used the originally trained model to decode the test set and decided not to retrain our model since it will take much time and the organizers remind that models cannot be changed after the test set is released. Anyway, we also suggest to test the translation quality on the remaining test set excluding the overlapped sentences.

Experiment Setup
Our implementation of Transformer model is based on the latest release of Fairseq. We use Transformer-Big as basic setting, which contains layers of N = 6 for both encoder and decoder. Each layer consists of a multi-head attention sublayer with heads h = 16 and a feed-forward sublayer with inner dimension d f f = 4096. The word embedding dimensions for source and target and the hidden state dimensions d model are set to 1024.
In the training phase, the dropout rate P drop is set to 0.1. In the fine tuning phase, the dropout rate is changed to 0.3 to prevent over-fitting. We use cross entropy as loss function and apply label smoothing of value ǫ ls = 0.1. For the optimizer, we use Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and ǫ = 10 −8 . The initial learning rate is set to 10 −4 for training and 10 −5 for fine tuning.
The models with complete training data are trained on 4 GPUs for 100,000 steps. For the dataset with knowledge distillation or backtranslation, the models are trained for 150,000 steps. We validate the model every 1,000 minibatches on the development data and perform early stop when the best loss is stable on validation set for 10,000 steps. At the end of training phase, we average last 20 checkpoints for each single model of general domain. In fine tuning phase, we use the averaged model of general domain as starting point for initializing the domain model, and continue training on 1 GPU with domain data for 50,000 steps without early stop. The batch sizes in training and fine tuning are set to 32768 and 8192 respectively.  Japanese→Chinese (JA→ZH) translation directions. We report the character BLEU score calculated with multi-bleu-detok.perl script. As shown in the result, filtering with complete parallel data plays an important role in our system. Techniques of back translation and knowledge distillation consistently improve the BLEU score. When applying domain classification, we classify each sentence using the Bert-based domain classifier and decode each sentence with corresponding domain model. As for combination methods, we build six separate models with three domain (Wiki, Spoken and News) fine tuned on two large synthetic data (back translation and knowledge distillation). In ensemble baseline, all of these models share the same weight in predicting word distributions. Weighted ensemble indicates we apply different weights for the ensemble models, in which the weights are obtained by the domain classifier. With weighted domain ensemble, our system achieves the best performance on development data in terms of BLEU, and surpass the single baseline systems by 7.34 BLEU for Chinese→Japanese and 8.36 BLEU for Japanese→Chinese.

Result
We also find a performance drop with reranking. The reason may be that we train the reranking models on the complete parallel data, which is from general domain and may assign lower score for domain specific translations. As a result, our submission is based on the weighted ensemble system, which performs best in our experiments.

Analysis
We compare the performance of different model variations and token granularities on   Chinese→Japanese development data. The data we used to train the models is existing parallel data, which consists of 1.9M parallel sentences.
For the model variations, we compare Relative Position (Shaw et al., 2018), Dynamic Convolutions (Wu et al., 2019) and Transformer Base and Big settings (Vaswani et al., 2017). As shown in Table 6, The best result is produced by Transformer Big setting, which is used as default when training on large datasets.
For the token granularities, we report the result with four tokenization methods: Word→Word, Character→Character, BPE→BPE and BPE→Character. As shown in table 7, adopting BPE in source side and Character in target side performs better than other token granularities, which is used in our submission systems.
We notice that there exits a large divergence between the two translation directions when using complete parallel data and process with parallel data filtering. We have verified the result and the parallel data in depth. We find that the quality of Japanese data is lower. For example, there are sentences consist of punctuations only, which may harm the target side language model learned by the decoder. After parallel data filtering, the invalid sentences are removed and thus the translation quality of ZH-JA is improved.
We also find that the provided development data is mainly from spoken domain, and thus we use our collected data as extra development set from other domain to investigate the general performance of single model. The result is shown in table 8. We  find there exists a small gap between provided development data and our collected data, which indicates that the domain information may further improve the translation quality, and thus leads us to utilize domain transfer and ensemble techniques. Note that the extra development set is only used in single models. When it comes to system combination, these data are added into News domain since the size of News domain data in parallel dataset is extremely smaller than other domains (Section 3.4).

Conclusion
We present the CASIA's neural machine translation system submitted to IWSLT 2020 Chinese→Japanese and Japanese→Chinese open domain translation task. Our system is built with Transformer architecture and incorporating the following techniques: • Deliberate data pre-processing and filtering • Back-translation of selected monolingual corpus • Knowledge distillation from multi polity teacher models • Domain classification and weighted domain model ensemble As a result, our final system achieves substantial improvements over baseline system.