University of Tsukuba’s Machine Translation System for IWSLT20 Open Domain Translation Task

In this paper, we introduce University of Tsukuba’s submission to the IWSLT20 Open Domain Translation Task. We participate in both Chinese→Japanese and Japanese→Chinese directions. For both directions, our machine translation systems are based on the Transformer architecture. Several techniques are integrated in order to boost the performance of our models: data filtering, large-scale noised training, model ensemble, reranking and postprocessing. Consequently, our efforts achieve 33.0 BLEU scores for Chinese→Japanese translation and 32.3 BLEU scores for Japanese→Chinese translation.


Introduction
In this paper, we introduce University of Tsukuba's submission to the IWSLT20 Open Domain Translation Task. The goal of this shared task is to promote: the research on translation between Asian languages, exploitation of noisy web corpora for machine translation, and smart processing of data and provenance. To have an overview look of the IWSLT20 Open Domain Translation Task, readers may refer to Ansari et al. (2020) for further details. We participated in both Chinese→Japanese and Japanese→Chinese directions.
It is widely acknowledged that a neural machine translation (NMT) system requires a large amount of training data. Meanwhile, the training process of NMT models may consume a long period of time and lots of computing resources. Considering the limitation of our computing power, our goal is to boost the performance of NMT systems with fewer and smaller components that require less time and computing resources. Our ! Equal contribution. models are based on the base Transformer as described in Vaswani et al. (2017) without special parameter fine-tuning. For data preprocessing, firstly, various orthodox methods including punctuation normalization, tokenization as well as byte pair encoding (Sennrich et al., 2016a) which have been widely used in recent researches are applied. Besides, we also apply manual rules, aiming to clean the provided parallel data, the monolingual data and the synthetic data which is generated by ourselves for data augmentation. For the sake of a better use of all provided data, we do a back-translation for either the source-side and the target-side monolingual data. Meanwhile, inspired by noised training method (Edunov et al., 2018;Wu et al., 2019;He et al., 2020), we add noise to the source sentences of the synthetic parallel corpus to make the translation models more robust and to improve its generalization ability. In addition, in inference phrase, we apply the model ensemble strategy while top n-best hypotheses are kept for further multi-features reranking process. At last, post-processing is applied to correct the inconsistent punctuation form.
This paper is organized as follows: in Section 2, we describe our data preprocessing and data filtering. Details of each component of our systems are described in Section 3. The results of the experiments for each component and language pair are summarized in Section 4.

Data
For all of our submissions, we only use datasets provided by the organizers.
For training parallel data, we use the concatenation of web crawled parallel filtered and existing parallel. For data augmentation, since the unaligned data are extremely huge, we choose to use the separate side of the pre-filtered sentences (web crawled parallel unfiltered) as monolingual data. Also, development dataset provided by the organizer is used for development and model evaluation.

Data Preprocessing
The provided parallel training data contains different forms of characters, for example, full-width form and half-width form. To get a normalized form, we remove all the spaces between characters and perform NFKC-based text normalization.
Chinese sentences are segmented with the default mode of Jieba 1 and Japanese sentences are segmented with Mecab 2 using mecab-ipadic-NEologd 3 dictionary. To limit the size of vocabularies of NMT models, we use byte pair encoding(BPE) (Sennrich et al., 2016b) with 32K split operations separately for both side.

Data Filtering
The provided datasets built from the web data are very noisy and can potentially decrease the performance of a system. To get a clean form, we filter the parallel training corpus with the following rules: • Filter out duplicate sentence pairs.
• Filter out sentence pairs which have identical source-side and target-side sentences.
• Filter out sentence pairs with more than 10 punctuations or imbalanced punctuation ratio.
• Filter out sentence pairs which contains half or more tokens that are numbers or letters.
• Filter out sentence pairs which contain HTML tags or emoji.
• Filter out sentence pairs with wrong languages identified by langid. 4 • Filter out sentence pairs exceeding length ratio 1.5.
• Filter out sentence pairs with less than 3 words or more than 100 word.  Table 1: Statistics of the provided data. Notice that we treat two sides of the provided unfiltered dataset separately as monolingual data, therefore the number of monolingual data in terms of sentence pairs are identical as before data filtering.
The same data filtering strategies except those designed for sentence pairs are also employed on monolingual data. Details of the preprocessed dataset in terms of the amount of sentences and BPE subwords are listed out in millions in Table 1.

Baseline System
We adopt the base Transformer as our machine translation system following the settings as described in Vaswani et al. (2017), consisting of 6 encoder layers, 6 decoder layers, 8 heads, with an embedding dimension of 512 and feed-forward network dimension of 2048. The dropout probability is 0.2. For all experiments, we adopt the Adam optimizer (Kingma and Ba, 2014) using β 1 = 0.9, β 2 = 0.98, and " = 1e-8. The learning rate is scheduled using inverse square root schedule with a maximum learning rate 0.0005 and 4000 warmup steps. We train all our models using fairseq 5 (Ott et al., 2019) on two NIVIDA 2080Ti GPUs with a batch size of around 4096 tokens. During training, we employ label smoothing of value 0.1. We average the last 5 model checkpoints and use it for decoding.

Large-scale Noised Training
It is widely known that the performance of a NMT system relies heavily on the amount of parallel training data. Back-translation (Sennrich et al., 2016a) and Self-training (Zhang and Zong, 2016) are effective and commonly used data augmentation techniques to leverage the monolingual data to augment the original parallel dataset.
In our case, we leverage both the source-side and target-side monolingual data to help the train-ing. Specifically, we train a baseline NMT model with the provided parallel corpus at first. Then, target-side monolingual sentences are translated by a target-to-source NMT model and source-side monolingual sentences are translated by a sourceto-target NMT model. Inspired by noised training method (Edunov et al., 2018;Wu et al., 2019;He et al., 2020), we add noise to the source sentences of the synthetic parallel corpus to make the translation models more robust and to improve its generalization ability. Specifically, • We randomly replace a word by a special unknown token with probability 0.1.
• We randomly delete the words with probability 0.1.
• We randomly swap the words with constraint that no further than 3 words apart.
Then we add the synthetic parallel data to original parallel data and train a new NMT model.

Model Ensemble
Model ensemble is a common method to boost translation performance. However, due to the huge amount of training data and our limited computing power, we do not ensemble multiple strong models with different random seeds. Instead, we only combine three models trained on filtered data with different random seeds and one model trained through large-scale noised training. All individual models used for model ensemble are the average of the last 5 model checkpoints.

Reranking
Reranking is a method of improving translation quality by rescoring a list of n-best hypotheses. For our submissions, we generate n-best hypotheses through a source-to-target NMT model and then train a reranker using k-best MIRA (Cherry and Foster, 2012). The features we use for reranking are: • Left-to-right NMT Feature: We keep the original perplexity by the original translation model as a L2R reranking feature.
• Right-to-left NMT Feature: In order to address exposure bias problem, we train a rightto-left (R2L) NMT model using the same   training data but with inverted target word order. We invert the hypothesis sequence and use the perplexity score given by the right-toleft NMT model as R2L feature.
• Target-to-Source NMT Feature: To reduce inadequate translation, we use the perplexity score given by the target-to-source NMT model as T2S feature. In addition, we also use the score generated by a target-to-source right-to-left model as a reranking feature.
• Length Feature: We also design a length feature that quantifies the difference between the ratio of each sentence pair and the optimal ratio. The optimal ratio is determined according to the training parallel corpus.

Postprocessing
Since we perform NFKC-based text normalization on the training corpus, we also employ a postprocessing algorithm on the generated hypothesis.
To be more specific, we change half-width punctuations to full-width punctuations.

Chinese→Japanese
For Zh→Ja, data filtering plays an important role and improves our baseline performance on development data by + 5.08 BLEU scores. The addition of synthetic data and large-scale noised training slightly hurt the performance. 6 After applying model ensemble and reranking, we further gain + 0.61 BLEU scores and + 0.42 BLEU scores respectively. Finally, applying postprocessing on top of the generated hypothesis gives another 0.02 BLEU scores. The final BLEU score of our submission is 33.0.

Conclusion
This paper describes University of Tsukuba's submission to IWSLT20 open domain translation task. We trained standard Transformer models and adopted various techniques for better performance, including data filtering, large-scale noised training, model ensemble, reranking and postprocessing. We demonstrated the effectiveness of our approach and achieved 33.0 BLEU scores for Chinese→Japanese translation and 32.3 BLEU scores for Japanese→Chinese translation.