OPPO’s Machine Translation System for the IWSLT 2020 Open Domain Translation Task

In this paper, we demonstrate our machine translation system applied for the Chinese-Japanese bidirectional translation task (aka. open domain translation task) for the IWSLT 2020. Our model is based on Transformer (Vaswani et al., 2017), with the help of many popular, widely proved effective data preprocessing and augmentation methods. Experiments show that these methods can improve the baseline model steadily and significantly.


Introduction
Machine translation, proposed even before the first computer was invented (Hutchins, 2007), has been always a famous research topic of computer sci ence. In the recently years, with the renaissance of neural network and the emergence of atten tion mechanism (Sutskever et al., 2014) (Bahdanau et al., 2014), the old area has stepped into a new era. Furthermore, the Transformer architecture, after being published, has immediately attracted much attention nowadays and is dominating the whole field now.
Although Transformer has achieved many SOTA results, it has tremendous amount of param eters so is hard to be fit on small datasets, there fore it has a high demand on good and large data source. Despite of the lack of high quality data of parallel corpus, the document level comparable data is relatively easy to be crawled, so exploring an effective and accurate way of mining aligned sentence pairs from such large but noisy data, to enrich the small parallel corpus, could benefit the machine translation system a lot. Besides, cur rently, most open access big volume machine trans lation datasets are based on English, and many of them are translated to/from another European lan guage -As many popular European languages are fusional languages, most corpus are composed by two fusional languages together. To understand whether the existing model architectures and train ing skills can be applied on the translation between Asian languages and other type of languages , such as between an analytic language like Chinese and an agglutinative language like Japanese, interest us.
In this paper, we demonstrate our system ap plied for the IWSLT 2020 open domain text transla tion task, which aims to translate Chinese from/to Japanese 1 . Besides describing how we trained the model that is used to generate the final result, this paper also introduces how do we mine extra paral lel sentences from a large but noisy data released by the organizer, and several experiments inspired by the writing systems of Chinese and Japanese.

Data Preprocessing
Four pairs of parallel data are provided in the cam paign, which are • The very original file, which is crawled from various websites, and is very huge. Ac cording to the information of the campaign page, the corpus contains 15.7 million docu ments, which are composed by 941.3 million Japanese sentences and 928.7 million Chinese sentences -From the counts of sentences it can be immediately observed that the original corpus is not parallel, so cannot be directly used for the model. Mining parallel corpus from this mega size file is another work we have done during the campaign, which will be covered in another section of this report.
• A prefiltered file, consists of 161.5 million "parallel" sentences. We tried to filter this dataset to extract parallel lines, and this work will also be presented later.
• A filtered file, which has 19 million sen tences, is aligned officially from the data de scribed in the previous item. And, • An existing parallel file, which contains 1.96 million pairs of sentences, is obtained by the provider from current existing Japanese Chinese parallel datasets.
However, per our investigation, even the sen tences in the existing parallel file are actually not fully aligned. For example, a sentence "1994 年 2 月、ジャスコはつるまいの全株式を取得。" (means "Jasco bought all shares in February, 1994" ) in the corpus is translated into "次年 6 月乌迪 内斯买断了他的全部所有权。", which means "Udinese bought out all his ownership in June in the next year", so here is clearly a noise. Since deep neural network demands high quality input data, we combined the filtered file and the existing parallel file into a 20 million pairs dataset (noted as combined dataset afterwards) , and made a further data preprocessing, including two main steps:

Rule Based Preprocessing and Filtering
We first feed the combined dataset to a data prepro cessing pipeline, including the following steps: • Converting Latin letters to lower case. This step helps to decrease the size of the vocabularies, but since the evaluation is casesensitive, we applied a further post processing step: Having generated the results from the model, we extract all Latin words from the sources and the hypotheses, and con vert the words in the hypo side according to the case forms of their counterparts in the source side.
• For Chinese, converting traditional Chinese characters to simplified form; for Japanese, converting simplified Chinese characters to kanji.
• Converting full width characters to half width.
• Normalizing punctuations and other special characters, e.g. different forms of hyphen "".
• Removing extra spaces around the dot symbol of float numbers • Removing unnecessary spaces Because both Chinese language and Japanese language don't use spaces to mark borders of words, we applied segmentation on each side (A branched experiment will be presented later in this report). For Chinese, we use PKUSEG (Luo et al., 2019) and for Japanese it is mecab 2 . After having observed the preprocessed data, sentence pairs are filtered out according to the following orders: 1. Sentences that contain too many nonsense symbols (including emojis ,kaomojis and emoticons, such as "(＠＾ ＾)". Although these symbols could bring semantic informa tion, we don't consider they are important to machine translation system) 2. Sentence pairs that have abnormal length ra tio, here "length" is the count of words of a sentence. As Chinese character is also an important constituent of Japanese writing sys tem, we don't expect the Japanese sentences will be too much longer than the Chinese side; however in another hand, since Japanese is an agglutinative language, it always needs several additional (sub)words to express its own syntactical structure, so the Japanese sen tences can neither be too short. We set the upper bound of words count ratio between Japanese and Chinese to 2.4 and the corre sponding lower bound is 0.8.
3. Sentence pairs that occur more than once. We deduplicated and left only one single pair.
4. Sentence pairs that target is simply a replica of the source sentence.
5. Sentence pairs that target sentence shares the same beginning or ending 10 characters with source sentence.
6. Sentence pairs that the amount of Chinese words is less than 40% of the total word count in the Chinese side. Here "Chinese word" is defined as a word which is composed by Chi nese characters only.
7. Sentence pairs that the amount of Japanese words is less than 40% of the total word count in the Japanese side. Here "Japanese word" is defined as a word which is composed by kan jis or kanas only. As Chinese language and Japanese language each has its own special "alphabets", this step together with the pre vious one can be seen as a way of language detection.
8. Sentence pairs that the count difference be tween numbers in Chinese side and numbers in Japanese side is greater than or equal to 3 9. Sentence pairs that cannot be aligned on num bers and Latin letters.

Alignment Information Based Filtering
Processing rules listed in the previous subsection can be applied to filter out sentence pairs that have obvious noises, but some pairs still have subtle noises that cannot be directly discovered. There fore we use fast_align to align the source and tar get sentences, generate alignment score in the sen tence level and word level 3 , then further filter the combined dataset by the alignment results. For the sentence level alignment score, the threshold was set to 16 and for the word level it was 2.5. After multiple rounds cleaning, 75% of the data provided are swiped out, leaving about 5.4M sentence pairs as the foundation of our experiments described in the next section.

Main Task Experiments
Taking the 5.4M corpus in the hand, we further di vided the words in the text into subwords (Sennrich et al., 2016b). BPE code is trained on the Chinese and Japanese corpus jointly, with 32,000 merging operations, but the vocabulary is extracted for each language individually, so for Chinese the size of its vocabulary is 31k and for Japanese it is 30k. Vocab ularies for both two directions (ja-zh and zh-ja) are shared. We trained 8 heads Transformer Big models with Facebook FAIR's fairseq (Ott et al., 2019) using the following configuration: • learning rate: 0.001 • learning rate schedule: inverse square root 3 To get a word level alignment score, we divide the sen tence level score by the average length of source sentence and target sentence The first model trained on the filtered genuine parallel corpus (i.e. the 5.4M corpus) is not only seen as the baseline model of the consequent exper iments, but also used as a scorer 4 . We rescored the alignment scores of the sentences using this model, and again filtered away about one quarters data. The model trained on the refined data improved the BLEU score by 1.7 for zh-ja and 0.5 for ja-zh.
As many works proved, backtranslation (BT) (Sennrich et al., 2016a) is a common data augmen tation method in the machine translation research. Besides, (Edunov et al., 2018) also provides some other ways to backtranslate. We applied both of them and in our experiments, top10 sampling is effective on zh-ja direction and for ja-zh tradi tional argmaxbased beam search is still better.
Firstly, 4M data in the original corpus is se lected by the alignment score and translated by the models (models for the different directions) got in the previous step to build synthetic corpus, then for each direction a new model is trained on the augmented dataset (contains 5.4M + 4M + 4M = 13.4M pairs). To get a better translation result, we used ensemble model to augment the dataset. One more thing could be clarified that, in this aug mentation step we not only introduced 4M back translated data, but also generated 4M synthetic tar get sentences by applying knowledge distillation (KD) (Freitag et al., 2017).
On this genuineBTKD mixture dataset, we tried one more round of backtranslation and knowledge distillation, but just saw a minor im provement.
Afterwards we trained language model on the 5.4M parallel corpus for each lan guage using kenlm (Heafield, 2011). With the help of the language model, 3M Chinese sentences and 4M Japanese sentences with the highest scores are selected from the unaligned monolingual cor pus as the new input of BT models, augmented the mixture dataset to 20.4M pairs (noted as final augmented dataset, which will be referenced later 117 in the report), and we did another round of back translation and knowledge distillation. After these three rounds iterative BT (Hoang et al., 2018) and KD, several best single models are further com posed together to an ensemble model. In the last step, following (Yee et al., 2019), we use both back ward model (for zh-ja task, model from ja-zh is its backward model, and vice versa) and Trans former language model to rerank the nbest candi dates of the output from the ensemble model, to generate the final results.
Detailed results on the dev dataset of each inter mediate step is shown in table 1. We strictly fol lowed the organizer's requirement to build a con strained system, means that we didn't add in any external data, nor made use of the test data in any other form besides of generating the final result.

Branched Task Experiments
Besides the main task experiments demonstrated in the previous section, as the introduction part says, we are also interested in how to mine or extract par allel data from such huge but noisy datasets, and explore some special skills on translating from Chi nese to Japanese (and also vice versa). This section will mainly discuss our work on these two parts.

Filtering the Noisy Dataset
We first tried to extract parallel sentences from the prefiltered, 161.5 million dataset. Since this dataset is "nearly aligned", it is assumed that for a given sentence pair, if the target side doesn't match the source, the whole pair can be safely dropped because the counterpart of the source doesn't exist in other places of the corpus. We first use CLD as the language detector to remove sentences that are neither Chinese nor Japanese -only in this step nearly 110 million pairs are filtered out. Next, we feed the data into the preprocessing pipeline which is the same as the one introduced in the Preprocess ing section. The preprocessed corpus are then fil tered in a similar way described in the Preprocess ing section, with the following additional steps: • We compared the url counts of each side and remove the inconsistent line pairs.
• We kept a set of common special characters as a white list, removed all other special char acters • We removed the sentence pairs that the source side is too similar to the target side. Con cretely, we compared the Levenshtein dis tance between the sentences, divided it by the average length (count of characters) of the text in the pair. If this ratio is above 0.9, we consider the source and the target are too sim ilar.
After the filtering, 14.92 million sentence pairs are kept, and based on them we trained a model by Marian (JunczysDowmunt et al., 2018) using Transformer base model, see it as the baseline model for the current task. 36k BPE merge opera tions are applied on the remained sentence pairs, in dependently for each language, led to two vocabu laries each contains 50k words. We use Adam opti mizer with learning rate set to 3×10 −4 and 16,000 warmup steps, clipnorm set to 0.5, dropout of at tention set to 0.05, label smoothing set to 0.1. De coder searches with a 6 beamwidth and the length normalization is 0.8. 5 To filter the noisy parallel corpora, We followed dual conditional crossentropy filtering proposed by (JunczysDowmunt, 2018): for a parallel cor pus D, in which the source language is noted as X and the target language is noted as Y, two trans lation models can be trained: model A is trained from X to Y and model B is trained in the re versed direction. Given a sentence pair (x, y) ∈ D and a translation model M , the conditional cross entropy of the sentence pair normalized by target sentence length can be calculated: log P M (y t |y <t , x) As we have two models A and B, two scores achieved by each can be combined to calculate the maximal symmetric agreement (MSA) of the sen tence pair, following: zh-ja BLEU ja-zh BLEU Baseline 34.6 32.6 + Filtered by alignment information from baseline model 36.3 (+1.7, +1.7) 33.2 (+0.6, +0.6) + 1st round BT using genuine parallel corpus (13.4M pairs) 37.5 (+2.9, +1.2) 34.6 (+2.0, +1.4) + 2nd round BT using genuine parallel corpus (13.4M pairs) 37.6 (+3.0, +0.1) 34.6 (+2.0, +0.0) + BT using monolingual corpus (20.4M pairs) 38.8 (+4.2, +1.2) 35.4 (+2.8, +0.8) + 3rd round BT using both parallel and monolingual corpus (20.4M pairs) 39.2 (+4.6, +0.4) 36.0 (+3.4, +0.6) + Ensemble 40.1 (+5.5, +0.9) 36.6 (+4.0, +0.6) + Reranking 40.8 (+6.2, +0.7) 37.2 (+4.6, +0.6) Table 1: Results of the main task experiments, evaluation is taken on the validation dataset provided officially. The improvement amount of each row is expressed in two forms: absolute improvement (current score baseline score) and relative improvement (current score previous step score). Note to get a more strict BLEU score, we used SacreBLEU (Post, 2018) to calculate the final BLEU score, and we didn't split words composed by Latin letters and numbers into characters, which differs from the official evaluation process. If the same splitting is applied, and evaluated by multibleu (https://github.com/mosessmt/mosesdecoder/blob/master/scripts/generic/multibleu detok.perl) which is officially designated, the score could be higher by 1.x points Since MSA(x, y) ∈ [0, +∞), we can rescale the score to (0, 1], by This method is noted as "adq" adapting the no tation proposed in the original paper. We took a se ries of experiments on the direction zh-ja, but the results are not so good as we expected. Detailed in formation is listed in table 2. We also added dataset C mentioned in table 2 to the original dataset used for training baseline model of the main task, but still didn't see too much improvement. Using the configuration introduced in the main task section, the model's BLEU score is 34.7, only 0.1 points higher than the baseline score listed in table 1.

Mining the Unaligned Dataset
Besides the officially released prefiltered dataset, we also paid our attention on the very original, huge but dirty dataset, tried some methods to clean it. As previously said, both Chinese and Japanese have its own closed characters set respectively, so we first simply remove the lines that don't con tain any Chinese characters (for Chinese corpus), or those don't contain any katas or kanjis (for Japanese lines). This simple step directly removed about 400 million lines. We also applied the same preprocessing described before, like language de tection, deduplication, and the cleaning pipeline. This preprocessing reserved 460 million lines.
For the remained data, as they are not aligned, we cannot follow the filtering process shown in the previous subsection. However, we assumed that for a Chinese sentence, if we can find its Japanese counterpart, the corresponding line can only exist in the same document. As the dataset gives doc ument boundary, we split the whole dataset into millions of documents, and use hunalign (Varga et al., 2007) to mine aligned pairs in each doc ument (dictionaries are extracted from a cleaned version of the combined dataset). Although still hold the intradocument alignment assumption, we kept reading documents, didn't perform hunalign until the accumulated lines reached 100k (but we don't break the document), for the possible cross document alignment. We kept all lines which have alignment scores higher than 0.8, and of which the words count ratio between source and target falls into [0.5, 2]. Then we removed all lines contains url, replaced numbers and English words which have more than 3 letters with tags, and dedupli cated again, leaving only 5.5 million lines. We trained a Transformer base model using marian on the dataset which is utilized for training the base line model in the main task experiments, apply ing the same configuration given in the previous subsection, and ranked the results using bleualign (Sennrich and Volk, 2010) (Sennrich and Volk, 2011), finally kept 4 million lines. This dataset is patched to the original dataset which is used the main task, and a minor improvement (+0.6 BLEU) can be seen. However, due to the time limit this part of data were not further used in the whole main task experiments.  Table 2: zh-ja experiments using data filtered from the prefiltered "parallel" corpus. BLEU is calculated by sacreBLEU in the same way depicted in the main task experiments section

Character Based Models and Some Variations
From the perspective of writing system research, Chinese characters system is a typical logogram, means a single character can also carry meaning ful semantical information, which differs to phono logic writing systems widely used in the world. Previous research (Li et al., 2019) argues that for Chinese, characterbased model even performs bet ter than subwordbased models. Moreover, For the Japanese language, its literacy "was introduced to Japan in the form of the Chinese writing system, by way of Baekje before the 5th century" 6 , even today Chinese characters (Hanzi, in simplified Chi nese 汉字, in traditional Chinese 漢字) are still important components of Japanese writing system (in Japanese called kanji, written as 漢字), so in tuitively characters between two languages could have strong mapping relationship. (ngo, 2019) also shows that for JapaneseVietnamese machine translation system, characterbased model takes advantages to the traditional methods. As both Vietnamese and Japanese are impacted by Chi nese language, it is reasonable to try character based machine translation systems on Chinese ⇔ Japanese language pairs. Inspired from the intuition and the previous re lated works, we further split the subwords in the final augmented dataset (presented in the main task experiments) into characters in three different ways, which are • Split CJK characters (hanzi in Chinese and kanji in Japanese) only, since we assume that the characters are highly related between these two sets • Split CJK characters and katakana (in kanji 片 仮 名). In Japanese writing system, be sides kanji, another component is called kana (in kanji 仮 名), which belongs to syllabic system (one character is corresponding to a syllable). Kana further consists of a pair of syllabaries: hiragana (in kanji 平 仮 名) and katakana, the latter is generally used to transliterate loanword (including foreign names). Although a single katakana charac ter doesn't carry semantical information, only imitates the pronunciation, the same situation exists in Chinese, too -when transliterating foreign names, a single Chinese character is only used to show the pronunciation, loses the means it could have. Therefore, katakanas can also be roughly mapped to Chinese char acters.

• Split CJK characters and all kanas
For each direction, we trained four different Transformer Big models using the splitting meth ods described above (another one is subword based model as baseline). In this series of exper iments, we used FAIR's fairseq, set clipnorm to 0, max tokens to 12,200, updatefreq to 8, dropout to 0.1, warmupupdates to 15,000. Length penalties are different among all models, we set the optimal value according to the results reported on the vali dation set. However, surprisingly, there is still no improvement can be observed, and for zh-ja di rection models generally perform worse (detailed results are listed in table 3). It needs some extra work to find out the reason, one possible explana tion is the big amount of backtranslated synthesis corpus, which was generated by model based on subwords, changed the latent data distribution.

Final Results
From the evaluation results provided by the orga nizer officially, Our BLEU score for jazh direc tion is 32.9, for zhja is 30.1.
However, per our observation on the dev dataset, we found most of the numbers and Latin words are styled in full width characters, so we made an extra step in postprocessing to convert all   Table 4: Leaderboard released officially just after the submission.
Scores shown in the table are characterlevel BLEU calcu lated by multibleu (https://github.com/moses smt/mosesdecoder/blob/master/scripts/generic/multi bleudetok.perl). Our results are styled in bold and the contrastive one is marked with an additional asterisk symbol * . At the date of submitting the cameraready version report, the leaderboard hasn't marked which system(s) is/are unconstrained the numbers and Latin words in our final submis sion of zhja to full width characters. For exam ple, "2008" was converted to "２００８" 7 . Our contrastive result, in which all the numbers and Latin words are composed by half width characters (and this is the only difference compared with the primary submission we made), was scored 34.8, gained an improvement of nearly 5 points. The contrastive result is generated by the same model we trained on the constrained dataset. All the re sults reported above is shown in table 4 7 Whether a letter or a digit is styled in half width or full width doesn't change its meaning

Conclusion and Future Works
In this report, we demonstrate our work for the ChineseJapanese and JapaneseChinese open do main translation task. The system we submitted is a neural MT model based on Transformer ar chitecture. During the experiments, many tech niques, such as backtranslation, ensemble, rerank ing are applied and are proved to be effective for the MT system. Parallel data extraction, noisy data filtering methods and characterbased models are also experienced and discussed, although currently they are not integrated into our systems, there will be still a lot work on them to find out proper ways to optimize the procedure and models, or to prove their limitations.