ISTIC’s Neural Machine Translation System for IWSLT’2020

This paper introduces technical details of machine translation system of Institute of Scientific and Technical Information of China (ISTIC) for the 17th International Conference on Spoken Language Translation (IWSLT 2020). ISTIC participated in both translation tasks of the Open Domain Translation track: Japanese-to-Chinese MT task and Chinese-to-Japanese MT task. The paper mainly elaborates on the model framework, data preprocessing methods and decoding strategies adopted in our system. In addition, the system performance on the development set are given under different settings.


Introduction
This paper describes the neural machine translation (NMT) system of the Institute of Scientific and Technical Information of China (ISTIC) for the 17th International Conference on Spoken Language Translation (IWSLT 2020) (Ebrahim et al., 2020). ISTIC participated in the Japanese-to-Chinese and Chinese-to-Japanese MT tasks of the Open Domain Translation track.
In this evaluation, we adopted the NMT Google Transformer (Vaswani et al., 2017) architecture as a part of our system. We use the data released by the organizer and adopted general and specific preprocessing methods to the training and development data. Several filtering methods of corpus are explored to improve the quality of the training data. A corpus filtering method based on Elasticsearch is used to select the development data similar to test data. We adopted a model averaging strategy in the decoding phase and different results are combined in post-processing stage to obtain the final translation. The performance of the system is compared under different settings in the two translation directions, and further analyzed.
2 System Architecture Figure 1 shows the flow chart of ISTIC's NMT system in this evaluation. Our model architecture, data processing and decoding strategy are given below. Our baseline system used in this evaluation is the Transformer (Vaswani et al., 2017) based on a full attention mechanism, which includes an encoder and a decoder, as shown in Figure 2. Transformer does not use a recurrent neural network (Cho et al., 2014) or a convolutional neural network (Gehring et al., 2017), but is completely based on attention mechanism. It can achieve algorithm parallelism, speed up model training, further alleviate long-distance dependence and improve translation quality.  (Vaswani et al., 2017) The encoder and decoder are formed by stacking N layer blocks. Each layer of encoder contains two sub-modules, namely a multi-head self-attention module and a feed-forward neural network module. The multi-head self-attention module divides the dimension of hidden state into multiple parts， and each part is separately calculated by using self-attention function, furthermore, these output vectors are concatenated together. Multi-head mechanism enables the model to pay more attention to the feature information of different positions and different sub-spaces. The multi-head attention method includes two steps: 1) dot product attention calculation; 2) multi-head attention calculation. The calculation method of dot product attention can be expressed as: where Q is the query vector, K is the key vector, V is the value vector, and d k is the dimension of the hidden layer state. On the basis of dot product attention, the calculation method of the multi-head attention mechanism can be expressed as: is the matrix parameter. The attention value of each head is: Each layer of the decoder is composed of three sub-modules. In addition to the two modules similar to the encoder, an decoder-encoder attention module is added between them and can focus attention on source language information in decoding process. In order to avoid the problem that too many layers cause the model to be difficult to converge, both the encoder and the decoder use residual connection and hierarchical regularization techniques. To make the model obtain the position information of the input sentence, additional position encoding vectors are added to the input layer of the encoder and decoder.
After the encoder obtains a hidden state, Transformer model inputs the hidden state into the softmax layer and scores with candidate vocabulary to obtain the final translation result.

Corpus preprocessing
In this evaluation we use data of the open domain translation track released by the evaluation organizer of IWSLT 2020, as shown in Table  1.

It contains existing
Japanese-Chinese parallel data (Data 2) and web crawl data (Data 1, Data 3, Data 4) The quality of Data 2 is much better than other web crawling data. Therefore, a two-stage preprocessing method is designed as a general preprocessing stage and a specific preprocessing stage.
General preprocessing stage: Due to time limitation, we did not have a chance to use Data 4, only the following preprocessing operations were performed on Data 1, Data 2 and Data 3: Among them, the filtering of adjacent similar sentences calculates the Dice similarity (Dice, 1945) of the current sentence with the previous sentence in the corpus of source language side or target language side, and remove the current sentence pair if the Dice similarity exceeds 0.9. Sentence length filtering removes sentence pairs which source sentence length or target sentence length is 0 or exceeds 50, and sentence length ratio filtering excludes the sentence pairs whose ratio of source sentence length and target sentence length exceeds the range of [0.2, 5]. Since a certain percentage of sentence pairs in the corpus use the same language as the source and target sentences. The language token ratio method (Lu et al., 2018) is used to eliminate sentence pairs where the proportion of Japanese or Chinese words is smaller than a certain threshold， here set to 0.1. Both Japanese and Chinese word segmentation are implemented using the lexical tool Urheen 2 .
Specific preprocessing stage: The quality of training corpus has a great influence on the performance of machine translation model. The web crawling data is large in scale but has a great amount of noise, thus, the following specific preprocessing operations are performed on Data 1 and Data 3: GIZA++ 3 tool is used on Data 2 to obtain an alignment dictionary, and each word only retains the top ten translations in their probability ranking. According to the alignment dictionary, the alignment scores for each sentence pair in Data 1 and Data 3 are calculated and the threshold is set as 0.4 as the alignment ratio: is the alignment ratio of sentence pair(X,Y); x is the word in sentence X, y is the word in sentence Y; p(x, y) is the probabilities that word x translates into word y, p(y, x) is the probabilities that word y translates into word x; and () length X is the length of sentence X; () length Y is the length of sentence Y. The filtering results after general preprocessing and specific preprocessing are shown in Table 2

Model average
In order to reduce model parameter instability and improve model robustness, the model averaging technique was applied on the parameters stored in the same model at different training moments. We average the parameters of the last N epochs when the model is converged. N is set to 5 for this evaluation.

Candidate translations merge
Data 1 and Data 2 are included in training set. However, Data 1 contains a great amount of noise and results in some untranslated sentences in the Zh-Ja translations task, some of which take the source sentences directly as the target translations. This rarely happens in the translation model which was trained only with Data 2. Therefore, the former system is taken as primary system and the latter as secondary 162 system. The final translations are obtained by combining two systems' translations. For each source sentence in the test set of the Zh-Ja task, primary system translations and secondary system translations are checked by the following standards: 1) the primary system translation are exactly the same as the source sentence; 2) the primary system translation are judged as non-Japanese words by the language detection tool. If one of the two checks is satisfied and the secondary system translation is also judged to be Japanese, then the secondary system translation will replace the primary system translation as final translation.

Parameters setting
The open source project tensor2tensor 5 is chosen for this evaluation system. The main parameters are set as follows. Each model uses 1-3 GPUs for training, and the batch size is 2048. We use six self-attention layers for both encoder and decoder, and the multi-head self-attention mechanism has 16 heads. The embedding size and hidden size are set to 1024, the dimension of the feed-forward layer is 4096 and ReLU (Krizhevsky et al., 2012) is used as the activation function. The dropout mechanism (Gal and Ghahramani, 2015) was adopted, and dropout probabilities are set to 0.1. BPE (Sennrich et al., 2015) is used in all experiments, where the merge operation is set to 30K. The initial learning rate is 0.2, and the warm-up steps are set to 8000.
To choose the method of word segmentation, we use Data 2 as training data and score on the development set provided by evaluation organizer, as shown in Figures 3 and 4 where the horizontal axis is the different settings of model parameter alpha, and vertical axis as the 5 https://github.com/tensorflow/tensor2tensor character-based bleu4. Zh_ja_b and ja_zh_b means Jieba 6 word segmentation in Chinese (Zh) and Mecab 7 word segmentation in Japanese (Ja). Zh_ja_u and ja_zh_u use the lexical tool Urheen (Zh,Ja) in both the two language.

Experimental results
Training data comparison: In order to choose training data, we use baseline system and score on the original development set, shown in Tables 4-5. It can be seen that increasing the scale of training corpus helps to improve the translation ability. Although Bleu (token) score decreased after Data 1 was added to Chinese-to-Japanese task, we still decided to adopt Data1 + Data2 for training data on the both translation tasks.

ES development set:
In order to verify the effect of ES development set, we use Data2 as training data, and to train baseline system on different combinations of development sets, and score on ES test set, as shown in Tables 6-7. The experimental results show that ES development set alone or together with the original development set is better than the original development set alone to train machine translation system.
The statistics of training data and development set that are used in this evaluation can be found in Table 8.
System comparison: In order to choose machine translation system, we designed different combinations of training data, and further trying model average strategy on original development set + ES-development set, and score on them, as shown in Tables 9-10. It can be seen that the larger training set Data2+Data1 leads to the result improvement, but no further effect to the growth of training set. Since Data3 was captured from public network, its quality is still limited even though we have  Table 10. System comparison in Japanese-to-Chinese task adopted a few strategies to filter it. Meanwhile, Data 3 has open domains yet we did not carry out consistency analysis on the data domain. In addition, the model average strategy also brings some improvement to the translation effect. Therefore, we adopted model average strategy and train on data1+data2 for the two translation tasks. Table 11 shows some translation results of Chinese-to-Japanese translation task. It can be seen that for <sent2>, the translation quality of Data2+Data1+avg model is better than that of Data2+avg model. But due to the noise in Data 1, some translations of the sentences are completely the same as source text. But the situation for Data2+avg model is rare, thus we take the post-processing strategy to merge them. Data1+Data2+avg model is looked as primary system, Data2+avg model as secondary system. Take <sent1> as an example, primary system translation is judged to be in Chinese and is just a copy of Chinese. The secondary translation was checked to be in Japanese, and successfully express the meaning of the source sentence.

Conclusions
This paper introduces the main techniques and methods used by the Institute of Scientific and Technical Information of China on the task of two-directions translation of Japanese and Chinese in IWSLT 2020 Open Domain Translation track. We use the architecture of transformer model based on a full attention mechanism. Several filtering methods were explored in data preprocessing, and the model average strategy is adopted in decoding. Similar development set is chosen based on ES and the results are merged by post-processing. Experimental results show that these methods can effectively improve the quality of translation.