XMU Neural Machine Translation Systems for WMT 17

This paper describes the Neural Machine Translation systems of Xiamen University for the translation tasks of WMT 17. Our systems are based on the Encoder-Decoder framework with attention. We participated in three directions of shared news translation tasks: English → German and Chinese ↔ English. We experimented with deep architectures, different segmentation models, synthetic training data and target-bidirectional translation models. Experiments show that all methods can give substantial improvements.


Introduction
Neural Machine Translation (NMT) (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015) has achieved great success in recent years and obtained state-of-the-art results on various language pairs (Zhou et al., 2016;Sennrich et al., 2016a;Wu et al., 2016).This paper describes the NMT systems of Xiamen University (XMU) for the WMT 17.We participated in three directions of shared news translation tasks: English→German and Chinese↔English.We use two different NMTs for shared news translation tasks: • MININMT: A deep NMT system (Zhou et al., 2016;Wu et al., 2016;Wang et al., 2017) with a simple architecture.The decoder is a stacked Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) with 8 layers.The encoder has two variants.For English-German translation, we use an interleaved bidirectional encoder with 2 columns.Each column consists of 4 LSTMs.For Chinese-English translation, we use a stacked bidirectional encoder with 8 layers.
• DL4MT: Our reimplementation of dl4mttutorial1 with minor changes.We also use a modified version of AmuNMT C++ decoder2 for decoding.This system is used in the English-Chinese translation task.
We use both Byte Pair Encoding (BPE) (Sennrich et al., 2016c) and mixed word/character segmentation (Wu et al., 2016) to achieve open-vocabulary translation.Back-translation method (Sennrich et al., 2016b) is applied to make use of monolingual data.We also use target-bidiretional translation models to alleviate the label bias problem (Lafferty et al., 2001).The remainder of this paper is organized as follows: Section 2 describes the architecture of MIN-INMT.Section 3 describes all experimental features used in WMT 17 shared translation tasks.Section 4 shows the results of our experiments.Section 5 shows the results of shared translation task.Finally, we conclude in section 6.

Model Description
Deep architectures have recently shown promising results on various language pairs (Zhou et al., 2016;Wu et al., 2016;Wang et al., 2017).We also experimented with a deep architecture as depicted in Figure 1.We use LSTM as the main recurrent unit and residual connections (He et al., 2016) (Zhou et al., 2016) and GNMT (Wu et al., 2016).Both the encoder and decoder adopt LSTM as its main recurrent unit.We also use residual connections (He et al., 2016) to help training, but here we omit it for clarity.We use black lines to denote input connections while use blue lines to denote recurrent connections.
translation y t given the source annotation vectors {x i } and target history y <t .

Interleaved Bidirectional Encoder
The interleaved bidirectional encoder was introduced by (Zhou et al., 2016), which is also used in (Wang et al., 2017).Like (Zhou et al., 2016), our interleaved bidirectional encoder consists of two columns.In interleaved bidirectional encoder, the LSTMs in adjacent layers run in opposite directions: Here x 0 t ∈ R e is the word embedding of word x t , denotes the memory and hidden state of LSTM.We set both e and h to 512 in all our experiments.The annotation vectors x i ∈ R 2h are obtained by concatenating the final output − → x Lenc and ← − x Lenc of two encoder columns.In our experiments, we set L enc = 4.

Stacked Bidirectional Encoder
To better exploit source representation, we adopt a stacked bidirectional encoder.As shown in Figure 1, all layers in the encoder are bidirectional.The calculation is described as follows: To reduce parameters, we reduce the dimension of hidden units from h to h/2 so that x i ∈ R h .The annotation vectors are taken from the output x Lenc of top LSTM layer.In our experiments, L enc is set to 8.

Decoder
The decoder network is similar to GNMT (Wu et al., 2016).At each time-step t, let y 0 t−1 ∈ R e denotes the word embedding of y t−1 and y 1 t−1 ∈ R h denotes the output of bottom LSTM from previous time-step.The attention network calculates the context vector a t as the weighted sum of source annotation vectors: Different from GNMT (Wu et al., 2016), we use the concatenation of y 0 t−1 and y 1 t−1 as the query vector for attention network, as described follows: This approach is also used in (Wang et al., 2017).The context vector a t is then fed to all decoder LSTMs.The probability of the next word y t is simply modeled using a softmax layer on the output of top LSTM: We set L dec to 8 in all our experiments.
3 Experimental Features

Segmentation Approaches
To enable open-vocabulary, we use two approaches: BPE and mixed word/character segmentation.
In most of our experiments, we use BPE3 (Sennrich et al., 2016c) with 50K operations.In our preliminary experiments, we found that BPE works better than UNK replacement techniques.
For English-Chinese translation task, we apply mixed word/character model (Wu et al., 2016) to Chinese sentences.We keep the most frequent 50K words and split other words into characters.Unlike (Wu et al., 2016), we do not add any prefixes or suffixes to the segmented Chinese characters.In post-processing step, we simply remove all the spaces.

Synthetic Training Data
We apply back-translation (Sennrich et al., 2016b) method to use monolingual data.For English-German and Chinese-English translation, we sample monolingual data from the NewsCrawl2016 corpora.For English-Chinese translation, we sample monolingual data from the XinhuaNet2011 corpus.

Target-bidirectional Translation
For Chinese-English translation, we also use a target-bidirectional model (Liu et al., 2016;Sennrich et al., 2016a) to rescore the hypotheses.
To train a target-bidirectional model, we reverse the target side of bilingual pairs from left-to-right (L2R) to right-to-left (R2L).We first output 50 candidates from the ensemble of 4 L2R models.Then we rescore candidates by interpolating L2R score and R2L score with uniform weights.

Training
For all our models, we adopt Adam (Kingma and Ba, 2015) (β 1 = 0.9, β 2 = 0.999 and = 1× 10 −8 ) as the optimizer.The learning rate is set to 5 × 10 −4 .We gradually halve the learning rate during the training process.As a common way to train RNNs, we clip the norm of gradient to a predefined value 5.0.The batch size is 128.We use dropout (Srivastava et al., 2014) to avoid overfitting with a keep probability of 0.8.
Table 1 show the results of English-German Translation.The baseline system is trained on preprocessed parallel data 4 .For synthetic data, we randomly sample 10M German sentences from NewsCrawl2016 and translate them back to English using an German-English model.However, we found random sampling do not work well.As a result, for Chinese-English translation, we select monolingual data according to development set.We first train one baseline model and continue to train 4 models on synthetic data with different shuffles.Next we ensemble 4 models and get the final results.We found this approach do not lead to substantial improvements.We use all training data (CWMT Corpus, UN Parallel Corpus and News Commentary) to train a baseline system.The Chinese sentences are segmented using Stanford Segmenter5 .For English sentences, we use the moses tokenizer6 .We filter bad sentences according to the alignment score obtained by fast-align toolkit7 and remove duplications in the training data.The preprocessed training data consists of 19M bilingual pairs.As noted earlier, the monolingual data is selected using newsdev2017.We first train 4 L2R models and one R2L model on training data, then we finetune our model on a mixture of 2.5M synthetic bilingual pairs and 2.5M bilingual pairs sampled from CWMT corpus.As shown in Table 3 show the results of English-Chinese Translation.We use our reimplementation of DL4MT to train English-Chinese models on CWMT and UN parallel corpus.The preprocessing steps, including word segmentation, tokenization, and sentence filtering, are almost the same as Section 4.2, except that we limited the vocabulary size to 50K and split all target side OOVs into characters.For synthetic parallel data, we use SRILM8 to train a 5-gram KN language model on XinhuaNet2011 and select 2.5M sentences from XinhuaNet2011 according to their perplexities.We obtained +3.9 BLEU score when tuning the single best model on a mixture of 2.5M synthetic bilingual pairs and 2.5M bilingual pairs selected from CWMT parallel data randomly.We further gain +1.5 BLEU score when ensembling 4 models.

Shared Task Results
Table 4 shows the ranking of our submitted systems at the WMT17 shared news translation task.Our submissions are ranked (tied) first for 2 out of 3 translation directions in which we participated: EN↔ZH.

Conclusion
We describe XMU's neural machine translation systems for the WMT 17 shared news translation tasks.All our models perform quite well on all tasks we participated.Experiments also show the effectiveness of all features we used.

Table 3 :
English-Chinese translation results on newstest2017.

Table 4 :
Automatic (BLEU) and human ranking of our submitted systems at WMT17 shared news translation task.