XMU Neural Machine Translation Systems for WAT 2017

This paper describes the Neural Machine Translation systems of Xiamen University for the shared translation tasks of WAT 2017. Our systems are based on the Encoder-Decoder framework with attention. We participated in three subtasks. We experimented subword segmentation, synthetic training data and model ensembling. Experiments show that all these methods can give substantial improvements.


Introduction
Neural Machine Translation (NMT) (Bahdanau et al., 2015;Cho et al., 2014; has achieved great success in recent years and outperforms traditional statistical machine translation (SMT) on various language pairs (Sennrich et al., 2016a;Wu et al., 2016;Zhou et al., 2016). This paper describes the NMT systems of Xiamen University (XMU) for the WAT 2017 evaluation (Nakazawa et al., 2017). We participated in three translation subtasks: JIJI Japanese↔English newswire subtask, IITB Hindi↔English mixed domain subtasks, and Cookpad Japanese↔English recipe subtask.
In all three subtasks, we use our reimplementation of dl4mt-tutorial 1 with minor changes. We use both Byte Pair Encoding (BPE) (Sennrich et al., 2016c) and mixed word/character segmentation (Wu et al., 2016) to achieve openvocabulary translation. We apply back-translation method (Sennrich et al., 2016b) to make use of monolingual data. We use ensemble  of multiple models to further improve the translation quality.
The remainder of this paper is organized as follows: Section 2 describes our NMT system, including the training details. Section 3 describes the processing of the data. Section 4 describes all experimental features. Section 5 shows the results of our experiments. Finally, we conclude in section 6.

Baseline System
Our NMT system is a reimplementation of dl4mttutorial model. We import some minor changes and new features such as dropout (Srivastava et al., 2014).
For all three subtasks, we train our models with almost the same settings of hyper-parameters. We use word embeddings of size 620 and hidden layers of size 1000. We use mini-batches of size 128 and adopt Adam (Kingma and Ba, 2015) (β 1 = 0.9, β 2 = 0.999 and = 1× 10 −8 ) as the optimizer. The initial learning rate is set to 5 × 10 −4 . We gradually halve the learning rate during the training process. As a common way to train RNN models, we clip the norm of gradients to a predefined value 1.0 (Pascanu et al., 2013). We use dropout to avoid over-fitting with a keep probability of 0.8. For ensembling, we train multiple models with different random initialization of parameters and different data shuffling.
In Decoding, we employ beam search strategy with a beam size of 10. We use a modified version of AmuNMT C++ decoder 2 for parallel decoding. We use the same ensembling method as  with uniform weights for different models.

Data Processing
We use all training data provided by JIJI, IITB, and Cookpad corpora 3 . For JIJI and Cookpad corpora, Moses 4 tokenizer and truecaser are applied on the English side. On the Japanese side, the full-width ASCII variants are first converted into their halfwidth form and the mecab 5 segmenter is used to segment the sentences. For IITB corpus, we directly use the tokenized data and truecase the English sentences with Moses truecaser.
For all three corpora, we remove duplications and filter out bad sentence pairs according to the word alignment scores obtained by fast-align toolkit 6 . For IITB corpus, we also filter out sentence pairs which are not in English-Hindi according to the range of Devanagari characters' Unicode, as well as a language identification toolkit langid 7 .

Subword Segmentation
To enable open-vocabulary, we apply subwordbased translation approaches. In our preliminary experiments, we found that BPE and mixed word/character segmentation works better than UNK replacement techniques.
In JIJI and IITB tasks, we apply BPE 8 with 20K operations to English sentences and Hindi sentences separately. We use mixed word/character model in the Japanese sides of JIJI task. We keep 20K most frequent Japanese words and split other words into characters. Unlike (Wu et al., 2016), we do not add any extra prefixes or suffixes to the segmented Japanese characters. In the postprocessing step, we simply remove all spaces in Japanese sentences. Similarly, in Cookpad task, we also use BPE segmentation in English side, but with 10K operations, since the vocabulary size is much smaller. Correspondingly, mixed word/character model with a shortlist of 10K words is applied to the Japanese sentences.

Synthetic Training Data
To utilize the monolingual data in IITB corpus, we employ the back-translation method. We use srilm 9 to train a 5-gram KN language model on the monolingual data and select monolingual sentences according to their perplexity. By this way, 2.5M English sentences are selected from IITB's monolingual data. We use one single EN-HI NMT baseline model to translate the selected English monolingual sentences back to Hindi. The synthetic sentence pairs are used to train HI-EN NMT models.
Similarly, we also select 2.5M Hindi monolingual sentences and use one single HI-EN NMT baseline model to translate them back to English. The synthetic sentence pairs are used to train EN-HI NMT models.
In preliminary experiments, we found that training or tuning on the synthetic data alone could not significantly improve the performance of NMT models. Therefore, we mix up the synthetic data with a comparable amount of bilingual pairs over sampled from IITB's parallel data and train NMT models on the mixture data. A similar method is also used in (Sennrich et al., 2017).

Results
In this section, we report the automatic evaluation results (word-level BLEU score 10 ) and human evaluation results on test sets. We compare our NMT systems with the best SMT systems provided by the organizer.  Table 1 shows the results of JIJI subtask. We apply subword segmentation on the parallel data and train 4 English-Japanese NMT models and 4 Japanese-English models. We found that both in EN-JP and JP-EN, one single NMT model can outperform the traditional SMT systems, such as a hierarchical phrase-based model. Ensembles of 4 NMT models can further improve the results by more than +2.0 BLEU scores.  In IITB subtask, we first train an English-Hindi and a Hindi-English baseline NMT models on the parallel data with subword segmentation. Then we select monolingual sentences and synthesize larger training data using the backward baseline NMT models. As shown in Table 2, both in EN-HI and HI-EN, training on synthetic data is effective to improve the BLEU score (more than +6.0). When ensembling 4 models, we further gain more than +1.6 BLEU scores.

Results on Cookpad Subtask
In Cookpad subtask, we hope one single NMT model has the robustness to translate different types of text. So we directly train NMT models on all training data without any extra data separation or labelling. And we use the same models for four test sets. The results are shown in Table 3. Our single NMT baselines beat phrasebased SMTs in almost all test sets, except for JA-EN ingredient. When ensembling 4 models, we further gain +1.3 to +3.1 BLEU scores in all test sets and outperform SMTs by +2.2 to +5.8 BLEU scores. For human evaluation results, we found that NMT models achieve good results in title and step sets, but not in ingredient sets. It's reasonable because NMT models are good at fluency, instead of adequacy. And for title and step, human readers usually focus on fluency. But for ingredient, human readers care more about adequacy.

Conclusion
We describe XMU's neural machine translation systems for the WAT 2017 shared translation tasks. Our models perform quite well and proved to be effective enough to outperform traditional SMT systems in all tasks, even with limited training data. Experiments also show the effectiveness of all features we used, including subword segmentation, synthetic training data, and multimodel ensemble.