English-Myanmar Supervised and Unsupervised NMT: NICT’s Machine Translation Systems at WAT-2019

This paper presents the NICT’s participation (team ID: NICT) in the 6th Workshop on Asian Translation (WAT-2019) shared translation task, specifically Myanmar (Burmese) - English task in both translation directions. We built neural machine translation (NMT) systems for these tasks. Our NMT systems were trained with language model pretraining. Back-translation technology is adopted to NMT. Our NMT systems rank the third in English-to-Myanmar and the second in Myanmar-to-English according to BLEU score.


Introduction
This paper describes the neural machine translation (NMT) systems 1 built for National Institute of Information and Communications Technology (NICT)'s participation in the the 6th Workshop on Asian Translation (WAT-2019) translation task (Nakazawa et al., 2019), specifically Myanmar (My) -English (En) for both translation directions.
The remainder of this paper is organized as follows. In Section 2, we present the data preprocessing. In Section 3, we introduce the details of our NMT systems. Empirical results obtained with our systems are analyzed in Section 4 and we conclude this paper in Section 5.

Data Preprocessing
As parallel data to train our systems, we used all the provided parallel data for all our targeted * Rui and Haipeng have equal contribution to this paper. This work was conductd when Haipeng visited NICT as an internship student.
1 This system is based on our WMT-2019 system (Marie et al., 2019). translation directions, including the training corpus "ALT" and "UCSY", and the "ALT" dev/test data. The statistics of our preprocessed parallel data are illustrated in Table 1  We used Moses tokenizer and truecaser for English. The truecaser was trained on the English data, after tokenization. For Myanmar, we used the original tokens. For cleaning, we only applied the Moses script clean-n-corpus.perl to remove lines in the parallel data containing more than 80 tokens and replaced characters forbidden by Moses.

MT Systems
To build competitive NMT systems, we chose to rely on the Transformer architecture (Vaswani et al., 2017) since it has been shown to outperform, in quality and efficiency, the two other mainstream architectures for NMT known as deep recurrent neural network (deep-RNN) and convolutional neural network (CNN). We chose to rely on the Transformer-based NMT initialized by a pretrained cross-lingual language model (Lample and Conneau, 2019) to train our NMT systems since it had been shown to be efficient in the low-resource language pairs. In order to limit the size of the vocabulary of the NMT model, we segmented tokens in the training data into sub-word units via byte pair encoding (BPE) (Sennrich et al., 2016b). We determined 60k BPE operations jointly on the training data for English and Myanmar, and used a shared vocabulary for both languages with 60k tokens based on BPE.

TLM
Before training NMT, we used all training corpora including parallel data and monolingual data to train a translation language model (TLM) using XLM 3 in order to pretrain the NMT model on 8 GPUs 4 . The parameters for training the language model were set as listed in Table 3.

NMT
We trained a Transformer-based NMT model with the pre-trained TLM using XLM toolkit. Our NMT system was consistently trained on 8 GPUs, with the following parameters listed in Table 4.
We performed NMT decoding with a single model according to the best BLEU (Papineni et al., 2002) and the perplexity scores.

Back-translation
We also tried back-translation method (Sennrich et al., 2016a) to make use of monolingual corpora for English-to-Myanmar translation task. Parallel data for training NMT can be augmented with synthetic parallel data, generated through backtranslation, to significantly improve translation quality. For back-translation generation, we used an NMT system, trained on the parallel data provided by the organizers, to translate target monolingual sentences into the source language to generate pseudo parallel corpora. Then, the pseudo parallel corpora were simply mixed with the original parallel data to train from scratch a new source-to-target NMT system.

UNMT
To the best of our knowledge, unsupervised NMT (UNMT) (Artetxe et al., 2018;Lample et al., 2018a;Yang et al., 2018;Lample et al., 2018b;Lample and Conneau, 2019) has achieved remarkable results on some similar language pairs. To obtain a better picture of the feasibility of UNMT, we also set up a UNMT system for one truly low-resource and distant language pair: En-My. We tried to train a Transformer-based UNMT model that relies solely on monolingual corpora, with the pre-trained cross-lingual language model using XLM toolkit. Note that this cross-lingual language model was trained solely on monolingual corpora shown in Section 2. We used these monolingual corpora to train the UNMT model for 50000 iterations. The En-My UNMT system was trained on 8 GPUs, with the parameters listed in Table 6.  Table 5: Results (BLEU-cased) of our MT systems on the test set. ALT denotes that ALT training data was used in this system; UCSY denotes that UCSY training data was used in this system; MONO denotes monolingual training data was used in this system. +TLM denotes that language model pretraining was used in this system; +back-translation denotes that back-translation was used in this system.

Results
Our systems are evaluated on the ALT test set and the results 5 are shown in Table 5. Our observations from are as follows: 1) The results of UNMT are very low, highlighting that UNMT is still very far from exploitable for low-resource distant language pairs. 2) Language model pretraining showed significant improvement in the NMT systems for both translation directions. This demonstrates that language model pretraining is effective for 5 The results of BLEU are based on our own evaluation.
3) For My-En translation direction, backtranslation could further improve translation performance, achieving 8 BLEU scores improvement.
However, back-translation for En-My translation direction was unable to improve or even harm the NMT performance since the My monolingual data was noisy.

Conclusion
We presented in this paper the NICT's participation in the WAT-2019 shared translation task. Our primary NMT submissions to the task performed the third in English-to-Myanmar and the second in Myanmar-to-English according to BLEU score. Our results also confirmed the positive impact of language model pretraining in NMT. Moreover, our results for UNMT highlighted that unsupervised machine translation is still very far from exploitable for low-resource distant language pairs.