NICT Self-Training Approach to Neural Machine Translation at NMT-2018

This paper describes the NICT neural machine translation system submitted at the NMT-2018 shared task. A characteristic of our approach is the introduction of self-training. Since our self-training does not change the model structure, it does not influence the efficiency of translation, such as the translation speed. The experimental results showed that the translation quality improved not only in the sequence-to-sequence (seq-to-seq) models but also in the transformer models.


Introduction
In this study, we introduce the NICT neural translation system at the Second Workshop on Neural Machine Translation and Generation (NMT-2018) . A characteristic of the system is that translation qualities are improved by introducing self-training, using open-source neural translation systems and defined training data.
The self-training method discussed herein is based on the methods proposed by Sennrich et al. (2016a) and Imamura et al. (2018), and they are applied to a self training strategy. It extends only the source side of the training data to increase variety. The merit of the proposed self-training strategy is that it does not influence the efficiency of the translation, such as the translation speed, because it does not change the model structure. (However, the training time increases due to an increase in the training data size.) The proposed approach can be applied to any translation method. However, we want to confirm on which model our approach is practically effective. This paper verifies the effect of our selftraining method in the following two translation models: • Sequence-to-sequence (seq-to-seq) models (Sutskever et al., 2014;Bahdanau et al., 2014) based on recurrent neural networks (RNNs). Herein, we use OpenNMT (Klein et al., 2017) as an implementation of the seqto-seq model.
• The transformer model proposed by Vaswani et al. (2017). We used Marian NMT (Junczys-Dowmunt et al., 2018) as an implementation of the transformer model. The remainder of this paper is organized as follows. Section 2 describes the proposed approach. Section 3 describes the details of our system. Section 4 explains the results of experiments, and Section 5 concludes the paper.
2 Self-training Approach

Basic Flow
The self-training approach in this study is based on a method proposed by Imamura et al. (2018). Their method extends the method proposed by Sennrich et al. (2016a) that a target monolingual corpus is translated back into source sentences and generates a synthetic parallel corpus. Then, the forward translation model is trained using the original and synthetic parallel corpora. The synthetic parallel corpus contains multiple source sentences of a target sentence to enhance the encoder and attention. The diversity of the synthetic source sentences is important in this study. Imamura et al. (2018) confirmed that the translation quality improved when synthetic source sentences were generated by sampling, rather than when they were generated by n-best translation.
Although Imamura et al. (2018) assumed the usage of monolingual corpora, it can be modified to a self-training form by assuming the target side of parallel corpora as monolingual corpora. In fact,  Figure 1: Flow of Self-training they proposed such self-training strategy and confirmed the effect on their own corpus. Figure 1 shows the flow of self-training. The procedure is summarized as follows.
1. First, train the back-translator that translates the target language into the source using original parallel corpus.
2. Extract the target side of the original parallel corpus, and translate it into the source language (synthetic source sentences) using the above back-translator. During backtranslation, it not only generates one sentence but also generates multiple source sentences per target sentence using a sampling method.
3. Construct the synthetic parallel corpus making pairs of the synthetic source sentences and their original target sentences. If we define the number of synthetic source sentences per target sentence as N , the size of the synthetic parallel corpus becomes N -times larger than the original parallel corpus.
4. Train the forward translator, which translates the source to the target, using a mixture of the original and synthetic parallel corpora.
In this study, we modify the method proposed by Imamura et al. (2018) to improve the efficiency of the training while maintaining the diversity of the source sentences. Imamura et al. (2018) generates synthetic source sentences by sampling. The sampling is based on the posterior probability of an output word as follows.

Diversity Control
where y t , y <t , and x denote the output word sequence at time t, history of the output words, and input word sequence, respectively.
To control the diversity of generated sentences, the synthetic source generation in this paper introduces an inverse temperature parameter 1/τ into the softmax function.
If we set the inverse temperature parameter 1/τ greater than 1.0, high probability words become preferable, and if we set it to infinity, the sampling becomes identical to the argmax operation. On the contrary, if we set it less than 1.0, diverse words will be selected, and the distribution becomes uniform if we set it zero.

Dynamic Generation
A problem in the research proposed by Imamura et al. (2018) is that the training time increases (N + 1)-times with an increase in the training data size. To alleviate this problem, we introduce dynamic generation that uses different synthetic parallel sets for each epoch (Kudo, 2018). Specifically, a synthetic parallel sentence set, which contains one synthetic source sentence per target sentence, is used for an epoch of the training. By changing the synthetic parallel sentence set for each epoch, we expect a similar effect to using multiple source sentences in the training.
For implementation, we do not embed the dynamic generation in the training program but perform it offline. Multiple synthetic source sentences were generated in advance, whose number N is 20 this time, and N synthetic parallel sets are constructed. During training, a synthetic set is selected for each epoch using round-robin scheduling, and learns the model using the synthetic set and the original corpus. We can use the same learning rates because the sizes of the original and synthetic sets are the same. Using the dynamic generation, the size of the training data is restricted to double of the original parallel corpus. 1 The training procedure is summarized as follows.
1. First, train the back-translator using the original parallel corpus.
2. Translate the target side of the original corpus into N synthetic source sentences per target sentence using the above back-translator.
Note that the sampling method described in Section 2.2 is used for the generation.
3. Make N sets of the synthetic parallel sentences by pairing the synthetic source sentences generated in Step 2 and the target side of the original parallel corpus.
4. Train the forward translator. In each epoch, select one synthetic parallel set, and train the model using the mixture of the synthetic and original parallel sets.

Back-Translator
The back-translator used herein is OpenNMT, which employs an RNN-based seq-to-seq model. The training corpus for the back-translation is preprocessed using the byte-pair encoding (BPE) (Sennrich et al., 2016b). For each language, 16K subword types were independently computed. The model was optimized using the stochastic gradient descent (SGD) whose learning rate was 1.0.
For the back-translation, we modified Open-NMT to generate synthetic source sentences by sampling. This time, we generated three types of synthetic source sentences changing the inverse temperature parameter 1/τ to 1.0, 1.2, and 1.5.

Forward Translator 1: Transformer Model
The first forward translator is Marian NMT, which is based on the transformer model. We used this system for the submission. The settings were almost identical to the base model of Vaswani et al. (2017). 2 The vocabulary sets were equal to those of the back-translator for the original parallel corpus. For the synthetic source sentences, we directly used subword sequences output from the back-translator. 2 We referred the settings described at https://github.com/marian-nmt/marianexamples/tree/master/transformer. Marian NMT performs the length normalization using the following equation.
where ll norm (y|x), W P , and T denote the loglikelihood normalized by the output length, word penalty, and number of output words, respectively. If we set the word penalty greater than 0.0, long hypotheses are preferred. The setting of the word penalty will be further discussed in Section 4.1.

Forward Translator 2: Seq-to-Seq Model
The other forward translator used herein is Open-NMT based on the seq-to-seq model. The settings were almost the same as the back-translator. SGD was used for the optimization, but the learning rate was set to 0.5 because all target sentences appear twice in an epoch. At the translation, we translated the source sentence into 10-best, and the best hypothesis was selected using the length reranking based on the following equation (Oda et al., 2017).
where ll bias (y|x) denotes the log-likelihood biased by output length. Although this formula differs from Equation 3, there is an equivalent effect that long hypotheses are preferred if the word penalty W P is set to a positive value.

Experiments
In this section, we describe our results for the NMT-2018 shared task in English-German translation. Note that the shared task uses the WMT-2014 data set preprocessed by Stanford NLP Group.

Word Penalty / Length Ratio
The BLEU score significantly changes due to the translation length (Morishita et al., 2017). For instance, Figure 2 shows BLEU scores of our submitted system (a) when the word penalty was changed from 0.0 to 2.0 and (b) on various length ratios (LRs), which indicate the ratios of the number of words of the system outputs to the reference translations (sys/ref). Figure 2 (a), the BLEU scores change over 0.5 when we change the word penalty. The penalties of the peaks are different among the development/test sets. The BLEU score peaks were at W P = 1.2, 0.2, and 0.5 in the newstest2013, newstest2014, and newstest2015 sets, respectively. Therefore, the BLEU scores significantly depend on the word penalty. However, as shown in Figure 2 (b), we can see that the peaks of the BLEU scores were at LR = 1.0 in all development/test sets. This setting supports no brevity penalty and high n-gram precision. 3 These results reveal that the length ratio should be constant for fair comparison when we compare different systems because they generate translations of different lengths. Therefore, we compare different models and settings by tuning the word penalty to maintain the stable length ratio on the development set (newstest2013). In this experiment, we show results of the two length ratios based on the "original parallel corpus only" of the transformer model. Note that the submitted system employs the first setting.

As shown in
1. LR ≃ 0.988, which is the length ratio when W P = 1.0.
2. LR ≃ 0.973, which is the length ratio when W P = 0.5.

Results
Tables 2  to-seq model), respectively. These tables consist of three information groups. The first group shows training results; the number of training epochs and perplexity of the development set. The second and third groups show the BLEU scores when the length ratio in the development set become 0.988 and 0.973, respectively. The results of Marian NMT were better than those of OpenNMT in all cases. The following discussion mainly focuses on the results of Marian NMT (Table 2), but there was similarity in Table 3. First, in comparison with the "original parallel corpus only" and the "one-best without dynamic generation," the perplexity of the one-best case increased from 4.37 to 4.43. Along with increasing the perplexity, the BLEU scores of the test sets (newstest2014 and newstest2015) degraded to 26.19 and 28.49 when LR ≃ 0.988. This result indicates that the self-training, which simply uses one-best translation result, is not effective.
On the contrary, using our self-training method, the perplexities were decreased and the BLEU scores improved significantly regardless of the inverse temperature parameters in most cases. 4 For example, when 1/τ = 1.0, the perplexity were decreased to 4.20, and the BLEU scores improved to 27.59 and 30.19 on the newstest2014 and 2015, respectively, when LR ≃ 0.988. When LR ≃ 0.973, the BLEU scores further improved, but the improvements come from the length ratio. The same tendency was observed in OpenNMT. We can conclude that the proposed self-training method is effective for the transformer and seq-4 The significance test was performed using the multeval tool (Clark et al., 2011) at a significance level of 5% (p < 0.05). https://github.com/jhclark/multeval to-seq models.
The effectiveness of the inverse temperature parameter is still unclear because the BLEU scores were depend on the parameters.

Conclusions
The self-training method in this paper improves the accuracy without changing the model structure. The experimental results show that the proposed method is effective for both the transformer and seq-to-seq models. Although our self-training method increases training time by more than double, we believe that it is effective for the tasks that emphasize on translation speed because it does not change the translation efficiency.
In this paper, only restricted settings were tested. We require further experiments such as another back-translation methodology and settings of the inverse temperature parameters.