Long Warm-up and Self-Training: Training Strategies of NICT-2 NMT System at WAT-2019

This paper describes the NICT-2 neural machine translation system at the 6th Workshop on Asian Translation. This system employs the standard Transformer model but features the following two characteristics. One is the long warm-up strategy, which performs a longer warm-up of the learning rate at the start of the training than conventional approaches. Another is that the system introduces self-training approaches based on multiple back-translations generated by sampling. We participated in three tasks—ASPEC.en-ja, ASPEC.ja-en, and TDDC.ja-en—using this system.


Introduction
This paper describes the NICT-2 neural machine translation (NMT) system at the 6th Workshop on Asian Translation (WAT-2019) (Nakazawa et al., 2019). This system employs Vaswani et al. (2017)'s Transformer base model but improves translation quality by applying the following training strategies and hyperparameters.
• We investigated the relationship between the learning rate, warm-up, and model perplexity, and found that a long warm-up allows high learning rates, and consequently the translation quality improves. According to this finding, we applied the long warm-up.
• We applied the self-training strategy, which uses multiple back-translations generated by sampling  to increase the robustness of the encoder and improve the translation quality.
The remainder of this paper is organized as follows. Section 2 summarizes the system, including its settings. We describe the characteristics  of our system-the long warm-up and the selftraining based on multiple back-translations by sampling-in Sections 3 and 4, respectively. The results are presented in Section 5. Finally, Section 6 concludes the paper.

System Summary
We participated in three tasks, namely English-to-Japanese and Japanese-to-English of ASPEC (abbreviated to ASPEC.en-ja and ASPEC.ja-en, respectively), and Japanese-to-English of the TDDC (TDDC.ja-en). The corpus used in the ASPEC tasks is Asian Scientific Paper Excerpt Corpus (Nakazawa et al., 2016), which is a collection of scientific papers. In the TDDC task, the Timely Disclosure Documents Corpus (TDDC) was used. The development and test sets of TDDC are divided into items and texts sets, which are collections of titles and body texts, respectively. The sizes of the corpora are shown in Table 1.
All corpora were divided into sub-words using the byte-pair encoding rules (Sennrich et al., 2016b) acquired from the training sets of each corpus. The rules were independently acquired from the source and target languages, to give a vocabulary size around 16K.  We used fairseq 1 as a basic translator. The model used here is the Transformer base model (six layers). Table 2 shows the hyperparameters of the model, training, and test.
Training was performed on Volta 100 GPUs (two GPUs for the ASPEC dataset and one GPU for the TDDC dataset) using 16-bit floating point computation. The training was stopped when the loss of the development set was minimized (i.e., early stopping). We also used checkpoint averaging: using the best checkpoint and the next nine checkpoints (10 checkpoints in total).
During testing, we used 32-bit floating point computation. For the final submission, four models, which were trained using different random seeds, were ensembled (Imamura and Sumita, 2017) 3 Long Warm-up The warm-up is a technique that gradually increases the learning rate at the beginning of training. The most general strategy is to increase the learning rate linearly. Using the warm-up, the model parameters are updated toward convergence, even if they were randomly initialized. This allows us to obtain stable models.
We generally use a fixed time for the warm-up. For example, Vaswani et al. (2017)   time influences the final quality of models. Figure 1 shows how the development set perplexity (Dev. PPL) changes as the learning rate and warm-up time vary, using the ASPEC.ja-en dataset (as explained in Section 5). Lower perplexity indicates a better model. We can observe that both the learning rate and the warm-up time influence the perplexity. When we use a long warm-up, we can apply high learning rates and consequently obtain low-perplexity models. We observed a similar tendency in the TDDC.ja-en task.
Based on the above experiment, we used 0.0004 as the learning rate and set the warm-up time to 30K updates for ASPEC datasets and 14K updates for TDDC datasets. 2 These values almost minimize the development perplexity.
Recently, a variant of Adam optimization (called RAdam), which automatically adapts the learning rate and does not require any warm-up, has been proposed (Liu et al., 2019). To confirm the relationship between the warm-up and RAdam is our future work.  ated manually. The back-translation can be applied to self-training if the pseudo-parallel corpora are created from the manually created corpora that will be used for training the forward translator ( Figure 2).

Back-Translation with Sampling Generation
A problem with back-translation is that the pseudo-parallel sentences become less varied than those created manually, because of machine translation. This characteristic makes it difficult for the back-translation method to enhance the encoder, in contrast to the decoder.
To solve this problem,  proposed a method that combines the following two methods.
• To generate diverse pseudo-parallel sentences, words are generated by sampling based on the word probability distribution (Eq. 1) instead of the maximum likelihood during the back-translation.
whereŷ t , y <t , and x denote the generated word at time t, the history until time t, and the input word sequence, respectively. τ denotes the temperature parameter, which is used to control the diversity, but we use τ = 1.0 in this paper.
• Multiple pseudo-source sentences, for a target sentence, are used for training.
Both methods are intended to enhance the encoder by increasing the diversity of source sentences, while fixing the target sentences.

Training Procedure
The training procedure is summarized as follows ( Figure 2). 1) Train a back-translator from the target language to the source language, using the original parallel corpus.
2) Translate the target side of the original corpus to the source language, using the backtranslator. During the back-translation, K pseudo-source sentences are generated for each target sentence, using sampling.
3) Construct K sets of pseudo-parallel sentences by pairing the target and pseudosource sentences.
4) Build the training set by mixing the original and pseudo-parallel corpora, and train the forward translator from the source language to the target language.

Static and Dynamic Self-Training
There are two types of self-training based on backtranslation, depending on the pseudo-parallel sentences and the structure of mini-batches: the static self-training  and dynamic self-training . Steps 3 and 4 of Section 4.2 are different for each type. Static self-training constructs a training set by combining K static pseudo-parallel sentences with each of the original sentences. In this paper, we set K static = 4. During training, the training set is fixed.
In static self-training, the number of pseudoparallel sentences is K static times larger than the number of original sentences. If we simply mix these sentences, the ratio of pseudo-parallel sentences to original sentences would be too high. To avoid this problem, we oversample the original sentences by a factor of K static , instead of changing the learning rate depending on the sentence . Therefore, the total size of the training set is 2K static times larger than the size of the original corpus.
In contrast, dynamic self-training constructs K dynamic training sets. A training set includes one pseudo-parallel set and one original set. For each epoch, a set is randomly selected from the K dynamic training sets, and the model is trained using the set. In this paper, we set K dynamic = 20.
In dynamic self-training, the size of a training set is twice the size of the original corpus. Therefore, the training time for an epoch is shorter than that for static self-training. However, a greater number of epochs are required until convergence, because diverse training sets are used.

Results of ASPEC and TDDC Tasks
The results of ASPEC and TDDC are shown in Tables 3 and 4, respectively. Both tables show the translation quality (BLEU) and the perplexity of the development set (Dev. PPL), depending on the model type and training method. The effect of the long warm-up has already been shown in Section 3.

Notes of Experimental Settings
The BLEU scores (Papineni et al., 2002) in the tables were computed based on the tokenizers MeCab (for Japanese (Kudo et al., 2004)) and Moses (for English (Koehn et al., 2007)). We trained four models with different random seeds. The single model rows of the tables show the average score of four models, and the ensemble rows show the score of the ensemble of four models.
The length penalty for testing was set to maximize the BLEU score of the development set. However, in the TDDC task, we used different penalties for the items and texts sets, and independently optimized according to the set.
Finally, we submitted the ensemble models for which the BLEU scores of the development set (in the single model cases) were the highest.

Results
First, we focus on the ASPEC.en-ja task. In the single model cases, the BLEU scores improved around +0.20 to +0.58 by adding static self-training to the base model. In the case of dynamic self-training, the improvements were between +0.46 and +0.60. The ensemble models have a similar tendency, and we can conclude that self-training is effective because the BLEU scores significantly improved in many cases. Comparing static and dynamic self-training, there were no significant differences, even though the scores of dynamic self-training were higher than those of static self-training.
In contrast, for the ASPEC.ja-en task, the BLEU scores of static self-training were worse than those of the base model, in both the single model and ensemble cases. However, for dynamic self-training, some BLEU scores significantly improved. Dynamic self-training tends to be more effective for the ASPEC tasks.
In terms of the training time, the number of epochs of the static self-training was lower than that of the dynamic self-training in both ASPEC.en-ja and ASPEC ja-en tasks. However, conversely, the total number of updates of the dynamic self-training was lower. As the training data size increased, the total training time increased in the static self-training. The dynamic self-training was more efficient from the perspective of training time.
For the TDDC.ja-en task, the items and texts sets have a different tendency. For the items set, the BLEU scores improved by applying static selftraining, but became worse for the texts set.
Self-training based on back-translation is not always effective; that is, there are effective and ineffective datasets. Investigating the conditions that influence translation quality is our future work. Note that this phenomenon is only observed for self-training; the back-translation of additional monolingual corpora has different features.

Conclusions
This paper explained the NICT-2 NMT system at WAT-2019. This system employs the Transformer model and applies the following two training strategies.
• We employed the long warm-up strategy and trained the model using a high learning rate.
• We also employed the self-training strategy, which uses multiple back-translations generated by sampling.   The bold values indicate the highest score among base, static self-training (ST), and dynamic ST. The (+) and (-) symbols denote significant improvement and degradation, respectively, from the base model (p ≤ 0.05). The § symbol indicates that there is a significant difference between static and dynamic ST cases.