Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning

The performance of Neural Machine Translation (NMT) models relies heavily on the availability of sufficient amounts of parallel data, and an efficient and effective way of leveraging the vastly available amounts of monolingual data has yet to be found. We propose to modify the decoder in a neural sequence-to-sequence model to enable multi-task learning for two strongly related tasks: target-side language modeling and translation. The decoder predicts the next target word through two channels, a target-side language model on the lowest layer, and an attentional recurrent model which is conditioned on the source representation. This architecture allows joint training on both large amounts of monolingual and moderate amounts of bilingual data to improve NMT performance. Initial results in the news domain for three language pairs show moderate but consistent improvements over a baseline trained on bilingual data only.


Introduction
In recent years, neural encoder-decoder models (Kalchbrenner and Blunsom, 2013; have significantly advanced the state of the art in NMT, and now consistently outperform Statistical Machine Translation (SMT) (Bojar et al., 2016). However, their success hinges on the availability of sufficient amounts of parallel data, and contrary to the long line of research in SMT, there has only been a limited amount of work on how to effectively and efficiently make use of monolingual data which is typically amply available. We propose a modified neural sequence-to-sequence model with atten-tion Luong et al., 2015b) that uses multi-task learning on the decoder side to jointly learn two strongly related tasks: target-side language modeling and translation. Our approach does not require any pre-translation or pre-training to learn from monolingual data and thus provides a principled way to integrate monolingual data resources into NMT training. Gülçehre et al. (2015) investigate two ways of integrating a pre-trained neural Language Model (LM) into a pre-trained NMT system: shallow fusion, where the LM is used at test time to rescore beam search hypothesis, requiring no additional finetuning and deep fusion, where hidden states of NMT decoder and LM are concatenated before making a prediction for the next word. Both components are pre-trained separately and fine-tuned together.

Related Work
More recently, Sennrich et al. (2016) have shown significant improvements by back-translating target-side monolingual data and using such synthetic data as additional parallel training data. One downside of this approach is the significantly increased training time, due to training of a model in the reverse direction and translation of monolingual data. In contrast, we propose to train NMT models from scratch on both bilingual and target-side monolingual data in a multi-task setting.
Our approach aims to exploit the signals from target-side monolingual data to learn a strong language model that supports the decoder in making translation decisions for the next word. Our approach further relates to Zhang and Zong (2016), who investigate multi-task learning for sequenceto-sequence models by strengthening the encoder using source-side monolingual data. A shared encoder architecture is used to predict both, transla-tions of parallel source sentences and permutations of monolingual source sentences. In this paper we focus on target-side monolingual data and only update encoder parameters based on existing parallel data.
In a broader context, multi-task learning has shown to be effective in the context of sequenceto-sequence models (Luong et al., 2015a), where different parts of the network can be shared across multiple tasks.

Neural Machine Translation
We briefly recap the baseline NMT model Luong et al., 2015b) and highlight architectural differences of our implementation where necessary.
Given source sentence x = x 1 , ..., x n and target sentence y = y 1 , ..., y m , NMT models p(y|x) as a target language sequence model, conditioning the probability of the target word y t on the target history y 1:t−1 and source sentence x. Each x i and y t are integer ids given by source and target vocabulary mappings, V src , V trg , built from the training data tokens. The target sequence is factorized as: p(y|x; θ) = m t=1 p(y t |y 1:t−1 , x; θ). (1) The model, parameterized by θ, consists of an encoder and a decoder part . For training set P consisting of parallel sentence pairs (x, y), we minimize the cross-entropy loss w.r.t θ: (2) Encoder Given source sentence x = x 1 , ..., x n , the encoder produces a sequence of hidden states h 1 . . . h n through an Recurrent Neural Network (RNN), such that: where h 0 = 0, x i ∈ {0, 1} |Vsrc| is the one-hot encoding of x i , E S ∈ R e×|Vsrc| is a source embedding matrix with embedding size e, and f enc some non-linear function, such as the Gated Rectified Unit (GRU)  or a Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) network.
Attentional Decoder The decoder also consists of an RNN to predict one target word at a time through a state vector s: where y t−1 ∈ {0, 1} |Vtrg| is the one-hot encoding of the previous target word, E T ∈ R e×|Vtrg| the target word embedding matrix, f dec an RNN, s t−1 the previous state vector, ands t−1 the sourcedependent attentional vector. The initial decoder hidden state is a non-linear transformation of the last encoder hidden state: The attentional vectors t combines the decoder state with a context vector c t : where c t is a weighted sum of encoder hidden states: c t = n i=1 α ti h i and brackets denote vector concatenation.
The attention vector α t is computed by an attention network Luong et al., 2015b): The next target word y t is predicted through a softmax layer over the attentional vectors t : where W o mapss t to the dimension of the target vocabulary. Figure 1a depicts this decoder architecture. Note that source information from c indirectly influences the states s of the decoder RNN as it takess as one of its inputs.

Separate Decoder LM layer
The decoder RNN (Figure 1a) is essentially a targetside language model, additionally conditioned on source-side sequences. Such sequences are not available for monolingual corpora and previous work has tried to overcome this problem by either using synthetically generated source sequences or using a NULL token as the source sequence (Sennrich et al., 2016). As previously shown empirically, the model tends to "forget" source-side information if trained on much more monolingual than parallel data. Addition of a source-independent LM layer that feeds into the source-dependent decoder (c) Multi-task setting next-word prediction from both layers; green softmax layers are shared.
In our approach we explicitly define a sourceindependent network that only learns from targetside sequences (a language model), and a sourcedependent network on top, that takes information from the source sequence into account (a translation model) through the attentional vectors. Formally, we modify the decoder RNN of Equation 4 to operate on the outputs an LM layer, which is independent of any source-side information: Figure 1b illustrates this separation graphically.

Multi-task Learning
The separation from above allows us to train the target embeddings E T and f lm parameters from monolingual data, concurrent to training the rest of the network on bilingual data. Let us denote the source-independent parameters by σ. We connect a second loss to f lm to predict the next target word also conditioned only on target history information ( Figure 1c). Parameters for softmax layers are shared such that predictions of the LM layer are given by: p(y t |y 1:t−1 , σ) = sof tmax(W o r t + b o ). (10) Formally, for a heterogeneous data set Z = {P, M}, consisting of parallel and monolingual sentences (x, y), (y), we optimize the following joint loss: where the source-independent parameters σ ⊂ θ are updated by gradients from both mono-and parallel data examples, and source-dependent parameters θ are updated only through gradients from parallel data examples. γ ≥ 0 is a scalar to influence the importance of the monolingual loss. In practice, we construct mini-batches of training examples, where 50% of the data is parallel, and 50% of the data is monolingual and set γ = 1.
Since parts of the decoder are shared among both tasks and we optimize both loss terms concurrently, we view this approach as an instance of multi-task learning rather than transfer learning, where optimization is typically carried out sequentially.

Experiments
We conduct experiments for three different language pairs in the news domain: FR→EN, EN→DE, and CS→EN.

Data
For EN→DE and CS→EN we use newscommentary-v11 as bilingual training data, NewsCrawl 2015 as monolingual data, and news development and test sets from

Experimental Setup
We tokenize all data and apply Byte Pair Encoding (BPE) (Sennrich et al., 2015) with 30k merge operations learned on the joined bilingual data. Models are evaluated in terms of BLEU (Papineni et al., 2002), METEOR (Lavie and Denkowski, 2009) and TER (Snover et al., 2006) on tokenized, cased test data. Decoding is performed using beam search with a beam of size 5. We implement all models using MXNet (Chen et al., 2015) (Saxe et al., 2013) and the remaining parameters with the Xavier (Glorot and Bengio, 2010) method. We use early stopping with respect to perplexity on the development set. We train each model configuration three times with different seeds and report average metrics across the three runs.
Further, we train models with synthetic parallel data generated through back-translation (Sennrich et al., 2016). For this, we first train a baseline model in the reverse direction and then translate a random sample of 200k sentences from the monolingual target data. On the combined parallel and synthetic training data we train a new model with the same training hyper-parameters as the baseline.

Language Model Layer
The architecture with an additional source-independent LM layer (+LML) is trained with the same hyper-parameters and data as the baseline model. The LM RNN uses a hidden size of 1024. The multi-task system (+LML + MTL) is trained on both parallel and monolingual data. In practice, all +LML +MTL models converge before seeing the entire monolingual corpus and at about the same number of updates as the baseline. Table 1 shows results on the held-out test sets. We observe that a separate LM layer does not significantly impact performance across all metrics. Adding monolingual data in the described multitask setting improves translation performance by a small but consistent margin across all metrics. Interestingly, the improvements from monolingual data are additive to the gains from ensembling of 3 models with different random seeds. However, the use of synthetic parallel data still outperforms our approach both in single and ensemble systems.

Results
While separating out a language model allowed us to carry out multi-task training on mixed data types, it constrains gradients from monolingual data examples to a subset of source-independent network parameters (σ). In contrast, synthetic data always affects all network parameters (θ) and has a positive effect despite source sequences being noisy. We speculate that training from synthetic source data may also act as a model regularizer.

Conclusion
We proposed a way to directly integrate target-side monolingual data into NMT through multi-task learning. Our approach avoids costly pre-training processes and jointly trains on bilingual and monolingual data from scratch. While initial results show only moderate improvements over the baseline and fall short against using synthetic parallel data, we believe there is value in pursuing this line of research further to simplify training procedures.