Neural Machine Translation for Bilingually Scarce Scenarios: a Deep Multi-Task Learning Approach

Neural machine translation requires large amount of parallel training text to learn a reasonable quality translation model. This is particularly inconvenient for language pairs for which enough parallel text is not available. In this paper, we use monolingual linguistic resources in the source side to address this challenging problem based on a multi-task learning approach. More specifically, we scaffold the machine translation task on auxiliary tasks including semantic parsing, syntactic parsing, and named-entity recognition. This effectively injects semantic and/or syntactic knowledge into the translation model, which would otherwise require a large amount of training bitext to learn from. We empirically analyze and show the effectiveness of our multitask learning approach on three translation tasks: English-to-French, English-to-Farsi, and English-to-Vietnamese.


Introduction
Neural Machine Translation (NMT) with attentional encoder-decoder architectures Bahdanau et al., 2015) has revolutionised machine translation, and achieved state-of-the-art for several language pairs. However, NMT is notorious for its need for large amounts of bilingual data (Koehn and Knowles, 2017) to achieve reasonable translation quality. Leveraging existing monolingual resources is a potential approach for compensating this requirement in bilingually scarce scenarios. Ideally, semantic and syntactic knowledge learned from existing linguistic resources provides NMT with proper inductive biases, leading to increased generalisation and better translation quality.
Multi-task learning (MTL) is an effective approach to inject knowledge into a task, which is learned from other related tasks. Various recent works have attempted to improve NMT with an MTL approach (Peng et al., 2017;Liu et al., 2017;Zhang and Zong, 2016); however, they either do not make use of curated linguistic resources (Domhan and Hieber, 2017;Zhang and Zong, 2016), or their MTL architectures are restrictive yielding mediocre improvements (Niehues and Cho, 2017). The current research leaves open how to best leverage curated linguistic resources in a suitable MTL framework to improve NMT.
In this paper, we make use of curated monolingual linguistic resources in the source side to improve NMT in bilingually scarce scenarios. More specifically, we scaffold the machine translation task on auxiliary tasks including semantic parsing, syntactic parsing, and named-entity recognition. This is achieved by casting the auxiliary tasks as sequence-to-sequence (SEQ2SEQ) transduction tasks, and tie the parameters of their encoders and/or decoders with those of the main translation task. Our MTL architectures makes use of deep stacked encoders and decoders, where the parameters of the top layers are shared across the tasks. We further make use of adversarial training to prevent contamination of common knowledge with task-specific information.
We present empirical results on translating from English into French, Vietnamese, and Farsi; three target languages with varying degree of divergence compared to English. Our extensive empirical results demonstrate the effectiveness of our MTL approach in substantially improving the translation quality for these three translation tasks in bilingually scarce scenarios.
Encoder The encoder is a bi-directional RNN whose hidden states represent tokens of the input sequence. These representations capture information not only of the corresponding token, but also other tokens in the sequence to leverage the context. The bi-directional RNN consists of two RNNs running in the left-to-right and right-to-left directions over the input sequence: where E E E S [x i ] is the embedding of the token x i from the embedding table E E E S of the input (source) space, and − → h i and ← − h i are the hidden states of the forward and backward RNNs which can be based on the LSTM (long-short term memory) (Hochreiter and Schmidhuber, 1997) or GRU (gated-recurrent unit) (Chung et al., 2014) units. Each source token is then represented by the concatenation of the corresponding bidirectional hidden states, Decoder. The backbone of the decoder is a unidirectional RNN which generates the token of the output one-by-one from left to right. The generation of each token y j is conditioned on all of the previously generated tokens y <j via the state of the RNN decoder s j , and the input sequence via a dynamic context vector c j (explained shortly): where E E E T [y j ] is the embedding of the token y j from the embedding table E E E T of the output (target) space, and the W matrices and b r vector are the parameters.
A crucial element of the decoder is the attention mechanism which dynamically attends to relevant parts of the input sequence necessary for generating the next token in the output sequence. Before generating the next token t j , the decoder computes the attention vector α j over the input token: which intuitively is similar to the notion of alignment in word/phrase-based statistical MT (Brown et al., 1993). The attention vector is then used to compute a fixed-length dynamic representation of the source sentence which is conditioned upon in the RNN decoder when computing the next state or generating the output word (as mentioned above).
Training and Decoding. The model parameters are trained end-to-end by maximising the (regularised) log-likelihood of the training data arg max where the above conditional probability is defined according to eqn (1). Usually drop-out is employed to prevent over-fitting on the training data. In the decoding time, the best output sequence for a given input sequence is produced by arg max y P θ θ θ (y|x) = j P θ θ θ (y j |y <j x).
Usually greedy decoding or beam search algorithms are employed to find an approximate solution, since solving the above optimisation problem exactly is computationally hard.

SEQ2SEQ Multi-Task Learning
We consider an extension of the basic SEQ2SEQ model where the encoder and decoder are equipped with deep stacked layers. Presumably, deeper layers capture more abstract information about a task, hence they can be used as a mechanism to share useful generalisable information among multiple tasks.
Deep Stacked Encoder. The deep encoder consists of multiple layers, where the hidden states in layer − 1 are the inputs to the hidden states at the next layer . That is, where the inputs to the first layer RNNs are in which c j is the dynamic source context, as defined in eqn 3. The state of the decoder is then the concatenation of the hidden states for all layers: s j = [s 1 j ; . . . ; s L j ] which is then used in eqn 2 as part of the "output generation module".
Shared Layer MTL. We share the deep layer RNNs in the encoders and/or decoders across the tasks, as a mechanism to share abstract knowledge and increase model generalisation.
Suppose we have a total of M + 1 tasks, consisting of the main task plus M auxiliary tasks.
be the parameters of multi-layer encoder and decoder for the task m. Let {Θ m enc , Θ m dec } M m=1 and {Θ 0 enc , Θ 0 dec } be the RNN parameters for the auxiliary tasks and the main task, respectively. We share the parameters of the deep-level encoders and decoders of the auxiliary tasks with those of the main task. That is, where L m enc and L m dec specify the deep-layer RNNs need to be shared parameters. Other parameters to share across the tasks include those of the attention module, the source/target embedding tables, and the output generation module. As an extreme case, we can share all the parameters of SEQ2SEQ architectures across the tasks.
Training Objective. Suppose we are given a collection of M +1 SEQ2SEQ transductions tasks, each of which is associated with a training set . The parameters are learned by maximising the MTL training objective: where Θ mtl denotes all the parameters of the MTL architecture, |D m | denotes the size of the training set for the task m, and γ m balances out its influence in the training objective.
Training Schedule. Variants of stochastic gradient descent (SGD) can be used to optimise the objective in order to learn the parameters. Making the best use of tasks with different objective geometries is challenging, e.g. due to the scale of their gradients. One strategy for making an SGD update is to select the tasks from which the next data items should be chosen. In our training schedule, we randomly select a training data item from the main task, and pair it with a data item selected from a randomly selected auxiliary task for making the next SGD update. This ensures the presence of training signal from the main task in all SGD updates, and avoids the training signal being washed out by the auxiliary tasks.

Adversarial Training
The learned shared knowledge can be contaminated by task-specific information. We address this issue by adding an adversarial objective. The basic idea is to augment the MTL training objective with additional terms, so that the identity of a task cannot be predicted from its data items by the representations resulted from the shared encoder/decoder RNN layers.
Task Discriminator. The goal of the task discriminator is to predict the identity of a task for a data item based on the representations of the share layers. More specifically, our task discriminator consists of two RNNs with LSTM units, each of which encodes the sequence of hidden states in the shared layers of the encoder and the decoder. 1 The last hidden states of these two RNNs are then concatenated, giving rise to a fixed dimensional vector summarising the representations in the shared layers. The summary vector is passed through a fully connected layer followed by a softmax to predict the probability distribution over the tasks: where disLSTMs denotes the discriminator LSTMs, shrRep Θ mtl (x, y) denotes the representations in the shared layer of deep encoders and decoders in the MTL architecture, and Θ d includes the disLSTMs parameters as well as Adversarial Objective. Inspired by (Chen et al., 2017), we add two additional terms to the MTL training objective in eqn 4. The first term is Maximising the above objective over Θ d ensures proper training of the discriminator to predict the identity of the task. The second term ensures that the parameters of the shared layers are trained so that they confuse the discriminator by maximising the entropy of its predicted distribution over the task identities. That is, we add the term L adv2 (Θ mtl ) to the training objective defined as: where H[.] is the entropy of a distribution. In summary, the adversarial training leads to the following optimisation We maximise the above objective by SGD, and update the parameters by alternating between optimising L mtl (Θ mtl ) + λL adv2 (Θ mtl ) and L adv1 (Θ d ).

Bilingual Corpora
We use three language-pairs, translating from English to French, Farsi, and Vietnamese. We have chosen these languages to analyse the effect of multi-task learning on languages with different underlying linguistic structures. The sentences are segmented using BPE (Sennrich et al., 2016) on the union of source and target vocabularies for English-French and English-Vietnamese. For English-Farsi, BPE is performed using separate vocabularies due to the disjoint alphabets. We use a special <UNK> token to replace unknown BPE units in the test and development sets. Table 1 show some statistics about the bilingual corpora. Further details about the corpora and their pre-processing is as follows: • The English-French corpus is a random subset of EuroParlv7 as distributed to WMT2014. Sentence pairs in which either the source  or the target has length more than 80 (before applying BPE) have been removed. The BPE is performed with a 30k total vocabulary size. The "news-test2012" and "news-test-2013" portions are used for validation and test sets, respectively.
• The English-Farsi corpus is assembled from all the parallel news text in LDC2016E93 Farsi Representative Language Pack from the Linguistic Data Consortium, combined with English-Farsi parallel subtitles from the TED corpus (Tiedemann, 2012). Since the TED subtitles are user-contributed, this text contained considerable variation in the encoding of its Perso-Arabic characters. To address this issue, we have normalized the corpus using the Hazm toolkit 2 . Sentence pairs in which one of the sentences has more than 80 (before applying BPE) are removed, and BPE is performed with a 30k vocabulary size. Random subsets of this corpus (3k and 4k sentences each) are held out as validation and test sets, respectively.
• The English-Vietnamese is from the translation task in IWSLT 2015, and we use the preprocessed version provided by . The sentence pairs in which at least one of their sentences had more than 300 units (after applying BPE) are removed. "tst2012" and "tst2013" parts are used for validation and test sets, respectively.

Auxiliary Tasks
We have chosen the following auxiliary tasks to provide the NMT model with syntactic and/or semantic knowledge, in order to enhance the quality of translation: Named-Entity Recognition (NER). With a small bilingual training corpus, it would be hard for the NMT model to learn how to translate rarely occurring named-entities. Through the NER task, the model hopefully learns the skill to recognize named entities. Speculatively, it would then enables leaning translation patterns by masking out named entities. The NER data comes from the CONLL shared task. 3 Syntactic Parsing. This task enables NMT to learn the phrase structure of the input sentence, which would then be useful in better re-orderings. This would be most useful for language pairs with high syntactic divergence. The parsing data comes from the Penn Tree Bank with the standard split for training, development, and test (Marcus et al., 1993). We linearise the constituency trees, in order to turn syntactic parsing as a SEQ2SEQ transduction (Vinyals et al., 2015).
Semantic Parsing. A good translation should preserve the meaning. Learning from the semantic parsing task enables the NMT model to pay attention to a meaning abstraction of the source sentence, in order to convey it to the target translation. We have made use of the Abstract Meaning Representation (AMR) corpus Release 2.0 (LDC2017T10), which pairs English sentences AMR meaning graphs. We linearise the AMR graphs, in order to convert semantic parsing as a SEQ2SEQ transduction problem (Konstas et al., 2017).

Models and Baselines
We have implemented the proposed multi-task learning architecture in C++ using DyNet (Neubig et al., 2017), on top of Mantis (Cohn et al., 2016) which is an implementation of the attentional SEQ2SEQ NMT model in (?). In our multitask architecture, we do partial sharing of parameters, where the parameters of the top 2 stacked layers are shared among the encoders of the tasks. Moreover, we share the parameters of the top layer stacked decoder among the tasks. Source and target embedding tables are shared among the tasks, while the attention component is task-specific. 4 We compare against the following baselines: • Baseline 1: The vanila SEQ2SEQ model without any multi-tasking.
• Baseline 2: The multi-tasking architecture proposed in (Niehues and Cho, 2017), which is a special case of our approach where the parameters of all 3 stacked layers are shared among the tasks. 5 They have not used deep stacked layers in encoder and decoder as we do, so we extend their work to make it comparable with ours.
The configuration of models is as follows. The encoders and decoders make use of GRU units with 400 hidden dimensions, and the attention component has 200 dimensions. For training, we used Adam algorithm (Kingma and Ba, 2014) with the initial learning rate of 0.003 for all of the tasks. Learning rates are halved when the performance on the corresponding dev set decreased. In order to speed-up the training, we use mini-batching with the size of 32. Dropout rates for both encoder and decoder are set to 0.5, and models are trained for 50 epochs where the best models is selected based on the perplexity on the dev set. λ for the adversarial training is set to 0.5. Once trained, the NMT model translates using the greedy search. We use BLEU (Papineni et al., 2002) to measure translation quality. 6 Table 2 reports the BLEU scores and perplexities for the baseline and our proposed method on the three aforementioned translation tasks. It can be seen that the performance of multi-task learning models are better than Baseline 1 (only MT task). This confirms that adding auxiliary tasks helps to increase the performance of the machine translation task. As expected, the effect of different tasks are not similar across the language pairs, possibly due to the following reasons: (i) these translation tasks datasets come from different domains so they have various degree of domain relatedness to the auxiliary tasks, and (ii) the BLEU scores of the Baseline 1 show that the three translation models are on different quality levels which may entail that they benefit from auxiliary knowledge on different levels. In order to improve a model with low quality translations due to language divergence, syntactic knowledge can be more helpful as they help better reorderings. In a higher-quality model, however, semantic knowledge can be more useful as 5 We have used their best performing architecture and changed the training schedule to ours. 6 With "multi-bleu.perl" script from Moses (Koehn et al., 2007).   a higher-level linguistic knowledge. This pattern can be seen in the reported results: syntactic parsing leads to more improvement on Farsi translation which has a low BLEU score and high language divergence to English, and semantic parsing yields more improvement on the Vietnamese translation task which already has a high BLEU score. The NER task has led to a steady improvement in all the translation tasks, as it leads to better handling of named entities. We have further added adversarial training to ensure the shared representation learned by the encoder is not contaminated by the task-specific information. The results are in the last row of Table  2. The experiments show that adversarial training leads to further gains in MTL translation quality, except when translating into Farsi. We speculate this is due to the low quality of NMT for Farsi, where updating shared parameters with respect to the entropy of discriminator's predicted distribution may negatively affect the model. Table 3 compares our multi-task learning approach to Baseline 2. As Table 3, our partial parameter sharing mechanism is more effective than fully sharing the parameters (Baseline 2), due to its flexibility in allowing access to private taskspecific knowledge. We also applied the adaptation technique (Niehues and Cho, 2017) as follows. Upon finishing MTL training, we continue to train only on the MT task for another 20 epochs, and choose the best model based on perplexity on dev set. Adaptation has led to consistent gains in the performance of our MTL architecture and Baseline 2.

Analysis
How many layers of encoder/decoder to share? Figure 2 show the results of changing the number of shared layers in encoder and decoder based on the En→Vi translation task. The results confirm that partial sharing of stacked layers is better than full sharing. Intuitively, partial sharing provides the model with an opportunity to learn task specific skills via the private layers, while leveraging the knowledge learned from other tasks via shared layers.

Statistics of gold n-grams in MTL translations.
Generating high order gold n-grams is hard. We analyse the effect of syntactic and semantic knowledge on generating gold n-grams in translations.
For each sentence, we first extract n-grams in the gold translation, and then compute the number of n-grams which are common with the generated translations. Finally, after aggregating the results over the entire test set, we compute the percentage of additional gold n-grams generated by each MTL model compared to the ones in single-task MT model. The results are depicted in Figure 1. Interestingly, the MTL models generate more correct n-grams relative to the vanilla NMT model, as n increases. Effect of the NER task. The NMT model has difficulty translating rarely occurring namedentities, particularly when the bilingual parallel data is scarce. We expect learning from the NER task leads the MTL model to recognize namedentities and learn underlying patterns for translating them. The top part in Table 4 shows an example of such situation. As seen, the MTL is able to recognize all of the named-entities in the sentence and translate the while the single-task model     For more analysis, we have applied a Farsi POS tagger (Feely et al., 2014) to gold translations. Then, we extracted n-grams with at least one noun in them, and report the statistics of correct such ngrams, similar to what reported in Figure 1. The resulting statistics is depicted in Figure 3. As seen, the MTL model trained on MT and NER tasks leads to generation of more correct unigram noun phrases relative to the vanilla NMT, as n increases.
Effect of the semantic parsing task. Semantic parsing encourages a precise understanding of the source text, which would then be useful for conveying the correct meaning to the translation. The middle part in Table 4 is an example translation, showing that semantic parsing has helped NMT by understanding that "the subject sees the object via subject's screens".
Effect of the syntactic parsing task. Recognizing the syntactic structure of the source sentence helps NMT to better translate phrases. The bottom part of Table 4 shows an example translation demonstrating such case. The source sentence is talking about "a method for controlling the traffic", which is correctly translated by the MTL model while vanilla NMT has mistakenly translated it to "controlled traffic".

Related Work
Multi-task learning has attracted attention to improve NMT in recent work. (Zhang and Zong, 2016) has made use of monolingual data in the source language in a multitask learning framework by sharing encoder in the attentional encoderdecoder model. Their auxiliary task is to reorder the source text to make it close to the target language word order. (Domhan and Hieber, 2017) proposed a two-layer stacked decoder, which the bottom layer is trained on language modelling on the target language text. The next word is jointly predicted by the bottom layer language model and the top layer attentional RNN decoder. They reported only moderate improvements over the baseline and fall short against using synthetic parallel data. (Dalvi et al., 2017) investigated the amount of learned morphology and how it can be injected using MTL. Our method is related to what they call joint data-learning, where they share all of the SEQ2SEQ components among the tasks. (Belinkov et al., 2017a;Shi et al., 2016;Belinkov et al., 2017b) investigate syntax/semantics phenomena learned as a byproduct of SEQ2SEQ NMT training. We, in turn, investigate the effect of injecting syntax/semantic on learning NMT using MTL.
The closet work to ours is (Niehues and Cho, 2017), which has made use of part-of-speech tagging and named-entity recognition tasks to improve NMT. They have used the attentional encoder-decoder with a shallow architecture, and share different parts eg the encoder, decoder, and attention. They report the best performance with fully sharing the encoder. In contrast, our architecture uses partial sharing on deep stacked encoder and decoder components, and the results show that it is critical for NMT improvement in MTL. Furthermore, we propose adversarial training to prevent contamination of shared knowledge with task specific details.
Taking another approach to MTL, (Søgaard and Goldberg, 2016) and (Hashimoto et al., 2017) have proposed architectures by stacking up tasks on top of each other according to their linguistic level, eg from lower level tasks (POS tagging) to higher level tasks (parsing). In this approach, each task uses predicted annotations and hidden states of the lower-level tasks for making a better prediction. This is contrast to the approach taken in this paper where models with shared parameters are trained jointly on multiple tasks.
More broadly, deep multitask learning has been used for various NLP problems, including graphbased parsing (Chen and Ye, 2011) and keyphrase boundary classification  . (Chen et al., 2017) has applied multi-task learning for Chinese word segmentation, and (Liu et al., 2017) applied it for text classification problem. Both of these works have used adversarial training to make sure the shared layer extract only common knowledge.
MTL has been used effectively to learn from multimodal data. (Luong et al., 2016) has proposed MTL architectures for neural SEQ2SEQ transduction for tasks including MT, image caption generation, and parsing. They fully share the encoders (many-to-one), the decoders (oneto-many), or some of the encoders and decoders (many-to-many). (Pasunuru and Bansal, 2017) have made use of an MTL approach to improve video captioning with auxiliary tasks including video prediction and logical language entailment based on a many-to-many architecture.

Conclusions and Future Work
We have presented an approach to improve NMT in bilingually scarce scenarios, by leveraging curated linguistic resources in the source, including semantic parsing, syntactic parsing, and named entity recognition. This is achieved via an effective MTL architecture, based on deep stacked en-coders and decoders, to share common knowledge among the MT and auxiliary tasks. Our experimental results show substantial improvements in the translation quality, when translating from English to French, Vietnamese, and Farsi in bilingually scarce scenarios. For future work, we would like to investigate architectures which allow automatic parameter tying among the tasks (Ruder et al., 2017).