The UMD Neural Machine Translation Systems at WMT17 Bandit Learning Task

We describe the University of Maryland machine translation systems submitted to the WMT17 German-English Bandit Learning Task. The task is to adapt a translation system to a new domain, using only bandit feedback: the system receives a German sentence to translate, produces an English sentence, and only gets a scalar score as feedback. Targeting these two challenges (adaptation and bandit learning), we built a standard neural machine translation system and extended it in two ways: (1) robust reinforcement learning techniques to learn effectively from the bandit feedback, and (2) domain adaptation using data selection from a large corpus of parallel data.


Introduction
We describe the University of Maryland systems for bandit machine translation. For the shared translation task of the EMNLP 2017's second conference on machine translation (WMT17), we focused on the task of bandit machine translation. This shared task was set up, consistent with (Kreutzer et al., 2017), simultaneously as a bandit learning problem and a domain adaptation problem. This raises the natural question: can we combine these potentially complementary information sources?
To investigate this question, we started from a standard neural machine translation (NMT) setup §2 1 , and then we: 1. applied domain adaptation techniques by data selection (Moore and Lewis, 2010) to the outof-domain data, with the goals of filtering out 1 Our implementation is based on OpenNMT (Klein et al., 2017), an open-source toolkit for neural MT. harmful data and fine-tuning the training process to focus only on relevant sentences ( §4).
2. trained robust reinforcement learning algorithms that can effectively learn from bandit feedback ( §3); this allows our model to "test" proposed generalizations and adapt from the provided feedback signals.
Tackling the problem of learning with bandit feedback is important because neural machine translation systems, like other natural language processing technology, currently learn almost exclusively from labeled data for a specific domain. While this approach is useful, it cannot scale to a broad variety of language and domains, as linguistic systems often cannot generalize well beyond their training data. Machine translation systems need to be able to learn to improve their performance from naturalistic interaction with users in addition to labeled data.
Bandit feedback (Robbins, 1985) offers systems the opportunity to "test" proposed generalizations and receive feedback on their performance; particularly interesting are contextual bandit systems, which make predictions based on a given input context (Auer et al., 2002;Langford and Zhang, 2008;Beygelzimer et al., 2010;Dudik et al., 2011). For example, a neural translation system trained on parliament proceedings often performs quite poorly at translating anything else. However, a translation system that is deployed to facilitate conversations between users might receive either explicit feedback (e.g. thumbs up/down) on its translations, or even implicit feedback, for example, the conversation partner asking for clarifications. There has recently been a flurry of work specifically addressing the bandit structured prediction problem (Chang et al., 2015;Sokolov et al., 2016a,b), of which machine translation is a special case.
Because this task is-at it's core-a domain adaptation problem (for which a bandit learning signal is available to "help"), we also explored the use of standard domain adaptation techniques. We make a strong assumption that a sizable amount of monolingual, source language data is available before bandit feedback begins. 2 We believe that in many realistic settings, one can at least get some amount of unlabeled data to begin with (we consider 40k sentences). Using this monolingual data, we use data selection on a large corpus of parallel out-of-domain data (Europarl, NewsCommentary, CommonCrawl, Rapid) to seed an initial translation model.
Overall, the results support the following conclusions ( § 5), based on the limited setting of one new domain and one language pair: 1. data selection for domain adaptation alone improves translation quality by about 1.5 BLEU points.
2. on top of the domain adaptation, reinforcement learning (which requires exploration) leads to an initial degradation of about 3 BLEU points, which is recovered (on development data) after approximately 40k sentences of bandit feedback. 3 One limitation of our current setup is that we used bandit feedback on development data to train a "critic" function for our reinforcement learning implementation, which, in the worst case, means that our results over-estimate performance on the first 120k examples (more details in §5.3).

Neural MT architecture
We closely follow Luong et al. (2015) for the structure of our neural machine translation (NMT) systems. Our NMT model consists of an encoder and a decoder, each of which is a recurrent neural network (RNN). We use a bi-directionaral RNN as the encoder and a uni-directional RNN as the decoder. The model directly estimates the posterior distribution P θ (y | x) of translating a source sentence x = (x 1 , · · · , x n ) to a target sentence y = (y 1 , · · · , y m ): where y <t are all tokens in the target sentence prior to y t . Each local distribution P θ (y | y <t , x) is modeled as a multinomial distribution over the target language vocabulary. We represent this as a linear transformation followed by a softmax function on the decoder's output vectorh dec t : .] is the concatenation of two vectors, attend(., .) is an attention mechanism, 4 , τ is the temperature hyperparameter of the softmax function, h enc and h dec are the hidden vectors generated by the encoder and the decoder, respectively. During training, the encoder first encodes x to a continuous vector Φ(x), which is used as the initial hidden vector for the decoder. The decoder performs RNN updates to produce a sequence of hidden vectors: where e(.) is a word embedding lookup operation, f θ is an LSTM cell. 5 At prediction time, the ground-truth token y t in Eq. 5 is replaced by the model's own prediction y t :ŷ In a supervised learning framework, an NMT model is typically trained under the maximum loglikelihood objective: where D tr is the training set.
However, this learning framework is not applicable to our problem since reference translations are not available.

Reinforcement Learning
The translation process of an NMT model can be viewed as a Markov decision process operating on a continuous state space. The states are the hidden vectors h dec t generated by the decoder. The action space is the target language's vocabulary.

Markov decision process formulation
To generate a translation from a source sentence x, an NMT model commences at an initial state h dec 0 , which is a representation of x computed by the encoder. At time step t > 0, the model decides the next action to take by defining a stochastic policy P θ (y t | y <t , x), which is directly parametrized by the parameters θ of the model. This policy takes the previous state h dec t−1 as input and produces a probability distribution over all actions (words in the target vocabulary). The next actionŷ t is chosen either by taking arg max or sampling from this policy. The encoder computes the current state h dec t by applying an RNN update on the previous state h dec t−1 and the next action takenŷ t (Eq. 5). The objective of bandit NMT is to find a policy that maximizes the expected quality of translations sampled from the model's policy: where R is a reward function that returns a score in [0, 1] reflecting the quality of the input translation. We optimize this objective function by policy gradient methods. The gradient of the objective in Eq. 8 with respect to θ is: 6

Advantage Actor-Critic
Algorithm 1 The A2C algorithm for NMT.
receive a source sentence x 3: sample a translation:ŷ ∼ P θ (y | x) 4: receive reward R(ŷ, x) 5: update the NMT model using the gradient in Eq. 9 6: update the critic model using the gradient in Eq. 12 7: end for We follow the approach of the advantage actorcritic (A2C) algorithm (Mnih et al., 2016), which combines the REINFORCE algorithm (Williams, 1992) with actor-critic. The algorithm approximates the gradient in Eq. 9 by a single-point sample and normalize the rewards by V values to reduce variance: is a baseline that estimates the expected future reward given x andŷ <t .
We train a critic model V ω to estimate the V values. This model is an attention-based encoderdecoder model that encodes a source sentence x and decodes a predicted translationŷ. At time step t, it computes V ω (ŷ <t , x) = W oh dec t whereh dec t is the hidden state of the RNN decoder, and W o is a matrix that transforms a vector into a scalar. 7 The critic model is trained to minimize the MSE between its estimates and the true values: 11) Given a fixed x, the gradient with respect to ω of this objective is: Algorithm 1 describes our algorithm. For each x, we draw a single sampleŷ from the NMT model, which is used for both estimating the gradient of the NMT model (Eq. 10) and the gradient of the critic model (Eq. 12). We update the NMT model and the critic model simultaneously.

Domain Adaptation
We performed domain adaptation by choosing the best out-of-domain parallel data for training using Moore and Lewis (2010) cross-entropy based data selection technique.

Cross-Entropy Difference
The Moore and Lewis method uses the crossentropy difference H I (s) -H O (s) for scoring a 7 We abuse the notationh dec to denote the decoder output. But since the translation model and the critic model do not share parameters, their decoder outputs are distinct.
given sentence s, based on an in-domain language model LM I and an out-of-domain language model LM O (Moore and Lewis, 2010). We trained LM O using the German-English Europarl, NewsCommentary, CommonCrawl and Rapid (i.e. out-ofdomain) data sets and LM I using the e-commerce domain data provided by Amazon. After training both language models, we follow Moore and Lewis method by applying the cross-entropy difference to score each sentence in the out-ofdomain data. The cross-entropy is mathematically defined as: where P LM is the probability of a LM for the word sequence W and w 1 , · · · , w i−1 represents the history of the word w i .
Sentences with the lowest cross-entropy difference scores are the most relevant because they are the more similar to the in-domain data and less similar to the average of the out-of-domain data. Using this criteria, the top n out-of-domain sentences are used to create the training set D tr . In this work we consider various n sizes, selecting the n that provides the best performance on the validation set.

Experiments
This section describes the experiments we conducted in attempt to assess the challenges posed by bandit machine translation and our exploration of efficient algorithms to improve machine translation systems using bandit feedback.
As explained in previous sections, this task requires performing domain adaptation for machine translation through bandit feedback. With this in mind, we experimented with two types of models: simple domain adaptation without using the feedbacks, and reinforcement learning models that leverage the feedbacks. In the following sections, we explain how we train the regular NMT model, how we select training data for domain adaptation, and how we use reinforcement learning to leverage the bandit feedbacks.
We trained our systems using the out-of-domain parallel data restricted by the shared task. The entire out-of-domain dataset contains 4.5 millions parallel German-English sentences from Europarl, NewsCommentary, CommonCrawl and Rapid data for the News Translation (constrained) task. Our NMT model is based on OpenNMT's (Klein et al., 2017) PyTorch implementation of attention-based encoder-decoder model. We extended their implementation and added our implementation of the A2C algorithm. Details of the model configuration and training hyperparameters are listed in Table 1.

Subword Unit for Neural Machine Translation
Neural machine translation (NMT) relies on first mapping each word into the vector space, and traditionally we have a word vector corresponding to each word in a fixed vocabulary. Due to the data scarcity, it's hard for the system to learn high quality representations for rare words. To address this problem, with the goal of open vocabulary NMT, Sennrich et al. (2015) proposed to learn subword units and perform translation on a subword level. We incorporated this approach in our system as a preprocessing step. We generate the so-called byte-pair encoding (BPE), which is a mapping from words to subword units, on the whole training set (WMT15), for both the source and target languages. The same mapping is used for all the training sets in our system. After the translation, we do an extra post-processing step to convert the target language subword units back to words. With BPE, the vocabulary size is reduced dramatically and we no longer need to prune the vocabularies. We find this approach to be very helpful and use it for all our systems.

Domain Adaptation
As explained in Section 4, we use the data selection method of (Moore and Lewis, 2010) for domain adaptation. We use the kenlm toolkit (Heafield, 2011) to build all the language models used for the data selection. We train 4-gram language models. For computing the cross-entropy similarity scores, we use the XenC (Rousseau, 2013) open source data selection tool. We use the mono-lingual data selection mode of XenC on the in-domain and out-ofdomain source sentences.
We have two parameters in this data selection process: the size of in-domain dataset that is used for training the in-domain language model, and the size of the out-of-domain training data that we select. We experimented with different configurations and the results on the development server are listed in Table 2. For obtaining the in-domain data, we pre-fetch the source sentences from development and training servers. For the training server, we do not have enough keys to test all combinations, so we picked several configurations and for each sentence, we select randomly a system to translate it. In addition, we also compare with and without beam search. The purpose for this is to provide another comparable baseline for the later reinforcement learning model, for which beam search cannot be used. Thus, the domain adaptation system that we submit to the training server is the uniformly random combination of 6 systems, and their individual average BLEU scores are listed in Table 3.
It can be seen from these results that most configurations of data selection improve the overall BLEU score. The model without data selection achieves 18.70 BLEU on the development server, while the best data selection configurations achieves 20.16, while on the training server the scores are 18.65 without data selection and 20.13 with. It can also be seen from Table 3 that beam search does help with improving the BLEU score.

Reinforcement Learning Results
While translating with the domain adaptation models to the development server, we collect 320,000 triples of (source sentence, translation, feedback) from 8 submitted systems. We use these triples to pre-train the critic in the A2C algorithm. We use the same pre-trained critic for all A2Ctrained systems. The critic for each model is then  We note that there are some drawbacks when using the A2C algorithm when it comes to generating translations. Normally we generate translations by greedy decoding, which means at each time step we pick the word with the highest probability from the distribution produced by the model. But with A2C, we need to sample from the distribution of words to ensure exploration. As a direct consequence, it is not clear how to apply beam search for A2C (and for policy gradient methods in general). To control the trade-off between exploration and exploitation, we use the temperature hyperparameter τ in the softmax function. In our experiments τ is set to 2 3 , which produces a more peaky distribution and makes the model explore less.
It is best to have batching during bandit training for stability. Due to the limitation of the submission servers, that is, we only get the single reward feedback each time, we had to devise a method for batching for the feedback from the server. We cache the rewards until we reach the batch size, then do a batch update. However, due to some bugs in the implementation of this method, some sentences are not submitted in the correct order. And at some test points on the training server the scores are near or equal to zero.
In Figure 1 we present some results from the development server. We use a data selection model (200k in-domain data, 30% out-of-domain training data) as the baseline translation model, upon which we use the A2C algorithm to improve further. From this model, we generate translations with both sampling and greedy decoding to see how much the exploration required by the A2C algorithm hurts the performance. Figure 1 shows the average BLEU score of every 2000 sentences from the development server. A2C loses at the beginning because of exploration, and catches up as it sees more examples. Using sampling instead of greedy decoding, but exploration eventually improves the model.

Conclusion
We present the University of Maryland neural machine translation systems for the WMT17 bandit MT shared task. We employ two approaches: out-of-domain data selection and reinforcement learning. Experiments show that the best performance is achieved with a model pre-trained with only one-third of the available out-of-domain data. When applying reinforcement learning to further improve this model with bandit feedback, the model performance degrades initially due to exploration but gradually improves over time. Future work is to determine if reinforcement learning is more effective on a larger bandit learning dataset.