Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to improve. Yet, current neural machine translation training focuses on expensive human-generated reference translations. We describe a reinforcement learning algorithm that improves neural machine translation systems from simulated human feedback. Our algorithm combines the advantage actor-critic algorithm (Mnih et al., 2016) with the attention-based neural encoder-decoder architecture (Luong et al., 2015). This algorithm (a) is well-designed for problems with a large action space and delayed rewards, (b) effectively optimizes traditional corpus-level machine translation metrics, and (c) is robust to skewed, high-variance, granular feedback modeled after actual human behaviors.


Introduction
Bandit structured prediction is the task of learning to solve complex joint prediction problems (like parsing or machine translation) under a very limited feedback model: a system must produce a single structured output (e.g., translation) and then the world reveals a score that measures how good or bad that output is, but provides neither a "correct" output nor feedback on any other possible output (Chang et al., 2015;Sokolov et al., 2015). Because of the extreme sparsity of this feedback, a common experimental setup is that one pre-trains a good-but-not-great "reference" system based on whatever labeled data is available, and then seeks to improve it over time using this bandit feedback.
A common motivation for this problem setting is cost. In the case of translation, bilingual "experts" can read a source sentence and a possible translation, and can much more quickly provide a rating of that translation than they can produce a full translation on their own. Furthermore, one can often collect even less expensive ratings from "non-experts" who may or may not be bilingual (Hu et al., 2014). Breaking this reliance on expensive data could unlock previously ignored languages and speed development of broad-coverage machine translation systems.
All work on bandit structured prediction we know makes an important simplifying assumption: the score provided by the world is exactly the score the system must optimize ( §2). In the case of parsing, the score is attachment score; in the case of machine translation, the score is (sentence-level) BLEU. While this simplifying assumption has been incredibly useful in building algorithms, it is highly unrealistic. Any time we want to optimize a system by collecting user feedback, we must take into account: 1. The metric we care about (e.g., expert ratings) may not correlate perfectly with the measure that the reference system was trained on (e.g., BLEU or log likelihood); 2. Human judgments might be more granular than traditional continuous metrics (e.g., thumbs up vs. thumbs down); 3. Human feedback have high variance (e.g., different raters might give different responses given the same system output); 4. Human feedback might be substantially skewed (e.g., a rater may think all system outputs are poor). Our first contribution is a strategy to simulate expert and non-expert ratings to evaluate the robustness of bandit structured prediction algorithms in general, in a more realistic environment ( §4). We construct a family of perturbations to capture three attributes: granularity, variance, and skew. We apply these perturbations on automatically generated scores to simulate noisy human ratings. To make our simulated ratings as realistic as possible, we study recent human evaluation data (Graham et al., 2017) and fit models to match the noise profiles in actual human ratings ( §4.2).
Our second contribution is a reinforcement learning solution to bandit structured prediction and a study of its robustness to these simulated human ratings ( § 3). 1 We combine an encoderdecoder architecture of machine translation (Luong et al., 2015) with the advantage actor-critic algorithm (Mnih et al., 2016), yielding an approach that is simple to implement but works on lowresource bandit machine translation. Even with substantially restricted granularity, with high variance feedback, or with skewed rewards, this combination improves pre-trained models ( §6). In particular, under realistic settings of our noise parameters, the algorithm's online reward and final heldout accuracies do not significantly degrade from a noise-free setting.

Bandit Machine Translation
The bandit structured prediction problem (Chang et al., 2015;Sokolov et al., 2015) is an extension of the contextual bandits problem (Kakade et al., 2008;Langford and Zhang, 2008) to structured prediction. Bandit structured prediction operates over time i = 1 . . . K as: 1. World reveals context x (i) 2. Algorithm predicts structured outputŷ (i) 3. World reveals reward R ŷ (i) , x (i) We consider the problem of learning to translate from human ratings in a bandit structured prediction framework. In each round, a translation model receives a source sentence x (i) , produces a translationŷ (i) , and receives a rating R ŷ (i) , x (i) from a human that reflects the quality of the translation. We seek an algorithm that achieves high reward over K rounds (high cumulative reward). The challenge is that even though the model knows how good the translation is, it knows neither where its mistakes are nor what the "correct" translation looks like. It must balance exploration (finding new good predictions) 1 Our code is at https://github.com/ khanhptnk/bandit-nmt (in PyTorch). Figure 1: A translation rating interface provided by Facebook. Users see a sentence followed by its machined-generated translation and can give ratings from one to five stars.
with exploitation (producing predictions it already knows are good). This is especially difficult in a task like machine translation, where, for a twenty token sentence with a vocabulary size of 50k, there are approximately 10 94 possible outputs, of which the algorithm gets to test exactly one.
Despite these challenges, learning from nonexpert ratings is desirable. In real-world scenarios, non-expert ratings are easy to collect but other stronger forms of feedback are prohibitively expensive. Platforms that offer translations can get quick feedback "for free" from their users to improve their systems ( Figure 1). Even in a setting in which annotators are paid, it is much less expensive to ask a bilingual speaker to provide a rating of a proposed translation than it is to pay a professional translator to produce one from scratch.

Effective Algorithm for Bandit MT
This section describes the neural machine translation architecture of our system ( § 3.1). We formulate bandit neural machine translation as a reinforcement learning problem ( § 3.2) and discuss why standard actor-critic algorithms struggle with this problem ( § 3.3). Finally, we describe a more effective training approach based on the advantage actor-critic algorithm ( §3.4).

Neural machine translation
Our neural machine translation (NMT) model is a neural encoder-decoder that directly computes the probability of translating a target sentence y = (y 1 , · · · , y m ) from source sentence x: where P θ (y t | y <t , x) is the probability of outputting the next word y t at time step t given a translation prefix y <t and a source sentence x.
We use an encoder-decoder NMT architecture with global attention (Luong et al., 2015), where both the encoder and decoder are recurrent neural networks (RNN) (see Appendix A for a more detailed description). These models are normally trained by supervised learning, but as reference translations are not available in our setting, we use reinforcement learning methods, which only require numerical feedback to function.

Bandit NMT as Reinforcement Learning
NMT generating process can be viewed as a Markov decision process on a continuous state space. The states are the hidden vectors h dec t generated by the decoder. The action space is the target language's vocabulary.
To generate a translation from a source sentence x, an NMT model starts at an initial state h dec 0 : a representation of x computed by the encoder. At time step t, the model decides the next action to take by defining a stochastic policy P θ (y t | y <t , x), which is directly parametrized by the parameters θ of the model. This policy takes the current state h dec t−1 as input and produces a probability distribution over all actions (target vocabulary words). The next actionŷ t is chosen by taking arg max or sampling from this distribution. The model computes the next state h dec t by updating the current state h dec t−1 by the action takenŷ t . The objective of bandit NMT is to find a policy that maximizes the expected reward of translations sampled from the model's policy: where D tr is the training set and R is the reward function (the rater). 2 We optimize this objective function with policy gradient methods. For a fixed x, the gradient of the objective in Eq 2 is: where Q(ŷ <t ,ŷ t ) is the expected future reward of y t given the current prefixŷ <t , then continuing sampling from P θ to complete the translation: withR(ŷ , x) ≡ R(ŷ , x)1 ŷ <t =ŷ <t ,ŷ t =ŷ t 2 Our raters are stochastic, but for simplicity we denote the reward as a function; it should be expected reward.
1{·} is the indicator function, which returns 1 if the logic inside the bracket is true and returns 0 otherwise.
The gradient in Eq 3 requires rating all possible translations, which is not feasible in bandit NMT. Naïve Monte Carlo reinforcement learning methods such as REINFORCE (Williams, 1992) estimates Q values by sample means but yields very high variance when the action space is large, leading to training instability.

Why are actor-critic algorithms not effective for bandit NMT?
Reinforcement learning methods that rely on function approximation are preferred when tackling bandit structured prediction with a large action space because they can capture similarities between structures and generalize to unseen regions of the structure space. The actor-critic algorithm (Konda and Tsitsiklis) uses function approximation to directly model the Q function, called the critic model. In our early attempts on bandit NMT, we adapted the actor-critic algorithm for NMT in Bahdanau et al. (2017), which employs the algorithm in a supervised learning setting. Specifically, while an encoder-decoder critic model Q ω as a substitute for the true Q function in Eq 3 enables taking the full expectation (because the critic model can be queried with any stateaction pair), we are unable to obtain reasonable results with this approach. Nevertheless, insights into why this approach fails on our problem explains the effectiveness of the approach discussed in the next section. There are two properties in Bahdanau et al. (2017) that our problem lacks but are key elements for a successful actor-critic. The first is access to reference translations: while the critic model is able to observe reference translations during training in their setting, bandit NMT assumes those are never available. The second is per-step rewards: while the reward function in their setting is known and can be exploited to compute immediate rewards after taking each action, in bandit NMT, the actorcritic algorithm struggles with credit assignment because it only receives reward when a translation is completed. Bahdanau et al. (2017) report that the algorithm degrades if rewards are delayed until the end, consistent with our observations.
With an enormous action space of bandit NMT, approximating gradients with the Q critic model induces biases and potentially drives the model to wrong optima. Values of rarely taken actions are often overestimated without an explicit constraint between Q values of actions (e.g., a sum-to-one constraint). Bahdanau et al. (2017) add an ad-hoc regularization term to the loss function to mitigate this issue and further stablizes the algorithm with a delay update scheme, but at the same time introduces extra tuning hyper-parameters.

Advantage Actor-Critic for Bandit NMT
We follow the approach of advantage actorcritic (Mnih et al., 2016, A2C) and combine it with the neural encoder-decoder architecture. The resulting algorithm-which we call NED-A2Capproximates the gradient in Eq 3 by a single sampleŷ ∼ P (· |x) and centers the reward R(ŷ) using the state-specific expected future reward V (ŷ <t ) to reduce variance: We train a separate attention-based encoderdecoder model V ω to estimate V values. This model encodes a source sentence x and decodes a sampled translationŷ. At time step t, it computes where h crt t is the current decoder's hidden vector and w is a learned weight vector. The critic model minimizes the MSE between its estimates and the true values: We use a gradient approximation to update ω for a fixed x andŷ ∼ P (· |x): NED-A2C is better suited for problems with a large action space and has other advantages over actor-critic. For large action spaces, approximating gradients using the V critic model induces lower biases than using the Q critic model. As implied by its definition, the V model is robust to biases incurred by rarely taken actions since rewards of those actions are weighted by very small probabilities in the expectation. In addition, the V model has a much smaller number of parameters and thus is more sample-efficient and more stable to train than the Q model. These attractive properties were not studied in A2C's original paper (Mnih et al., 2016).
Algorithm 1 The NED-A2C algorithm for bandit NMT.
update the NMT model using Eq 5. 6: update the critic model using Eq 7. 7: end for Algorithm 1 summarizes NED-A2C for bandit NMT. For each x, we draw a single sampleŷ from the NMT model, which is used for both estimating gradients of the NMT model and the critic model. We run this algorithm with mini-batches of x and aggregate gradients over all x in a minibatch for each update. Although our focus is on bandit NMT, this algorithm naturally works with any bandit structured prediction problem.

Modeling Imperfect Ratings
Our goal is to establish the feasibility of using real human feedback to optimize a machine translation system, in a setting where one can collect expert feedback as well as a setting in which one only collects non-expert feedback. In all cases, we consider the expert feedback to be the "gold standard" that we wish to optimize. To establish the feasibility of driving learning from human feedback without doing a full, costly user study, we begin with a simulation study. The key aspects (Figure 2) of human feedback we capture are: (a) mismatch between training objective and feedbackmaximizing objective, (b) human ratings typically are binned ( § 4.1), (c) individual human ratings have high variance ( §4.2), and (d) non-expert ratings can be skewed with respect to expert ratings ( §4.3).
In our simulated study, we begin by modeling gold standard human ratings using add-onesmoothed sentence-level BLEU (Chen and Cherry, 2014). 3 Our evaluation criteria, therefore, is average sentence-BLEU over the run of our algo- rithm. However, in any realistic scenario, human feedback will vary from its average, and so the reward that our algorithm receives will be a perturbed variant of sentence-BLEU. In particular, if the sentence-BLEU score is s ∈ [0, 1], the algorithm will only observe s ∼ pert(s), where pert is a perturbation distribution. Because our reference machine translation system is pre-trained using log-likelihood, there is already an (a) mismatch between training objective and feedback, so we focus on (b-d) below.

Humans Provide Granular Feedback
When collecting human feedback, it is often more effective to collect discrete binned scores. A classic example is the Likert scale for human agreement (Likert, 1932) or star ratings for product reviews. Insisting that human judges provide continuous values (or feedback at too fine a granularity) can demotivate raters without improving rating quality (Preston and Colman, 2000).
To model granular feedback, we use a simple rounding procedure. Given an integer parameter g for degree of granularity, we define: pert gran (s; g) = 1 g round(gs) This perturbation function divides the range of possible outputs into g + 1 bins. For example, for g = 5, we obtain bins [0, 0.1), [0.1, 0.3), [0.3, 0.5), [0.5, 0.7), [0.7, 0.9) and [0.9, 1.0]. Since most sentence-BLEU scores are much closer to zero than to one, many of the larger bins are frequently vacant.

Experts Have High Variance
Human feedback has high variance around its expected value. A natural goal for a variance model of human annotators is to simulate-as closely as possible-how human raters actually perform. We use human evaluation data recently collected as part of the WMT shared task (Graham et al., 2017). The data consist of 7200 sentences multiply annotated by giving non-expert annotators on Amazon Mechanical Turk a reference sentence and a single system translation, and asking the raters to judge the adequacy of the translation. 4 From these data, we treat the average human rating as the ground truth and consider how individual human ratings vary around that mean. To visualize these results with kernel density estimates (standard normal kernels) of the standard deviation. Figure 3 shows the mean rating (x-axis) and the deviation of the human ratings (y-axis) at each mean. 5 As expected, the standard deviation is small at the extremes and large in the middle (this is a bounded interval), with a fairly large range in the middle: a translation whose average score is 50 can get human evaluation scores anywhere between 20 and 80 with high probability. We use a linear approximation to define our variance-based perturbation function as a Gaussian distribution, which is parameterized by a scale λ that grows or shrinks the variances (when λ = 1 this exactly matches the variance in the plot). pert var (s; λ) = Nor s, λσ(s) 2 (9) σ(s) = 0.64s − 0.02 if s < 50 −0.67s + 67.0 otherwise

Non-Experts are Skewed from Experts
The preceding two noise models assume that the reward closely models the value we want to optimize (has the same mean). This may not be the case with non-expert ratings. Non-expert 4 Typical machine translation evaluations evaluate pairs and ask annotators to choose which is better. 5 A current limitation of this model is that the simulated noise is i.i.d. conditioned on the rating (homoscedastic noise). While this is a stronger and more realistic model than assuming no noise, real noise is likely heteroscedastic: dependent on the input.  raters are skewed both for reinforcement learning (Thomaz et al., 2006;Thomaz and Breazeal, 2008;Loftin et al., 2014) and recommender systems (Herlocker et al., 2000;Adomavicius and Zhang, 2012), but are typically bimodal: some are harsh (typically provide very low scores, even for "okay" outputs) and some are motivational (providing high scores for "okay" outputs). We can model both harsh and motivations raters with a simple deterministic skew perturbation function, parametrized by a scalar ρ ∈ [0, ∞): For ρ > 1, the rater is harsh; for ρ < 1, the rater is motivational.

Experimental Setup
We choose two language pairs from different language families with different typological properties: German-to-English and (De-En) and Chinese-to-English (Zh-En). We use parallel transcriptions of TED talks for these pairs of languages from the machine translation track of the IWSLT 2014 and 2015 (Cettolo et al., 2014(Cettolo et al., , 2015(Cettolo et al., , 2012. For each language pair, we split its data into four sets for supervised training, bandit training, development and testing (Table 1). For English and German, we tokenize and clean sen-tences using Moses (Koehn et al., 2007). For Chinese, we use the Stanford Chinese word segmenter (Chang et al., 2008) to segment sentences and tokenize. We remove all sentences with length greater than 50, resulting in an average sentence length of 18. We use IWSLT 2015 data for supervised training and development, IWSLT 2014 data for bandit training and previous years' development and evaluation data for testing.

Evaluation Framework
For each task, we first use the supervised training set to pre-train a reference NMT model using supervised learning. On the same training set, we also pre-train the critic model with translations sampled from the pre-trained NMT model. Next, we enter a bandit learning mode where our models only observe the source sentences of the bandit training set. Unless specified differently, we train the NMT models with NED-A2C for one pass over the bandit training set. If a perturbation function is applied to Per-Sentence BLEU scores, it is only applied in this stage, not in the pre-training stage. We measure the improvement ∆S of an evaluation metric S due to bandit training: ∆S = S A2C − S ref , where S ref is the metric computed on the reference models and S A2C is the metric computed on models trained with NED-A2C. Our primary interest is Per-Sentence BLEU: average sentence-level BLEU of translations that are sampled and scored during the bandit learning pass. This metric represents average expert ratings, which we want to optimize for in real-world scenarios. We also measure Heldout BLEU: corpuslevel BLEU on an unseen test set, where translations are greedily decoded by the NMT models. This shows how much our method improves translation quality, since corpus-level BLEU correlates better with human judgments than sentence-level BLEU.
Because of randomness due to both the random sampling in the model for "exploration" as well as the randomness in the reward function, we repeat each experiment five times and report the mean results with 95% confidence intervals.

Model configuration
Both the NMT model and the critic model are encoder-decoder models with global attention (Luong et al., 2015). The encoder and the decoder are unidirectional single-layer LSTMs. They have the same word embedding size and LSTM hidden size of 500. The source and target vocabulary sizes are both 50K. We do not use dropout in our experiments. We train our models by the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999 and a batch size of 64. For Adam's α hyperparameter, we use 10 −3 during pre-training and 10 −4 during bandit learning (for both the NMT model and the critic model). During pre-training, starting from the fifth pass, we decay α by a factor of 0.5 when perplexity on the development set increases. The NMT model reaches its highest corpus-level BLEU on the development set after ten passes through the supervised training data, while the critic model's training error stabilizes after five passes. The training speed is 18s/batch for supervised pre-training and 41s/batch for training with the NED-A2C algorithm.

Results and Analysis
In this section, we describe the results of our experiments, broken into the following questions: how NED-A2C improves reference models ( §6.1); the effect the three perturbation functions have on the algorithm ( § 6.2); and whether the algorithm improves a corpus-level metric that corresponds well with human judgments ( §6.3).

Effectiveness of NED-A2C under Un-perturbed Bandit Feedback
We evaluate our method in an ideal setting where un-perturbed Per-Sentence BLEU simulates ratings during both training and evaluation (Table 2).
Single round of feedback. In this setting, our models only observe each source sentence once and before producing its translation. On both De-En and Zh-En, NED-A2C improves Per-Sentence BLEU of reference models after only a single pass (+2.82 and +1.08 respectively).
Poor initialization. Policy gradient algorithms have difficulty improving from poor initializations, especially on problems with a large action space, because they use model-based exploration, which is ineffective when most actions have equal probabilities (Bahdanau et al., 2017;Ranzato et al., 2016). To see whether NED-A2C has this problem, we repeat the experiment with the same setup but with reference models pretrained for only a single pass. Surprisingly, NED-A2C is highly effective at improving these poorly Comparisons with supervised learning. To further demonstrate the effectiveness of NED-A2C, we compare it with training the reference models with supervised learning for a single pass on the bandit training set. Surprisingly, observing ground-truth translations barely improves the models in Per-Sentence BLEU when they are fully trained (less than +0.4 on both tasks). A possible explanation is that the models have already reached full capacity and do not benefit from more examples. 6 NED-A2C further enhances the models because it eliminates the mismatch between the supervised training objective and the evaluation objective. On weakly trained reference models, NED-A2C also significantly outperforms supervised learning (∆Per-Sentence BLEU of NED-A2C is over three times as large as those of supervised learning).
Multiple rounds of feedback. We examine if NED-A2C can improve the models even further with multiple rounds of feedback. 7 With supervised learning, the models can memorize the reference translations but, in this case, the models have to be able to exploit and explore effectively. We train the models with NED-A2C for five  Table 2: Translation scores and improvements based on a single round of un-perturbed bandit feedback. Per-Sentence BLEU and Heldout BLEU are not comparable: the former is sentence-BLEU, the latter is corpus-BLEU.
passes and observe a much more significant ∆Per-Sentence BLEU than training for a single pass in both pairs of language (+6.73 on De-En and +4.56 on Zh-En) (Figure 4).

Effect of Perturbed Bandit Feedback
We apply perturbation functions defined in § 4.1 to Per-Sentence BLEU scores and use the perturbed scores as rewards during bandit training ( Figure 5).
Granular Rewards. We discretize raw Per-Sentence BLEU scores using pert gran (s; g) ( §4.1). We vary g from one to ten (number of bins varies from two to eleven). Compared to continuous rewards, for both pairs of languages, ∆Per-Sentence BLEU is not affected with g at least five (at least six bins). As granularity decreases, ∆Per-Sentence BLEU monotonically degrades. However, even when g = 1 (scores are either 0 or 1), the models still improve by at least a point.
High-variance Rewards. We simulate noisy rewards using the model of human rating variance pert var (s; λ) ( § 4.2) with λ ∈ {0.1, 0.2, 0.5, 1, 2, 5}. Our models can withstand an amount of about 20% the variance in our human eval data without dropping in ∆Per-Sentence BLEU. When the amount of variance attains 100%, matching the amount of variance in the human data, ∆Per-Sentence BLEU go down by about 30% for both pairs of languages. As more variance is injected, the models degrade quickly but still improve from the pre-trained models. Variance is the most detrimental type of perturbation to NED-A2C among the three aspects of human ratings we model.

Skewed
Rewards. We model skewed raters using pert skew (s; ρ) ( § 4.3) with ρ ∈ {0.25, 0.5, 0.67, 1, 1.5, 2, 4}. NED-A2C is robust to skewed scores. ∆Per-Sentence BLEU is at least 90% of unskewed scores for most skew values. Only when the scores are extremely harsh (ρ = 4) does ∆Per-Sentence BLEU degrade significantly (most dramatically by 35% on Zh-En). At that degree of skew, a score of 0.3 is suppressed to be less than 0.08, giving little signal for the models to learn from. On the other spectrum, the models are less sensitive to motivating scores as Per-Sentence BLEU is unaffected on Zh-En and only decreases by 7% on De-En.

Held-out Translation Quality
Our method also improves pre-trained models in Heldout BLEU, a metric that correlates with translation quality better than Per-Sentence BLEU (Table 2). When scores are perturbed by our rating model, we observe similar patterns as with Per-Sentence BLEU: the models are robust to most perturbations except when scores are very coarse, or very harsh, or have very high variance (Figure 5, second row). Supervised learning improves Heldout BLEU better, possibly because maximizing log-likelihood of reference translations correlates more strongly with maximizing Heldout BLEU of predicted translations than maximizing Per-Sentence BLEU of predicted translations.

Related Work and Discussion
Ratings provided by humans can be used as effective learning signals for machines. Reinforcement learning has become the de facto standard for incorporating this feedback across diverse tasks such as robot voice control (Tenorio-Gonzalez et al.,  (Isbell et al., 2001). Recently, this learning framework has been combined with recurrent neural networks to solve machine translation (Bahdanau et al., 2017), dialogue generation (Li et al., 2016), neural architecture search (Zoph and Le, 2017), and device placement (Mirhoseini et al., 2017). Other approaches to more general structured prediction under bandit feedback (Chang et al., 2015;Sokolov et al., 2016a,b) show the broader efficacy of this framework. Ranzato et al. (2016) describe MIXER for training neural encoder-decoder models, which is a reinforcement learning approach closely related to ours but requires a policy-mixing strategy and only uses a linear critic model. Among work on bandit MT, ours is closest to Kreutzer et al. (2017), which also tackle this problem using neural encoder-decoder models, but we (a) take advantage of a state-of-the-art reinforcement learning method; (b) devise a strategy to simulate noisy rewards; and (c) demonstrate the robustness of our method on noisy simulated rewards.
Our results show that bandit feedback can be an effective feedback mechanism for neural machine translation systems. This is despite that errors in human annotations hurt machine learning models in many NLP tasks (Snow et al., 2008). An obvious question is whether we could extend our framework to model individual annotator preferences (Passonneau and Carpenter, 2014) or learn personalized models (Mirkin et al., 2015;Rabinovich et al., 2017), and handle heteroscedastic noise (Park, 1966;Kersting et al., 2007;Antos et al., 2010). Another direction is to apply active learning techniques to reduce the sample complexity required to improve the systems or to extend to richer action spaces for problems like simultaneous translation, which requires prediction (Grissom II et al., 2014) and reordering (He et al., 2015) among other strategies to both minimize delay and effectively translate a sentence (He et al., 2016).