Prediction Improves Simultaneous Neural Machine Translation

Simultaneous speech translation aims to maintain translation quality while minimizing the delay between reading input and incrementally producing the output. We propose a new general-purpose prediction action which predicts future words in the input to improve quality and minimize delay in simultaneous translation. We train this agent using reinforcement learning with a novel reward function. Our agent with prediction has better translation quality and less delay compared to an agent-based simultaneous translation system without prediction.


Introduction
One of the next significant challenges in machine translation research is to make translation ubiquitous using real-time translation. Simultaneous machine translation aims to address this issue by interleaving reading the input with writing the output translation. Current Simultaneous Neural Machine Translation (SNMT) systems (Satija and Pineau, 2016;Cho and Esipova, 2016;Gu et al., 2017) use an AGENT to control an incremental encoder-decoder (or sequence to sequence) NMT model. Each READ adds more information to the encoder RNN, and each WRITE produces more output using the decoder RNN. In this paper, we propose adding a new action to the AGENT: a PREDICT action that predicts what words might appear in the input stream. Prediction was previously proposed in simultaneous statistical machine translation (Grissom II et al., 2014) but has not been studied in the context of Neural Machine Translation (NMT). In SNMT systems, prediction of future words augments the encoder-decoder model with possible future contexts to produce output translations earlier (minimize delay) and/or produce better output translations (improve translation quality). Our experiments show that prediction improves SNMT in both these measures.

Simultaneous Translation Framework
An agent-based framework whose actions decide whether to translate or wait for more input is a natural way to extend neural MT to simultaneous neural MT and has been explored in (Satija and Pineau, 2016;Gu et al., 2017) which contains two main components: The ENVIRONMENT which receives the input words X = {x 1 , . . . , x N } from the source language and incrementally generates translated words W = {w 1 , . . . , w M } in the target language; And the AGENT which decides an action for each time step, a t . The AGENT generates an action sequence A = {a 1 , . . . , a T } to control the ENVIRONMENT.
Previous models only include two actions: READ and WRITE. We extend the model by adding the third action called PREDICT. Action READ is simply sending a new word to the EN-VIRONMENT and generating a candidate word in the target language. In action WRITE, the AGENT takes current candidate word and sends it to the output. For PREDICT, the AGENT predicts the next word in the input and treats it like a READ action. The following section explains how the ENVIRONMENT deals with different actions.

ENVIRONMENT
The ENVIRONMENT is an attention-based Encoder-Decoder MT system (Bahdanau et al., 2014) which is adopted to simultaneous translation task. The Encoder receives the embedded representation of the input words (including predicted ones) and converts them into context vectors H ⇢ n = {h 1 , . . . , h n+⇢ } using a gated RNN (GRU) where n is the number of input words so far and ⇢ is the number of predicted words since the last READ. Whenever the AGENT decides to READ, ⇢ will be set to 0, and h n = f ENC (h n 1 , x n ), where x n is the next input words (n  N ). But if the action is PREDICT, ⇢ > 0, the AGENT predicts a new word x 0 ⇢ and the context vector h n+⇢ = f ENC (h n+⇢ 1 , x 0 ⇢ ) will be added to H ⇢ n = {h 1 , . . . , h n+⇢ }. At each time step t, the decoder uses the current context vectors (H ⇢ n ) to generate the next candidate output (y t ): where w m 1 is the previous output word, and a ATT is an attention model (Bahdanau et al., 2014), f and g are nonlinear functions, and c t is the current context vector.
If the action a t is either READ or PREDICT the current candidate y t will be ignored (wait for better predictions). But in the case of WRITE, the candidate y t is produced as the next output word w m and then the decoder state will be updated (w m y t , z m s t ). Note that as soon as the AGENT decides to READ, all the hidden vectors generated by PRE-DICT actions will be discarded (H ⇢ n = H 0 n = {h 1 , . . . , h n }). Figure 1 shows an example of how a sentence can be translated using our modified translation framework 1 .

AGENT
The AGENT is a separate component which examines the ENVIRONMENT at each time step and decides on the actions that lead to better translation quality and lower delay. The agent in the greedy decoding framework (Gu et al., 2017) was trained using reinforcement learning with the policy gradient algorithm (Williams, 1992), which observes the current state of the ENVIRONMENT at time step t as o t where o t = [c t ; s t ; w m ]. A RNN with one hidden layer passed through a softmax function generates the probability distribution over the actions a t at each step. Therefore, policy ⇡ ✓ will be computed as: Where u t is the hidden state of the AGENT's RNN.

Training the AGENT with Prediction
In order to speed-up the training process, we have restricted AGENT's options by removing redundant operations. As illustrated in Figure 2  ter a series of WRITE, the AGENT cannot choose to PREDICT, and after a sequence of PREDICTs, READ is not an option. Reward Function: The total reward at any time step is calculated as the cumulative sum of rewards for actions at each preceding step. All the evaluation metrics have been modified to be computed for every time step. Quality: We use a modified smoothed version of BLEU score (Chen and Cherry, 2014) multiplied by Brevity Penalty (Lin and Och, 2004) for evaluating the impact of each action on translation quality. At each point in time, the reward for translation quality is: The BLEU(t) is the difference between BLEU score of the translated sentence at the previous time step and the current time step; BLEU(t) = BLEU(W t , W ⇤ ) BLEU(W t 1 , W ⇤ ); where W t is the prefix of the translated sentence at time t. Delay: The Delay reward is used to motivate the AGENT to minimize delay. We use Average Proportion (AP) (Cho and Esipova, 2016) for this purpose, which is the average number of source words needed when translating each word. Given the source words X and translated words W , AP can be computed as: Where s(t) denotes the number of source words the WRITE action uses at time step t (for any other actions, s(t) would be zero). The delay reward is smoothed using a Target Delay which is a scalar constant denoted by d ⇤ (Gu et al., 2017): Rewards for Quality and Delay alone do not motivate the AGENT to choose prediction and in preliminary experiments, after a number of steps, the number of prediction actions became zero. We address this problem by defining Prediction Quality (PQ) which rewards the AGENT for changes in BLEU score after each prediction action. By initializing r p 0 = 0, the prediction reward can be written as: otherwise Final Reward The final reward function is calculated as the combination of quality, delay, and prediction rewards: (1) The trade-off between better translation quality and minimal delay is achieved by modifying the parameters ↵, , and . Reinforcement Learning is used to train the AGENT using a policy gradient algorithm (Gu et al., 2017;Williams, 1992) which searches for the maximum in The gradient for a sentence is the cumulative sum of gradients at each time step. We pre-train the ENVIRONMENT on full sentences using log-loss log p(y|x).

Experiments
We train and evaluate our model on English-German (EN-DE) in both directions. We use WMT 2015 for training and Newstest 2013 for validation and testing. All sentences have been tokenized and the words are segmented using byte pair encoding (BPE) (Sennrich et al., 2016).  Model Configuration For a fair comparison, we follow the settings that worked the best for the greedy decoding model in (Gu et al., 2017) and set the target delay d ⇤ for the AGENT to 0.7. The EN-VIRONMENT consists of two unidirectional layers with 1028 GRU units for encoder and decoder. We train the network using AdaDelta optimizer, a batch of size 32 and a fixed learning rate of 0.0001 without decay. We use softmax policy via recurrent networks with 512 GRU units and a softmax function for the AGENT and train it using Adam optimizer (Kingma and Ba, 2014). The batch size for the AGENT is 10, and the learning rate is 2e-6. The word predictor is a two layer RNN Language model which consists of two layers of 1024 units, followed by a softmax layer. The batch size is 64 with a learning rate of 2e-5. The predictor has been trained on the WMT'16 dataset and tested on Newstest'16 corpora for both languages. The perplexity of our language model is reported in Table 1. We set ↵ = 1, = 0.5 and = 0.5. We tried different settings for these hyperparameters during training and picked values that gave us the best Quality and Delay on the training data.
Results and Analysis Figure 4 shows that as the sentence length increases, prediction helps translation quality due to complex reordering and multiclausal sentences; However, for shorter samples where the structure of the sentences are simpler, the prediction action cannot improve translation quality. does not account for less delay as longer sentences are produced. A better measure than AP might be needed to emphasize delay differences. Therefore we also report the average segment length (µ), which is computed as the average number of consecutive READs in each sentence. In both EN!DE and DE!EN experiments, our model constantly decreases the segment length by around 1 word which results in less latency.
In order to evaluate the effectiveness of our proposed reward function for PREDICT action, we have explored various values for its hyperparameters (↵, , and ). Our empirical results show that the best trade-off between quality and delay is achieved when around 20 percent of the actions are PREDICT for both EN!DE and DE!EN translation tasks (Figure 3). When there is no reward for PREDICT action ( = 0), the AGENT prefers other actions, and the number of PREDICT actions turns into zero immediately after training the AGENT. If the reward for prediction is valued too highly, the ENVIRONMENT depends more on predicted words and the translation quality decreases 2 .

Related work
Early work in SNMT was done in speech, where the incoming signals were segmented based on acoustic or statistical cues (Bangalore et al., 2012;Fügen et al., 2007 Yarmohammadi et al., 2013;Siahbani et al., 2014) use a separate segmentation step and incrementally translate each segment using a standard phrase-based MT system. (Matsubara et al., 2000) applied pattern matching to predict target-side verbs in Japanese to English translation. (Grissom II et al., 2014) used reinforcement learning to predict the next word and the sentencefinal verb in a statistical MT model. These models reduce the delay but are not trained end-toend like our agent-based SNMT system. (Cho and Esipova, 2016) proposed a non-trainable heuristic agent which is not able to trade-off quality with delay. It always prefers to read more words from the input and this approach does not work well in practice. (Satija and Pineau, 2016) introduced a trainable agent which they trained using Deep Q networks (Mnih et al., 2015). We modified the SNMT trainable agent in (Gu et al., 2017) and added a new non-trivial PREDICT action to the agent. We compare to their model and show better results in delay and quality.

Conclusion
We introduce a new prediction action in a trainable agent for simultaneous neural machine translation. With prediction, the agent can be informed about future time steps in the input stream. Compared to a very strong baseline our results show that prediction can lower delay and improve the translation quality, especially for longer sentences and translating from an SOV (subject-object-verb) language (DE) to an SVO language (EN).
decoding strategies for simultaneous translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1032-1036.