Simultaneous Translation with Flexible Policy via Restricted Imitation Learning

Simultaneous translation is widely useful but remains one of the most difficult tasks in NLP. Previous work either uses fixed-latency policies, or train a complicated two-staged model using reinforcement learning. We propose a much simpler single model that adds a “delay” token to the target vocabulary, and design a restricted dynamic oracle to greatly simplify training. Experiments on Chinese <-> English simultaneous translation show that our work leads to flexible policies that achieve better BLEU scores and lower latencies compared to both fixed and RL-learned policies.


Introduction
Simultaneous translation, which translates sentences before they are finished, is useful in many scenarios such as international conferences, summits, and negotiations.However, it is widely considered one of the most challenging tasks in NLP, and one of the holy grails of AI (Grissom II et al., 2014).A major challenge in simultaneous translation is the word order difference between the source and target languages, e.g., between SOV languages (German, Japanese, etc.) and SVO languages (English, Chinese, etc.).
Simultaneous translation is previously studied as a part of real-time speech recognition system (Yarmohammadi et al., 2013;Bangalore et al., 2012;Fügen et al., 2007;Sridhar et al., 2013;Jaitly et al., 2016;Graves et al., 2013).Recently, there have been two encouraging efforts in this problem with promising but limited success.Gu et al. (2017) propose a complicated two-stage model that is also trained in two stages.The base model, responsible for producing target words, is a conventional full-sentence seq2seq model, and on top of that, the READ/WRITE (R/W) model decides, at every step, whether to wait for another source word (READ) or to emit a target word Wait-1 policy makes a mistake on guessing thanks from while wait-5 policy has high latency.The adaptive policy can wait for more information to avoid guesses while maintaining low latency.
(WRITE) using the pretrained base model.This R/W model is trained by reinforcement learning (RL) method without updating the base model.Ma et al. (2018), on the other hand, propose a much simpler architecture, which only need one model and can be trained with end-to-end local training method.However, their model follows a fixed-latency policy, which inevitably needs to guess future content during translation.Table 1 gives an example which is difficult for the fixedlatency (wait-k) policy but easy for adaptive policy.
We aim to combine the merits of both efforts, that is, we design a single model end-toend trained from scratch to perform simultaneous translation, as with Ma et al. (2018), which can decide on the fly whether to wait or translate as in Gu et al. (2017).There are two key ideas to achieve this: the first is to add a "delay" token (similar to the READ action in Gu et al. (2017), the empty token in Press and Smith (2018), and the 'blank' unit in Connectionist Temporal Classification (CTC) (Graves et al., 2006)) to the targetside vocabulary, and if the model emits this delay token, it will read one source word; the second idea is to train the model using (restricted) imitation learning by designing a (restricted) dynamic oracle as the expert policy.Table 2 summarizes different approaches for simultaneous translation using neural machine translation (NMT) model.seq-to-seq prefix-to-prefix fixed policy static Read-Write (Dalvi et al., 2018) test-time wait-k (Ma et al., 2018) wait-k (Ma et al., 2018) adaptive policy RL (Gu et al., 2017) imitation learning (this work) Table 2: Different approaches for simultaneous translation.

Preliminaries
Let x = (x 1 , . . ., x n ) be a sequence of words.For an integer 0 ≤ i ≤ n, we denote the sequence consisting of the first consecutive i − 1 words in x by x <i = (x 1 , . . ., x i−1 ).We say such a sequence x <i is a prefix of the sequence x, and define s x if sequence s is a prefix of x.
Conventional Machine Translation Given a sequence x from the source language, the conventional machine translation model predicts the probability distribution of the next target word y j at the j-th step, conditioned on the full source sequence x and previously generated target words y <j , that is p(y j | x, y <j ).The probability of the whole sequence y generated by the model will be To train such a model, we can maximize the probability of ground-truth target sequence conditioned on the corresponding source sequence in a parallel dataset D, which is equivalent to minimize the following loss: (1) In this work, we use Transformer (Vaswani et al., 2017) as our NMT model, which consists of an encoder and a decoder.The encoder works in a self-attention fashion and maps a sequence of words to a sequence of continuous representations.The decoder performs attention over the predicted words and the output of the encoder to generate next prediction.Both encoder and decoder take as input the sum of a word embedding and its corresponding positional embedding.
Prefix-to-Prefix Framework Previous work (Gu et al., 2017;Dalvi et al., 2018) use seq2seq models to do simultaneous translation, which are trained with full sentence pairs but need to predict target words based on partial source sentences.Ma et al. (2018) proposed a prefix-to-prefix training framework to solve this mismatch.The key idea of this framework is to train the model to predict the next target word conditioned on the partial source sequence the model has seen, instead of the full source sequence.
As a simple example in this framework, Ma et al. (2018) presented a class of policies, called wait-k policy, that can be applied with local training in the prefix-to-prefix framework.For a positive integer k, the wait-k policy will wait for the first k source words and then start to alternate generating a target word with receiving a new source word, until there is no more source words, when the problem becomes the same as the fullsequence translation.The probability of the j-th word is p k (y j | x <j+k , y <j ), and the probability of the whole predicted sequence is

Model
To obtain a flexible and adaptive policy, we need our model to be able to take both READ and WRITE actions.Conventional translation model already has the ability to write target words, so we introduce a "delay" token ε in target vocabulary to enable our model to apply the READ action.Formally, for the target vocabulary V , we define an extended vocabulary (2) Each word in this set can be an action, which is applied with a transition function δ on a sequence pair (s, t) for a given source sequence x where s x.We assume ε cannot be applied with the sequence pair (s, t) if s = x, then we have the transition function δ as follows, where s • x represents concatenating a sequence s and a word x.
Based on this transition function, our model can do simultaneous translation as follows.Given the currently available source sequence, our model continues predicting next target word until it predicts a delay token.Then it will read a new source word, and continue prediction.Since we use Transformer model, the whole available source sequence needs to be encoded again when reading in a new source word, but the predicted target sequence will not be changed.
Note that the predicted delay tokens do not provide any semantic information, but may introduce some noise in attention layer during the translation process.So we propose to remove those delay token in the attention layers except for the current input one.However, this removal may reduce the explicit latency information which will affect the predictions of the model since the model cannot observe previous output delay tokens.Therefore, to provide this information explicitly, we embed the number of previous delay tokens to a vector and add this to the sum of the word embedding and position embedding as the input of the decoder.

Training via Restricted Imitation Learning
We first introduce a restricted dynamic oracle (Cross and Huang, 2016) based on our extended vocabulary.Then we show how to use this dynamic oracle to train a simultaneous translation model via imitation learning.Note that we do not need to train this oracle.
Restricted Dynamic Oracle Given a pair of full sequences (x, y) in data, the input state of our restricted dynamic oracle will be a pair of prefixes (s, t) where s x, t y and (s, t) = (x, y).The whole action set is V + defined in the last section.The objective of our dynamic oracle is to obtain the full sequence pair (x, y) and maintain a reasonably low latency.
For a prefix pair (s, t), the difference of the lengths of the two prefixes can be used to measure the latency of translation.So we would like to bound this difference as a latency constraint.This idea can be illustrated in the prefix grid (see Figure 1), where we can define a band region and al-ways keep the translation process in this band.For simplicity, we first assume the two full sequences have the same lengths, i.e. |x| = |y|.Then we can bound the difference d = |s| − |t| by two constants: α < d < β.The conservative bound (β) guarantees relatively small difference and low latency; while the aggressive bound (α) guarantees there are not too many target words predicted before seeing enough source words.Formally, this dynamic oracle is defined as follows.
By this definition, we know that this oracle can always find an action sequence to obtain (x, y).When the input state does not satisfy any latency constraint, then this dynamic oracle will provide only one action, applying which will improve the length difference.Note that this dynamic oracle is restricted in the sense that it is only defined on the prefix pair instead of any sequence pair.And since we only want to obtain the exact sequence from data, this oracle can only choose the next groundtruth target word other than ε .
In many cases, the assumption |x| = |y| does not hold.To overcome this limitation, we can utilize the length ratio γ = |x|/|y| to modify the length difference: d = |s| − γ|t|, and use this new difference d in our dynamic oracle.Although we cannot obtain this ratio during testing time, we may use the averaged length ratio obtained from training data (Huang et al., 2017).
Training with Restricted Dynamic Oracle We apply imitation learning to train our translation model, using the proposed dynamic oracle as the expert policy.Recall that the prediction of our model depends on the whole generated prefix including ε (as the input contains the embedding of the number of ε ), which is also an action sequence.If an action sequence a is obtained from our oracle, then applying this sequence will result in a prefix pair, say s a and t a , of x and y.Let p(a | s a , t a ) be the probability of choosing action a given the prefix pair obtained by applying action sequence a.Then the averaged probability of choosing the oracle actions conditioned on the action sequence a will be f (a, π x,y,α,β ) = a∈π x,y,α,β (sa,ta) .
To train a model to learn from the dynamic oracle, we can sample from our oracle to obtain a set, say S(x, y), of action sequences for a sentence pair (x, y).The loss function for each sampled sequence a ∈ S(x, y) will be For a parallel text D, the training loss is Directly optimizing the above loss may require too much computation resource since for each pair of (x, y), the size of S(x, y) (i.e. the number of different action sequences) can be exponentially large.To reduce the computation cost, we propose to use two special action sequences as our sample set so that our model can learn to do translation within the two latency constraints.Recall that the latency constraints of our dynamic oracle π x,y,α,β are defined by two bounds: α and β.For each bound, there is a unique action sequence, which corresponds to a path in the prefix grid, such that following it can generate the most number of prefix pairs that make this bound tight.Let a α (x,y) (a β (x,y) ) be such an action sequence for (x, y) and α (β).We replace S(x, y) with {a α (x,y) , a β (x,y) }, then the above loss for dataset D becomes . This is the loss we use in our training process.
Note that there are some steps where our oracle will return two actions, so for such steps we will have a multi-label classification problem where labels are the actions from our oracle.In such cases, Sigmoid function for each action is more appropriate than the Softmax function for the actions will not compete each other (Ma et al., 2017;Zheng et al., 2018;Ma et al., 2019).Therefore, we apply Sigmoid for each action instead of using Softmax function to generate a distribution for all actions.

Decoding
We observed that the model trained on the two special action sequences occasionally violates the latency constraints and visits states outside of the designated band in prefix grid.To avoid such case, we force the model to choose actions such that it will always satisfy the latency constraints.That is, if the model reaches the aggressive bound, it must choose a target word other than ε with highest score, even if ε has higher score; if the model reaches the conservative bound, it can only choose ε at that step.We also apply a temperature constant e t to the score of ε , which can implicitly control the latency of our model without retraining it.This improves the flexibility of our trained model so that it can be used in different scenarios with different latency requirements.

Experiments
To investigate the empirical performance of our proposed method, we conduct experiments on NIST corpus for Chinese-English.We use NIST 06 (616 sentence pairs) as our development set and NIST 08 (691 sentence pairs) as our testing set.We apply tokenization and byte-pair encoding (BPE) (Sennrich et al., 2015) on both source and target languages to reduce their vocabularies.For training data, we only include 1 million sentence pairs with length larger than 50.We use Transformer (Vaswani et al., 2017) as our NMT model, and our implementation is adapted from PyTorchbased OpenNMT (Klein et al., 2017).The architecture of our Transformer model is the same as the base model in the original paper.
We use BLEU (Papineni et al., 2002) as the translation quality metric and Average Lagging (AL) introduced by Ma et al. (2018) as our latency metrics, which measures the average delayed words.AL avoids some limitations of other existing metrics, such as insensitivity to actual lagging like Consecutive Wait (CW) (Gu et al., 2017), and sensitivity to input length like Average Proportion (AP) (Cho and Esipova, 2016) .

Results
We tried three different pairs for α and β: (1, 5), (3, 5) and (3, 7), and summarize the results on testing sets in Figure 2. Figure 2 (a) shows the results on Chinese-to-English translation.In this direction, our model can always achieve higher BLEU scores with the same latency, compared with the wait-k models and RL models.We notice the model prefers conservative policy during decoding time when t = 0.So we apply negative values of t to encourage the model to choose actions other than ε .This can effectively reduce latency without sacrificing much  translation quality, implying that our model can implicitly control latency during testing time.
Figure 2 (b) shows our results on English-to-Chinese translation.Since the English source sentences are always longer than the Chinese sentences, we utilize the length ratio γ = 1.25 (derived from the dev set) during training, which is the same as using "catchup" with frequency c = 0.25 introduced by Ma et al. (2018).Different from the other direction, models for this direction works better if the difference of α and β is bigger.Another difference is that our model prefers aggressive policy instead of conservative policy when t = 0. Thus, we apply positive values of t to encourage it to choose ε , obtaining more conservative policies to improve translation quality.

Example
We provide an example from the development set of Chinese-to-English translation in Table 3 to compare the behaviours of different models.Our model is trained with α = 3, β = 7 and tested with t = 0.It shows that our model can wait for information " Ōuméng" to translates "eu", while the wait-3 model is forced to guess this information and made a mistake on the wrong guess "us" before seeing " Ōuméng".

Ablation Study
To analyze the effects of proposed techniques on the performance, we also provide an ablation study on those techniques for our model trained with α = 3 and β = 5 in Chineseto-English translation.The results are given in Table 4, and show that all the techniques are important to the final performance and using Sigmoid function is critical to learn adaptive policy.Table 4: Ablation study on Chinese-to-English development set with α = 3 and β = 5.

Conclusions
We have presented a simple model that includes a delay token in the target vocabulary such that the model can apply both READ and WRITE actions during translation process without a explicit policy model.We also designed a restricted dynamic oracle for the simultaneous translation problem and provided a local training method utilizing this dynamic oracle.The model trained with this method can learn a flexible policy for simultaneous translation and achieve better translation quality and lower latency compared to previous methods.

Figure 1 :
Figure 1: Illustration of our proposed dynamic oracle on a prefix grid.The blue right arrow represents choosing next ground-truth target word, and the red downward arrow represents choosing the delay token.The left figure shows a simple dynamic oracle without delay constraint.The right figure shows the dynamic oracle with delay constraints.

Table 1 :
A Chinese-to-English translation example.

Table 3 :
A Chinese-to-English development set example.Our model is trained with α = 3 and β = 7.