Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation

Simultaneous translation is widely useful but remains challenging. Previous work falls into two main categories: (a) fixed-latency policies such as Ma et al. (2019) and (b) adaptive policies such as Gu et al. (2017). The former are simple and effective, but have to aggressively predict future content due to diverging source-target word order; the latter do not anticipate, but suffer from unstable and inefficient training. To combine the merits of both approaches, we propose a simple supervised-learning framework to learn an adaptive policy from oracle READ/WRITE sequences generated from parallel text. At each step, such an oracle sequence chooses to WRITE the next target word if the available source sentence context provides enough information to do so, otherwise READ the next source word. Experiments on German<=>English show that our method, without retraining the underlying NMT model, can learn flexible policies with better BLEU scores and similar latencies compared to previous work.


Introduction
Simultaneous translation outputs target words while the source sentence is being received, and is widely useful in international conferences, negotiations and press releases. However, although there is significant progress in machine translation (MT) recently, simultaneous machine translation is still one of the most challenging tasks. This is because it is hard to balance translation quality and latency, especially for the syntactically divergent language pairs, such as English and Japanese.
Researchers previously study simultaneous translation as a part of real-time speech-to-speech translation system (Yarmohammadi et al., 2013;Bangalore et al., 2012;Fügen et al., 2007; Sridhar * These authors contributed equally. Jaitly et al., 2016;Graves et al., 2013). Recent simultaneous translation research focuses on obtaining a strategy, called a policy, to decide whether to wait for another source word (READ action) or emit a target word (WRITE action). The obtained policies fall into two main categories: (1) fixed-latency policies (Ma et al., 2019;Dalvi et al., 2018) and (2) context-dependent adaptive policies (Grissom II et al., 2014;Cho and Esipova, 2016;Gu et al., 2017;Alinejad et al., 2018;Arivazhagan et al., 2019;Zheng et al., 2019a). As an example of fixed-latency policies, wait- (Ma et al., 2019) starts by waiting for the first few source words and then outputs one target word after receiving each new source word until the source sentence ends. It is easy to see that this kind of policy will inevitably need to guess the future content, which can often be incorrect. Thus, an adaptive policy (see Table 1 as an example), which can decides on the fly whether to take READ action or WRITE action, is more desirable for simultaneous translation. Moreover, the widely-used beam search technique become non-trivial for fixed policies (Zheng et al., 2019b).
To represent an adaptive policy, previous work shows three different ways: (1) a rule-based decoding algorithm (Cho and Esipova, 2016), (2) an original MT model with an extended vocabulary (Zheng et al., 2019a), and (3) a separate policy model (Grissom II et al., 2014;Gu et al., 2017;Alinejad et al., 2018;Arivazhagan et al., 2019). The decoding algorithm (Cho and Esipova, 2016) applies heuristics measures and does not exploit information in the hidden representation, while the MT model with an extended vocabulary (Zheng et al., 2019a) needs guidance from a restricted dynamic oracle to learn an adaptive policy, whose size is exponentially large so that approximation is needed. A separate policy model could avoid these issues. However, previous policy-learning  methods either depends on reinforcement learning (RL) (Grissom II et al., 2014;Gu et al., 2017;Alinejad et al., 2018), which makes the training process unstable and inefficient due to exploration, or applies advanced attention mechanisms (Arivazhagan et al., 2019), which requires its training process to be autoregressive, and hence inefficient. Furthermore, each such learned policy cannot change its behaviour according to different latency requirements at testing time, and we will need to train multiple policy models for scenarios with different latency requirements.
To combine the merits of fixed and adaptive policies, and to resolve the mentioned drawbacks, we propose a simple supervised learning framework to learn an adaptive policy, and show how to apply it with controllable latency. This framework is based on sequences of READ/WRITE actions for parallel sentence pairs, so we present a simple method to generate such an action sequence for each sentence pair with a pre-trained neural machine translation (NMT) model. 1 Our experiments on German↔English dataset show that our method, without retraining the underlying NMT model, leads to better policies than previous methods, and achieves better BLEU scores than (the retrained) wait-models at low latency scenarios.

Generating Action Sequences
In this section, we show how to generate action sequences for parallel text. Our simultaneous translation policy can take two actions: READ (receive a new source word) and WRITE (output a new target word). A sequence of such actions for a sentence pair ( , ) defines one way to translate into . Thus, such a sequence must have | | number of WRITE actions. However, not every action sequence is good for simultaneous translation. For instance, a sequence without any READ action will not provide any source information, while a sequence with all | | number of READ actions be-fore all WRITE actions usually has large latency. Thus the ideal sequences for simultaneous translation should have the following two properties: • there is no anticipation during translation, i.e. when choosing WRITE action, there is enough source information for the MT model to generate the correct target word; • the latency is as low as possible, i.e. the WRITE action for each target word appears as early as possible. Table 1 gives an example for such a sequence.

Algorithm 1 Generating Action Sequence
In the following, we present a simple method to generate such an action sequence for a sentence pair ( , ) using a pre-trained NMT model, assuming this model can make reasonable prediction given incomplete source sentence. Our method is based on this observation: if the rank of the next ground-truth target word is high enough in the prediction of the model, then this implies that there is enough source-side information for the model to make a correct prediction. Specifically, we sequentially input the source words to the pre-trained model, and use it to predict next target word. If the rank of the gold target word is high enough, we will append a WRITE action to the sequence and then try the next target word; otherwise, we append a READ action and input a new source word. Let be a positive integer, be a pre-trained NMT model, ≤ be the source sequence consisting of the first words of , and ( | ≤ ) be the rank of the target word in the prediction of model given sequence ≤ . Then the generating process can be summarized as Algorithm 1.
Although we can generate action sequences balancing the two wanted properties with appropriate value of parameter , the latency of generated action sequence may still be large due to the word order difference between the two sentences. To avoid this issue, we filter the generated sequences with the latency metric Average Lagging (AL) proposed by Ma et al. (2019), which quantifies the latency in terms of the number of source words and avoids some flaws of other metrics like Average Proportion (AP) (Cho and Esipova, 2016) and Consecutive Wait (CW) (Gu et al., 2017). Another issue we observed is that, the pre-trained model may be too aggressive for some sentence pair, meaning that it may write all target words without seeing the whole source sentence. This may be because the model is also trained on the same dataset. To overcome this, we only keep the action sequences that will receive all the source words before the last WRITE action. After the filtering process, each action sequence has AL less than a fixed constant and receives all source words.

Supervised-Learning Framework for Simultaneous Translation Policy
Given a sentence pair and an action sequence for this pair, we can apply supervised learning method to learn a parameterized policy for simultaneous translation. For the policy to be able to choose the correct action, its input should include information from both source and target sides. Since we use Transformer (Vaswani et al., 2017) as our underlying NMT model in this work, we need to recompute encoder hidden states for all previous seen source words, which is the same as done for wait-model training (Ma et al., 2019). 2 The policy input at step consists of three components from this model: • ℎ : the last-layer hidden state from the encoder for the first source word at step ; • ℎ : the last-layer hidden state from the decoder for the first target word at step ; 3 2 In our experiments, the decoder and policy model combined need about 0.0445 seconds on average to generate one target word (this might include multiple READ's), while it takes on average only about 0.0058 seconds to recompute all encoder states for each new source word. So we think this re-computation might not be a serious issue for system efficiency. 3 The hidden states of the first target word will be recomputed at each step.
• : cross-attention scores at step for the current input target word on all attention layers in decoder, averaged over all current source words.

That is = [ℎ , ℎ , ].
Let be the -th action in the given action sequence . Then the decision of our policy on the -th step depends on all previous inputs ≤ and all taken actions < . We want to maximize the probability of the next action given those information: where is the action distribution of our policy parameterized by .

Decoding with Controllable Latency
To apply the learned policy for simultaneous translation, we can choose at each step the action with higher probability . However, different scenarios may have different latency requirements. Thus, this greedy policy may not always be the best choice for all situations. Here we present a simple way to implicitly control the latency of the learned policy without retraining the policy model.
Let be a probability threshold. For each step in translation, we choose READ action only if the probability of READ is greater than ; otherwise we choose WRITE action. Thus, this threshold balances the tradeoff between latency and translation quality: with larger , the policy prefers to take WRITE actions, providing lower latency; and with smaller , the policy prefers to take READ actions, and provides more conservative translation with larger latency. WMT 15 for training, newstest-2013 for validation and newstest-2015 for testing. 4 All datasets are tokenized and segmented into sub-word units with byte-pair encoding (BPE) (Sennrich et al., 2016), and we only use the sentence pairs of lengths less than 50 (on both sides) for training.

Model Configuration
We use Transformer-base (Vaswani et al., 2017) as our NMT model and our implementation is based on PyTorch-based Open-NMT (Klein et al., 2017). We add an <eos> token on the source side, which is not included in the original OpenNMT codebase. Our recurrent policy model consists of one GRU layer with 512 units, one fully-connected layer of dimension 64 followed by ReLU activation, and one fully-connected layer of dimension 2 followed by a softmax function to produce the action distribution. We use BLEU (Papineni et al., 2002) as the translation quality metric and Averaged Lagging (AL) (Ma et al., 2019) as the latency metric.

Effects of Generated Action Sequences
We first analyze the effects of the two parameters in the generation process of action sequences: the rank and the filtering latency . We fix = 3 and choose the rank ∈ {5, 50}; then we fix = 50 and choose the latency ∈ {3, 7, ∞} to generate action sequences on DE→EN direction. Figure 1 shows the performances of resulting models with different probability thresholds . We find that smaller helps achieve better performance and our model is not very sensitive to values of rank. Therefore, in the following experiments, we report results with 4 http://www.statmt.org/wmt15/translation-task.html = 50 and = 3.

Performance Comparison
We compare our method on EN↔DE directions with different methods: greedy decoding algorithms Wait-If-Worse/Wait-If-Diff (WIW/WID) of Cho and Esipova (2016), RL method of Gu et al. (2017), waitmodels and test-time wait-methods of Ma et al. (2019). Both of WIW and WID only use the pre-trained NMT model, and the algorithm initially reads 0 number of source words and chooses READ only if the probability of the most likely target word decreases when given another number of source words (WIW) or the most likely target word changes when given another number of source words (WID). For RL method, we use the same kind of input and architecture for policy model. 5 Test-time wait-means decoding with wait-policy using the pre-trained NMT model. All methods share the same underlying pre-trained NMT model except for the wait-method, which retrains the NMT model from scratch, but with the same architecture as the pre-trained model. Figure 2 shows the performance comparison. Our models on both directions can achieve higher BLEU scores with the similar latency than WIW, WID, RL model and test-time wait-method, implying that our method learns a better policy model than the other methods when the underlying NMT model is not retrained. 6 Compared with waitmodels, our models with beam search achieve  higher BLEU scores when latency AL is small, which we think will be the most useful scenarios of simultaneous translation. Furthermore, this figure also shows that our model can achieve good performance on different latency conditions by controlling the threshold , so we do not need to train multiple models for different latency requirements. We also provide a translation example in Table 2 to compare different methods.

Learning Process Analysis
We analyze two aspects of the learning processes of different methods: stability and training time. Figure 3 shows the learning curves of the training processes of RL method and our SL method, averaged over four different runs with different random seeds on DE→EN direction. We can see that the training process of our method is more stable and converges faster than the RL method. Although there are some steps where the RL training process can achieve better BLEU scores than our SL method, the corresponding latencies are usually very big, which are not appropriate for simultaneous translation. We present the training time of different methods in Table 3. Our method only need about 12 hours to train the policy model with 1 GPU, while the wait-method needs more than 600 hours with 8 GPUs to finish the training process, showing that our method is very efficient. Note that this table does not include the time needed to generate action sequences. The time for this process could be very flexible since we can parallelize this by dividing the training data into separating parts. In our experiments, we need about 2 hours to generate all action sequences in parallel.

Conclusions
We have proposed a simple supervised-learning framework to learn an adaptive policy based on ends. Without this token, test-time wait-generates either very short target sentences or many punctuations for small 's.  generated action sequences for simultaneous translation, which leads to faster training and better policies than previous methods, without the need to retrain the underlying NMT model.