Simultaneous Translation Policies: From Fixed to Adaptive

Adaptive policies are better than fixed policies for simultaneous translation, since they can flexibly balance the tradeoff between translation quality and latency based on the current context information. But previous methods on obtaining adaptive policies either rely on complicated training process, or underperform simple fixed policies. We design an algorithm to achieve adaptive policies via a simple heuristic composition of a set of fixed policies. Experiments on Chinese -> English and German -> English show that our adaptive policies can outperform fixed ones by up to 4 BLEU points for the same latency, and more surprisingly, it even surpasses the BLEU score of full-sentence translation in the greedy mode (and very close to beam mode), but with much lower latency.


Introduction
Simultaneous translation (ST) aims to provide good translation quality while keeping the latency of translation process as low as possible. This is very important for the scenarios that require simultaneity, such as international summits and negotiations. For this, human interpreters usually start translation before the source sentence ends. However, this makes the translation process much more challenging than the full-sentence translation, because to balance the translation quality and latency, interpreters need to make decisions on when to continue translation and when to stop temporarily to wait for more source side information, which are difficult, especially for syntactically divergent language pairs, such as German and English.
The above decisions can be considered as two actions: READ (wait for a new source word) and WRITE (emit a translated target word) (Gu et al., 2017). Then we only need to decide which action to choose at each step, and the solution can be represented by a policy. Earlier works (Yarmohammadi  Bangalore et al., 2012;Fügen et al., 2007;Sridhar et al., 2013;Jaitly et al., 2016) study policies as a part of speech-to-speech ST system, where the policies usually try to separate the source sentence into several chunks that can be translated safely. Recent works focus on obtaining policies for text-to-text ST, which can be generally divided into two categories: fixed and adaptive. Fixed policies (Ma et al., 2019;Dalvi et al., 2018) usually follow some simple rules to choose actions. For example, the wait-k policy by Ma et al. (2019) first chooses k READ actions, and then chooses WRITE and READ alternatively. This kind of policies do not utilize the context information and can be either too aggressive or too conservative in different cases.
By contrast, adaptive policies try to make decisions on the fly using the currently available information. It is obvious that this kind of policies is more desirable for ST than the fixed ones, and different methods are explored to achieve an adaptive policy. The majority of such methods (Grissom II et al., 2014;Cho and Esipova, 2016;Gu et al., 2017;Alinejad et al., 2018;Zheng et al., 2019a) are based on full-sentence translation models, which may be simple to use but cannot outperform fixed policies applied with "genuinely simultaneous" models trained for ST (Ma et al., 2019). Other meth-ods (Arivazhagan et al., 2019;Zheng et al., 2019b) try to learn a policy together with the underlying translation model, but they rely on complicated and time-consuming training process.
In this paper, we propose to achieve an adaptive policy via a much simpler heuristic composition of a set of wait-k policies (e.g., k = 1 ∼ 10). See Fig. 1 for an example. To further improve the translation quality of our method, we apply ensemble of models trained with different wait-k policies. Our experiments on Chinese→English and German→English translation show that our method can achieve up to 4 BLEU points improvement over the wait-k method for same latency. More interestingly, compared with full-sentence translation, our method achieves higher BLEU scores than greedy search but with much lower latency, and is close to the results from beam search.

Preliminaries
Full-sentence translation. Neural machine translation (NMT) model usually consists of two components: an encoder, which encodes the source sentence x = (x 1 , . . . , x m ) into a sequence of hidden states, and a decoder, which sequentially predicts target tokens conditioned on those hidden states and previous predictions. The probability of the predicted target sequence y = (y 1 , . . . , y n ) will be p(y | x) = |y| t=1 p(y t | x, y <t ) where y <t = (y 1 , . . . , y t−1 ) denotes the target sequence predicted before step t.
Simultaneous translation. Ma et al. (2019) propose a prefix-to-prefix framework to train models to make predictions conditioned on partial source sentences. In this way, the probability of predicted sequence y becomes is a monotonic non-decreasing function of t, denoting the number of processed source tokens when predicting y t . This function g(t) can be used to represent a policy for ST. Ma et al. (2019) introduce a kind of fixed policies, called wait-k policy, that can be defined by the following g k (t) = min{|x|, t + k − 1}.
Intuitively, this policy first waits k source tokens and then outputs predicted tokens concurrently with the rest of source sentence. In this example, we will choose an action based on the top probability p top , and apply a new policy (the dotted arrows) after the chosen action.

Obtaining an Adaptive Policy
Assume we have a set of wait-k policies and the corresponding models M k (k = k min . . . k max ).
We can obtain an adaptive policy, whose lag at each step is between k min and k max , meaning that at each step, the target sequence falls behind the source sequence at most k max tokens and at least k min tokens. At each step, there is a wait-k policy synchronizing the adaptive policy, meaning that they have the same lag at that step. Specifically, at any step t, if the lag of the adaptive policy is k , then we apply the NMT model with the wait-k policy and force it to predict existing target tokens until step t, when the model will make a new prediction as the output of step t. However, the above method only shows how to simulate the adaptive policy to make a prediction at one step if we would like to write at that step, but it does not tell us at which steps we should write. We utilize the model confidence to make such a decision. Specifically, we set a probability threshold ρ k for each wait-k policy. At each step, if the NMT model follows a wait-k policy, and predicts the most likely token with probability higher than the threshold ρ k , then we consider the model is confident on this prediction, and choose WRITE action; otherwise, we choose READ action. Figure 2 gives an example for this process.
We define the process of applying a wait-k model M k with a wait-k policy on a given sequence pair (x, y) by the following which forces model M k to predict y, and returns the top token y top at the final step with the corresponding probability p top . The process of reading and returning a new source token is denoted by READ(), and expression x • x represents to append an element x to the end of sequence x. We denote by <s> and </s> the start symbol and end symbol of a sequence. Then Algorithm 1 gives the pseudocode of the above method.

Algorithm 1 ST decoding with an adaptive policy
Input: two integers k min and k max , a set of NMT models M k , and a sequence of Write action return y

Ensemble of Wait-k Models
Using the corresponding model M k with each waitk policies may not give us the best performance. If we have a set of models trained independently with different wait-k policies, then we can apply ensemble of those models (Dietterich, 2000;Hansen and Salamon, 1990) to improve the translation quality, which is also used to improve the translation quality of full-sentence translation (Stahlberg and Byrne, 2017). However, there may be two issues to apply ensemble of all models: (1) the runtime for each prediction could be longer, resulting in higher latency; and (2) the translation accuracy may be worse, for the best model for one policy may give bad performance when doing inference with another policy. To avoid these, we propose to apply ensemble of the top-3 models for each policy. That is, we first generate distribution with the top-3 models independently with the same policy, and then take the arithmetic average of the three distributions as the final token distribution at that step.

Experiments
Datasets and models. We conduct experiments on Chinese→English (ZH→EN) and German→English (DE→EN)  Performance with different policies. We first evaluate the performance of each model with different policies, which helps us to choose models for different policies. Specifically, we apply each model with ten different wait-k policies on dev set to compare the performance. Fig. 3 shows the results of five models. We find the best model for one policy may not be the one trained with that policy. For example, on ZH→EN translation, the best model for wait-1 policy is the one trained with wait-3 policy. Further, there is no one model could achieve the best performance for all policies.
Comparing different methods. We compare our method with others from literature: wait-k method (Ma et al., 2019) (train and test models with the same wait-k policy), test-time waitk method (Ma et al., 2019) (apply full-sentence model with wait-k policies), wait-if-diff (Cho and Esipova, 2016) (start with s 0 source tokens, choose to read only if top token at t-th step diffs from that at (t − δ)-th step), and wait-if-worse (Cho and Esipova, 2016) (start with s 0 source tokens, choose to read only if the top probability at t-th step is smaller than that at (t − δ)-th step). For wait-if-diff we set    Figure 4: Performance of different methods on test set. Our single method achieves better BLEU scores than wait-k method with same latency. And our ensemble top-3 method achieves the highest BLEU scores with same latency, and outperforms full-sentence greedy search with AL < 9. # $: full-sentence translation with greedy search and beam search (beam size = 10) respectively. s 0 ∈ {4, 6} and δ ∈ {2, 4}; and for wait-if-worse we set s 0 ∈ {1, 2, 4, 6} and δ ∈ {1, 2}.

4-ref BLEU
For our method, we test three different cases: (1) single, where for each policy we apply the corresponding model that trained with the same policy; (2) ensemble top-3, where for each policy we apply the ensemble of 3 models that achieve the highest BLEU scores with that policy on dev set; (3) ensemble all, where we apply the ensemble of all 10 models for each policy. For thresholds, we first choose ρ 1 and ρ 10 , and the other thresholds are computed in the following way: ρ i = ρ 1 −d·(i−1) for integer 1 ≤ i ≤ 10 and d = (ρ 1 − ρ 10 )/9. We test with ρ 1 ∈ {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, ρ 10 = 0 and ρ 1 = 1, ρ 10 ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, totally 18 different settings in our experiments. The reason behind these settings is that we assume our adaptive policy cannot be either too aggressive or too conservative (as mentioned at the beginning of Section 3). The policy is the most aggressive for k = 1, so we set ρ 1 as the largest; while for k = 10 the policy is the most conservative, so we set ρ 10 the smallest.
The comparison is provided in Fig. 4 (the corresponding numeric scores are provided in Appendix A). Compared with wait-k method, our single method achieves improvement of up to 2 BLEU point, and our ensemble top-3 achieves improvement up to 4 BLEU points. Compared with full-sentence translation, our ensemble top-3 surprisingly outperforms greedy search with much lower latency (AL < 9), and achieves BLEU scores close to that from beam search (see Table 2). We also give one ZH→EN translation example from dev set in Table 1 to compare different methods, showing that our method achieves an adaptive policy with low latency and good translation quality.
Efficiency. To evaluate the efficiency, we present in Table 3 the averaged time needed to predict one token for different methods. These methods are tested on one GeForce GTX TITAN-X GPU for ZH→EN test set. We can see that our ensemble top-3 method needs about 0.2 seconds to make a prediction on average. However, if the source sentence is revealed in the same speed as general " we express the most sincere sympathy and condol-ences to the families of the victims . " Table 1: One example from ZH→EN dev set. Although wait-3 method has low latency, it makes anticipations on "offered" and "wishes", and adds additional words "he said", which are not accurate translation. Our ensemble top-3 method could provide better translation with lower latency.  speech, which is about 0.6 seconds per token in Chinese (Zheng et al., 2019c), then our method is still faster than that (which means that it could be used for real-time). Further, we believe the efficiency of our method could be improved with other techniques, such as parallelizing the running of three models in the ensemble, making it less an issue.

Conclusions
We have designed a simple heuristic algorithm to obtain an adaptive policy based on a set of wait-k policies, and applied ensemble in our method to improve the translation quality while maintaining low latency. Experiments show that our method not only outperforms the original wait-k method with relatively large gap, but also surpasses greedy full-sentence translation with much lower latency.

A Appendices
We provide the complete results of Figure 4 from Section 5 in the following tables, where AL is Average Lagging. Note that for ZH→EN, we use 4reference BLEU; while for DE→EN we use singlereference BLEU.