Opportunistic Decoding with Timely Correction for Simultaneous Translation

Simultaneous translation has many important application scenarios and attracts much attention from both academia and industry recently. Most existing frameworks, however, have difficulties in balancing between the translation quality and latency, i.e., the decoding policy is usually either too aggressive or too conservative. We propose an opportunistic decoding technique with timely correction ability, which always (over-)generates a certain mount of extra words at each step to keep the audience on track with the latest information. At the same time, it also corrects, in a timely fashion, the mistakes in the former overgenerated words when observing more source context to ensure high translation quality. Experiments show our technique achieves substantial reduction in latency and up to +3.1 increase in BLEU, with revision rate under 8% in Chinese-to-English and English-to-Chinese translation.


Introduction
Simultaneous translation, which starts translation before the speaker finishes, is extremely useful in many scenarios, such as international conferences, travels, and so on.In order to achieve low latency, it is often inevitable to generate target words with insufficient source information, which makes this task extremely challenging.
Recently, there are many efforts towards balancing the translation latency and quality with mainly two types of approaches.On one hand, Ma et al. (2019a) propose very simple frameworks that decode following a fixed-latency policy such as waitk.On the other hand, there are many attempts to learn an adaptive policy which enables the model to decide READ or WRITE action on the fly using various techniques such as reinforcement learning (Gu et al., 2017;Alinejad et al., 2018; Grissom II * These authors contributed equally.y t < l a t e x i t s h a 1 _ b a s e 6 4 = " S e G B d m F 0 K 2 6 9 B z W 9 N T I 6 y l i l w t w = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e z 6 Q I 9 B L x 4 j m g c k S 5 i d z C Z D Z m e X m V 4 h L P k E L x 4 U 8 e o X e f N v n C R 7 0 M S C h q K q m + 6 u I J H C o O t + O 4 W V 1 b X 1 j e J m a W t 7 Z 3 e v v H / Q N H G q G W + w W M a 6 H V D D p V C 8 g Q I l b y e a 0 y i Q v B W M b q d + 6 4 l r I 2 L 1 i O O E + x E d K B E K R t F K D + M e 9 s o V t + r O Q J a J l 5 M K 5 K j 3 y l / d f s z S i C t k k h r T 8 d w E / Y x q F E z y S a m b G p 5 Q N q I D 3 r F U 0 Y g b P 5 u d O i E n V u m T M N a 2 F J K Z + n s i o 5 E x 4 y i w n R H F o V n 0 p u J / X i f F 8 N r P h E p S 5 I r N F 4 W p J B i T 6 d + k L z R n K M e W U K a F v Z W w I d W U o U 2 n Z E P w F l 9 e J s 2 z q n d e v b y / q N R u 8 j i K c A T H c A o e X E E N 7 q A O D W A w g G d 4 h T d H O i / O u / M x b y 0 4 + c w h / I H z + Q N 1 B o 3 r < / l a t e x i t > ŷ6w t < l a t e x i t s h a 1 _ b a s e 6 4 = " K B 9 f l a t e x i t > revision window decoding time t < l a t e x i t s h a 1 _ b a s e 6 4 = " t 6 X a y t d I s H w d U 4 A e C N D S j P N P 5 s M = " > A A n e I F X 4 9 F 4 N t 6 M 9 / l o x S h 3 9 u A X j I 9 v L j K U m g = = < / l a t e x i t > irreversible Figure 1: Besides y t , opportunistic decoding continues to generate additional w words which are represented as ŷ w t .The timely correction only revises this part in future steps.Different shapes denote different words.In this example, from step t to t + 1, all previously opportunistically decoded words are revised, and an extra triangle word is generated in opportunistic window.From step t + 1 to t + 2, two words from previous opportunistic window are kept and only the triangle word is revised.et al., 2014), supervised learning over pseudooracles (Zheng et al., 2019a), imitation learning (Zheng et al., 2019b), model ensemble (Zheng et al., 2020) or monotonic attention (Ma et al., 2019d;Arivazhagan et al., 2019).
Though the existing efforts improve the performance in both translation latency and quality with more powerful frameworks, it is still difficult to choose an appropriate policy to explore the optimal balance between latency and quality in practice, especially when the policy is trained and applied in different domains.Furthermore, all existing approaches are incapable of correcting the mistakes from previous steps.When the former steps commit errors, they will be propagated to the later steps, inducing more mistakes to the future.
Inspired by our previous work on speculative beam search (Zheng et al., 2019c), we propose an opportunistic decoding technique with timely correction mechanism to address the above problems.As shown in Fig. 1, our proposed method always decodes more words than the original policy at each step to catch up with the speaker and arXiv:2005.00675v1[cs.CL] 2 May 2020 reduce the latency.At the same time, it also employs a timely correction mechanism to review the extra outputs from previous steps with more source context, and revises these outputs with current preference when there is a disagreement.Our algorithm can be used in both speech-to-text and speech-to-speech simultaneous translation (Oda et al., 2014;Bangalore et al., 2012;Yarmohammadi et al., 2013).In the former case, the audience will not be overwhelmed by the modifications since we only review and modify the last few output words with a relatively low revision rate.In the later case, the revisable extra words can be used in look-ahead window in incremental TTS (Ma et al., 2019b).By contrast, the alternative retranslation strategy (Arivazhagan et al., 2020) will cause non-local revisions which makes it impossible to be used in incremental TTS.
We also define, for the first time, two metrics for revision-enabled simultaneous translation: a more general latency metric Revision-aware Average Lagging (RAL) as well as the revision rate.We demonstrate the effectiveness of our proposed technique using fixed (Ma et al., 2019a) and adaptive (Zheng et al., 2019a) policies in both Chineseto-English and English-to-Chinese translation.

Preliminaries
Full-sentence NMT.The conventional fullsentence NMT processes the source sentence x = (x 1 , ..., x n ) with an encoder, where x i represents an input token.The decoder on the target side (greedily) selects the highest-scoring word y t given source representation h and previously generated target tokens, y <t = (y 1 , ..., y t−1 ), and the final hypothesis y = (y 1 , ..., y t ) with y t = <eos> has the highest probability: Simultaneous Translation.Without loss of generality, regardless the actual design of policy, simultaneous translation is represented as: , y <t ) (2) where g(t) can be used to represent any arbitrary fixed or adaptive policy.For simplicity, we assume the policy is given and does not distinguish the difference between two types of policies.

Opportunistic Decoding with Timely
Correction and Beam Search Opportunistic Decoding.For simplicity, we first apply this method to fixed policies.We de-fine the original decoded word sequence at time step t with y t , which represents the word that is decoded in time step t with original model.We denote the additional decoded words at time step t as ŷ w t = (y 1 t , ..., y w t ), where w denote the number of extra decoded words.In our setting, the decoding process is as follows: where • is the string concatenation operator.We treat the procedure for generating the extra decoded sequence as opportunistic decoding, which prefers to generate more tokens based on current context.When we have enough information, this opportunistic decoding eliminates unnecessary latency and keep the audience on track.With a certain chance, when the opportunistic decoding tends to aggressive and generates inappropriate tokens, we need to fix the inaccurate token immediately.
Timely Correction.In order to deliver the correct information to the audience promptly and fix previous mistakes as soon as possible, we also need to review and modify the previous outputs.
At step t + 1, when encoder obtains more information from x g(t) to x g(t+1) , the decoder is capable to generate more appropriate candidates and may revise and replace the previous outputs from opportunistic decoding.More precisely, ŷ w t and y t+1 • ŷ w−1 t+1 are two different hypothesis over the same time chunk.When there is a disagreement, our model always uses the hypothesis from later step to replace the previous commits.Note our model does not change any word in y t from previous step and it only revise the words in ŷ w t .Modification for Adaptive Policy.For adaptive policies, the only difference is, instead of committing a single word, the model is capable of generating multiple irreversible words.Thus our proposed methods can be easily applied to adaptive policies.
Correction with Beam Search.When the model is committing more than one word at a time, we can use beam search to further improve the translation quality and reduce revision rate (Murray and Chiang, 2018;Ma et al., 2019c).
The decoder maintains a beam B k t of size b at step t, which is ordered list of pairs The decoder generates target word y 4 = "his" and two extra words "welcome to" at step t = 4 when input x 9 = "zàntóng" ("agreement") is not available yet.When the model receives x 9 at step t = 5, the decoder immediately corrects the previously made mistake "welcome" with "agreement" and emits two additional target words ("to President").The decoder not only is capable to fix the previous mistake, but also has enough information to perform more correct generations.Our framework benefits from opportunistic decoding with reduced latency here.Note though the word "to" is generated in step t = 4, it only becomes irreversible at step t = 6.hypothesis, probability , where k denotes the k th step in beam search.At each step, there is an initial beam B 0 t = [ y t−1 , 1 ].We denote one-step transition from the previous beam to the next as where top b (•) returns the top-scoring b pairs.Note we do not distinguish the revisable and nonrevisable output in y for simplicity.We also define the multi-step advance beam search function with recursive fashion as follows: When the opportunistic decoding window is w at decoding step t, we define the beam search over w + 1 (include the original output) as follows: where next b n+w (•) performs a beam search with n + w steps, and generate y t as the outputs which include both original and opportunistic decoded words.n represents the length of y t

Revision-aware AL and Revision Rate
We define, for the first time, two metrics for revision-enabled simultaneous translation.

Revision-aware AL
AL is introduced in (Ma et al., 2019a) to measure the average delay for simultaneous translation.Besides the limitations that are mentioned in (Cherry and Foster, 2019), AL is also not sensitive to the modifications to the committed words.Furthermore, in the case of re-translation, AL is incapable to measure the meaningful latency anymore.
< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 A 3 3 j C S J Y n P / z s 9 We hereby propose a new latency, Revisionaware AL (RAL), which can be applied to any kind of translation scenarios, i.e., full-sentence translation, use re-translation as simultaneous translation, fixed and adaptive policy simultaneous translation.Note that for latency and revision rate calculation, we count the target side difference respect to the growth of source side.As it is shown in Fig. 3, there might be multiple changes for each output words during the translation, and we only start to calculate the latency for this word once it agrees with the final results.Therefore, it is necessary to locate the last change for each word.For a given source side time s, we denote the t th outputs on target side as f (x s ) t .Then we are able to find the Last Revision (LR) for the t th word on target side as follows: From the audience point of view, once the former words are changed, the audience also needs to take the efforts to read the following as well.Then we also penalize the later words even there are no changes, which is shown with blue arrow in Fig. 3.We then re-formulate the LR(t) as follows (assume LR(0) = 0):  LR(t) = max{LR(t − 1), LR(t)} (5) The above definition can be visualized as the thick black line in Fig. 3. Similar with original AL, our proposed RAL is defined as follows: where τ (|x|) denotes the cut-off step, and r = |y|/|x| is the target-to-source length ratio.

Revision Rate
Since each modification on the target side would cost extra effort for the audience to read, we penalize all the revisions during the translation.We define the revision rate as follows: where dist can be arbitrary distance measurement between two sequences.For simplicity, we design where pad is a padding symbol in case b is shorter than a.

Experiments
Datasets and Implementation We evaluate our work on Chinese-to-English and English-to-Chinese simultaneous translation tasks.We use the NIST corpus (2M sentence pairs) as the training data.We first apply BPE (Sennrich et al., 2015) on all texts to reduce the vocabulary sizes.
For evaluation, we use NIST 2006 and NIST 2008 as our dev and test sets with 4 English references.
We re-implement wait-k model (Ma et al., 2019a) and adaptive policy (Zheng et al., 2019a).We use Transformer (Vaswani et al., 2017) based waitk model and pre-trained full-sentence model for learning adaptive policy.
Performance on Wait-k Policy We perform experiments using opportunistic decoding on wait- k policies with k ∈ {1, 3, 5, 7, 9}, opportunistic window w ∈ {1, 3, 5} and beam size b ∈ {1, 3, 5, 7, 10, 15}.We select the best beam size for each policy and window pair on dev-set.We compare our proposed method with a baseline called re-translation which uses a fullsentence NMT model to re-decode the whole target sentence once a new source word is observed.The final output sentences of this method are identical to the full sentence translation output with the same model but the latency is reduced.Fig. 4 (left) shows the Chinese-to-English results of our proposed algorithm.Since our greedy opportunistic decoding doesn't change the final output, there is no difference in BLEU compared with normal decoding, but the latency is reduced.However, by applying beam search, we can achieve 3.1 BLEU improvement and 2.4 latency reduction on wait-7 policy.Fig. 4 (right) shows the English-to-Chinese results.Compare to the Chinese-to-English translation results in previous section, there is comparatively less latency reduction by using beam search because the output translations are slightly longer which hurts the latency.As shown in Fig. 5(right), the revision rate is still controlled under 8%.
Fig. 5 shows the revision rate with different window size on wait-k policies.In general, with opportunity window w ≤ 5, the revision rate of our proposed approach is under 8%, which is much lower than re-translation.
Performance on Adaptive Policy Fig. 6 shows the performance of the proposed algorithm on adaptive policies.
We use threshold ρ ∈ {0.55, 0.53, 0.5, 0.47, 0.45}.We vary beam size b ∈ {1, 3, 5, 7, 10} and select the best one on devset.Comparing with conventional beam search on consecutive writes, our decoding algorithm achieves even much higher BLEU and less latency.We further investigate the revision rate with different beam sizes on wait-k policies.Fig. 7 shows that the revision rate is higher with lower wait-k policies.This makes sense because the low k policies are always more aggressive and easy to make mistakes.Moreover, we can find that the revision rate is not very sensitive to beam size.

Conclusions
We have proposed an opportunistic decoding timely correction technique which improves the latency and quality for simultaneous translation.We also defined two metrics for revision-enabled simultaneous translation for the first time.
Figure2: The decoder generates target word y 4 = "his" and two extra words "welcome to" at step t = 4 when input x 9 = "zàntóng" ("agreement") is not available yet.When the model receives x 9 at step t = 5, the decoder immediately corrects the previously made mistake "welcome" with "agreement" and emits two additional target words ("to President").The decoder not only is capable to fix the previous mistake, but also has enough information to perform more correct generations.Our framework benefits from opportunistic decoding with reduced latency here.Note though the word "to" is generated in step t = 4, it only becomes irreversible at step t = 6.

Figure 3 :
Figure3: The red arrows represent the changes between two different commits, and the last changes for each output word is highlighted with yellow.

Figure 7 :
Figure 7: Revision rate against beam size with window size of 3 and different wait-k policies.