Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation

We address the problem of simultaneous translation by modifying the Neural MT decoder to operate with dynamically built encoder and attention. We propose a tunable agent which decides the best segmentation strategy for a user-defined BLEU loss and Average Proportion (AP) constraint. Our agent outperforms previously proposed Wait-if-diff and Wait-if-worse agents (Cho and Esipova, 2016) on BLEU with a lower latency. Secondly we proposed data-driven changes to Neural MT training to better match the incremental decoding framework.


Introduction
Simultaneous translation is a desirable attribute in Spoken Language Translation, where the translator is required to keep up with the speaker. In a lecture or meeting translation scenario where utterances are long, or the end of sentence is not clearly marked, the system must operate on a buffered sequence. Generating translations for such incomplete sequences presents a considerable challenge for machine translation, more so in the case of syntactically divergent language pairs (such as German-English), where the context required to correctly translate a sentence, appears much later in the sequence, and prematurely committing to a translation leads to significant loss in quality.
Various strategies to select appropriate segmentation points in a streaming input have been proposed (Fügen et al., 2007;Bangalore et al., 2012;Yarmohammadi et al., 2013;Oda et al., 2014). A downside of this approach is that the MT system translates sequences independent of each other, ignoring the context. Even if the segmenter decides perfect points to segment the input stream, an MT system requires lexical history to make the correct decision. The end-to-end nature of the Neural MT architecture (Sutskever et al., 2014;Bahdanau et al., 2015) provides a natural mechanism 1 to integrate stream decoding. Specifically, the recurrent property of the encoder and decoder components provide an easy way to maintain historic context in a fixed size vector.
We modify the neural MT architecture to operate in an online fashion where i) the encoder and the attention are updated dynamically as new input words are added, through a READ operation, and ii) the decoder generates output from the available encoder states, through a WRITE operation. The decision of when to WRITE is learned through a tunable segmentation agent, based on user-defined thresholds. Our incremental decoder significantly outperforms the chunk-based decoder and restores the oracle performance with a deficit of  2 BLEU points across 4 language pairs with a moderate delay. We additionally explore whether modifying the Neural MT training to match the decoder can improve performance. While we observed significant restoration in the case of chunk decoding matched with chunk-based NMT training, the same was not found true with our proposed incremental training to match the incremental decoding framework.
The remaining paper is organized as follow: Section 2 describes modifications to the NMT decoder to enable stream decoding. Section 3 describes various agents to learn a READ/WRITE strategy. Section 4 presents evaluation and results. Section 5 describes modifications to the NMT training to mimic corresponding decoding strategy, and Section 6 concludes the paper. Iteration 2 Iteration 3 Iteration 4 n w = 0 n w = 2 n w = 0 n w = 4 Figure 1: A decoding pass over a 4-word source sentence. n w denotes the number of words the agent chose to commit. Green nodes = committed words, Blue nodes = newly generated words in the current iteration. Words marked in red are discarded, as the agent chooses to not commit them.

Incremental Decoding
Problem: In a stream decoding scenario, the entire source sequence is not readily available. The translator must either wait for the sequence to finish in order to compute the encoder state, or commit partial translations at several intermediate steps, potentially losing contextual information.
Chunk-based Decoder: A straight forward way to enable simultaneous translation is to chop the incoming input after every N-tokens. A drawback of these approaches is that the translation and segmentation process operate independently of each other, and the previous contextual history is not considered when translating the current chunk. This information is important to generate grammatically correct and coherent translations.
Incremental Decoding: The RNN-based NMT framework provides a natural mechanism to preserve context and accommodate streaming. The decoder maintains the entire target history through the previous decoder state alone. But to enable incremental neural decoding, we have to address the following constraints: i) how to dynamically build the encoder and attention with the streaming input? ii) what is the best strategy to pre-commit translations at several intermediate points?
Inspired by Cho and Esipova (2016), we modify the NMT decoder to operate in a sequence of READ and WRITE operations. The former reads the next word from the buffered source sequence and translates it using the available context, and the latter is computed through an AGENT, which decides how many words should be committed from this generated translation. Note that, when a translation is generated in the READ operation, the already committed target words remain unchanged, i.e. the generation is continued from Algorithm 1 Algorithm for incremental decoder s, Source sequence s 0 , Available source sequence tc, Committed target sequence t, Current decoded sequence for s 0 nw, Number of tokens to commit . WRITE operation end for function GETNEWTOKENS(tc, t, nw) start length(tc) + 1 end start + nw return t[start : end] end function the last committed target word using the saved decoder state. See Algorithm 1 for details. The AGENT decides how many target words to WRITE after every READ operation, and has complete control over the context each target word gets to see before being committed, as well as the overall delay incurred. Figure 1 shows the incremental decoder in action, where the agent decides to not commit any target words in iterations 1 and 3. The example shows an instance where the incorrectly translated words are discarded when more context becomes available. Given this generic framework, we describe several AGENTS in Section 3, trained to optimize the BLEU loss and latency. complexities for beam decoding. For example, if at some iteration the decoder generates 5 new words, but the agent decides to commit only 2 of these, the best hypothesis at the 2 nd word may not be the same as the one at the 5 th word. Hence, the agent has to re-rank the hypotheses at the last target word it decides to commit. Future hypotheses then continue from this selected hypothesis. See Figure 2 for a visual representation. The overall utility of beam decoding is reduced in the case of incremental decoding, because it is necessary to commit and retain only one beam at several points to start producing output with minimal delay.

Segmentation Strategies
In this section, we discuss different AGENTS that we evaluated in our modified incremental decoder. To measure latency in these agents, we use Average Proportion (AP) metric as defined by Cho and Esipova (2016). AP is calculated as the total number of source words each target word required before being committed, normalized by the product of the source and target lengths. It varies between 0 and 1 with lesser being better. See supplementary material for details.
Wait-until-end: The WUE agent waits for the entire source sentence before decoding, and serves as an upper bound on the performance of our agents, albeit with the worst AP = 1.
Wait-if-worse/diff: We reimplemented the baseline agents described in Cho and Esipova (2016). The Wait-if-Worse (WIW) agent WRITES a target word if its probability does not decrease after a READ operation. The Wait-if-Diff (WID) agent instead WRITES a target word if the target word remains unchanged after a READ operation.
Static Read and Write: The STATIC-RW: agent is inspired from the chunk-based decoder and tries to resolve its shortcomings while maintaining its simplicity. The primary drawback of the chunk-based decoder is the loss of context across chunks. Our agent starts by performing S READ operations, followed by repeated RW WRITES and READS until the end of the source sequence. The number of WRITE and READ operations is the same to ensure that the gap between the source and target sequence does not increase with time.
The initial S READ operations essentially create a buffer of S tokens, allowing some future context to be used by the decoder. Note that the latency induced by this agent in this case is only in the beginning, and remains constant for the rest of the sentence. This method actually introduces a class of AGENTS based on their S,RW values. We tune S and RW to select the specific AGENT with the user-defined BLEU-loss and AP thresholds.

Evaluation
Data: We trained systems for 4 language pairs: German-, Arabic-, Czech-and Spanish-English pairs using the data made available for IWSLT (Cettolo et al., 2014). See supplementary material for data stats. These language pairs present a diverse set of challenges for this problem, with Arabic and Czech being morphologically rich, German being syntactically divergent, and Spanish introducing local reorderings with respect to English.
NMT System: We trained a 2-layered LSTM encoder-decoder models with attention using the seq2seq-attn implementation (Kim, 2016). Please see supplementary material for settings. Figure 3 shows the results of various streaming agents. Our proposed STATIC-RW agent outperforms other methods while maintaining an AP < 0.75 with a loss of less than 0.5 BLEU points on Arabic, Czech and Spanish. This was found to be consistent for all test-sets 2011-2014 (See under "small" models in Figure 4). In the case of German the loss at AP < 0.75 was around 1.5 BLEU points. The syntactical divergence and rich morphology of German posits a bigger challenge and requires larger context than other language pairs. For example the conjugated verb in a German verb complex appears in the second position, while the main verb almost always occurs at the end of the sentence/phrase (Durrani et al., 2011). Our methods are also comparable to the more sophisticated techniques involving Reinforcement Learning to learn an agent introduced by Gu et al. (2017) and Satija and Pineau (2016), but without the overhead of expensive training for the agent.

Scalability:
The preliminary results were obtained using models trained on the TED corpus only. We conducted further experiments by training models on larger data-sets (See the supplementary section again for data sizes) to see if our findings are scalable. We fine-tuned (Luong and Manning, 2015;Sajjad et al., 2017b) our models with the in-domain data to avoid domain disparity. We then re-ran our agents with the best S,RW values (with an AP under 0.75) for each language pair. Figure 4 ("large" models) shows that the BLEU loss from the respective oracle increased when the models were trained with bigger data sizes. This could be attributed to the increased lexical ambiguity from the large amount of out-domain data, which can only be resolved with additional contextual information. However our results were still better than the WIW agent, which also has an AP value above 0.8. Allowing similar AP, our STATIC-RW agents were able to restore the BLEU loss to be  1.5 for all language

Incremental Training
The limitation of previously described decoding approaches (chunk-based and incremental) is the mismatch between training and decoding. The training is carried on full sentences, however, at the test time, the decoder generates hypothesis based on incomplete source information. This discrepancy between training and decoding can be potentially harmful. In Section 2, we presented two methods to address the partial input sentence decoding problem, the Chunk Decoder and the Incremental Decoder. We now train models to match the corresponding decoding scenario.

Chunk Training
In chunk-based training, we simply split each training sentence into chunks of N tokens. 2 The corresponding target sentence for each chunk is generated by having a span of target words that are word-aligned 3 with the words in the source span.
Chunking the data into smaller segments increases the training time significantly. To overcome this problem, we train a model on the full sentences using all the data and then fine-tune it with the indomain chunked data.

Add-M Training
Next we formulate a training mechanism to match the incremental decoding described in Section 2. A way to achieve this is to force the attention on a local span of encoder states and block it from giving weight to the non-local (rightward) encoder states. The hope is that in the case of long-range dependencies, the model learns to predict these dependencies without the entire source context. Such a training procedure is non-trivial, as it requires dynamic inputs to the attention mechanism while training, including backpropagation where some encoder states which have been seen by the attention mechanism a greater number of times dynamically receiving more gradient inputs. We leave this idea as future work, while focusing on a data-driven technique to mimic this kind of training as described below.
We start with the first N words in a source sentence and generate target words that are aligned to these words. We then generate the next training instances with N + M , N + 2M , N + 3M ... source words until the end of sentence has been reached. 4 The resulting training roughly mimics the decoding scenario where the source-side context is gradually built. The down-side of this method is that the data size increases quadratically, making the training infeasible. To overcome this, we finetune a model trained on full sentences with the indomain corpus generated using this method.

Results
The results in Figure 5 show that matching the chunk-decoding with corresponding chunk-based training significantly improves performance, with a gain of up to 12 BLEU points. However, we were not able to improve upon our incremental decoder, with the results deteriorating notably. One reason for this degradation is that the training/decoding scenarios are still not perfectly matched. The training pipeline in this case also sees the beginning of sentences much more often, which could lead to unnatural distributions being inferred within the model.

Conclusion
We addressed the problem of simultaneous translation by modifying the architecture in Neural MT decoder. We presented a tunable agent which decides the best segmentation strategy based on userdefined BLEU loss and AP constraints. Our results showed improvements over previously established WIW and WID methods. We additionally modified the Neural MT training to match the incremental decoding, which significantly improved the chunk-based decoding, but we did not observe any improvement using Add-M Training. The code for our incremental decoder and agents has been made available. 5 While were able to significantly improve the the chunk-based decoder, we did not observe any improvement using the Add-M Training. In the future we would like to change the training model to dynamically build the encoder and the attention model in order to match our incremental decoder.