Neural Simultaneous Speech Translation Using Alignment-Based Chunking

In simultaneous machine translation, the objective is to determine when to produce a partial translation given a continuous stream of source words, with a trade-off between latency and quality. We propose a neural machine translation (NMT) model that makes dynamic decisions when to continue feeding on input or generate output words. The model is composed of two main components: one to dynamically decide on ending a source chunk, and another that translates the consumed chunk. We train the components jointly and in a manner consistent with the inference conditions. To generate chunked training data, we propose a method that utilizes word alignment while also preserving enough context. We compare models with bidirectional and unidirectional encoders of different depths, both on real speech and text input. Our results on the IWSLT 2020 English-to-German task outperform a wait-k baseline by 2.6 to 3.7% BLEU absolute.


Introduction
Simultaneous machine translation is the task of generating partial translations before observing the entire source sentence. The task fits scenarios such as live captioning and speech-to-speech translation, where the user expects a translation before the speaker finishes the sentence. Simultaneous MT has to balance between latency and translation quality. If more input is consumed before translation, quality is likely to improve due to increased context, but latency also increases. On the other hand, consuming limited input decreases latency, but degrades quality.
There have been several approaches to solve simultaneous machine translation. In (Dalvi et al., 2018;, a fixed policy is introduced to delay translation by a fixed number of words. Alternatively, Satija and Pineau (2016); Gu et al. (2017); Alinejad et al. (2018) use reinforcement learning to learn a dynamic policy to determine whether to read or output words. Cho and Esipova (2016) adapt the decoding algorithm without relying on additional components. However, these methods do not modify the training of the underlying NMT model. Instead, it is trained on full sentences. Arivazhagan et al. (2019) introduce a holistic framework that relaxes the hard notion of read/write decisions at training time, allowing it to be trained jointly with the rest of the NMT model.
In this paper, we integrate a source chunk boundary detection component into a bidirectional recurrent NMT model. This component corresponds to segmentation or read/write decisions in the literature. It is however trained jointly with the rest of the NMT model. We propose an algorithm to chunk the training data based on automatically learned word alignment. The chunk boundaries are used as a training signal along with the parallel corpus. The main contributions of this work are as follows: • We introduce a source chunk boundary detection component and train it jointly with the NMT model. Unlike in (Arivazhagan et al., 2019), our component is trained using hard decisions, which is consistent with inference.
• We propose a method based on word alignment to generate the source and target chunk boundaries, which are needed for training.
• We study the use of bidirectional vs unidirectional encoder layers for simultaneous machine translation. Previous work focuses mostly on the use of unidirectional encoders.
• We provide results using text and speech input. This is in contrast to previous work that only simulates simultaneous NMT on text input.

238
2 Related Work Oda et al. (2014) formulate segmentation as an optimization problem solved using dynamic programming to optimize translation quality. The approach is applied to phrase-based machine translation. Our chunking approach is conceptually simpler, and we explore its use with neural machine translation. Cho and Esipova (2016) devise a greedy decoding algorithm for simultaneous neural machine translation. They use a model that is trained on full sentences. In contrast, we train our models on chunked sentences to be consistent with the decoding condition. Satija and Pineau (2016), Alinejad et al. (2018), and Gu et al. (2017) follow a reinforcement learning approach to make decisions as to when to read source words or to write target words.  propose the simpler approach to use the position of the reference target word in the beam of an existing MT system to generate training examples of read/write decisions. We extract such decisions from statistical word alignment instead. In ; Dalvi et al. (2018), a wait-k policy is proposed to delay the first target word until k source words are read. The model alternates between generating s target words and reading s source words, until the source words are exhausted. Afterwards, the rest of the target words are generated. In addition, Dalvi et al. (2018) convert the training data into chunks of predetermined fixed size. In contrast, we train models that learn to produce dynamic context-dependent chunk lengths.
The idea of exploiting word alignments to decide for necessary translation context can be found in several recent papers. Arthur et al. (2020) train an agent to imitate read/write decisions derived from word alignments. In our architecture such a separate agent model is replaced by a simple additional output of the encoder.  use word alignments to tune a pretrained language representation model to perform word sequence chunking. In contrast, our approach integrates alignmentbased chunking into the translation model itself, avoiding the overhead of having a separate component and the need for a pretrained model. Moreover, in this work we improve on pure alignmentbased chunks using language models (Section 6.3) to avoid leaving relevant future source words out of the chunk. Press and Smith (2018) insert -tokens into the target using word alignments to develop an NMT model without an attention mechanism.
Those tokens fulfill a similar purpose to wait decisions in simultaneous MT policies. Arivazhagan et al. (2019) propose an attentionbased model that integrates an additional monotonic attention component. While the motivation is to use hard attention to select the encoder state at the end of the source chunk, they avoid using discrete attention to keep the model differentiable, and use soft probabilities instead. The hard mode is only used during decoding. We do not have to work around discrete decisions in this work, since the chunk boundaries are computed offline before training, resulting in a simpler model architecture.

Simultaneous Machine Translation
The problem of offline machine translation is focused on finding the target sequence e I 1 = e 1 ...e I of length I given the source sequence f J 1 of length J. In contrast, simultaneous MT does not necessarily require the full source input to generate the target output. In this work, we formulate the problem by assuming latent monotonic chunking underlying the source and target sequences.
Formally, let s K 1 = s 1 ...s k ...s K denote the chunking sequence of K chunks, such that s k = (i k , j k ), where i k denotes the target position of last target word in the k-th chunk, and j k denotes the source position of the last source word in the chunk. Since the source and target chunks are monotonic, the beginnings of the source and target chunks do not have to be defined explicitly. The chunk positions are subject to the following constraints: We useẽ k = e i k−1 +1 ...e i k to denote the k-th target chunk, andf k = f j k−1 +1 ...f j k to denote its corresponding source chunk. The target sequence e I 1 can be rewritten asẽ K 1 , similarly, the source sequence can be rewritten as f J 1 =f K 1 . We introduce the chunk sequence s K 1 as a latent variable as follows: where Equation 2 introduces the latent sequence s K 1 with a marginalization sum over all possible chunk sequences and all possible number of chunks K. In Equation 3 we rewrite the source and target sequences using the chunk notation, and we apply the chain rule of probability in Equation 4. We use the chain rule again in Equation 5 to decompose the probability further into a target chunk boundary probability p(i k |ẽ k 1 , s k−1 1 , j k ,f K 1 ), a target chunk translation probability p(ẽ k |ẽ k−1 1 , s k−1 1 , j k ,f K 1 ), and a source chunk boundary probability p(j k |ẽ k−1 1 , s k−1 1 ,f K 1 ). This creates a generative story, where the source chunk boundary is determined first, followed by the translation of the chunk, and finally by the target chunk boundary. The translation probability can be further decomposed to reach the word level: In this work, we drop the marginalization sum over chunk sequences and use fixed chunks during training. The chunk sequences are generated as described in Section 6.

Source Chunk Boundary Detection
We simplify the chunk boundary probability, dropping the dependence on the target sequence and previous target boundary decisions where the distribution is conditioned on the source sequence up to the last word of the k-th chunk. It is also conditioned on the previous source boundary decisions j 1 ...j k−1 . Instead of computing a distribution over the source positions, we introduce a binary random variable b j such that for each source position we estimate the probability of a chunk boundary: For this, we use a forward stacked RNN encoder. The l-th forward encoder layer is given by wheref j is the word embedding of the word f j , which is concatenated to the embedding of the boundary decision at the previous source position b j−1,k . L enc is the number of encoder layers. On top of the last layer a softmax estimates p(b j,k ): where g(·) denotes a non-linear function.

Translation Model
We use an RNN attention model based on (Bahdanau et al., 2015) for p(e i |e i−1 1 , f j k 1 ). The model shares the forward encoder with the chunk boundary detection model. In addition, we extend the encoder with a stacked backward RNN encoder. The l-th backward layer is given by where the backward layer is computed within a chunk starting at the last position of the chunk j = j k . 0 indicates a vector of zeros for positions beyond the current chunk. The source representation is given by the concatenation of the last forward and backward layer We also stack L dec LSTM layers in the decoder whereê i is the target word embedding of the word e i ,k = k unless the previous decoder state belongs to the previous chunk, thenk = k − 1. The vector d i,k is the context vector computed over source positions up to the last source position j k in the k-th chunk where α i,j,k is the attention weight normalized over the source positions 1 ≤ j ≤ j k , and r i,j,k is the energy computed via the function f which uses tanh of the previous top-most decoder layer and the source representation at position j. Note the difference to the attention component used in offline MT, where the attention weights are computed considering the complete source sentence f J 1 . The output distribution is computed using a softmax function of energies from the top-most decoder layer u (L dec ) i−1,k , the target embedding of the previous wordê i−1 , and the context vector

Target Chunk Boundary Factor
Traditionally, the translation model is trained to produce a sentence end token to know when to stop the decoding process. In our approach, this decision has to be made for each chunk (see next section). Hence, we have to train the model to predict the end positions of the chunks on the target side. For this, we use a target factor (García-Martínez et al., 2016;Wilken and Matusov, 2019), i.e. a second output of the decoder in each step: where b i is a binary random variable representing target chunk boundaries analogous to b j on the source side. This probability corresponds to the first term in Equation 5, making the same model assumptions as for the translation probability. Note however, that we make the boundary decision dependent on the embeddingê i of the target word produced in the current decoding step.

Search
Decoding in simultaneous MT can be seen as an asynchronous process that takes a stream of source words as input and produces a stream of target words as output. In our approach, we segment the incoming source stream into chunks and output a translation for each chunk individually, however always keeping the full source and target context.
Algorithm 1 explains the simultaneous decoding process. One source word f j (i.e. its embeddinĝ f j ) is read at a time. We calculate the next step of the shared forward encoder (Equation 9), including source boundary detection (Equation 10). If the boundary probability p(b j ) is below a certain threshold t b , we continue reading the next source word f j+1 . If, however, a chunk boundary is detected, we first feed all word embeddingsf k of the current chunk into the backward encoder (Equation 11), resulting in representations ← − h k for each of the words in the current chunk. After that, the decoder is run according to Equations 12-18. Note, that it attends to representations − → h and ← − h of all source words read so far, not only the current chunk. Here, we perform beam search such that in each decoding step those combinations of target words and target chunk boundary decisions are kept that have the highest joint probability. A hypothesis is considered final as soon as it reaches a position i where a chunk boundary b i = 1 is predicted. Note that the length of a chunk translation is not restricted and hypotheses of different lengths compete. When all hypotheses in the beam are final, the first-best hypothesis is declared as the translationẽ k of the current chunk and all its words are flushed into the output stream at once.
During search, the internal states of the forward encoder and the decoder are saved between consecutive different calls while the backward decoder is initialized with a zero state for each chunk.

Baseline Approach
We aimed at a meaningful segmentation of sentence pairs into bilingual chunks which could then be translated in monotonic sequence and each chunk is -in terms of aligned words -translatable without consuming source words from succeeding chunks. We extract such a segmentation from unsupervised word alignments in source-to-target and target-tosource directions that we trained using the Eflomal toolkit (Östling and Tiedemann, 2016) and combined using the grow-diag-final-and heuristic (Koehn et al., 2003). Then, for each training sentence pair, we extract a sequence of "minimallength" monotonic phrase pairs, i.e. a sequence of the smallest possible bilingual chunks which do not violate the alignment constraints 2 and at the same time are conform with the segmentation constraints in Equation 1. By this we allow word reordering between the two languages to happen only within the chunk boundaries. The method roughly follows the approach of (Mariño et al., 2005), who extracted similar chunks as units for n-gram based statistical MT.
For fully monotonic word alignments, only chunks of length 1 either on the source or the target side are extracted (corresponding to 1-to-1, 1-to-M, M-to-1 alignments). For non-monotonic alignments larger chunks are obtained, in the extreme case the whole sentence pair is one chunk. Any unaligned source or target words are attached to the chunk directly preceding them, also any nonaligned words that may start the source/target sentence are attached to the first chunk. We perform the word alignment and chunk boundary extraction on the word level, and then convert words to subword units for the subsequent use in NMT.

Delayed Source Chunk Boundaries
We observed that the accuracy of source boundary detection can be improved significantly by including the words immediately following the source chunk boundary into the context. Take e. g. the source word sequence I have seen it. It can be translated into German as soon as the word it was read: Ich habe es gesehen. Therefore the model is likely to predict a chunk boundary after it. However, if the next read source word is coming, it becomes clear that we should have waited because the correct German translation is now Ich habe es kommen gesehen. There is a reordering which invalidates the previous partial translation.
To be able to resolve such cases, we shift the source chunks by a constant delay D such that j 1 , ..., j k , ..., j K becomes j 1 + D, ..., j k + D, ..., j K + D. 3 Note that the target chunks remain unchanged, thus the extra source words also provide an expanded context for translation. In preliminary experiments we saw large improvements in translation quality when using a delay of 2 or more words, therefore we use it in all further experiments.

Improved Chunks for More Context
The baseline chunking method (Section 6.1) considers word reordering to determine necessary context for translation. However, future context is often necessary for correct translation. Consider the translation The beautiful woman → Die schöne Frau. Here, despite of the monotonic alignment, we need the context of the third English word woman to translate the first two words as we have to decide on the gender and number of the German article Die and adjective schöne.
In part, this problem is already addressed by adding future source words into the context as described in Section 6.2. However, this method causes a general increase in latency by D source positions and yet covers only short-range dependencies. A better approach is to remove any chunk boundary for which the words following it are important for a correct translation of the words preceding it. To this end, we introduce a heuristic that uses two bigram target language models (LMs). The first language model yields the probability p(e i k |e i k −1 ) for the last word e i k of chunk s k ,  Figure 1: Examples of the baseline and the improved approach of extracting chunk boundaries. Note how in the improved approach noun phrases were merged into single bigger chunks. Also note the long last chunk that corresponds to the non-monotonic alignment of the English and German subordinate clause.
whereas the second one computes the probability p(e i k |e i k +1 ) for the last word in the chunk given the first word e i k +1 of the next chunk s k+1 that follows the word e i k . The chunk boundary after e i k is removed if the probability of the latter reverse bigram LM is higher than the probability of the first one by a factor l = √ i k − i k−1 , i.e. dependent on the length of the current chunk. The motivation for this factor is that shorter chunks should be merged with the context to the right more often than chunks which are already long, provided that the right context word has been frequently observed in training to follow the last word of such a chunk candidate. The two bigram LMs are estimated on the target side of the bilingual data, with the second one trained on sentences printed in reverse order.
Examples of the chunks extracted with the baseline and the improved approach for a given training sentence pair are shown in Figure 1.

Streaming Speech Recognition
To translate directly from speech signal, we use a cascaded approach. The proposed simultaneous NMT system consumes words from a streaming automatic speech recognition (ASR) system. This system is based on a hybrid LSTM/HMM acoustic model (Bourlard and Wellekens, 1989;Hochreiter and Schmidhuber, 1997), trained on a total of approx. 2300 hours of transcribed English speech from the corpora allowed by IWSLT 2020 evaluation, including MUST-C, TED-LIUM, and Lib-riSpeech. The acoustic model takes 80-dim. MFCC features as input and estimates state posterior probabilities for 5000 tied triphone states. It consists of 4 bidirectional layers with 512 LSTM units for each direction. Frame-level alignment and state tying were bootstrapped with a Gaussian mixtures acoustic model. The LM of the streaming recognizer is a 4-gram count model trained with Kneser-Ney smoothing on English text data (approx. 2.8B running words) allowed by the IWSLT 2020 evaluation. The vocabulary consists of 152K words and the out-of-vocabulary rate is below 1%. Acoustic training and the HMM decoding were performed with the RWTH ASR toolkit (Wiesler et al., 2014).
The streaming recognizer implements a version of chunked processing (Chen and Huo, 2016;Zeyer et al., 2016) which allows for the same BLSTMbased acoustic model to be used in both offline and online applications. By default, the recognizer updates the current first-best hypothesis by Viterbi decoding starting from the most recent frame and returns the resulting word sequence to the client. This makes the first-best hypothesis "unstable", i.e. past words can change depending on the newly received evidence due to the global optimization of the Viterbi decoding. To make the output more stable, we made the decoder delay the recognition results until all active word sequences share a common prefix. This prefix is then guaranteed to remain unchanged independent of the rest of the utterance and thus can be sent out to the MT model.

Experiments
We conduct experiments on the IWSLT simultaneous translation task for speech translation of TED talks from English to German.

Setup
For training the baseline NMT system, we utilize the parallel data allowed for the IWSLT 2020 evaluation. We divide it into 3 parts: in-domain, clean, and out-of-domain. We consider data from the TED and MUST-C corpora (Di Gangi et al., 2019) as in-domain and use it for subsequent fine-tuning experiments, as well as the "ground truth" for filtering the out-of-domain data based on sentence embedding similarity with the in-domain data; details are given in (Bahar et al., 2020). As "clean" we consider the News-Commentary, Europarl, and WikiTitles corpora and use their full versions in training. As out-of-domain data, we consider Open-Subtitles, ParaCrawl, CommonCrawl, and rapid corpora, which we reduce to 40% of their total size, or to 23.2M parallel lines, with similaritybased filtering. Thus, in total, we use almost 26M lines of parallel data to train our systems, which amounts to ca. 327M running words on the English side. Furthermore, we added 7.9M sentence pairs or ca. 145M running words of similarity-filtered back-translated 4 German monolingual data allowed by the IWSLT 2020 evaluation.
In training, the in-domain and clean parallel data had a weight of 5. All models were implemented and trained with the RETURNN toolkit (Zeyer et al., 2018). We used an embedding size of 620 and LSTM state sizes of 1000.
As heldout tuning set, we use a combination of IWSLT dev2010, tst2014, and MUST-C-dev corpora. To obtain bilingual chunks as described in Section 6, we word-align all of the filtered parallel/back-translated and tuning data in portions of up to 1M sentence pairs, each of them combined with all of the in-domain and clean parallel data. As heldout evaluation sets, we use IWSLT tst2015, as well as MUST-C HE and COMMON test data.
For the text input condition, we applied almost no preprocessing, tokenization was handled as part of the subword segmentation with the sentencepiece toolkit (Kudo and Richardson, 2018). The vocabularies for both the source and the target subword models had a size of 30K. For the speech input condition, the additional preprocessing applied to the English side of the parallel data had the goal to make it look like speech transcripts. We lowercased the text, removed all punctuation marks, expanded common abbreviations, especially for measurement units, and converted numbers, dates, and other entities expressed with digits into their spoken form. For the cases of multiple readings of a given number (e.g. one oh one, one hundred and one), we selected one randomly, so that the system could learn to convert alternative readings in English to the same number expressed with digits in German. Because of this preprocessing, our system for the speech condition learned to insert punctuation marks, restore word case, and convert spoken number and entity forms to digits as part of the translation process.
The proposed chunking method (Section 6) is applied to the training corpus as a data preparation step. We measured average chunk lengths of 2.9 source words and 2.7 target words. 40% of both the source and target chunks consist of a single word, about 20% are longer than 3 words.
We compute case-sensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) scores as well as the average lagging (AL) metric . Table 1 shows results for the proposed simultaneous MT system. For reference, we first provide the translation quality of an offline system that is trained on full sentences. It is a transformer "base" model (Vaswani et al., 2017) that we trained on the same data as the online systems. Row 1 shows BLEU and TER scores for translation of the human reference transcription of the speech input (converted to lower-case, punctuation removed), whereas row 2 uses the automatic transcription generated by our streaming ASR system (Section 7). The ASR system has a word error rate (WER) of 8.7 to 11.2% on the three test sets, causing a drop of 4-6% BLEU absolute.

Results
All following systems are cascaded streaming ASR + MT online systems that produce translations from audio input in real-time. Those systems have an overall AL of 4.1 to 4.5 seconds, depending on D. We compare between two categories of models: unidirectional and bidirectional. For the unidirectional models the backwards decoder (Equation 11) was removed from the architecture. We show results for different values of source boundary delay D (see Section 6.2). For the number of layers we choose L enc = 6 and L dec = 2 for the unidirectional models, and L enc = 4 (both directions) and L dec = 1 for the bidirectional models, such that the number of parameters is comparable. Contradicting our initial assumption, bidirectional models do not outperform unidirectional models. This might indicate be due to the fact that the majority of chunks are too short to benefit from a backwards encoding. Also, the model is not sensitive to the delay D. This confirms our assumption that the additional context of future source words is primarily useful for making the source boundary decision, and for this a context of 2 following (sub-)words is sufficient. For translation, the model does not depend on this "extra" context but instead is able to make sufficiently good chunking decisions. Table 2 shows results for the case of streamed text input (cased and with punctuation marks). We compare our results to a 4-layer unidirectional system that was trained using the wait-k policy . For this, we chunk the training data into single words, except for a first chunk of size k = 9 on the source side, and set the delay to D = 0. All of our systems outperform this wait-k system by large margins. We conclude that the alignment-based chunking proposed in Section 6 is able to provide better source context than a fixed policy and that the source boundary detection component described in Section 4.1 successfully learns to reproduce this chunking at inference time. Also for the text condition, we do not observe large differences between uni-and bidirectional models and between different delays.
For all systems, we report AL scores averaged over all test sets. Figure 2 breaks down the scores to the individual test sets for the bidirectional models. For a source boundary delay D = 2 we observe an AL of 4.6 to 4.7 words. When increasing D, we increase the average lagging score by roughly the same amount, which is expected, since the additional source context for the boundary decision is not translated in the same step where it is added. As discussed before, translation quality does not consistently improve from increasing D.
We found tuning of length normalization to be important, as the average decoding length for chunks is much shorter than in offline translation. For optimal results, we divided the model scores by I α , I being the target length, and tuned the parameter α. Figure 3 shows that α = 0.9 works best in our experiments, independent of the source boundary delay D. This value is used in all experiments.
Furthermore, we found the model to be very sensitive to a source boundary probability threshold t b different than 0.5 regarding translation quality. This means the "translating" part of the network strongly adapts to the chunking component.

Conclusion
We proposed a novel neural model architecture for simultaneous MT that incorporates a component for splitting the incoming source stream into translatable chunks. We presented how we generate training examples for such chunks from statistical word alignment and how those can be improved via language models. Experiments on the IWSLT 2020 English-to-German task proved that the proposed learned source chunking outperforms a fixed wait-k strategy by a large margin. We also investigated the value of backwards source encoding in the context of simultaneous MT by comparing uniand bidirectional versions of our architecture.