Direct Segmentation Models for Streaming Speech Translation

The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) sys-tem followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is specially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.


Introduction
ST is a field that is very closely aligned with ASR and MT, as it is their natural evolution to combine the advances in both areas. Thus, the goal is to obtain the translation of an utterance that has been spoken in a different language, without necessarily requiring the intermediate transcription. At the same time, it is desirable to have high quality translations without compromising the speed of the system. Although research into ST started in the nineties (Waibel et al., 1991), the field did not really take off until significant breakthroughs were achieved in ASR (Chan et al., 2016;Irie et al., 2019;Park et al., 2019; and MT (Bahdanau et al., 2015;Sennrich et al., 2016b,a;Vaswani et al., 2017), mainly due to the introduction of deep neural networks (NN). Thanks to this, the field has recently attracted significant amounts of attention from both the research and industry communities, as the field is now mature enough that it has tangible and well-performing applications Jia et al., 2019).
Currently, there are two main approaches to ST: cascade and end-to-end models. The goal of the end-to-end approach is to train a single system that is able to carry out the the entire translation process (Weiss et al., 2017;Berard et al., 2018;Gangi et al., 2019). This has only recently been possible thanks to advances in neural modeling. Due to a lack of ST data, different techniques such as pre-training and data augmentation (Bahar et al., 2019;Pino et al., 2019) have been used in order to alleviate this lack of data. It is important to remark that the currently proposed end-to-end models work in an offline manner and must process the entire input sequence. Therefore they cannot be used for a streaming scenario.
In the cascade approach, an ASR system transcribes the input speech signal, and this is fed to a downstream MT system that carries out the translation. The provided input to the MT step can be the 1-best hypothesis, but also n-best lists (Ng et al., 2016) or even lattices (Matusov and Ney, 2011;Sperber et al., 2019). Additional techniques can also be used to improve the performance of the pipeline by better adapting the MT system to the expected input, such as training with transcribed text (Peitz et al., 2012) or chunking (Sperber et al., 2017). The cascade approach can be used to take advantage of independent developments in ASR and MT, and it is significantly easier to train due to greater data availability. Thus, it is very relevant to study improvements for the ST cascade pipeline.
This work focuses on the effects of segmentation in streaming ST, as cascade systems still outperform end-to-end systems in standard setups (Niehues et al., 2019;Pino et al., 2019). Fol-lowing a cascade approach, a streaming ST setup can be achieved with individual streaming ASR and MT components. Advances in neural streaming ASR (Zeyer et al., 2016;Jorge et al., 2019 allow the training of streaming models whose performance is very similar to offline ones. Recent advances in simultaneous MT show promise (Arivazhagan et al., 2019;, but current models have additional modelling and training complexity, and are not ready for translation of long streams of input text. For the scenario to be considered (translation of parliamentary speeches, with an average duration of 100s), it is required for the ST systems to have a minimum throughput, but simultaneous translation is not required, so the translation of chunks 1 is still acceptable. In this case we prioritize quality over simultaneous translation, with a streaming ASR system followed by a standard offline MT system. This way, the resulting ST cascade system can provide transcribed words in real-time, that are eventually split into chunks to be translated by the offline MT system.
Following this approach, it is necessary to incorporate a segmentation component in the middle in order to split the output of the ASR system into (hopefully semantically self-contained) chunks that can be successfully processed by the MT model, while maintaining a balance between latency and quality. In (Cho et al., 2012(Cho et al., , 2017, the authors approach this problem by training a monolingual MT system that predicts punctuation marks, and then the ASR output is segmented into chunks based on this punctuation. Another approach is to segment the ASR output by using a language model (LM) that estimates the probability of a new chunk to start (Stolcke and Shriberg, 1996;Wang et al., 2016. Binary classifiers with Part of Speech and reordering features have also been proposed (Oda et al., 2014;Siahbani et al., 2018). It is also possible to segment the ASR output using handcrafted heuristics such as those based on a fixed number of words per chunk (Cettolo and Federico, 2006) or acoustic information (Fügen et al., 2007). These heuristic approaches present the disadvantage of being very domain and speaker specific. Alternatively, segmentation can be integrated into the decoder, so that it is carried out at the target side rather than the source side (Kolss et al., 2008). 1 A chunk must be understood as a sequence of words. This work introduces a statistical framework for the problem of segmentation in ST, which incorporates both textual and acoustic information. Jointly with this, we propose a set of novel models that follow this framework, and a series of extensive experiments are carried out, which show how these new models outperform previously proposed segmentation models. In addition, we study the effect of the preprocessing scheme applied to the input of the MT system, the performance degradation explained by transcription and/or segmentation errors, and the latency due to the components of the ST system. This paper is organized as follows. The next section describes the statistical framework of the segmenter in the streaming ST scenario. Section 3 follows, detailing how our proposed models are instantiated in this framework. Then, Section 4 describes the Europarl-ST dataset that is used in the experiments and the three main components of our streaming ST system based on a cascade approach: ASR and MT systems, and the segmentation models. Next, Section 5 reports a detailed evaluation in terms of BLEU score on the Europarl-ST dataset and comparative results with previous work, and latency figures. Finally, Section 6 draws the main conclusions of this work and devises future research lines.

Statistical framework
We define the streaming ST segmentation as a problem in which a continuous sequence of words provided as the output of an ASR system is segmented into chunks. These chunks will then be translated by a downstream MT system. The goal of the segmentation is to maximize the resulting translation accuracy while keeping latency under the responsetime requirements of our streaming scenario.
Formally, the segmentation problem is the task of dividing a sequence of input words x J 1 into nonoverlapping chunks. We will represent this with a sequence of split/non-split decisions, y J 1 , with y j = 1 if the associated word x j is the word that ends a chunk; and y j = 0, otherwise. Optionally, additional input features can be used. In this work, we use word-based acoustic features (a J 1 ) aligned with the sequence of words output by the ASR system.
Ideally, we would choose the segmentationŷ J 1 such that, However, in a streaming setup, we need to bound the sequence to w words into the future (hereafter, future window) to meet latency requirementŝ Indeed, for computational reasons, the sequence is also bounded to h words into the past (hereafter, history size) Previous works in the literature can be stated as a particular case of the statistical framework defined above under certain assumptions.
LM based segmentation (Stolcke and Shriberg, 1996;Wang et al., 2016. In this approach, an n-gram LM is used to compute the probability where y j is zero or one depending on the non-split or split decision to be taken, respectively. Split and non-split probabilities are combined into a function to decide whether a new chunk is defined after x ĵ y j = arg max y j f (P (y j )).
Monolingual MT segmentation (Cho et al., 2012(Cho et al., , 2017. Following this setup, a monolingual MT model translates from the original, (unpunctuated) words x J 1 into a new sequence z J 1 that contains segmentation information (via puntuaction marks). Each z j can be understood as a pair (x j , y j ), so the segmentation can be defined as an MT problem that basically reverts tô since x J 1 is given. In contrast with previous approaches, which treats segmentation as a by-product of a more general task, we propose a model that directly represents the probability of the split/non-split decision.

Direct Segmentation Model
Now that we have introduced the theoretical framework, we are going to describe our proposed segmentation model. This approach has the advantage of allowing a future dependency and consider not only textual, but also acoustic features. This provides the model with additional evidence for taking a better split/non-split decision.
First, the Text model computes text state vectors s j+w j that consider each word in x j+w j−h using an embedding function e() and one or more recurrent layers, represented by the function f 1 (). In order to incorporate information about previous decisions y j−1 j−h , we create a new sequencex j+w j−h by inserting an end-of-chunk token into the text input sequence every time a split decision has been taken. This sequence is bounded in length by h.
Then, the state vectors are defined as follows Next, the split probability is computed by concatenating the state vectors of the current word and those in the future window, and passing them through a series of feedforward layers f 2 () If we include acoustic information, acoustic state vectors are computed using function f 3 () and are concatenated with the text state vectors in order to compute the split/non-split probability In the case of audio information, we assess two variants, depending whether the acoustic sequence is passed through a RNN (Audio w/ RNN) or not (Audio w/o RNN). These word-based acoustic feature vectors are obtained as follows. The Audio w/o RNN (also referred to as copy) option extracts three acoustic features associated to each word: duration of the current word, duration of the previous silence (if any), and duration of the next silence (if any). The three features were selected due to effectiveness (indeed, they improve system performance) as well as being word-based features which therefore can be directly integrated into the proposed model. At training time, these features are obtained by carrying out a forced alignment between the audio and the reference transcription, while at testing time are directly provided by the ASR system. The Audio w/ RNN option adds an independent RNN as f 3 , to process the sequence a j+w j−h of three-dimensional acoustic feature vectors just described, and the acoustic state vectors are concatenated at word level with the text state vectors. Whenever acoustic features are used, first the Text model is pre-trained and frozen, and then the feedforward network is updated with the acoustic data.

Experimental setup
To study the effects of our streaming ST segmenter in terms of BLEU score (Papineni et al., 2002), state-of-the-art ASR and MT systems were trained to perform ST from German (De), Spanish (Es) and French (Fr) into English (En), and vice versa. ASR and MT systems were treated as black boxes in order to focus our efforts on evaluating the proposed streaming ST segmentation models on the recently released and publicly available Europarl-ST corpus . Basic statistics of the six language pairs of the Europarl-ST corpus involved in the evaluation are shown in Table 1

ASR systems
In our cascade ST setting, input speech signal is segmented into speech/non-speech regions using a Gaussian Mixture Model -Hidden Markov Model based voice activity detection (VAD) system (Silvestre-Cerdà et al., 2012), which will be referred to as the baseline segmentation system. Detected speech chunks are delivered to our generalpurpose hybrid ASR systems for German (De), English (En), Spanish (Es) and French (Fr). On the one hand, acoustic models (AM) were generated using the TLK toolkit (del Agua et al., 2014) to train Feed-Forward deep neural Network -Hidden Markov Models (FFN-HMM). These models were used to bootstrap bidirectional longshort term memory (BLSTM) NN models (Zeyer et al., 2017), trained using Tensorflow (Abadi et al., 2015), except for the French ASR system which only features FFNs. These AMs were trained with 0.9K (De), 5.6K (En), 3.9K (Es), and 0.7K (Fr) hours of speech data from multiple sources and domains.
On the other hand, language models (LM) consist of a linear interpolation of several n-gram LMs trained with SRILM (Stolcke, 2002), combined with a recurrent NN (RNN) LM trained using the RNNLM toolkit (Mikolov, 2011) (De, Fr),

MT systems
Neural MT systems were trained for each of the translation directions to be studied using the fairseq toolkit (Ott et al., 2019). The initial models are general out-of-domain systems trained with millions (M) of sentences: 21.0M for De↔En, 21.1M for En↔Es and 38.2M for En↔Fr. These models followed the sentence-level Transformer (Vaswani et al., 2017) BASE configuration, and were finetuned using the Europarl-ST training data.
Two MT systems were trained for each translation direction depending on the preprocessing scheme applied to the source sentences in the training set. The first scheme uses a conventional MT preprocessing (tokenization, truecasing, etc.), while the second scheme applies a special ST preprocessing to the source side of the training set, by lowercasing, transliterating and removing punctuation marks from all sentences (Matusov et al., 2018). This latter preprocessing scheme guarantees that the same conditions for the MT input are found at training and inference time. Since conventional MT preprocessing was applied to the target side, our hope is that the model is also able to learn to recover casing and punctuation information from the source to the target side. Both preprocessing schemes were evaluated by translating ASR hypotheses provided in chunks given by the baseline VAD segmenter. Results are shown in Table 2. As the segmentation is different from that of the reference, the evaluation is carried out by re-segmenting the translations so that they match the segmentation of the reference (Matusov et al., 2005).
As shown in Table 2, BLEU score improvements of the ST scheme over the MT scheme range from 4.1 (En-De) to 7.5 (En-Es), due to the fact that the ST source processing scheme fixes the mismatch between training and inference time. At the same time, MT systems are able to recover punctuation information that was not available in the ASR output. Thus, the special ST preprocessing scheme was applied in the rest of experiments.

Segmentation models
Depending on the segmentation model, text and optionally audio belonging to Europarl-ST were used as training data. As a preprocessing step, an end-of-chunk token was inserted in the text training data after each punctuation mark, such as full point, question/exclamation marks, etc., delimiting a chunk. In addition, the ST preprocessing scheme was applied to the annotated reference transcriptions in order to obtain training data that mimics ASR output. In the case of Audio models, as mentioned before, audio and reference transcriptions were forced-aligned using the AMs described in Section 4.1 in order to compute word and silence durations as acoustic features.
Due to the class imbalance present in the segmentation problem, (95% of samples belong to the non-split class), training batches were prepared by weighted random sampling so that on average, one third of the samples belongs to the split class. Otherwise, the model degenerates to always classifying into the non-split class.
The Text model consists of 256-unit word- Given the sequential nature of the split/non-split decision process as a streaming ASR output is processed, greedy and beam search decoding algorithms were implemented and compared, but negligible differences were observed between them.

Evaluation
In order to perform an evaluation that simulates real conditions, the ASR hypothesis of an entire speech (intervention made by a MEP, with an average duration of 100 seconds) is fed to the segmentation model whose generated chunks are translated by the MT system. The chunks are translated independently from each other. The quality of the MT output, in terms of BLEU score, provides a clear indication of the performance of the streaming ST segmenter and allows us to compare different segmenters. Figure 2 shows BLEU scores as a function of the length of the future window for the English-German (En-De) and Spanish-English (Es-En) dev sets. On the left-hand side, the three segmenters (Text, Audio w/ RNN and Audio w/o RNN) are compared averaging their BLEU scores over history sizes 5, 10 and 15 for the sake of clarity. On the right-hand side, the effect of history sizes is analysed for the Audio w/o RNN segmenter. In both cases, reference transcriptions were used as input to the segmenter.
As observed, the length of the future window is a very significant parameter to decide whether to split or not, which validates our decision to use a model that considers not only past history, but also a future window. In the case of En-De, adding a future window significantly improves the results, up to 5.8 and 3.5 BLEU points on average, in the Text and Audio models, respectively. Similarly for Es-En, but at a lower magnitude, a gain of up to 3.7 and 3.4 BLEU points on average in the Text and Audio models, respectively, is obtained at larger future windows.
When comparing the segmenters ( Figure 2 on the left), the Text segmenter provides a performance that is clearly lower than the Audio-based segmenters for the English-German pair, and similar or lower for the Spanish-English pair. Audio-based segmenters offers nearly the same BLEU scores for English-German and Spanish-English. However, the Audio w/o RNN being a simpler model reaches slightly better BLEU scores using future window of length 4. This window length presents an appropriate trade-off between system latency and accuracy in our streaming scenario. Focusing on the Audio w/o RNN segmenter ( Figure 2 on the right), longer history sizes such as 10 and 15 clearly provide better BLEU scores than the shorter history size (h = 5). A history size of 10 reaches the best BLEU scores for English-German, and similar performance is achieved between 10 and 15 in Spanish-English for future window of length 4. Based on these results, a history size of 10 and a future window of length 4 were selected for the rest of the experiments. Table 3 presents BLEU scores of conventional cascade ST systems, in which the ASR output is segmented using the three proposed models and passed down to the MT system, from English into German, Spanish and French, and vice versa. As an upper-bound reference, results on an oracle segmenter are provided, which we have approximated by splitting the text into chunks using punctuation marks. The oracle segmenter shows which are the best BLEU scores that can be achieved with our current ASR and MT systems.
BLEU scores show how, except for Spanish-English, the models with acoustic features are able to outperform those that are only text-based. The largest improvement is in the English-German case, with a 0.8 BLEU-point improvement of Audio models over the Text model. When comparing the  Audio models, there does not seem to be an improvement of using RNN to process the acoustic features with respect to directly feeding the acoustic features to the FFN. In the case of the Es-En, our analysis shows that, as one one set of hyperparameters was optimized and then shared among all language directions, the resulting segments produced by the Es-En Audio models are around 60% longer than segments in other pairs, which results in reduced performance of the sentence-based MT model. Table 4 shows BLEU scores when the ASR output is replaced by the reference transcription, so that errors are only due to the segmenter and the MT systems. These results follow the trend of those in Table 3, with improvements of Audio over Text models, and no significant differences between both Audio models. Unlike the previous case, the Es-En Audio w/o RNN system does improve the results of the Text model. Interestingly enough, the oracle segmentation allows us to observe the performance degradation specifically due to the segmentation model, that is, between 2.7 and 5.3 BLEU points. Those oracle results show the bestcase scenario that can be achieved with the current MT systems, using the reference transcriptions and the reference segmentation. As the addition of the RNN to process the acoustic features does not improve the performance, the simpler Audio w/o RNN will be used in the remaining experiments.

Comparison with previous work
In this section, we compare our results with previous work in the literature described in Section 2: the n-gram LM based segmenter included in the SRILM toolkit (Stolcke, 2002), and the monolingual MT segmentation (Cho et al., 2017) whose implementation is also publicly available 2 . Table 5 shows BLEU scores of a cascade ST system for the English-German Europarl-ST test set, comparing the two segmenters mentioned above, the Audio w/o RNN model proposed in this work, and the oracle segmenter that provides the refer-   (Cettolo et al., 2012) , in order to study the performance of the segmenter when additional, out-of-domain text data is available. Results translating both, ASR hypotheses as well as reference transcriptions, are provided.
The results show the same trend across inputs to the MT system, ASR outputs, and reference transcriptions; but differences in BLEU over segmenters are more noticeable when segmenting the references. The LM based segmenter provides the lowest BLEU scores and is not able to take advantage of additional IWSLT training data. The monolingual MT model is at a middle ground between the LM based segmenter and our segmenter, but it is able to take advantage of the additional IWSLT training data. However, our segmenter outperforms all other segmenters in both training data settings. More precisely, when incorporating IWSLT training data, our segmenter outperforms by 0.4 BLEU (ASR output) and 0.9 BLEU (reference transcriptions) the best results of previous work obtained using the monolingual MT model, mainly thanks to the ability to use acoustic information. Additionally, our proposed model shrinks the gap with respect to the oracle segmentation to 3.1 BLEU points working with ASR output, and 2.2 BLEU points when reference transcriptions are provided.

Latency evaluation
We will now measure the latency of our cascade ST system in a streaming scenario. Following (Li et al., 2020), we define accumulative chunk-level latencies at three points in the system, as the time elapsed between the last word of a chunk being spoken, and: 1) The moment the consolidated hypothesis for that chunk is provided by the ASR system; 2) The moment the segmenter defines that chunk on the ASR consolidated hypothesis; 3) The moment the MT system translates the chunk defined by the segmenter. These three latency figures, in terms of mean and standard deviation, are shown in Table 6. It should be noticed that this ST system is working with ASR consolidated hypotheses in the sense that these hypotheses will not change as the audio stream is further processed.
The difference of 1.1 seconds between the ASR and the segmenter is mostly due to the need to wait for the words in the future window to be consolidated, as the time taken by the segmenter to decide whether to split or not is negligible ( 0.01s). Lastly, the MT system has a delay of 0.5 seconds. The total latency is dominated by the ASR system, since the long-range dependencies of the RNNbased LM delay the consolidation of the hypothesis, which is needed by the segmenter and the MT system in order to output the definitive translation.
In practice, however, the ST system could work with non-consolidated hypotheses, since these hypotheses very rarely change with respect to those consolidated. In this case, the latency of the ASR system is significantly reduced to 0.8 ± 0.2 seconds, while the latency experienced by the user for the whole ST system is 1.3 ± 0.4 seconds, as the segmenter does not wait for the words in the future window to be consolidated.

Conclusions
This work introduces a statistical framework for the problem of ASR output segmentation in stream- ing ST, as well as three possible models to instantiate this framework. In contrast to previous works, these models not only consider text, but also acoustic information. The experimental results reported provide two key insights. Firstly, we have confirmed how the preprocessing of the MT training data has a significant effect for ST, and how a special preprocessing that is closer to the inference conditions is able to obtain significant improvements. Secondly, we have shown the importance of including acoustic information in the segmentation process, as the inclusion of these features improves system performance. The proposed model improves the results of previous works on the Europarl-ST test set when evaluated with two training data setups. In terms of future work, there are many ways of improving the segmenter system that has been presented here. We plan to look into additional acoustic features as well as possible ways to incorporate ASR information to the segmentation process. The segmenter model itself could also benefit from the incorporation of additional text data as well as pre-training procedures. We also devise two supplementary research lines, the integration of the segmentation into the translation process, so the system learns how to segment and translate at the same time, and moving from an offline MT system to a streaming MT system to improve response time, but without performance degradation.

A Reproducibility
The source code of the Direct Segmentation Model, as well as the ASR hypothesis and acoustic features used in the experiments are attached as supplementary materials. Combined with the instructions provided for training the MT systems, this allows for faithful reproduction of our experiments.

B ASR Systems
The acoustic models were trained using the datasets listed on Table 7, and the architecture of the models is summarized in Table 8.
The language models were trained using the datasets listed on Table 9. The number of English words includes 294G words from Google Books counts. As for the models themselves, they are an interpolation between 4-gram LM and a RNNLM. For German and French, the RNN is trained with the RNNLM toolkit and has a hidden layer of 400 units. For Spanish and English, the RNN is a LSTM trained with the CUED toolkit, with an embbeding layer of 256 units and a hidden layer of 2048 units. The vocabulary was limited to the most common 200K words.

D Segmentation Systems
The different hyperparameters values that were tried for the segmentation models are shown on Table 11. In total, no more than 75 combinations were tested in order to conduct the experiments reported on this paper.