Findings of the Second Workshop on Automatic Simultaneous Translation

This paper presents the results of the shared task of the 2nd Workshop on Automatic Simultaneous Translation (AutoSimTrans). The task includes two tracks, one for text-to-text translation and one for speech-to-text, requiring participants to build systems to translate from either the source text or speech into the target text. Different from traditional machine translation, the AutoSimTrans shared task evaluates not only translation quality but also latency. We propose a metric “Monotonic Optimal Sequence” (MOS) considering both quality and latency to rank the submissions. We also discuss some important open issues in simultaneous translation.


Introduction
Simultaneous translation is to translate concurrently with the speech in the source language, aiming to obtain high translation quality with low latency. The concurrent comprehension and production process makes simultaneous translation an extremely challenging task for both human experts and machines. As a combination of machine translation (MT), automatic speech recognition (ASR), and text-to-speech synthesis (TTS), simultaneous translation still facing many problems to be studied in the research and application. To promote the development in this cutting-edge field, we conduct a shared task at the 2nd Workshop on Automatic Simultaneous Translation.
This year, we focus on Chinese-English simultaneous translation and set up two tracks: 1. Text-to-text track, where the participants are asked to submit systems that translate streaming input text in real-time. The input of this track is human-annotated transcripts in streaming format, in which every n-word sentence is broken into n lines of sequences whose length ranges from 1 to n, incremented by 1. We set up this track for two reasons. On the one hand, the difficulty of the task is reduced by removing the recognition of speech. On the other hand, participants can focus on text processing, such as segmentation and translation, without being influenced by ASR errors.
2. Speech-to-text track, where the submitted systems need to produce a real-time translation of the given audio.
We provide BSTC (Zhang et al., 2021) (Baidu Speech Translation Corpus) as the training data, which consists of about 68 hours of Mandarin speeches, together with corresponding transcripts, ASR results, and translations. In addition, participants can also use bilingual corpus provided by CCMT (China Conference on Machine Translation) 1 . We will describe the data in detail in Section 2.
One objective of the shared task is to explore the performance of state-of-the-art simultaneous translation systems. Traditional evaluation metrics, such as BLEU, only measure the translation quality, while recently proposed metrics, such as Consecutive Wait (CW) (Gu et al., 2017) and Average Lagging (AL) (Ma et al., 2019) focus on latency. So far as we know, there is no metric that evaluates both quality and delay.
We ask the participants to submit systems under different configurations to produce multiple translation results with varying latency. Then we plot each result in a quality-latency coordinate. Normally, a system is regarded as the best if all of its points are above others (Figure 1(a)). However, in most cases, their lines of points intersect with each other (Figure 1(b)).
To consider both quality and latency in ranking, we propose a ranking metric, Monotonic Optimal Sequence (MOS) (Section 3). The idea is to first   find all the optimal points, that is, a group of points with the highest quality under different latency, and then calculate the proportion of a system's optimal points in all its submitted points. The higher the proportion, the better the performance. We received six submissions from four teams this year. We will report the results and analysis in Section 4. We discuss some important open issues in Section 5 and conclude the paper in Section 6.

Shared Task
We first introduce the data sets used in the shared task and the setup of the two tracks.

Training Set
Due to the scarcity of Zh→En speech translation corpora, we provide a Zh→En speech translation dataset BSTC and a large-scale text translation corpus CCMT for the participants.
• BSTC (Zhang et al., 2021)  The statistics of the two datasets are listed in Table  1. As far as we know, BSTC is by far the largest Zh→En speech translation corpus, but it is still insufficient to train either a well-performed ASR model or an end-to-end simultaneous translation model in the speech-to-text track. Therefore, we don't impose restrictions on the dataset used by the participants for the speech track.

Test Set
Notice that the test set of BSTC shown in Table  1 is not released. The participants are required to submit docker systems, which will be tested on the 1.5-hours test set by us. The test set is kept confidential as a progress test set. To validate the system to submit, we provide the dev set to the participants, which has the same format as the test set. It contains four-way parallel samples of 1) the streaming transcript, 2) the streaming asr, 3) the sentence-level translation of the transcript, and 4) the audio. The streaming transcripts are produced by turning each n-word (a word means a Chinese character here) sentence to n lines of word sequences with length 1, 2, ..., n. And the streaming ASR is produced by the real-time Baidu ASR system based on SMLTA 2 .

Two Tracks
We set two tracks in our shared task, the text-totext track is to input streaming transcripts and the speech-to-text track is to input audio files, as mentioned in section 1.
The simultaneous translation aims to balance system delay and translation quality. The key problem is to explore a policy that decides when to begin translating a source sentence before the speaker has finished his/her utterance. Eager policies, such as translating every word when it is received, will lead to poor translation quality, while lazy policies, such as waiting to translate until receiving a complete sentence, will result in long system delay.
In order to comprehensively evaluate each system's performance, we suggest that the participants generate multiple results on varying latency. Six systems from four teams were submitted in the shared task, four to Track 1 and two to Track 2.

System Evaluation
Unlike text translation evaluation that only takes one indicator (i.e., translation quality), simultaneous translation evaluation needs to consider quality and latency at the same time. The evaluation based on two criteria brings difficulties to ranking the systems. However, the two indicators are not easy to merge into one.
To rank the submissions better, we propose a ranking algorithm called Iterative Monotonic Optimal Sequence (I-MOS). Specifically, we define an optimal point as the result of the best translation quality at each latency. Our algorithm iteratively finds sets of optimal points to construct an optimal curve called Monotonic Optimal Sequence (MOS), then each team's proportion of points on the MOS curve is calculated to measure the performance. The overall process is illustrated in Figure 2.
In the following sections, we first introduce the commonly used metrics of quality and latency (Section 3.1), then propose the Monotonic Optimal Sequence (Section 3.2) and elaborate our I-MOS algorithm (Section 3.3).

Evaluation metrics
In simultaneous translation, quality is often measured by BLEU (Papineni et al., 2002). Recent work proposed some metrics for latency evaluation, such as Average Proportion (AP) (Cho and Esipova, 2016), Consecutive Wait (CW) (Gu et al., 2017), Average Lagging (AL) (Ma et al., 2019) and Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019). Here we briefly introduce the two latency metrics used in our evaluation: • CW is the average source segment length in words. It measures the number of source words being waited for between each two translation actions.
• AL quantifies the degree the audience is out of sync with the speaker by the average number of source words that the audience lags behind the ideal policy, in which the translation of each sentence is output at the same speed as the speaker's utterance and the entire translation finished when the speaker completes his/her utterance.
Note that the above-mentioned latency metrics are all proposed for text-to-text simultaneous translation and we use AL in the text track for latency evaluation. Some work extended AP and AL to speech translation (Ren et al., 2020;Ma et al., 2020), but we don't use them because they measure real-time latency, while some submissions calling remote services contain network delay. It is unreasonable to use real-time latency metrics for both the local-running systems and remote-running systems. Thus we ignore the latency of the ASR model and take the metrics of text-to-text simultaneous translation in the speech track. Specifically, we use BLEU-AL evaluation in the Text-to-text track and BLEU-CW evaluation in the Speech-to-text track.

Monotonic Optimal Sequence
To comprehensively rank systems based on the translation quality and latency, we propose to construct a monotonic optimal sequence composed of Optimal Points.
Definition 1. On the quality-latency figure, one result is considered optimal if there is no other point or line above it at an identical latency. In this case, the result is of the highest translation quality at that latency and we define it as an Optimal Point.
For example, among the nine results of Figure 1 (b), the leftmost two points of Team1 and rightmost two points of Team2 are Optimal Points. The third point from left on Team2's curve is not optimal because it lies below the line of Team1.
To get Optimal Points, we select the results of the best translation quality with different latency. Since the submitted systems have discrete latency, we use the linear interpolation of adjacent points of each team to estimate their translation quality on continuous latency. Then we select some Optimal Points to form an optimal curve called Monotonic Optimal Sequence. Step 1 of I-MOS (c).
Step 3  We arrange all the Optimal Points in ascending order of latency and then select the points with monotonously increasing translation quality to form the MOS. The monotonicity requirement for translation quality is to avoid outlier points. For example, the rightmost point of Team1 in Figure  2 (b) is an outlier because there is no point or line above this point at the same latency, but it doesn't follow the monotonicity principle, so it should not be added to MOS.
We propose to use each team's proportion of points on the MOS to evaluate its performance. That is, we rank teams with: where N (p * t i ) and N (p t i ) denote the number of points on MOS and the number of submitted points of team i, respectively. Therefore, the maximum value of S T i is 1, when all of its submitted points are on the MOS.

Iterative Monotonic Optimal Sequence Algorithm
There exists a problem in our measurement that, according to Eq. 1, all the teams that have no points on the MOS are ranked tied because they all score zero. To tackle this problem, we propose the Iterative Monotonic Optimal Sequence (I-MOS) algorithm. The main idea is to iteratively calculate the MOS curves, MOS-1, MOS-2, ... MOS-K, in which MOS-k denotes the Monotonic Optimal Sequence of level k calculated at the k th iteration.
All the systems that have at least one point on MOSk are classified to level k. We remove these systems and calculate MOS-(k + 1) in the next iteration. Each team of the k th level ranks higher than all teams of the (k + 1) th level. Our algorithm is elaborated in Algorithm 1. The level of all teams is initialized to zero (line 1), which denotes the team's score has not been calculated. Then we begin our iteration. While there exists at least one team whose score has not been calculated (line 4), we update the score of teams that belong to superior levels (level 1, 2, ..., k − 1) teams by adding the maximum value of S T i (1 point) to them (line 5-7) to ensure the systems of level 1, 2, ...k − 1 scores higher than systems of level k. Then we calculate MOS-k (line 8) and update the score of the teams that belong to level k according to Eq. 1 (line 9-11). After an iteration, we continue to explore teams that belong to level k +1 (line 12). Figure 2 provides a running process of I-MOS.

Systems Results
We received 6 systems submitted by four teams from four universities: • Institute of computing technology, Chinese Academy of Science (ICT) • Xiamen University (XMU) • Beijing Institute of Technology (BIT) • Ping An Technology (Shenzhen) Co., Ltd.
We test each docker system with our testset, which contains 1.5 hours of 6 Mandarin talks. All the systems are run on V100 GPU. We plot the evaluation results in Figure 3 and rank them according to the I-MOS algorithm. Their ranking results are shown in Table 2. We use BLEU 3 to evaluate the translation quality and use Average Lagging (AL) (Ma et al., 2019) and Consecutive Wait (CW) (Gu et al., 2017) as latency metrics.

Text-to-text Track
In the first track, the results of the four teams reflect their preference in balancing system latency and translation quality. We briefly describe the methods of the four teams below in the order of their ranks: 1. ICT proposes the character-level wait-k policy, rather than using the standard word-level wait-k (Ma et al., 2019). They perform prefixto-prefix MT training as in the original work. Besides, they follow the multi-path (Elbayad et al., 2020) and future-guided (Zhang et al., 2020b) methods to enhance the predictability and avoid huge anticipation in translation 3 BLEU is calculated using " https://github.com/mosessmt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl".  3. BIT uses a pipeline method with a segmentation model that bridges the streaming text input and the MT model. Once a punctuation mark is detected, the segmentation sends the currently received sub-sentence for translation as in . To make the MT model adapt to translating short subsentences at inference time, each sample in the provided parallel training corpus is automatically divided into multiple translation pairs for training. A statistical word alignment tool is used to segment the source sentence into minimal chunks so that crossing alignment links between source and target words occur only within individual chunks. The parallel pairs of chunks are then used to train their MT model. 4. PingAn takes the test-time wait-k (Ma et al., 2019) as the segmentation policy. Different from the standard wait-k policy, test-time waitk uses the wait-k policy only at inference time without prefix-to-prefix training the MT model. They further adopt Back-Translation (Sennrich et al., 2016) to improve the translation quality.
In summary, we can categorize the four systems according to their segmentation policy: Both ICT and PingAn adopt the wait-k policy. ICT adopts training-time wait-k while PingAn uses test-time wait-k. BIT chooses sub-sentence translation, that is, to translate only when a punctuation is detected. XMU performs MU-based segmentation in which the training samples of meaningful units are generated by the MT model. Figure 3 (a) shows that the latency of the two methods using wait-k is relatively low, while MUbased policy can achieve high translation quality. For the two wait-k systems, ICT performs better than PingAn, which is consistent with the experimental results in Ma et al. (2019) that training-time wait-k is superior to test-time wait-k.
It's interesting to find that the latency of XMU is larger than that of BIT. This might be because there are often long-distance reorderings in the training corpus. The reordering in translation that crosses punctuation marks would prevent the MU segmentation policy from extracting fine-grained MUs, resulting in the average length of MUs exceeding sub-sentences. This problem has been illustrated in Zhang et al. (2020a) and they proposed a refined method called MU++ to alleviate the problem.
The result of BIT is a little weird. The translation quality decreases as system latency grows. This might be caused by the discrepancy between the segmentation module and the MT model. In their method, the segmentation module segments sentences into sub-sentences while the MT model is trained on statistically split chunks.

Speech-to-text Track
As elaborated in Section 3.1, we use BLEU and Consecutive Wait (CW) (Gu et al., 2017) to evaluate systems in the speech track.
PingAn and XMU continue their work based on their systems submitted to the Text-to-text track. The two systems both keep the same policy used in the first track and only replace the text input with the recognition results of an ASR model. PingAn trains a QuartzNet model (Kriman et al., 2020) with the Memory-Self-Attention (Luo et al., 2021) and XMU uses Baidu's real-time speech recognition service. Figure 3 (b) shows that PingAn using wait-k outperforms XMU in latency. The reason behind the large delay of XMU's system might be the same as in the first track.

Discussion
Most recent studies on simultaneous translation focused on methods to balance translation quality and latency. Besides this, we will discuss some other important challenges for simultaneous translation.

Data Scarcity
The first problem is the shortage of high-quality simultaneous translation data. In recent years, some speech translation corpora have released, such as MuST-C (Di Gangi et al., 2019), Covost (Wang et al., 2020a,b), Europarl-ST (Iranzo-Sánchez et al., 2020), Aug-LibriSpeech (Kocabiyikoglu et al., 2018), etc. These corpora focus on Indo-European languages and have greatly contributed to the increasing popularity of research of simultaneous translation. However, there is little attention paid to research and data collection of Chinese-English (Zh→En) simultaneous translation. To the best of our knowledge, only MSLT (Federmann and Lewis, 2016) and Covost (Wang et al., 2020b) contain Zh→En speech translation data, but they totally have about 30 hours of speech. In our shared task, we build 68-hour Zh→En speech translation corpus, BSTC (Zhang et al., 2021) for training and evaluation. The dataset alleviates the Zh→En data scarcity, but it's still insufficient to train data-hungry end-to-end simultaneous translation models.

Evaluation Dilemma
The second problem lies in system evaluation, which has not been widely explored.
Traditional metrics such as BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), etc, are designed for text translation. These metrics based on accurate matching between system outputs and references. However, to reduce latency in simultaneous interpretation, human interpreters usually use strategies such as reasonable omissions, avoiding longdistance reordering in translation, etc. Thus the traditional metrics are not suitable to evaluate the simultaneous interpretation.
On the other hand, there is no metric to evaluate both translation quality and latency. In our shared task, we propose a novel ranking algorithm, I-MOS. We only consider the proportion of optimal points, ignoring whether the points lie in low-latency or high-latency. Therefore, our ranking doesn't differentiate latency regimes. However, it remains open to question whether it is reasonable to compare two systems with no intersection in latency, like the ICT and XMU in Figure 3 (a). The ranking might be more convincing if ICT had provided results at high latency and XMU has provided results at low latency.
We note that IWSLT has also hosted simultaneous translation shared tasks 4 . They proposed to rank systems by the translation quality with different latency regimes: Low Latency: AL <= 3, Medium Latency: AL <= 6, and High Latency: AL <= 15. For each team, the submitted system that achieves the best translation quality is chosen for ranking in each latency regime. However, the value of artificially defined latency threshold between regimes has a big impact on the ranking results. As illustrated in Figure 4, different latency thresholds lead to completely different rankings of the two teams.
Actually, the ideal ranking mechanism is to rank all systems within a similar latency interval. However, asking participants to submit results in almost every latency regime is unreasonable, because existing policies all have a preference in trading off latency and translation quality. For example, wait-k focuses on getting controllable low latency, while the inspiration behind MU is to translate until a segment with definite meaning is formed, leading to a high latency as well as high quality. Therefore, it is a dilemma to evaluate systems comprehensively while distinguishing different latency regions reasonably. This problem can be explored in future work.

Applications
Recently, more and more simultaneous translation systems have emerged in international conferences.
In practical applications, systems face robust and controllability issues. Being robust denotes the system should achieve a high translation quality and be insensitive to speech noise, including sound capture noise, speaker's accent, disfluency in speech, etc. Being controllable means the system should be able to remember and understand some named entities and should be able to be intervened.
Our shared task provides such an opportunity for participants to pay attention to the robustness problem. For example, ICT and PingAn have adopted data augmentation to enhance the robustness of their systems.
In terms of controllability, it is not difficult to integrate an intervention mechanism in pipeline systems. For example, a pre-defined translation of a named entity can be introduced to the MT module. However, controllability is not easy to be guaranteed for end-to-end simultaneous translation systems (Ren et al., 2020;Ma et al., 2020). It remains a challenge to correct a translation without an intermediate ASR result. We also hope to see more work focusing on real-world simultaneous translation applications and discussing some interesting issues, such as the document-level ASR error correction in pipeline systems, and how to enhance the controllability in end-to-end speech-to-text systems, etc.

Conclusion
This paper presents the results of the Zh→En simultaneous translation shared task hosted on the 2nd Workshop on Automatic Simultaneous Translation (AutoSimTrans). The shared task includes two tracks, the text-to-text track (Track1) and the speech-to-text track (Track2). Six systems were submitted to the shared task, four to Track1 and two to Track2. We propose an evaluation method "Monotonic Optimal Sequence" (MOS) to evaluate both translation quality and time latency. We report the results and further discuss some important open issues of simultaneous translation.
Regrettably, the number of submissions is less than expected, especially for the speech-to-text track. In fact, there are more than 300 teams registered. However, most of them did not submit their results. The possible reason may be that the interdisciplinary task is not easy for participants. We hope to see more participants in the future.