BSTC: A Large-Scale Chinese-English Speech Translation Dataset

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.


Introduction
In recent years, automatic speech translation (AST) has attracted increasing interest for its commercial potential (e.g., Simultaneous Interpretation and Wireless Speech Translator). A large amount of research has focused on speech translation (Weiss et al., 2017;Niehues et al., 2018;Chung et al., 2018;Sperber et al., 2019;Kahn et al., 2020;Inaguma et al., 2020) and simultaneous translation (Sridhar et al., 2013;Oda et al., 2014;Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019;Zhang et al., 2020). The former intends to convert speech signals in the source language to the target language, and the latter aims to achieve a real-time translation that delivers the speech to the audience in the target language while minimizing the delay between the speaker and the translation.
To train an AST model, existing corpora can be classified into two categories: • Speech Translation corpora consist pairs of audio segments and their corresponding translations.  Table 1: Existing speech translation corpora and ours.
The duration statistics of all datasets are rounded up to an integer hour. For MuST-C, the "8 Euro langs" is short for "8 European languages". Europarl-ST contains the speech translation between 9 European languages.
• Simultaneous Translation corpora are constructed by transcribing lecturers' speeches and the streaming utterance of human interpreters.
The main difference between these two kinds of corpora lies in the way that the translations are generated. The translations in Speech Translation corpora are generated based on complete audios or their transcripts, while the translations in Simultaneous Translation corpora are transcribed from real-time human interpretation. Existing research on Speech Translation mainly focused on the translation between English and Indo-European languages 1 , with little attention paid to that between Chinese (Zh) and English. One of the reasons is the scarcity of public Zh↔En  Figure 1: The process of constructing the training set and development/test sets (dev/test). The difference between the two processes is that for the training set we first split audio into sentences and then get the ASR and transcript for each sentence, while for the dev/test sets we record the real-time ASR and transcript, the sentence splitting is only used to generate translations of segmented sentences.
speech translation corpora. Among the public corpora, only MSLT (Federmann and Lewis, 2017) and Covost (Wang et al., 2020a,b) contains Zh↔En speech translation, as shown in Table 1. But the total volume of them on Zh→En translation is merely about 30 hours, which is too small to train data-hungry neural models. Some studies explore Zh→En Simultaneous Translation (Ma et al., 2019;Zhang et al., 2020). However, they take text translation datasets to simulate real-time translation scenarios because of the lack of simultaneous translation corpus.
To promote the research on Chinese-English speech translation, as well as evaluating the translation quality in real simultaneous interpretation environments, we construct BSTC, a large-scale Zh→En speech translation and simultaneous translation dataset including approximately 68 hours of Mandarin speech data with their automatic recognition results, manual transcripts, and translations. Our contributions are: • We propose the first large-scale (68 hours) Chinese-English Speech Translation corpus. This training set is a four-way parallel dataset of Mandarin audio, transcripts, ASR lattices, and translations.
• The proposed dev and test set constitutes the first high-quality Simultaneous Translation dataset of over 3-hour Mandarin speech, together with its streaming transcript, streaming ASR results, and high-quality translation.
• We have organized two simultaneous interpretation tasks 2 to promote research in this field and deployed a strong benchmark on this dataset.
• The proposed dataset can also be taken as 1) a Chinese Spelling error Correction (CSC) corpus containing pairs of ASR results and corresponding manual transcripts or 2) a Zh→En Document Translation dataset with contextaware translations.

Dataset Description
BSTC is created to fill the gap in Zh→En speech translation, in terms of both size and quality. To achieve these objectives, we start by collecting approximate 68 hours of mandarin speeches from three TED-like content producers: BIT 3 , tndao.com 4 , and zaojiu.com 5 . The speeches involve a wide range of domains, including IT, economy, culture, biology, arts, etc. We randomly extract several talks from the dataset and divide them into the development and test set.

Training set
For the training set, we manually tag timestamps to split the audio into sentences, transcribe each sentence and ask professional translators to produce the English translations. The translation is generated based on the understanding of the entire talk and is faithful and coherent as a whole. To facilitate the research on robust speech translation, we also provide the top-5 ASR results for each segmented speech produced by SMLTA 6 , a streaming multi-   layer truncated attention ASR model. Figure 1 (a) shows the construction process of the training set, together with an example of a segmented sentence.

Dev/Test set
For the development (dev) set and test set, we consider the simultaneous translation scenario and provide the streaming transcripts and streaming ASR results, as shown in Figure 1 (b). The streaming transcripts are produced by turning each nwords (a word means a Chinese character here) sentence to n lines word by word with length 1, 2, ..., n. We use the real-time recognition results of each speech, rather than the recognition of each sentence-segmented audio. This is to simulate the simultaneous interpreting scenario, in which the input is streaming text, rather than segmented sentences.

Statistics and Dataset Features
We summarize the statistics of our dataset in Table  2. The distribution of talk length and utterance length in the training set is illustrated in Figure 2 and Figure 3, respectively. The average number of utterances per talk is 176.3 in the training set, 59.8 in the dev set, and 162.5 in the test set. And the average utterance length is 27.14 in the training set, 27.26 in the dev set, and 26.49 in the test set. We also calculate the word error rate 7 (WER) of the ASR system on the three datasets. As shown in Table 2, the WER of the training set is 27.90%, significantly higher than that of the dev and testset. This is due to the way of audio segmentation before recognition: some audio clips lose some parts in acoustic truncation, resulting in incomplete ASR results. We count the length difference of each <transcription, asr> pair, i.e., ∆ len = |len(transcription) − len(asr)|, and recalculate the WER of pairs whose length difference is within a certain range. The WER and coverage of these subsets are listed in Table 3. Note that when the asr and transcript with equal length (∆ len ≤ 0), the WER is only 5.87%. For the length difference in a relatively regular range (e.g, ∆ len ≤ 15), the WER is also relatively low (WER=15.23%).
Besides, there is a difference between our dataset and the existing speech translation corpora. In our dataset, speech irregularities are kept in transcrip-

Human Interpretation
We further ask three experienced interpreters (A, B, and C) with interpreting experience ranging from four to nine years to interpret the six talks of the testset, in a mock conference setting 8 . To evaluate their translation quality, we also ask human translators to evaluate the transcribed interpretation from multiple aspects: adequacy, fluency, and correctness: • Rank1: The translation contains no obvious errors.
• Rank2: The translation is comprehensible and adequate, but with minor errors such as incorrect function words and less fluent phrases.
• Rank3: The translation is incorrect and unacceptable. Table 4 shows the translation quality in BLEU and acceptability, which is calculated as the sum of the percentages of Rank1 and Rank2. It shows that their acceptability ranges from 62.8% to 83.0%, but the acceptability and BLEU are not completely positively correlated. This is because human interpreters routinely omit less important information to overcome their limitations in working memory. Acceptability focuses more on accuracy and faithfulness than adequacy, so it can tolerate information omission. Therefore, some information omitted in human interpretation that results in inferior BLEU 8 We play the video of the speech, just like in a real simultaneous interpretation scene { "offset": "105.975", "duration": "3.287", "wav": "2.wav", "transcript": " "Streaming ASR": "translation": "In fact, every one of you has multiple digital devices, "interpreter A": "But actually you own several devices, mobile devices, "interpreter B": "But every of you have multiple equipments with you "interpreter C": "But every one of you have multi devices, we have } Type: partial Type: partial Type: partial Type: partial Type: partial Type: final Type: partial Type: partial Type: partial Type: final Type: partial Type: partial handheld devices and mobile phones.", mobile phones.", hand held equipment like phone, smartphone.", mobile phones." Figure 4: A segment of one example in our test set，including audio, timelines, transcription, translation, streaming ASR results, and interpretation from three human interpreters (only for testing data). The red characters in "Streaming ASR" indicate recognition errors.
may not lead to the decrease of acceptability. But BLEU, as a statistical auto-evaluation metric, considers adequacy with the same importance with accuracy. This leads to the discrepancy between BLEU and acceptability. Figure 4 lists a segment from one example in our dataset. Notably, we only supply human interpretations for testing data. Here the "Streaming ASR" is the real-time recognition results, in which the "Type:final" means that the audio has detected a pause or silence and thus segmented, and will start to recognize a new sentence, while "Type:partial" is to continue recognizing the current sentence.

Experiments
In this section, we introduce our benchmark systems based on the dataset. We conduct experiments on speech translation and simultaneous translation, respectively.
To preprocess the Chinese and the English text, we use an open-source Chinese Segmenter 9 , and Moses Tokenizer 10 . After tokenization, we convert  all English letters into lower case. To train the MT model, we conduct byte-pair encoding (Sennrich et al., 2016) for both Chinese and English by setting the vocabulary size to 20K and 18K for Chinese and English, respectively. And we use the "multibleu.pl" 11 script to evaluate the BLEU score.

Benchmark System
Our benchmark is a cascade system that includes an ASR module, a sentence segmentation module, and a machine translation (MT) module.
• We use the SMLTA model for ASR, i.e., the streaming transcript/ASR of BSTC is taken as the output of the ASR module.
• The sentence segmentation module is to decide when to translate in real-time. We train a classification model based on the Meaningful Unit (MU) method proposed in Zhang et al. (2020) that implements a 5-class classification (MU, comma, period, question mark, and none). The training data of meaningful units are generated automatically from monolingual sentences based on context-aware translation consistency. The model is pre-trained on ERNIE-base (Sun et al., 2020) and fine-tuned on the transcript of the BSTC training set.
• Once an MU or a sentence boundary (period or question mark) is detected in the sentence segmentation module, the MT module generates translation for the detected sentence. The MT model is firstly pre-trained on the large-scale WMT19 Chinese-English corpus, then fine-tuned on BSTC. The WMT19 corpus includes 9.1 million sentence pairs collected from different sources, i.e., Newswire, United Nations Parallel Corpus, Websites, etc. We use the big version of Transformer model in the following experiments.

Performance of Speech Translation
Speech translation aims at translating accurately without considering system delay. Therefore, we only perform translation when sentence boundaries (periods and question marks) are detected by the sentence segmentation module.
The MT model is firstly trained on WMT, then fine-tuned on 37,901 training pairs of <transcription, translation> and <asr, translation> in two settings, respectively. The purpose of fine-tuning on transcription is to adapt the model to the speech domain, and the purpose of fine-tuning on ASR is to improve the robustness of the MT model against recognition errors. Our model pre-trained on WMT19 achieves a BLEU of 25.1 on Newstest19.
We evaluate our systems on the dev/test set using streaming transcription and streaming ASR as inputs. For each talk in the dev/test set, its streaming text is firstly segmented by the sentence segmentation module, then the translation of each segmentation is concatenated into one long sentence to evaluate the BLEU score. The results are listed in Table 5. Note that the great gap of BLEU in dev and test sets is that, the dev set has only one reference while the testset has 4 references. Contribution of fine-tuning on speech translation data: The systems pre-trained on WMT obtain an absolute improvement both on clean and noisy input by fine-tuning on <transcription, translation>. The performance of the former model increases by 4.35 BLEU score on average and the latter model obtains 1.93 BLEU score improvement on average. This indicates the transcribed training data can still bring large improvement after pre-training on large-scale training corpus. This probably because it is closer to the test set in terms of the domain (speech) and noise (disfluencies in spoken language). Contribution of fine-tuning on noisy data: Training on the corpus containing the ASR errors can be effective to improve the robustness of the NMT model. This can be proved by fine-tuning on the <ASR, translation> pairs. As shown in the last row of Table 5, the pre-trained model improves 2.93 and 2.59 BLEU scores on average for testing on streaming transcript and streaming ASR, respectively. This manifests that compared with fine-tuning the clean transcription, the model finetuned on ASR is less sensitive to false recognition results of ASR.

Performance of Simultaneous Translation
Different from speech translation, the simultaneous translation should balance translation quality and latency. Therefore, we fix the ASR and MT modules to evaluate our system under different sentence segmentation results. In simultaneous translation, once an MU or a sentence boundary is detected, the MU or sentence is translated immediately. In order to maintain coherent and consistent paragraph translation, we perform context-aware translation following  that except for the first segment in a sentence, the subsequent segments are translated with force-decoding.
The performance of system on the dev set and test set is listed in Figure 5 and Figure 6, respec-tively 12 . We use BLEU to evaluate the translation quality and use average lagging (AL) (Ma et al., 2019) and Consecutive Wait (CW) (Gu et al., 2017) as latency metrics. δ is the hyperparameter defined in Zhang et al. (2020) as the thresold of sentence segmentation module. It shows that the translation quality improves consistently with the increase of latency. The AL on both dev and test sets ranges from 7 to 12 and the CW ranges from 6 to 11 for points of simultaneous translation. In addition, we also draw the full-sentence translation results, as denoted by "ASR-Sentence" and "Transcript-Sentences" in the two figures. The fullsentence translation implements a high-latency policy, in which a translation is only triggered when a sentence is received. As shown in the figures, the delay of both "ASR-Sentence" and "Transcript-Sentences" is much higher than the simultaneous translation results.

Conclusion and Future Work
In this paper, we release a challenging dataset for the research on Chinese-English speech translation and simultaneous translation. Based on this  Table 6: Specific data corresponding to Figure 5 and Figure 6. dataset, we report a competitive benchmark based on a cascade system. In the future, we will expand this dataset, and propose an effective method to develop an End-to-End speech translation model.