ICT’s System for AutoSimTrans 2021: Robust Char-Level Simultaneous Translation

Simultaneous translation (ST) outputs the translation simultaneously while reading the input sentence, which is an important component of simultaneous interpretation. In this paper, we describe our submitted ST system, which won the first place in the streaming transcription input track of the Chinese-English translation task of AutoSimTrans 2021. Aiming at the robustness of ST, we first propose char-level simultaneous translation and applied wait-k policy on it. Meanwhile, we apply two data processing methods and combine two training methods for domain adaptation. Our method enhance the ST model with stronger robustness and domain adaptability. Experiments on streaming transcription show that our method outperforms the baseline at all latency, especially at low latency, the proposed method improves about 6 BLEU. Besides, ablation studies we conduct verify the effectiveness of each module in the proposed method.


Introduction
Automatic simultaneous translation (ST) (Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019), a task in machine translation (MT), aims to output the target translation while reading the source sentence. The standard machine translation is a full-sentence MT, which waits for the complete source input and then starts translation. The huge latency caused by full-sentence MT is unacceptable in many realtime scenarios. On the contrary, ST is widely used in real simultaneous speech translation scenarios, such as simultaneous interpretation, synchronized subtitles, and live broadcasting.
Previous methods (Ma et al., 2019;Arivazhagan et al., 2019) for ST are all evaluated on the existing full-sentence MT parallel corpus, ignoring the real speech translation scenario. In the real scene, the paradigm of simultaneous interpretation is Automatic Speech Recognition (ASR) → * Corresponding author : Yang Feng. simultaneous translation (ST) → Text-to-Speech Synthesis (TTS), in which these three parts are all carried out simultaneously. As a downstream task of simultaneous ASR, the input of ST is always not exactly correct and in the spoken language domain. Thus, robustness and domain adaptability become two challenges for the ST system.
For robustness, since the input of the ST system is ASR result (streaming transcription), which is incremental and may be unsegmented or incorrectly segmented, the subword-level segmentation result (Ma et al., 2019) of the streaming transcription seriously affect the ST result. Existing methods (Li et al., 2020) often remove the last token after segmentation to prevent it from being incomplete, which leads to a considerable increase in latency. Table 1 shows an example of the tokenization result of the streaming transcription input with different methods. In steps 4-7 of standard wait-2, the input prefix is different from its previous step, while the previous output prefix is not allowed to be modified in ST, which leads to serious translation errors. Although removing the last token improves the robustness, there is no new input in many consecutive steps, which greatly increases the latency.
For domain adaptability, the existing spoken language domain corpus is lacking, while the general domain corpus for MT and the spoken language domain corpus for ST are quite different in terms of word order, punctuation and modal particles, so ST needs to efficiently complete the domain adaption.
In our system, we propose a Char-Level Wait-k Policy for simultaneous translation, which is more robust with streaming transcription input. Besides, we apply data augmentation and combine two training methods to train the model to complete domain adaptation. Specifically, the source of the char-level wait-k policy is a character sequence segmented according to characters, and the target still maintains subword-level segmentation and BPE operations (Sennrich et al., 2016). When decoding,  Table 1: An example of the tokenization result of standard wait-k, standard wait-k+remove last token and charlevel wait-k, when dealing with streaming transcription input (take k = 2 as an example). Red mark: the source prefix changes during streaming input. Green mark: no input in consecutive steps since the last token is removed. the char-level wait-k policy first waits for k source characters, then alternately reads a character, and outputs a target subword. Table 1 shows the tokenization results of the char-level wait-k policy, which not only guarantees the stability of the input prefix but also avoids unnecessary latency. To adapt to the spoken language domain, we first pretrain an ST model on the general domain corpus and perform fine-tuning on the spoken language domain corpus. To improve the effect and efficiency of domain adaptation, we carry out data augmentation on both the general domain corpus and spoken language domain corpus and combine two different training methods for training.
In the streaming transcription track for the Chinese → English translation task of AutoSimTrans 2021, we evaluate the proposed method on the real speech corpus (Zhang et al., 2021). Our method exceeds the baseline model at all latency and performs more prominently at lower latency.
Our contributions can be summarized as follows: • To our best knowledge, we are the first to propose char-level simultaneous translation, which is more robust when dealing with real streaming input.
• We apply data augmentation and incorporate two training methods, which effectively improve the domain adaptation and overcome the shortage of spoken language corpus.

Task Description
We participated in the streaming transcription input track of the Chinese-English translation task of AutoSimTrans 2021 1 . An example of the task  is shown in Table 2. Streaming transcription is manually transcribed without word segmentation. Between each step, the source input adds one more character. The task applies AL and BLEU respectively to evaluate the latency and translation quality of the submitted system.

Background
Our system is based on a variant of wait-k policy (Ma et al., 2019), so we first briefly introduce waitk policy and its training method. Wait-k policy refers to waiting for k source tokens first, and then reading and writing alternately, i.e., the output always delays k tokens after the input. As shown by 'standard wait-k policy' in Figure 1, if k = 2, the first target token was output after reading 2 source tokens, and then output a target token as soon as a source token is read.
Define g (t) as a monotonic non-decreasing function of t, which represents the number of source tokens read in when outputting the target token y t . For the wait-k policy, g (t) is calculated as: where x is the input subword sequence.
Wait-k policy is trained with "prefix-to-prefix" framework. In "prefix-to-prefix" framework, when generating the t th target word, the source tokens participating in encoder is limited to less than g (t).

Methods
To improve the robustness and domain adaptability of ST, we enhance our system from read / write policy, data processing and training methods respectively.

Char-Level Wait-k Policy
To enhance the robustness of dealing with streaming transcription, we first proposed char-level simultaneous translation and applied the wait-k policy on it.

Char-Level Simultaneous Translation
Character-level neural machine translation (Ling et al., 2015;Lee et al., 2017;Cherry et al., 2018;Gao et al., 2020) tokenizes the source sentence and target sentence according to characters, thereby gaining advantages over subword-level neural machine translation in some specific aspects, such as avoiding out-of-vocabulary problems (Passban et al., 2018) and errors caused by subword-level segmentation (Tang et al., 2020). In terms of translation quality, the character-level MT is still difficult to compare with the subword-level MT. An important reason is that only one wrong generated character will directly cause the entire target word wrong (Sennrich, 2017).
To improve the robustness of the ST system when dealing with unsegmented incremental input, while avoiding the performance degradation caused by character-level MT, we propose char-level simultaneous translation, which is more suitable for streaming input. The framework of char-level ST is shown in the lower part of Figure 1.
Different from subword-level ST, given the parallel sentence pair < X, Y >, the source of the ST model in the proposed char-level ST is the character sequence c = (c 1 , · · · , c n ) after char-level tokenization, and the target is the subword sequence y = (y 1 , · · · , y m ) after word segmentation and BPE (Sennrich et al., 2016), where n and m are the source and target sequence lengths respectively.

Standard wait-k policy:
Char-level wait-k policy:

Subword-level streaming input
Char-level streaming input wait for k tokens Predict wait for k tokens Figure 1: Standard wait-k policy vs. our char-level wait-k policy (take k = 2 as an example).
The word segmentation and BPE operation at the target end are the same as subword-level MT (Vaswani et al., 2017), and char-level tokenization is similar to character-level MT (Yang et al., 2016;Nikolov et al., 2018;Saunders et al., 2020) but not completely consistent. The char-level tokenization we proposed divides each source language character into a token, and other characters (such as numbers, other language characters) are still divided into a token according to complete words. An example of char-level tokenization is shown in Table 3. In the result of char-level tokenization, each Chinese character is divided into a token, and the number (12) and English (UNIT) are entirely taken as a token, respectively. Char-level tokenization is more suitable for streaming transcription, which ensures that the newly input content at each step in streaming transcription is a complete token, and the input prefix does not change in any way. The robustness of char-level ST is greatly improved with the complete token and stable prefix.
Why char-level simultaneous translation? Motivating our use of char-level ST we consider three desiderata. 1) With the incremental source input, char-level ST is more robust since it avoids unstable prefixes caused by word segmentation, as shown in Table 1. 2) Char-level ST can obtain a more fine-grained latency, because if one character is enough to express the meaning of a entire word, the ST system does not have to wait for the complete word before translating. 3) Char-level ST only performs char-level tokenization on the source, while the target still retains subword-level tokenization, so its translation performance will not be affected too much, as shown in Table 7.

Input Sentence
欢迎来到UNIT系统的第12期高级课程。 Output Sentence welcome to the 12th advanced course on UNIT system . subword-level MT welcome / to / the / 12@@ / th / advanced / course / on / UNIT / system / . Table 3: An example of tokenization method applied by the char-level wait-k policy. For the source, we use char-level tokenization, which separates each source language character into separate segments, and divides the others by words. For the target, we apply the same operation as the conventional subword-level MT. The sentences marked in red are the source and target of our proposed ST model.

Read / Write Policy
For the read / write policy, we apply the wait-k policy on the proposed char-level ST. The difference between char-level wait-k policy and standard waitk policy is that each token in standard wait-k policy is a subword, while each token in char-level wait-k policy is a character (other languages or Numbers are still words), as shown in Figure 1. We rewrite g (t) in Eq.
(1) into g k (t) for charlevel wait-k policy, which represents the number of source tokens (Character) read in when outputting the target token y t , calculated as: where c is the input character sequence.
Another significant advantage of the standard wait-k policy is that it can obtain some implicit prediction ability in training, and char-level wait-k policy further strengthens the prediction ability and improves the stability of prediction. The reason is that the granularity of the char-level is smaller so that the prediction of char-level is simpler and more accurate than that of subword-level. As shown in Figure 1, it is much simpler and more accurate to predict "系统" given "系", since there are few possible characters that can be followed by "系".

Domain Adaptation
To improve the quality of domain adaptation, we apply some modifications to all training corpus, including general domain and spoken language domain, to make them more closer to streaming transcription. Besides, we also augment the spoken language corpus to make up for the lack of data.

Depunctuation
For training corpus, including general domain and spoken language domain, the most serious difference from streaming transcription is that each sentence in streaming transcription usually lacks ending punctuation, as shown in Table 2. Since the punctuation in the training corpus is complete, and the ending punctuation is often followed by < eos >, the model trained with them tends to wait for the source ending punctuation and then generate the corresponding target ending punctuation and < eos > to stop translating. As a result, given the unpunctuated input in streaming transcription, it is difficult for the model to generate target punctuation and < eos > to stop the translation. To this end, to strengthen the model's ability to translate punctuation from unpunctuated sentences, we delete the ending punctuation of the source sentence, and the target sentence remains unchanged, as shown in Table 4. Note that our depunctuation operation is limited to the ending punctuation at the end of the source sentence ('。','！','？').

Data Augmentation
For the spoken language domain corpus, since the data size is too small, we perform data augmentation on the source sentence. For each source sentence, we perform 5 operations: add a comma, add a tone character, copy an adjacent character, replace a character with its homophone, or delete a character. Meanwhile, the target sentence remains unchanged. The proposed method improves the robustness of the model while augmenting the data. An example of data augmentation is shown in Table  5.

Training Methods
Our method is based on Transformer (Vaswani et al., 2017), and the training is divided into two stages. First, we pre-train an ST model on the general domain MT corpus, and then fine-tune the ST model on the spoken language domain corpus. For pre-training, we apply multi-path (Elbayad et al., 2020) and future-guided (Zhang et al., 2020b), to enhance the predict ability and avoid the huge consumption caused by training different models for each k. For fine-tuning, we apply the original prefix-to-prefix framework (Ma et al., 2019).

Pre-training
To improve the predictive ability of the ST model, we apply the future-guided training proposed by (Zhang et al., 2020b). Besides the incremental Transformer for simultaneous translation with charlevel wait-k policy, we introduce a full-sentence Transformer, used as the teacher of the incremental Transformer for ST through knowledge distillation. The full-sentence Transformer is trained with crossentropy loss: where θ f ull is the parameter of full-sentence Transformer, D g is the general domain corpus. For the incremental Transformer for ST, since it applies char-level wait-k policy, the source tokens participating in translating are limited to less than g k (t) when decoding the t th target token. For each k, the decoding probability is calculated as: where c and y are the input character sequence and the output subword sequence, respectively. c ≤g k (t) represents the first g k (t) tokens of c. θ incr is the parameter of incremental Transformer.
Following Elbayad et al. (2020), to cover all possible k during training, we apply multi-path training. k is not fixed during training, but randomly and uniformly sampled from K, where K = [1, · · · , |c|] is the set of all possible values of k. Incremental Transformer is also trained with cross-entropy loss: For the knowledge distillation between fullsentence Transformer and incremental Transformer, we apply L 2 regularization term between their encoder hidden states, calculated as: where z incr and z f ull represent the hidden states of incremental Transformer and full-sentence Transformer, respectively. Finally, the total loss L is calculated as: where λ is the hyper-parameter, and we set λ = 0.1 in our system.

Fine-tuning
After pre-training an ST model, we use spoken language domain corpus for fine-tuning. The spoken language domain corpus is a small dataset, and meanwhile most of the word order between the target and the source is the same, so we do not continue to use multi-path and future-guided training methods. We fix k and use the original prefix-toprefix framework for training, and train different models for each k. Given k, the incremental Transformer is trained with cross-entropy loss: where D s is the spoken language domain corpus. Finally, for each k, we fine-tuned a ST model.

Dataset
The dataset for Chinese → English task provided by the organizer contains three parts, shown in Table 6. CWMT19 2 is the general domain corpus that consists of 9,023,708 sentence pairs. Transcription consists of 37,901 sentence pairs and Dev. Set consists of 956 sentence pairs 3 , which are both spoken language domain corpus collected from real speeches (Zhang et al., 2021). We use CWMT19 to pre-train the ST model, then use Transcription for fine-tuning, and finally evaluate the latency and translation quality of our system on Dev. Set. Note that we use the streaming transcription provided by the organizer for testing. Streaming transcription consists of 23,836 lines, which are composed by breaking each sentence in Dev. Set into lines whose length is incremented by one word until the end of the sentence.
We eliminate the corpus with a huge ratio in length between source and target from CWMT19, and finally got 8,646,245 pairs of clean corpus. We augment the Transcription data according to the method in Sec.4.2.2, and get 227,406 sentence pairs. Meanwhile, for both CWMT19 and Transcription, we remove the ending punctuation according to the method in Sec.4.2.1.
Given the processed corpus after cleaning and augmentation, we first perform char-level tokenization (Sec.4.1) on the Chinese sentences, and tokenize and lowercase English sentences with the Moses 4 . We apply BPE (Sennrich et al., 2016) with 16K merge operations on English.

System Setting
We set the standard wait-k policy as the baseline and compare our method with it. We conducted experiments on the following systems:  Offline: offline model, full-sentence MT based on Transformer. We report the results of the subword-level / char-level offline model with greedy / beam search respectively in Table 7.
Standard Wait-k: standard subword-level waitk policy proposed by Ma et al. (2019), used as our baseline. For comparison, we apply the same training method as our method (Sec.4.3) to train it.
Standard Wait-k + rm Last Token: standard subword-level wait-k policy. In the inference time, the last token after the word segmentation is remove to prevent it from being incomplete.
Char-Level Wait-k: our proposed method, refer to Sec.4 for details.
The implementation of all systems is based on Transformer-Big, and adapted from Fairseq Library (Ott et al., 2019). The parameters are the same as the original Transformer (Vaswani et al., 2017). All systems are trained on 4 RTX-3090 GPUs.

Evaluation Metric
For evaluation metric, we use BLEU 5 (Papineni et al., 2002) and AL 6 (Ma et al., 2019) to measure translation quality and latency, respectively.
Latency metric AL of char-level wait-k policy is calculated with g k (t) in Eq. (2): where c and y are the input character sequence and the output subword sequence, respectively. Note that since the streaming transcription provided by the organizer adds a source character at each step, for all systems, we use character-level AL to evaluate the latency. 5 The script for calculating BLEU is provided by the organizer from https://dataset-bj.cdn.bcebos. com/qianyan%2FAST_Challenge.zip. 6 The calculation of AL is as https://github.com/ autosimtrans/SimulTransBaseline/blob/ master/latency.py.

Main Result
We compared the performance of our proposed char-level wait-k policy and subword-level wait-k policy, and set k = 1, 2, . . . , 15 to draw the curve of translation quality against latency, as shown in Figure 2. Note that the same value of k for charlevel wait-k policy and subword-level wait-k policy does not mean that the latency of the two are similar, because lagging k tokens in char-level waitk means strictly waiting for k characters, while for subword-level wait-k, it waits for k subwords, which contain more characters.
'Char-Level Wait-k' outperforms 'Standard Wait-k' and 'Standard Wait-k+rm Last Token' at all latency, and improves about 6 BLEU at low latency (AL=1.10). Besides, char-level wait-k performs more stable and robust than standard wait-k when dealing with streaming transcription input, because char-level wait-k has a stable prefix while the prefix of standard wait-k may change between adjacent steps due to the different word segmentation results. 'Standard Wait-k+rm Last Token' solves the issue that the last token may be incomplete, so that the translation quality is higher than Standard Wait-k under the same k, which improves about 0.56 BLEU (average on all k). However, 'Standard Wait-k+rm Last Token' increases the latency. Compared with 'Standard Wait-k', it waits for one more token on average under the same k. Therefore, from the overall curve, the improvement of 'Standard Wait-k+rm Last Token' is limited.
Char-level wait-k is particularly outstanding at low latency, and it achieves good translation quality even when the AL is less than 0. It is worth mentioning that the reason why the AL is less than 0 is that the generated translation is shorter and |y| |c| in Eq.(9) is greater than 1.

Effect of Data Processing
To analyze the effect of data processing, including 'Depunctuation' and 'Data Augmentation', we show the results without them in Figure 3. We notice that data augmentation improves the translation quality of the model by 1.61 BLEU (average on all k), and the model becomes more stable and robust. 'Depunctation' is even more important. If we keep the ending punctuation in the training corpus, the translation quality of the model drops by 2.27 BLEU, and the latency increase by 2.83 (average on all k). This is because streaming transcription input has no ending punctuation, which makes the model hard to generate target ending punctuation and tend to translate longer translations since it is difficult to generate < eos > without target ending punctuation.

Ablation Study on Training Methods
To enhance the performance and robustness under low latency, we combine future-guided and multipath training methods in pre-training. To verify the effectiveness of the two training methods, we conducted an ablation study on them, and show the results of removing one of them in Figure 4.
When removing one of them, the translation quality decreases, especially at low latency. When the 'Future-guided' is removed, the translation quality decreases by 1.49 BLEU (average on all 4 2 0 2 4 6 8 10 12 k), and when the 'Multi-path' is removed, the translation quality decreases by 0.76 BLEU (average on all k). This shows that two training methods can both effectively improve the translation quality under low latency, especially 'Future-guided'.

Related Work
Previous ST methods are mainly divided into precise read / write policy and stronger predictive ability.
For read / write policy, early policies used segmented translation, and applied full sentence translation to each segment (Bangalore et al., 2012;Cho and Esipova, 2016;. Gu et al. (2017) trained an agent through reinforcement learning to decide read / write. Dalvi et al. (2018) proposed STATIC-RW, which first performing S's READs, then alternately performing RW 's WRITEs and READs. Ma et al. (2019) proposed wait-k policy, wherein first reads k tokens and then begin synchronizing write and read. Wait-k policy has achieved remarkable performance because it is easy to train and stable, and is widely used in simultaneous translation. Zheng et al. (2019a) generated the gold read / write sequence of input sentence by rules, and then trained an agent with the input sentences and gold read / write sequence. Zheng et al. (2019b) introduces a "delay" token {ε} into the target vocabulary to read one more token. Arivazhagan et al. (2019) proposed MILK, which uses a Bernoulli distribution variable to determine whether to output. Ma et al. (2020) proposed MMA, the implementation of MILK based on Transformer. Zheng et al. (2020) proposed a decoding policy that uses multiple fixed models to accomplish adaptive decoding. Zhang et al. (2020a) propose a novel adaptive segmentation policy for ST.
For predicting future, Matsubara et al. (2000) applied pattern recognition to predict verbs in advance. Grissom II et al. (2014) used a Markov chain to predict the next word and final verb. (Oda et al., 2015) predict unseen syntactic constituents to help generate complete parse trees and perform syntax-based simultaneous translation.  added a Predict operation to the agent based on Gu et al. (2017), predicting the next word as an additional input. Elbayad et al. (2020) enhances the wait-k policy by sampling different k to train. Zhang et al. (2020b) proposed future-guided training, which introduces a full-sentence Transformer as the teacher of the ST model and uses future information to guide training through knowledge distillation.
Although the previous methods performed well, they were all evaluated on the traditional MT corpus instead of the real streaming spoken language corpus. Therefore, the previous methods all ignore the robustness and domain adaptation of the ST model in the face of real streaming input. Our method bridgs the gap between the MT corpus and the streaming spoken language domain input, and is more robust and adaptable to the spoken language domain.

Conclusion and Future Work
Our submitted system won the first place in Au-toSimTrans 2021, which is described in this paper. For streaming transcription input from the real scenarios, our proposed char-level wait-k policy is more robust than standard subword-level wait-k. Besides, we also propose two data processing operations to improve the spoken language domain adaptability. For training, we combine two existing training methods that have been proven effective. The experiment on the data provided by the organizer proves the superiority of our method, especially at low latency.
In this competition, we implemented the charlevel wait-k policy on the Chinese source. For some language pairs with a large length ratio between the source (char) and the target (bpe), we can read multiple characters at each step to prevent the issue caused by the excessively long char-level source. We put the char-level simultaneous translation on other languages (such as German and English) for both fixed and adaptive policy into our future work.