Curriculum Pre-training for End-to-End Speech Translation

End-to-end speech translation poses a heavy burden on the encoder because it has to transcribe, understand, and learn cross-lingual semantics simultaneously. To obtain a powerful encoder, traditional methods pre-train it on ASR data to capture speech features. However, we argue that pre-training the encoder only through simple speech recognition is not enough, and high-level linguistic knowledge should be considered. Inspired by this, we propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages. The difficulty of these courses is gradually increasing. Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.


Introduction
Speech-to-Text translation (ST) is essential to breaking the language barrier for communication. It aims to translate a segment of source language speech to the target language text. To perform this task, prior works either employ a cascaded method, where an automatic speech recognition (ASR) model and a machine translation (MT) model are chained together, or an end-to-end approach, where a single model converts the source language audio sequence to the target language text sequence directly (Berard et al., 2016).
Due to the alleviation of error propagation and lower latency, the end-to-end ST model has been a hot topic in recent years. However, large paired data of source audios and target sentences are required to train such a model, which is not easy to satisfy for most language pairs. To address this * Works are done during internship at Microsoft issue, previous works resort to pre-training technique (Berard et al., 2018;Bansal et al., 2019), where they leverage the available ASR and MT data to pre-train an ASR model and an MT model respectively, and then initialize the ST model with the ASR encoder and the MT decoder. This strategy can bring faster convergence and better results.
The end-to-end ST encoder has three essential roles: transcribe the speech, extract the syntactic and semantic knowledge of the source sentence and then map it to a semantic space, based on which the decoder can generate the correct target sentence. These pose a heavy burden to the encoder, which can be alleviated by pre-training. However, we argue that the current pre-training method restricts the power of pre-trained representations. The encoder pre-trained on the ASR task mainly focuses on transcription, which learns the alignment between the acoustic feature with phonemes or words. It cannot capture linguistic knowledge or understand the semantics, which is essential for translation.
In order to teach the model to understand the sentence and incorporate the required knowledge, extra courses should be taken before learning translation. Motivated by this, we propose a curriculum pre-training method for end-to-end ST. As shown in Figure 1, we first teach the model transcription through ASR task. After that, we design two tasks, named frame-based masked language model (FMLM) task and frame-based bilingual lexicon translation (FBLT) task, to enable the encoder to understand the meaning of a sentence and map words in different languages. Finally, we fine-tune the model on ST data to obtain the translation ability.
For the FMLM task, we mask several segments of the input speech feature, each of which corresponds to a complete word. Then we let the encoder predict the masked word. This task aims to force the encoder to recognize the content of the utterance and understand the inner meaning of the sentence. In FBLT, for each speech segment that aligns with a complete word, whether or not it is masked, we ask the encoder to predict the corresponding target word. In this task, we give the model more explicit and strong cross-lingual training signals. Thus, the encoder has the ability to perform simple word translation, and the burden on the ST decoder is largely reduced. Besides, we adopt a hierarchical manner where different layers are guided to perform different tasks (first 8 layers for ASR and FMLM pre-training, and another 4 layers for FBLT pre-training). This is mainly because the three pre-training tasks have different requirements for language understanding and different output spaces. The hierarchical pre-training method can make the division of labor more clear and separate the incorporation of source semantic knowledge and cross-lingual alignments.
We conduct experiments on the LibriSpeech En-Fr and IWSLT18 En-De speech translation tasks, demonstrating the effectiveness of our pre-training method. The contributions of our paper are as follows: (1) We propose a novel curriculum pretraining method with three courses: transcription, understanding and mapping, forcing the encoder to have the ability to generate necessary features for the decoder. (2) We propose two new tasks to learn linguistic features, FMLM and FBLT, which explicitly teach the encoder to do source language understanding and target language meaning mapping. (3) Experiments show that both the proposed courses are helpful for speech translation, and our proposed curriculum pre-training leads to significant improvements.
2 Related Work

Speech Translation
Early work on speech translation used a cascade of an ASR model and an MT model (Ney, 1999;Matusov et al., 2005;Mathias and Byrne, 2006), which makes the MT model access to ASR errors. Recent successes of end-to-end models in the MT field (Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017) and the ASR fields (Chan et al., 2016;Chiu et al., 2018) inspired the research on end-to-end speech-to-text translation system, which avoids error propagation and high latency issues.
In this research line, Berard et al. (2016) give the first proof of the potential for an end-to-end ST model. After that, pre-training, multitask learning, attention-passing and knowledge distillation have been applied to improve the ST performance Berard et al., 2018;Weiss et al., 2017;Bansal et al., 2018Bansal et al., , 2019Sperber et al., 2019;Liu et al., 2019;Jia et al., 2019). However, none of them attempt to guide the encoder to learn linguistic knowledge explicitly. Recently, Wang et al. (2019b) propose to stack an ASR encoder and an MT encoder as a new ST encoder, which incorporates acoustic and linguistic knowledge respectively. However, the gap between these two encoders is hard to bridge by simply concatenating the encoders. Kano et al. (2017) propose structured-based curriculum learning for English-Japanese speech translation, where they use a new decoder to replace the ASR decoder and to learn the output from the MT decoder (fast track) or encoder (slow track). They formalize learning strategies from easier networks to more difficult network structures. In contrast, we focus on curriculum learning in pre-training and increase the difficulty of pre-training tasks.

Curriculum Learning
Curriculum learning is a learning paradigm that starts from simple patterns and gradually increases to more complex patterns. This idea is inspired by the human learning process and is first applied in the context of machine learning by Bengio et al. (2009). The study shows that this training approach results in better generalization and speeds up the convergence. Its effectiveness has been verified in multiple tasks, including shape recognition (Bengio et al., 2009), object classification (Gong et al., 2016), question answering (Graves et al., 2017), etc. However, most studies focus on how to control the difficulty of the training samples and organize the order of the learning data in the context of single-task learning.
Our method differs from previous works in two ways: (1) We leverage the idea of curriculum learning for pre-training. (2) We do not train the model on the ST task directly with more and more difficult training examples or use more and more complicated structures. Instead, we design a series of tasks with increased difficulty to teach the encoder to incorporate diverse knowledge.

Overview
The overview of our training process is shown in Figure 2. It can be divided into three steps: First, we train the model towards the ASR objective L ASR to learn transcription. We note this as the elementary course. Next, we design two advanced courses (tasks) to teach the model understanding a sentence and mapping words in two languages, named Frame-based Masked Language Model (FMLM) task and Frame-based Bilingual Lexicon Translation (FBLT) task. In the FMLM task, we mask some speech segments and ask the encoder to predict the masked words. In the FBLT task, we ask the encoder to predict the target word for each speech segment which corresponds to a complete source word. In this stage, the encoder is updated by L ADV . We adopt a hierarchical training manner where N encoder blocks are used to perform ASR and FMLM tasks as they both require outputs in source word space, and N e blocks are used in the FBLT task. After the two-phases pretraining, the encoder is finally combined with a new decoder or a pre-trained MT decoder to perform the ST task towards L ST .

Problem Formulation
The speech translation corpus usually contains speech-transcriptiontranslation triples, denoted as S = {(x, y s , y t )}. Specially, x = (x 1 , · · · , x Tx ) is a sequence of acoustic features which are extracted from the speech signals. y s = (y s 1 , · · · , y s Ts ) and y t = (y t 1 , · · · , y t Tt ) represent the corresponding transcription in source language and the translation in target language respectively. To pre-train the encoder, an extra ASR dataset A = {(x, y s )} can be leveraged . Finally, the data for encoder pre-training is denoted as After the encoder is pre-trained, we fine-tune the model using only S, to enable it generate y t from x directly. The model is updated using cross-entropy loss L ST = − log P (y t |x).
Model Architecture In this work, we adopt the architecture of Transformer as in (Karita et al., 2019). The encoder is a stack of two 3 × 3 2D CNN layers with stride 2 and N e Transformer encoder blocks. The CNN layers result in downsampling by a factor of 4. The decoder is a stack of N d Transformer decoder blocks.

Elementary Course: Transcription
In the elementary course, we train an end-to-end ASR model, which has similar architecture as the ST model. The ASR encoder consists of N blocks, and these blocks are used to initialize the bottom N blocks of the ST encoder. For the ASR task, we follow Karita et al. (2019), to employ a multitask learning strategy, that is, both the E2E decoder and a CTC module predict the source sentence.
Offline experiments indicate that the CTC objective is crucial for attentional encoder-decoder based ASR models. The final objective combines the CTC loss L ctc and the cross-entropy loss L CE : In this work, we set α to 0.3. The CTC loss works on the encoder output and it pushes the encoder to learn frame-wise alignment between speech with words.

Advanced Courses: Understanding and Word Mapping
With the ability of transcription, we further propose two new tasks for the advanced courses.

Frame-based Masked Language Model
The design of the Frame-based Masked Language Model task is inspired by the Masked Language Model (MLM) objective of BERT (Devlin et al., 2019) and semantic mask for ASR task (Wang et al., 2019a). This task enables the encoder to understand the inner meaning of a segment of speech. As shown in Figure 2, we first perform forcealignment between the speech and the transcript sentence to determine where in time particular words occur in the speech segment. For each word y s i , we obtain its corresponding start position s i and the end position e i in the sequence x according to force alignment results. At each training iteration, we randomly sample some percentage of the words in the y s and denote the selected word set asỹ s . Next, for each selected token y s j inỹ s , we mask the corresponding speech piece [x s j : x e j ]. The masked utterance is denoted asx and used as input to the encoder: After that, for a masked piece [x s j : x e j ], we average the corresponding output hidden states [h s j 4 : h e j 4 ] 1 , and compute the distribution probability over source words as shown in follows: In practice, the sentence is represented in BPE tokens and W ∈ R d model ×|Vs| , where |V s | is the size of source vocabulary. In this way, a speech piece can be aligned with one or more tokens. We compute KL-Divergence loss as: q(y s i ) ∈ R |Vs| is a distribution over all BPE tokens in source vocabulary V s and defined as: where pos represents the dimension index and n j is the total number of BPE tokens contained in word y s j .
In this work, we use a mask ratio of 15% following BERT and the masked speech piece is filled with the mean value of the whole utterance following Park et al. (2019). Because FMLM focuses on the understanding of source language, we computes its loss at the N -th layer of encoder (same with ASR loss), in the hope that the bottom N layers are only concerned with source language.

Frame-based Bilingual Lexicon Translation
Aside from predicting masked source words, we go further to leverage cross-lingual information. Specifically, for each segment of speech features [x s i : x e i ] which aligned with a source word y s i , we assume we can obtain its target counterpartỹ t i . Similar to FMLM, we average the output hidden states from position s i 4 to e i 4 , and then compute the distribution probability over target vocabulary. The alignment between speech segments and target words is a many-to-many correspondence, so there are cases whereỹ t i contains nothing or contains multiple foreign words. For the former case, we set the loss to zero, and for the latter case, we also compute KL-Divergence loss as: The definition of q(ỹ t i ) is the length normalized distribution over all tokens appear inỹ t i . Note that the loss is computed on every speech segments, whether or not it is masked.
The only question remaining is how to obtaiñ y t i for each speech segment. Since there are two types of data for pre-training, (x, y s , y t ) ∈ S and (x, y s ) ∈ A, we use two methods to get the alignment: For training examples (x, y s , y t ) ∈ S, we use reference-supervised method. In particular, we simply run Moses 2 scripts to establish word alignments. It begins from running of GIZA++ 3 to get source-to-target and target-to-source alignments, and then runs a heuristic grow-diag-final algorithm to get the final results, which means ∀y s i ∈ y s , we choose one word from its translation sentence as the corresponding word ∃ỹ t i ∈ y t s.t.ỹ t i ∼ y s . For training examples (x, y s ) ∈ A, we apply dictionary-supervised method. Through the above alignment process, we can calculate a bilingual lexical translation table T with {(y s , y t )|(x, y s , y t ) ∈ S}, which estimates the translation probability between a source word w s i and a target word w t j , denoted as T = (w s i , w t j , p(w s i , w t j )). After that, we compute aỹ t i for each y s i in y s according toỹ t i = argmax w s j p(y s i , w s j ). We compute the L F BLT at the top layer of the encoder, indicating that the top N e − N layers are duty on bilingual word mapping. The final training objective in the advanced course combines FMLM and FBLT losses  (Niehues et al., 2018).
LibriSpeech En-Fr: This corpus is a subset of the LibriSpeech ASR corpus (Panayotov et al., 2015) and aligned with French e-books, which contains 236 hours of speech in total. Following previous works, we use the 100 hours clean training set and double the ST size by concatenating the aligned references with the provided Google Translate references, resulting in 90k training instances. We validate on the dev set and report results on the test set (2048 utterances).
IWSLT En-De: The corpus contains 271 hours of data, with English wave, English transcription, and German translation in each example. We follow  to remove utterances of low alignment quality, resulting in 137k utterances. We sample 2k segments from the ST-TED corpus as dev set and tst2013 is used as the test set (993 utterances).
Data Preprocessing: We run ESPnet 4 (Watanabe et al., 2018) recipes to perform data preprocessing. For both tasks, our acoustic features are 80-dimensional log-Mel filterbanks stacked with 3-dimensional pitch features extracted with a step size of 10ms and window size of 25ms. The features are normalized by the mean and the standard deviation for each training set. Utterances of more than 3000 frames are discarded. We perform speed perturbation with factors 0.9 and 1.1. The alignment results between speech and transcriptions are obtained by Montreal Forced Aligner (McAuliffe et al., 2017). For references pre-processing, we tokenize and lowercase all the text with the Moses scripts. For pre-training tasks, the vocabulary is generated using sentencepiece (Kudo and Richardson, 2018) with a fixed size of 5k tokens for all languages, and the punctuation is removed. For ST task, we normalize the punctuation using Moses and use the character-level vocabulary due to its better performance (Berard et al., 2018). Since there is no human-annotated segmentation provided in the I-WSLT tst2013, we use two methods to segment the audios: 1) Following ESPnet, we segment each audio with the LIUM SpkDiarization tool (Meignier and Merlin, 2010). For evaluation, the hypotheses and references are aligned using the MWER method with RWTH toolkit (Bender et al., 2004).
2) We perform sentence-level force-alignment between audio and transcription using aeneas 5 tool and segment the audio according to alignment results.

Baselines
Experiments are conducted in two settings: base setting and expanded setting. In base setting, only the corpus described in Section 4.1 is used for each task. In the expanded setting, additional ASR and/or MT data can be used. All results are reported on case-insensitive BLEU with the multibleu.perl script unless noted.

End-to-End ST Baselines
We mainly compare our method with the conventional encoder pre-training method which uses only the ASR task to pre-train the encoder. Besides, we also compare with the results of the other works in the literature by copying their numbers.   combine three ST datasets of 472h training data 6 to train a multilingual ST model. In our work, we use the LibriSpeech ASR corpus as additional pre-training data, including 960h of speech. As the dev and test set of LibriSpeech ST task are extracted from the 960h corpus, we exclude all training utterances with the same speaker that appear in dev or test sets .
IWSLT: Since previous works use different segmentation methods and BLEU-score scripts, it is unfair to copy their numbers. In our work, we choose the ESPnet results as base setting baseline, the multilingual model and TCEN-LSTM model as expanded baselines.  use the same multilingual model as described in Lib-riSpeech baselines. And Wang et al. (2019b) use an additional 272h TEDLIUM2 (Rousseau et al., 2014) ASR corpus and 41M parallel data from WMT18 and WIT3 7 . All of them use ESPnet code, LI-UM segmentaion method and multi-bleu.perl script. We follow Wang et al. (2019b) to use another 272h ASR data for encoder pre-training and a subset of WMT18 8 for decoder pre-training. We use the same processing method for MT data, resulting in 4M parallel sentences in total. We also reimplement the CL-fast track of Kano et al. (2017) using our model architecture and data as another baseline.

Cacased Baselines
For LibriSpeech ST task, we use results of Berard et al. (2018),  and Liu et al. (2019) as base cascaded baselines. The first two use LSTM models for ASR and MT. While the last work trains Transformer ASR and MT models. We build an expanded cascaded system with the pretrained Transformer ASR model and a LSTM MT model with the default setting in ESPnet recipe. For IWSLT ST task, we use  as base cascaded baseline, which is based on LSTM architecture. And we implement a Transformerbased baseline using our pre-trained ASR and MT models in the expanded setting.

Implementation Details
All our models are implemented based on ESPnet. We set the model dimension d model to 256, the head number H to 4, the feed forward layer size d f f to 2048. For LibriSpeech expanded setting, d model = 512 and H = 8. For all the ST models, we set the number of encoder blocks N e = 12 and the number of decoder blocks N d = 6. Unless noted, we use N = 8 encoder blocks to perform the ASR and the FMLM pre-training tasks. For MT model used in IWSLT expanded setting, we use the Transformer architecture in Vaswani et al. (2017) with N e = 6, N d = 6, H = 4, d model = 256.
We train the model with 4 Tesla P40 GPUs and batch size is set to 64 per GPU. The pre-training takes 50 and 20 epochs for each phase and the final ST task takes another 50 epochs (a total of 120 epochs). We use the Adam optimizer with warmup steps 25000 in each phase. The learning rate decays proportionally to the inverse square root of the step number after 25000 steps. We  (Liu et al., 2019) 14.30 +knowledge distillation (Liu et al., 2019) 17.02 TCEN-LSTM (Wang et al., 2019b) 17  save checkpoints every epoch and average the last 5 checkpoints as the final model. To avoid overfitting, SpecAugment strategy (Park et al., 2019) is used in ASR pre-training with frequency masking (F = 30, mF = 2) and time masking (T = 40, mT=2). The decoding process uses a beam size of 10 and a length penalty of 0.2.

Comparison with End-to-End Baselines
LibriSpeech En-Fr: The results on LibriSpeech En-Fr test set are listed in Table 1. In base setting, our method improves the "Transformer+ASR pre-train" baseline by 1.7 BLEU and beats all the previous works, even though we do not pre-train the decoder. It indicates that through a well-designed learning process, the encoder has a strong potential to incorporate large amount of knowledge. Our method beats a knowledge distillation baseline, where an MT model is utilized to teach the ST model. The reason, we believe, is that our method gives the model more training signals and makes it easier to learn. We also outperform a TCEN baseline which includes two encoders. Compared to them, our method is more flexible and incorporates all information into a single encoder, which avoids the representation gap between the two encoders.
As the ASR data size increases, the model performs better. In the expanded setting, we find the FBLT task performs poorly compared with the base setting. This is because the target word prediction task is dictionary-supervised in expanded setting rather than reference-supervised as in base setting. However, our method still outperforms the simple pre-training method by a large margin. Besides, it is surprising to find that the end-to-end ST model is approaching the performance of an MT model, which is the upper bound of the ST model since it accepts golden source sentence without any ASR errors. This further verifies the effectiveness of our method. IWSLT En-De: The results on IWSLT tst2013 are listed in Table 2, showing a similar trend as in LibriSpeech dataset. We find that the segmentation methods have a big influence on the final results. In the base setting, our method can improve the ASR pre-training baseline by 0.9 to 2.2 BLEU scores, depending on the segmentation methods. In the expanded setting, we find when combined with decoder pre-train, the performance is further improved and beats other expanded baselines. Table 3 shows comparison with cascaded ST systems. For the base setting of two tasks, our end-toend model can achieve comparable or better results with cascaded methods. This shows the end-toend model has powerful learning capabilities and combines the functions of two models. In the Lib-riSpeech expanded setting, when more ASR data is available, we also obtain a competitive performance. This indicates our method can make a good use of ASR corpus and learn valuable linguistic knowledge other than simple acoustic information. However, when additional MT data is used, there is still a gap between the end-to-end method and the cascaded method. How to utilize bilingual parallel sentences to improve the E2E ST model is worth   (Berard et al., 2018) 14.6 LSTM ASR+ MT  15.8 Transformer ASR + MT (Liu et al., 2019) 17  further studying.

Analysis and Discussion
Ablation Study To better understand the contribution of each component, we perform an ablation study on LibriSpeech expanded setting. The results are shown in Table 4. On the one hand, we show that both of our proposed pre-training tasks are beneficial: In "-FMLM task" and "-FBLT task" 9 , we perform single-task pre-training for advanced course. The performance drops when we remove either one of them. On the other hand, we show the two-phases pre-training paradigm is necessary: The "-phase 2" experiment degenerates to the simple ASR pre-training baseline. In "-phase 1" setting, we find that without the ASR pre-training, the training accuracy on FMLM task and FBLT task drops a lot, which further affects the ST performance. This means the ASR task is necessary for both the advanced courses and ST. In "Multi3"  setting, we pre-train the model on ASR, FMLM and FBLT tasks in one phase. In this setting, we observe multi-task learning also decrease individual task performances (ASR, FMLM and FBLT) compared to curriculum learning. One reasonable expanation is that it is hard to train on the FMLM and FBLT tasks which takes masked input from randomly initialized parameters, which also leads to performance degradation on the ST task.
Hyper-parameter N During pre-training, which layer conducts ASR pre-training and FMLM loss is an important hyper-parameter. We conduct experiments on LibriSpeech base setting to explore the influence of different choices. We keep N e = 12 unchanged and always use the top layer to perform the FBLT task. Then we alter the hyperparameter N . We find if N = 6, the model finds it difficult to converge during ST training. That may be because the distance between the decoder and the bottom 6 encoder layers is too far so that the valuable source linguistic knowledge can not be well utilized. Moreover, the model performs undesirable if the choice is 10 or 12, which results in 16.47 and 16.14 BLEU score respectively, since the number of blocks for FBLT task is not enough. The model achieves the best performance when we choose N = 8. Thus, we use this strategy in our main experiments. Unlabeled Speech Data In this work, we also ex-plore how to utilize the unlabeled speech data in pre-training, but only get negative results. We conduct exploratory experiments on the LibriSpeech ST task. Assume that the (x, y s ) from 100h ST corpus as labeled pre-training data and (x) from 960h LibriSpeech ASR corpus as unlabeled data. Following , we design an unsupervised pre-training task for elementary course, in which we randomly mask 15% of fbank features and let the bottom 4 encoder layers predict the masked part. We compute the L1 loss between the prediction and groundtruth filterbanks. However, we find that this method is not helpful for the final ST task, which results in 16.85 BLEU score, lower than our base setting model (without extra data pre-training). It is still an open question about how to use unlabeled speech data.

Conclusion and Future Work
This paper investigates the end-to-end method for ST. We propose a curriculum pre-training method, consisting of an elementary course with an AS-R loss, and two advanced courses with a framebased masked language model loss and a bilingual lexicon translation loss, in order to teach the model syntactic and semantic knowledge in the pre-training stage. Empirical studies have demonstrated that our model significantly outperforms baselines. In the future, we will explore how to leverage unlabeled speech data and large bilingual text data to further improve the performance. Besides, we expect the idea of curriculum pre-training can be adopted on other NLP tasks.