End-to-End Offline Speech Translation System for IWSLT 2020 using Modality Agnostic Meta-Learning

In this paper, we describe the system submitted to the IWSLT 2020 Offline Speech Translation Task. We adopt the Transformer architecture coupled with the meta-learning approach to build our end-to-end Speech-to-Text Translation (ST) system. Our meta-learning approach tackles the data scarcity of the ST task by leveraging the data available from Automatic Speech Recognition (ASR) and Machine Translation (MT) tasks. The meta-learning approach combined with synthetic data augmentation techniques improves the model performance significantly and achieves BLEU scores of 24.58, 27.51, and 27.61 on IWSLT test 2015, MuST-C test, and Europarl-ST test sets respectively.


Introduction
The goal of the IWSLT 2020 Offline Speech Translation challenge (Ansari et al., 2020) is to check the feasibility of end-to-end models for translating audio speech of one language into text of a different target language. The success of end-toend neural models for ASR (Graves et al., 2013) and MT (Bahdanau et al., 2015) inspired to build end-to-end neural models for the more challenging Speech-to-Text translation (ST) task (Bérard et al., 2016). Traditionally the ST systems are built by cascading ASR and MT systems (Ney, 1999). However, the cascaded system suffers from error propagation, latency, and memory requirement issues. Although these issues can be addressed using end-to-end ST models, it is hard to collect such data for training these models.
In this work, we build an end-to-end ST system which not only addresses the issues of a cascaded system but also works with limited training data. The proposed system is fine-tuned towards * The two authors contributed equally to this paper IWSLT 2020 Offline Speech-Translation Task 1 . However, the proposed training strategies and the data augmentation techniques can be adopted into existing and future ST models. We adopt the metalearning approach proposed for ST task (Indurthi et al., 2019) to train our system. The meta-learning based training approach not only allows us to leverage huge amounts of training data available in ASR and MT tasks but also helps to find a good initialization point for the target ST task.
We conduct several experiments involving ASR, MT, and ST corpora to test our model performance on the IWSLT 2020, MuST-C, and Europarl-ST English-German (En-De) ST tasks. Our experiments reveal that the proposed model trained using the meta-learning approach achieves significant performance gains over the model which only utilizes the ST data for training. Our model achieves 4.81, 5.37, and 8.46 BLEU score improvements on IWSLT test 2015, MuST-C test, Europarl-ST test sets compared to the models trained without using the meta-learning approach for training. Our best system attains 24.58, 27.51, and 27.61 BLEU scores on IWSLT test 2015, MuST-C test, and Europarl-ST test sets, respectively.

Model Architecture
We use the Transformer model as a base Sequenceto-Sequence (seq2seq) model to train the ASR, MT, and ST tasks. In this section, we describe briefly about the Transformer architecture and how it is adopted to ASR and ST tasks. In Section 2.2, we describe the meta-learning algorithm used to train our seq2seq model.

Base Architecture
A general seq2seq architecture (Sutskever et al., 2014) generates a target sequence y = {y 1 , · · · , y n } given a source sequence x = {x 1 , · · · , x m } by modeling the conditional probability, p(y|x, θ). The MT task is one example of seq2seq problems where x represents the input sequence in the source language and y represents the translated output sequence in the target language.
The non-recurrent Transformer network (Vaswani et al., 2017) has been extensively used to solve general seq2seq problems, especially the MT task. The Transformer is based on an encoder-decoder architecture (Cho et al., 2014). The encoder and decoder blocks of the Transformer network are composed of stacks of N, M identical layers. Each encoder layer has two sub-layers, the first being a multi-head self-attention mechanism, and the second sub-layer being a position-wise fully connected feed-forward network. Similarly, each decoder has these two sub-layers. In addition to these two sub-layers, the decoder contains an additional sub-layer for computing the encoder-decoder attention vector based on soft attention mechanism (Bahdanau et al., 2015).

MAML
Meta-Learning approach is proven to be very useful to mitigate the data scarcity issue in low resource tasks. Due to the scarcity of ST data in our task, we use the variant of meta-learning approach called Modality Agnostic Meta-Learning (MAML) (Finn et al., 2017a) to leverage high resource tasks when training on low resource tasks. Here, we briefly describe the MAML approach for the ST task. For more details about the meta-learning approach for the ST task, please refer to (Indurthi et al., 2019).
The MAML approach involves two phases: (1) Meta-Learning Phase, (2) Fine-tuning Phase. In the meta-learning phase, we use a set of related high resource tasks as source tasks to train the model. In this phase, the model captures the general learning aspects of the tasks involved. During the fine-tuning phase, we tune the model towards the specific target task after initializing the model from the parameters learned in the meta-learning phase.
Meta-Learning Phase: In this phase, we use the high resource tasks as source tasks {τ 1 , · · · , τ s } to find a good parameter initialization point θ 0 for the low resource target task τ 0 . For each step in this phase, we first uniformly sample one source task τ at random from the set of source tasks {τ 1 , · · · , τ s }. We then sample two batches(D τ and D τ ) of training examples from this task τ . The D τ is used to train the model to learn the task specific distribution and this step is called meta-train step. In each meta-train step, we create auxiliary parameters (θ a τ ) initialized from the original model parameters (θ m ). We update the auxiliary parameters during this step using D τ while keeping original parameters intact. The auxiliary parameters (θ a ) are updated using the gradientdecent step and it is given by, After the meta-train step, the auxiliary parameters (θ a ) are evaluated on D τ to compute the loss. This step is called meta-test and the computed loss is used to update the original model parameters (θ m ).
Note that the meta-test step is performed over the model parameters (θ m ), whereas the loss is computed using the auxiliary parameters (θ a ). In effect, the meta-learning phase aims to optimize the model parameters such that a new low resource target task can be quickly learned during the finetuning phase.
Fine-tuning Phase: During fine-tuning phase, the model is initialized from the meta-learned parameters (θ m ) and trained on specific target task. In this phase, the model training is done like a usual neural network training without involving the auxiliary parameters.
Exposing the model parameters to vast amounts of data from high resource source tasks {τ 1 , · · · , τ s } during the meta-learning phase makes them suitable to act as a good initialization point for the target task τ 0 .

Speech-to-text Translation:
We adopt the basic Transformer (Vaswani et al., 2017) architecture described in Section 2.1 to train ASR and ST tasks. We represent the speech sequence in these tasks using the Log Mel 80dimensional features. The speech sequences are usually a few times longer than the text sequences. Thus, we add a compression layer at the beginning of the Transformer network to compress and extract structure locality from the speech sequences. This compressed signal is given as input to the Transformer encoder. The compression layer comprises of a stack of CNN layers. The text sequences in all the ASR, MT, and ST tasks are represented using word piece vocabulary.
The limited amount of training data in the ST task can result in over-fitting and leads to an inferior performance. Hence, we use the meta-learning approach described in the Section 2.2. The metalearning approach for ST task proposed by (Indurthi et al., 2019) suggests high resource tasks such as Automatic Speech Recognition (ASR) and Machine Translation (MT) as source tasks during meta-learning phase. Unlike (Indurthi et al., 2019), we include ST task as one of the source tasks during the meta-learning phase to leverage the ST training data as well. So, the set of source tasks in our metalearning phase are {ASR, M T, ST } and the target task τ 0 during the fine-tuning phase is ST. We dynamically disable the compression layer whenever we sample the MT task during the meta-learning phase. This allows us to train the model on the tasks with different input-output modalities.
During the meta-learning phase, the parameters of the model (θ m ) are exposed to vast amounts of speech-to-transcripts and text-to-text translation examples via ASR and MT tasks along with the original ST tasks' speech-to-text translation examples. This allows the parameters of all the sublayers in the model such as compression, encoder, decoder, encoder-decoder attention, and output layers to learn the individual language representations and translation relations between them.

Training
The speech-to-text translation models are trained on a dataset D of parallel sequences to maximize the the log likelihood: where θ denotes the parameters of the model. To facilitate the training on multiple languages and tasks, we create a universal vocabulary by following (Gu et al., 2018). The universal vocabulary is created based on all the tasks involved in the meta-learning and fine-tuning phases.

Dataset composition
Datasets used to train our model come from three different tasks, ASR, MT, and ST. All of these  ASR Task: We used four different datasets to train the ASR English task, IWSLT 19(filtered), LibriSpeech (Panayotov et al., 2015), MuST-C, and TED-LIUM 3 (Hernandez et al., 2018), which adds a total of 894K English speech-to-text transcripts. Although, IWSLT 19(filtered), MuST-C, and TED-LIUM 3 are ST corpora, they also have the English transcripts, so we include them into ASR tasks as well. We do not augment the ASR datasets with synthetic data, unlike the ST datasets. Adding more synthetic data for ASR task may bias the model towards ASR task rather than target ST task.
MT Task

Data augmentation
For the data augmentation on the text side, we use two English-to-German NMT model and top-2 beam results to generate synthetic German sequences from the corresponding English sequences.  For speech sequence, we use the Sox library to generate the speech signal using different values of speed, echo, and tempo parameters similar to (Potapczyk et al., 2019). The parameter values are uniformly sampled using these ranges : tempo ∈ (0.85, 1.3), speed ∈ (0.95, 1.05), echo delay ∈ (20, 200), and echo decay ∈ (0.05, 0.2). We increase the size of the IWSLT 19(filtered) ST dataset to five times of the original size by augmenting 4X data -four text sequences using the NMT models and four speech signals using the Sox parameter ranges. For the Europarl-ST, we augment 2X examples to triple the size. The TED-LIUM 3 dataset does not contain speech-to-text translation examples originally, hence, we create 2X synthetic speech-to-text translations using speech-totext transcripts. Finally, for the MuST-C datasest, we use synthetic speech to increase the dataset size to 4X. Overall, we created the synthetic training data of size roughly equal to four times the original data using data augmentation techniques described above. The details of these synthetic datasets are given in the Table 2. During training, we also tried SpecAugment (Park et al., 2019) to increase the speech data, but it did not help to boost overall performance.

Data processing
In order to deal with different input and output modalities, we use universal vocabulary (Gu et al., 2018)

Implementation Details
We trained all our models on 4*NVIDIA V100 GPUs. The MAML model is implemented based on the Tensor2Tensor framework (Vaswani et al., 2018). We train the models in the meta-learning phase for 1600k steps and then finetune for 400k steps. The compression layer is composed of three CNN layers. The number of encoder and decoder layers(N and M) in the base transformer model is set to 10 and 8, respectively. In all the experiments, a dropout rate of 0.2 is used. We use a batch size of 1.5M frames for the speech sequences and a batch size of 4096 tokens for the text sequences.
In order to deal with small batches due to long speech signals, we use Multistep Adam optimizer (Saunders et al., 2018) in our experiments, with the gradients accumulated over 32 steps.

Results
In this section, we report the performance of our models on different ST datasets. We report the performance of models on IWSLT tst 2010, tst 2015, MuST-C dev, MuST-C test, Europarl-ST dev, and Europarl-ST test sets. The number of examples in these test sets are reported in the Table 3. We trained one model using only ST datasets shown in Table 2, called woML (without Meta-Learn) from here on. This model woML is trained without using the meta-learning approach. We trained another model, called wML (with Meta-  Learn), in which we first pre-train the model using the meta-learning approach described in the Section 2.2 using all the ASR, MT, ST tasks. We then finetune the model from the meta-learned parameters on the ST task. As we can see from the Table 4, the wML model achieves a better BLEU score than woML on all the ST datasets. We see that the wML model out-performs woML by achieving a BLEU score of 24.4 on IWSLT 2015 test set as compared to the 19.77 BLEU score achieved by woML. These results clearly show that the meta-learning phase helps to leverage the data from ASR, MT datasets and helps to learn the individual language representations and the relations between them. We got further improvements on the ST BLEU score by averaging 10 checkpoints around the best model. In the

Related Work
End-to-end Speech Translation: Previously, speech translation leveraged the success of MT and ASR systems to build the cascade speech translation system (Post et al., 2013). The cascade models mostly suffer from problems such as propagating errors between models and high latency during decoding. In order to overcome these limitations, various attempts have been made to develop end-to-end ST models by aligning source speech signal and target text translation without using intermediate transcripts (Duong et al., 2016). However, due to the limited availability of training data unlike ASR or MT corpora, various data augmentation strategies have been proposed to leverage the data from ASR or MT tasks to improve the end-to-end ST (Jia et al., 2019;Pino et al., 2019) performance. Recently, several learning approaches such as multi-task learning using either ASR+ST or MT+ST data pairs have been suggested and explored. However, in these approaches, the parameters of the model are updated independently based on individual task performance, which may lead to sub-optimal solutions. Indurthi et al. (2019) proposed a metalearning approach to overcome these limitations.
Meta-Learning: Meta-learning algorithms are used to adapt quickly to new tasks with relatively few examples as the main goal of the algorithm is learning to learn. Unlike the past meta-learning approaches which focused on learning a meta policy (Ha et al., 2016;Andrychowicz et al., 2016), (Finn et al., 2017b) recently proposed a metalearning algorithm which puts more weight on finding a good initialization point for new target tasks.

Conclusion
In this work, we improve the performance of endto-end speech translation system based on the data available from the IWSLT2020 Offline Speech Translation Task. We train end-to-end models to solve the complex task of speech translation. We leverage the large out-of-domain training data from the ASR, MT tasks to improve the performance of the ST task. We adopt Model Agnostic Meta-Learning(MAML) and data augmentation techniques to achieve a performance of 24.58, 27.51, 27.61 BLEU scores on IWSLT test 2015, MuST-C test, and Europarl-ST test sets respectively. Turchi, and Changhan Wang. 2020. Findings of the IWSLT 2020 Evaluation Campaign. In Proceedings of the 17th International Conference on Spoken Language Translation (IWSLT 2020), Seattle, USA.