Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

A conventional approach to improving the performance of end-to-end speech translation (E2E-ST) models is to leverage the source transcription via pre-training and joint training with automatic speech recognition (ASR) and neural machine translation (NMT) tasks. However, since the input modalities are different, it is difficult to leverage source language text successfully. In this work, we focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models. To leverage the full potential of the source language information, we propose backward SeqKD, SeqKD from a target-to-source backward NMT model. To this end, we train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder. The paraphrases are generated from the translations in bitext via back-translation. We further propose bidirectional SeqKD in which SeqKD from both forward and backward NMT models is combined. Experimental evaluations on both autoregressive and non-autoregressive models show that SeqKD in each direction consistently improves the translation performance, and the effectiveness is complementary regardless of the model capacity.


Introduction
End-to-end speech translation (E2E-ST) (Bérard et al., 2016), which aims to convert source speech to text in another language directly, is an active research area. Because direct ST is a more difficult task than automatic speech recognition (ASR) and machine translation (MT), various techniques have been proposed to ease the training process by using source transcription. Examples include pretraining (Bérard et al., 2018;Wang et al., 2020c;Bansal et al., 2019;Wang et al., 2020d), multi-task learning (Weiss et al., 2017;Bérard et al., 2018;Bahar et al., 2019), knowledge distillation (Liu et al., 2019), meta-learning (Indurthi et al., 2020), twopass decoding (Anastasopoulos and Chiang, 2018;Sperber et al., 2019), and interactive decoding Le et al., 2020). However, as input modalities between ST and MT tasks are different, an auxiliary MT task is not always helpful, especially when additional bitext is not available (Bahar et al., 2019). Moreover, because monotonic speech-to-transcription alignments encourage the ASR task to see surface-level local information, an auxiliary ASR task helps the E2E-ST model to extract acoustic representations, not semantic ones, from speech.
Sequence-level knowledge distillation (Se-qKD) (Kim and Rush, 2016) is another approach to transferring knowledge from one model to another. Recent studies have shown that SeqKD has the effect of reducing the complexity of training data and thus eases the training of student models, e.g., non-autoregressive (NAR) models (Gu et al., 2018;Zhou et al., 2019a;Ren et al., 2020).
Paraphrasing, which represents text in a different form but with the same meaning, can also be regarded as SeqKD when using neural paraphrasing via back-translation Wieting et al., 2017;Federmann et al., 2019). It has been studied to improve the reference diversity for MT system evaluations (Thompson and Post, 2020;Bawden et al., 2020a,b) and the performance of low-resource neural MT (NMT) models (Zhou et al., 2019b;Khayrallah et al., 2020).
In this work, due to its simplicity and effectiveness, we focus on SeqKD from text-based NMT models to improve the performance of a bilingual E2E-ST model. In order to fully leverage source language information, we propose backward Se-qKD, which targets paraphrased source transcriptions generated from a target-to-source backward NMT model as an auxiliary task. Then, a single ST decoder is trained to predict both source and target language text as in a multilingual setting (Inaguma et al., 2019). This way, the decoder is biased to capture semantic representations from speech, un-like joint training with an auxiliary ASR task. We also propose bidirectional SeqKD, which combines SeqKD from two NMT models in both language directions. Therefore, the E2E-ST models can fully exploit the knowledge embedded in both forward and backward NMT models.
Experimental evaluations demonstrate that Se-qKD from each direction consistently improves the translation performance of both autoregressive and non-autoregressive E2E-ST models. We also confirm that bidirectional SeqKD outperforms unidirectional SeqKD and that the effectiveness is maintained in large models.

Method
In this section, we propose bidirectional SeqKD from both forward and backward NMT models that leverages machine-generated source paraphrases as another target in addition to the distilled translation to enhance the training of a bilingual E2E-ST model. Let X denote input speech features in a source language and Y s and Y t denote the corresponding gold transcription and translation, respectively. Let denote the corresponding ASR and MT datasets, respectively. 1 We drop the subscript i when it is obvious.

Sequence-level knowledge distillation
We first train a text-based source-to-target forward NMT model M fwd with D mt . 2 Then, we perform beam search decoding with M fwd on D st to create a new dataset D fwd i is a distilled translation. D fwd st is used to train the E2E-ST models, referred to as forward SeqKD (or fwd SeqKD).

Paraphrase generation
To exploit semantic information in the source language, we leverage machine-generated paraphrases of source transcriptions. We train a text-based target-to-source backward NMT model M bwd with D mt and then generate a new dataset D bwd for training the E2E-ST models. As neural paraphrasing can be regarded as SeqKD from M bwd , we referred to it as backward SeqKD (or bwd SeqKD). In this work, we do not use large paraphrase datasets (Wieting and Gimpel, 2018;Hu et al., 2019) because their availability depends on languages and domains. Moreover, neural paraphrasing is applicable to any source languages that lack a sufficient amount of paired paraphrase data.
We also propose combining forward SeqKD with backward SeqKD, referred to as bidirectional SeqKD (or bidir SeqKD), and construct a new dataset D bidir When using two references per utterance (2ref training) (Gordon and Duh, 2019), we concatenate D fwd st and D bwd st , and the suitable combination is analyzed in Section 4.3. This way, we can distill the knowledge of both M fwd and M bwd to a single E2E-ST model.

Training
We train an E2E-ST model with a direct ST objective L st (Y t orŶ t |X) and an auxiliary speechto-source text objective L src (Y s orŶ s |X). We refer to joint training with L src (Y s |X) as joint ASR and with L src (Ŷ s |X) as backward SeqKD. Both losses are calculated from the same ST decoder. To bias the model to generate the desired target language, we add language embedding to token embedding at every token position in the decoder (Conneau and Lample, 2019). 3 We then apply bidirectional SeqKD to both autoregressive (AR) and non-autoregressive (NAR) E2E-ST models.

Autoregressive E2E-ST model
We use the speech Transformer architecture in (Karita et al., 2019) with an additional language embedding. The total training objective is formulated with a hyperparameter λ src (≥ 0) as where both L st and L src are defined as crossentropy losses. The entire encoder-decoder parameters are shared in both tasks.

Non-autoregressive E2E-ST model
We where L cmlm , L ar , and L lp are losses in NAR E2E-ST, AR E2E-ST, and length prediction tasks, respectively. λ * is the corresponding tunable loss weight. During inference, the mask-predict algorithm is used for T iterations with a length beam width of l (Ghazvininejad et al., 2019). The best candidate at the last iteration is selected from the NAR decoder based on scores from the AR decoder (Inaguma et al., 2021). Note that we apply L src to the NAR decoder only.

Experimental setting
Data We used Must-C En-De (408 hours) and En-Fr (492 hours) datasets (Di Gangi et al., 2019). Both language pairs consist of a triplet of (X, Y s , Y t ). We performed the same data preprocessing as (Inaguma et al., 2020) (see details in Appendix A.1). We report case-sensitive detokenized BLEU scores (Papineni et al., 2002) on the tst-COMMON set with the multi-bleu-detok.perl script in Moses (Koehn et al., 2007).

Model configuration
We used the Transformer (Vaswani et al., 2017) architecture having 12 encoder layers following two CNN blocks and six decoder layers for the ASR and E2E-ST tasks. For the MT models, we used six encoder layers. We built our models with the ESPnet-ST toolkit (Inaguma et al., 2020). See details in Appendix A.2.
Training We always initialized the encoder parameters of the E2E-ST model by those of the corresponding pre-trained ASR model (Bérard et al., 2018). We follow the same optimization strategies as in (Inaguma et al., 2021(Inaguma et al., , 2020  Inference For the AR models, we used a beam width of 4. For the NAR models, we set T = {4, 10} and l = 9 as in (Inaguma et al., 2021).

Main results
We first report the paraphrasing quality, which is shown in Table 1. As confirmed by the BLEU and translation edit rate (TER) scores (Snover et al., 2006), the paraphrased source text was not just a simple copy of the transcription (see examples in Appendix A.5).

Autoregressive models
The results are shown in Table 2. Pre-training the ST decoder with the forward MT decoder (A2) improved the baseline performance (A1). Joint ASR showed a marginal improvement on En-De but a degraded performance on En-Fr (A3). We attribute this to the fact that the ASR task was more trivial than the ST task and biased the shared decoder to capture surface-level textual information. In contrast, backward SeqKD showed small but consistent improvements in both language directions (A4), and it was as effective as MT pre-training. As the encoder was already pre-trained with the ASR model, paraphrases had an additional positive effect on the BLEU improvement. Forward SeqKD significantly improved the performance, as previously reported in (Inaguma et al., 2021). However, the gains by MT pre-training and joint ASR were diminished. Forward SeqKD was more effective than backward SeqKD solely (A4 vs. B1). However, backward SeqKD was still  beneficial on top of forward SeqKD (C1, i.e., bidirectional SeqKD) while joint ASR was less so (B3). We also augmented the target translations by concatenating D st and D fwd st (2ref training), which further improved forward SeqKD (B4). Nevertheless, a combination of 2ref training and backward Se-qKD (i.e., bidirectional SeqKD with D fwd st ∪ D bwd st ) had a complementary effect and showed the best result (C2). It even outperformed larger multilingual models (Wang et al., 2020a) without using additional data in other language pairs.
Non-autoregressive models The results are presented in Table 3. Following the standard practice in NAR models (Gu et al., 2018), we always used forward SeqKD. We did not use 2ref training for the NAR models because it increases the multimodality. Joint ASR improved the performance on all NAR models, except for En-Fr with the number of iterations T = 10. However, bidirectional SeqKD with D bidir st further improved the performance consistently regardless of T . Since NAR models assume conditional independence for every token, they prefer monotonic input-output alignments with lower alignment complexity in theory. However, paraphrasing collapses the monotonicity of the ASR task and increases the alignment complexity, making the auxiliary speech-to-source text task non-trivial. Nevertheless, BLEU scores were improved by adding backward SeqKD. This was probably because the complexity of transcriptions in the training data was reduced at the cost of the alignment complexity, which was more effective for the NAR models.

Analysis
We analyze the performance of bidirectional Se-qKD through a lens of complexity in the training data following (Zhou et al., 2019a). We aligned words in every source and target sentence pair with Condition Entropy (↑ more complex)

En-De
En-Fr   , 2013). Then, we calculated corpus-level conditional entropy C(D) and faithfulness F(D) for both forward ( − → D ) and backward ( ← − D ) language directions to evaluate the multimodality. In short, conditional entropy measures uncertainty of translation, and faithfulness is defined as Kullback-Leibler divergence and measures how close the distilled data distribution is to the real data distribution. See the mathematical definition in Appendix A.6.
The results of entropy and faithfulness are shown in Tables 4 and 5, respectively. Consistent with (Zhou et al., 2019a), the entropy of target translations was reduced by forward SeqKD, indicating target translations were converted into a more deterministic and simplified form. Interestingly, the entropy of the original translations was also reduced by backward SeqKD. In other words, backward Se-qKD modified transcriptions so that the target translations can be predicted easier. This would help E2E-ST models learn relationships between source and target languages from speech because E2E-ST models are not conditioned on text in another language explicitly. Therefore, we presume that the encoder representations were enhanced by back-  ward SeqKD. Using machine-generated sequences in both languages increased the entropy, probably due to error accumulation. However, E2E-ST models do not suffer from it because they are conditioned on the source speech. We also confirmed similar trends in the reverse language direction. Regarding faithfulness, distilled target sequences degraded faithfulness as expected. However, an interesting finding was that the faithfulness of bidirectional SeqKD was better than that of forward SeqKD, meaning that the former reflected the true word alignment distribution more faithfully than the latter. Although lexical choice might be degraded by targeting distilled text in both languages (Ding et al., 2021), mixing the original and distilled text by 2ref training would recover it.

Ablation study
We conduct an ablation study to verify the analysis in the previous section. In Table 4, we observed that it was better to have the original reference in the target sequence of either the source or target language. For example, to reduce the entropy of German text in the training set, it was best to condition the distilled German translation on the original English transcription, and vice versa. Therefore, we hypothesize that the best way to reduce the entropy in both source and target languages during 2ref training is to combine (Ŷ s , Y t ) and (Y s , Y t ) for each sample. We compared four ways to leverage source text: gold transcription Y s only, distilled paraphraseŶ s only, and both. 5 The results are shown in Table 6. We confirmed that the model trained with the original reference in either language for every target achieved the best BLEU score, which verifies our hypothesis.

Increasing model capacity
Finally, we investigate the effectiveness of bidirectional Seq-KD with 2ref training when increasing the model capacity in Table 7. The purpose of 5 Both gold translation Y t and distilled translationŶ t were always used as target sequences.  this experiment is to verify our expectation that large models can model complex target distributions in multi-referenced training better. In addition to simply increasing the model dimensions, we also investigate Conformer (Gulati et al., 2020), a Transformer encoder augmented by a convolution module. We confirmed that bidirectional Se-qKD always outperformed forward SeqKD in both language directions regardless of model configurations. We also found that the Conformer encoder significantly boosted the translation performance of forward SeqKD, but the gains of bidirectional SeqKD were transferred.

Conclusion
To fully leverage knowledge in both source and target language directions for bilingual E2E-ST models, we have proposed bidirectional SeqKD, in which both forward SeqKD from a source-to-target NMT model and backward SeqKD from a target-tosource NMT model are combined. Backward Se-qKD is performed by targeting source paraphrases generated via back-translation from the original translations in bitext. Then, the E2E-ST model is enhanced by training to generate both source and target language text with a single decoder. We experimentally confirmed that SeqKD from each direction boosted the translation performance of both autoregressive and non-autoregressive E2E-ST models, and the effectiveness was additive.
Multi-referenced training with the original and distilled text gave further gains. We also showed that bidirectional SeqKD was effective regardless of model sizes.   et al., 2007). Non-verbal speech labels such as "(Applause)" and "(Laughter)" were removed during evaluation (Di Gangi et al., 2019;Inaguma et al., 2021;Le et al., 2020). We built output vocabularies based on the byte pair encoding (BPE) algorithm (Sennrich et al., 2016) with the Sentencepiece toolkit (Kudo, 2018) 6 . The joint source and target vocabularies were constructed in the ST and MT tasks, while the vocabularies in the ASR task were constructed with transcriptions only. For autoregressive models, we used 5k for ASR models and 8k for E2E-ST and MT models. We used 16k vocabularies for non-autoregressive E2E-ST models (Inaguma et al., 2021). For input speech features, we extracted 80channel log-mel filterbank coefficients computed with a 25-ms window size and shifted every 10ms with 3-dimensional pitch features using Kaldi (Povey et al., 2011). This results in 83dimensional features for every frame. The features were normalized by the mean and the standard deviation for each training set. To avoid overfitting, training data was augmented by a factor of 3 with speed perturbation (Ko et al., 2015) and SpecAugment (Park et al., 2019). We used (m T , m F , T, F ) = (2, 2, 40, 30) for the hyperparameters in SpecAugment.

A.2 Model configuration
We used the Transformer (Vaswani et al., 2017) architecture implemented with the ESPnet-ST toolkit (Inaguma et al., 2020) for all tasks. ASR and E2E-ST models consisted of 12 speech encoder blocks and six decoder blocks. The speech encoders had two CNN blocks with a kernel size of 3 and a channel size of 256 before the first Transformer encoder layer, which resulted in 4fold downsampling in the time and frequency axes. The text encoder in the MT models consisted of six Transformer blocks. The dimensions of the self-attention layer d model and feed-forward network d ff were set to 256 and 2048, respectively, and the number of attention heads H was set to 4. For a large Transformer model configuration, we increased d ff from 256 to 512 and H from 4 to 8. For a Conformer model configuration, we set d model = 256, d ff = 2048, and H = 4. The kernel size of depthwise separable convolution was set to 15. None of the other training or decoding hyperparameters were modified.

A.3 Initialization
In addition to initializing the encoder parameters of the E2E-ST model by those of the pre-trained ASR model, the auxiliary AR decoder parameters of the NAR models were initialized by those of the corresponding pre-trained AR MT model (Inaguma et al., 2021). The other decoder parameters of both the AR and NAR models were initialized as in BERT (Devlin et al., 2019;Ghazvininejad et al., 2019;Inaguma et al., 2021), where weight parameters were sampled from N (0, 0.02), biases were set to zero, and layer normalization parameters were set to β = 0, γ = 1. Note that we did not use additional data for pre-training.

A.4 Training
The Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 was used for training with a Noam learning rate schedule (Vaswani et al., 2017). We used dropout and label smoothing (Szegedy et al., 2016) with a probability of 0.1 and 0.1, respectively. The other training configurations for all tasks are summarized in Table 9. We removed utterances having more than 3000 input speech frames or more than 400 characters due to the GPU memory capacity. The last five best checkpoints based on the validation performance were used for model averaging.  For the training of ASR models used for E2E-ST encoder pre-training, we removed case and punctuation information from transcriptions and then applied a joint CTC/Attention objective (Watanabe et al., 2017). However, we retained this information in the transcriptions and paraphrases used for training the E2E-ST and MT models.

A.5 Case study
We present examples of generated paraphrases on the Must-C En-De training set in Table 8. We observed that most paraphrases kept the original meaning while some words were simplified to alternatives having a similar meaning. We also found that the first conjunction in an utterance was more likely to be omitted via paraphrasing.

A.6 Mathematical formulation of complexity and faithfulness
In this section, we mathematically formulate the corpus-level complexity and faithfulness given D ∈ {D st , D fwd st , D bwd st , D bidir st }. Our formulation follows (Zhou et al., 2019a), but we also consider the reverse language direction.
Conditional entropy (complexity) The corpuslevel complexity of D in the forward language direction, C( − → D ), is defined as the conditional entropy H(Y t |Y s ) normalized over all samples. H(Y t |Y s ) is defined as p(y t k |Align(y t k )) · log p(y t k |Align(y t k )) where A is an external alignment model, and T s and T t are the source and target sequence lengths, respectively. We make two assumptions: (1) conditional independence of target tokens given the source text sequence, and (2) the distribution of p(y t |Y s ) follows the alignment model A. Then, C( − → D ) is calculated as where V s is a set of all words in the source language. Division by |V s | is important to normalize frequent source words.
The corpus-level complexity of D in the backward language direction, C( ← − D ), is defined similarly as where V t is a set of all words in the target language.
Faithfulness Although the corpus-level conditional entropy can be used to evaluate the complexity of the training data, there are also trivial solutions to generate new data with smaller complexity when target translations are not adequate. Faithfulness is a good measure to assess how close the distilled data distribution is to the real (original) data distribution. The faithfulness of D in a forward language direction F( − → D ) and a backward language direction F( ← − D ) is defined as the KLdivergence of the alignment distribution between the real dataset and a distilled dataset, as F( − → D ) = 1 |V s | y s ∈V s y t ∈V t p r (y t |y s ) log p r (y t |y s ) p d (y t |y s ) , F( ← − D ) = 1 |V t | y t ∈V t y s ∈V s p r (y s |y t ) log p r (y s |y t ) p d (y s |y t ) , where p r and p d are alignment distributions of the real and distilled data, respectively. Therefore, when D = D st , F( − → D ) = F( ← − D ) = 0, and it was omitted in Table 5.