Adaptive Feature Selection for End-to-End Speech Translation

Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation).


Introduction
End-to-end (E2E) speech translation (ST), a paradigm that directly maps audio to a foreign text, has been gaining popularity recently (Duong et al., 2016;Bérard et al., 2016;Bansal et al., 2018;. Based on the attentional encoder-decoder framework (Bahdanau et al., 2015), it optimizes model parameters under direct translation supervision. This end-toend paradigm avoids the problem of error propagation that is inherent in cascade models where an automatic speech recognition (ASR) model and 1 We release our source code at https://github. com/bzhangGo/zero. the amplitude and frequency spectrum of an audio segment (top), paired with its time-aligned words and phonemes (bottom). Information inside an audio stream is not uniformly distributed. We propose to dynamically capture speech features corresponding to informative signals (red rectangles) to improve ST.
a machine translation (MT) model are chained together. Nonetheless, previous work still reports that E2E ST delivers inferior performance compared to cascade methods . We study one reason for the difficulty of training E2E ST models, namely the uneven spread of information in the speech signal, as visualized in Figure 1, and the consequent difficulty of extracting informative features. Features corresponding to uninformative signals, such as pauses or noise, increase the input length and bring in unmanageable noise for ST. This increases the difficulty of learning Na et al., 2019) and reduces translation performance.
In this paper, we propose adaptive feature selection (AFS) for ST to explicitly eliminate uninformative features. Figure 2 shows the overall architecture. We employ a pretrained ASR encoder to induce contextual speech features, followed by an ST encoder bridging the gap between speech and translation modalities. AFS is inserted in-between them to select a subset of features for ST encoding (see red rectangles in Figure 1). To ensure that the selected features are well-aligned to transcriptions, we pretrain AFS on ASR. AFS estimates the informativeness of each feature through a parameterized gate, and encourages the dropping of between the ST encoder (blue) and a pretrained ASR encoder (gray) to filter speech features for translation. We pretrain AFS jointly with ASR and freeze it during ST training. features (pushing the gate to 0) that contribute little to ASR. An underlying assumption is that features irrelevant for ASR are also unimportant for ST.
We base AFS on L 0 DROP (Zhang et al., 2020), a sparsity-inducing method for encoder-decoder models, and extend it to sparsify speech features. The acoustic input of speech signals involves two dimensions: temporal and feature, where the latter one describes the spectrum extracted from time frames. Accordingly, we adapt L 0 DROP to sparsify encoder states along temporal and feature dimensions but using different gating networks. In contrast to (Zhang et al., 2020), who focus on efficiency and report a trade-off between sparsity and quality for MT and summarization, we find that sparsity also improves translation quality for ST.
We conduct extensive experiments with Transformer (Vaswani et al., 2017) on LibriSpeech En-Fr and MuST-C speech translation tasks, covering 8 different language pairs. Results show that AFS only retains about 16% of temporal speech features, revealing heavy redundancy in speech encodings and yielding a decoding speedup of ∼1.4×. AFS eases model convergence, and improves the translation quality by ∼1.3-1.6 BLEU, surpassing several strong baselines. Specifically, without data augmentation, AFS narrows the performance gap against the cascade approach, and outperforms it on LibriSpeech En-Fr by 0.29 BLEU, reaching 18.56. We compare against fixed-rate feature selection and a simple CNN, confirming that our adaptive feature selection offers better translation quality.
Our work demonstrates that E2E ST suffers from redundant speech features, with sparsification bringing significant performance improvements. The E2E ST task offers new opportunities for follow-up research in sparse models to deliver performance gains, apart from enhancing efficiency and/or interpretability.
2 Background: L 0 DROP L 0 DROP provides a selective mechanism for encoder-decoder models which encourages removing uninformative encoder outputs via a sparsityinducing objective (Zhang et al., 2020). Given a source sequence X = {x 1 , x 2 , . . . , x n }, L 0 DROP assigns each encoded source state x i ∈ R d with a scalar gate g i ∈ [0, 1] as follows: where α i , β, are hyperparameters of the hard concrete distribution (HardConcrete) (Louizos et al., 2018). Note that the hyperparameter α i is crucial to HardConcrete as it directly governs its shape. We associate α i with x i through a gating network: Thus, L 0 DROP can schedule HardConcrete via α i to put more probability mass at either 0 (i.e g i → 0) or 1 (i.e. g i → 1). w ∈ R d is a trainable parameter. Intuitively, L 0 DROP controls the openness of gate g i via α i so as to determine whether to remove (g i = 0) or retain (g i = 1) the state x i . L 0 DROP enforces sparsity by pushing the probability mass of HardConcrete towards 0, according to the following penalty term: By sampling g i with reparameterization (Kingma and Welling, 2013), L 0 DROP is fully differentiable and optimized with an upper bound on the objective: L MLE + λL 0 (X), where λ is a hyperparameter affecting the degree of sparsity -a larger λ enforces more gates near 0 -and L MLE denotes the maximum likelihood loss. An estimation of the expected value of g i is used during inference. Zhang et al. (2020) applied L 0 DROP to prune encoder outputs for MT and summarization tasks; we adapt it to E2E ST. Sparse stochastic gates and L 0 relaxations were also by Bastings et al. (2019) to construct interpretable classifiers, i.e. models that can reveal which tokens they rely on when making a prediction.

Adaptive Feature Selection
One difficulty with applying encoder-decoder models to E2E ST is deciding how to encode speech signals. In contrast to text where word boundaries can be easily identified, the spectrum features of speech are continuous, varying remarkably across different speakers for the same transcript. In addition, redundant information, like pauses in-between neighbouring words, can be of arbitrary duration at any position as shown in Figure 1, while contributing little to translation. This increases the burden and occupies the capacity of ST encoder, leading to inferior performance (Duong et al., 2016;Bérard et al., 2016). Rather than developing complex encoder architectures, we resort to feature selection to explicitly clear out those uninformative speech features. Figure 2 gives an overview of our model. We use a pretrained and frozen ASR encoder to extract contextual speech features, and collect the informative ones from them via AFS before transmission to the ST encoder. AFS drops pauses, noise and other uninformative features and retains features that are relevant for ASR. We speculate that these retained features are also the most relevant for ST, and that the sparser representation simplifies the learning problem for ST, for example the learning of attention strength between encoder states and target language (sub)words. Given a training tuple (audio, source transcription, translation), denoted as (X, Y, Z) respectively, 2 we outline the overall framework below, including three steps: E2E ST with AFS 1. Train ASR model with the following objective and model architecture until convergence: 2. Finetune ASR model with AFS for m steps: 3. Train ST model with pretrained and frozen ASR and AFS submodules until convergence: We handle both ASR and ST as sequence-tosequence problem with encoder-decoder models.
We use E * (·) and D * (·, ·) to denote the correspond-ing encoder and decoder respectively. F (·) denotes the AFS approach, and F E means freezing the ASR encoder and the AFS module during training. Note that our framework puts no constraint on the architecture of the encoder and decoder in any task, although we adopt the multi-head dot-product attention network (Vaswani et al., 2017) for our experiments.
ASR Pretraining The ASR model M ASR (Eq. 6) directly maps an audio input to its transcription.
To improve speech encoding, we apply logarithmic penalty on attention to enforce short-range dependency (Di Gangi et al., 2019) and use trainable positional embedding with a maximum length of 2048. Apart from L MLE , we augment the training objective with the connectionist temporal classification (Graves et al., 2006, CTC) loss L CTC as in Eq. 5. Note η = 1 − γ. The CTC loss is applied to the encoder outputs, guiding them to align with their corresponding transcription (sub)words and improving the encoder's robustness (Karita et al., 2019). Following previous work (Karita et al., 2019;Wang et al., 2020), we set γ to 0.3.

AFS Finetuning
This stage aims at using AFS to dynamically pick out the subset of ASR encoder outputs that are most relevant for ASR performance (see red rectangles in Figure 1). We follow Zhang et al. (2020) and place AFS in-between ASR encoder and decoder during finetuning (see F (·) in M AFS , Eq. 8). We exclude the CTC loss in the training objective (Eq. 7) to relax the alignment constraint and increase the flexibility of feature adaptation. We use L 0 DROP for AFS in two ways. AFS t The direct application of L 0 DROP on ASR encoder results in AFS t , sparsifying encodings along the temporal dimension {x i } n i=1 : where α t i is a positive scalar powered by a simple linear gating layer, and w t ∈ R d is a trainable parameter of dimension d. g t is the temporal gate. The sparsity penalty of AFS t follows Eq. 4: AFS t,f In contrast to text processing, speech processing often extracts spectrum from overlapping time frames to form the acoustic input, similar to the word embedding. As each encoded speech feature contains temporal information, it is reasonable to extend AFS t to AFS t,f , including sparsification along the feature dimension {x i,j } d j=1 : where α f ∈ R d estimates the weights of each feature, dominated by an input-independent gating model with trainable parameter w f ∈ R d . 3 g f is the feature gate. Note that α f is shared for all time steps. denotes element-wise multiplication.
AFS t,f reuses g t i -relevant submodules in Eq. 11, and extends the sparsity penalty L t 0 in Eq. 12 as follows: 0 ) for extra m steps. We compare these two variants in our experiments.

E2E ST Training
We treat the pretrained ASR and AFS model as a speech feature extractor, and freeze them during ST training. We gather the speech features emitted by the ASR encoder that correspond to g t i > 0, and pass them similarly as done with word embeddings to the ST encoder. We employ sinusoidal positional encoding to distinguish features at different positions. Except for the input to the ST encoder, our E2E ST follows the standard encoder-decoder translation model (M ST in Eq. 10) and is optimized with L MLE alone as in Eq. 9. Intuitively, AFS bridges the gap between ASR output and MT input by selecting transcriptaligned speech features.

Experiments
Datasets and Preprocessing We experiment with two benchmarks: the Augmented LibriSpeech dataset (LibriSpeech En-Fr)  and the multilingual MuST-C dataset (MuST-C) . LibriSpeech En-Fr is collected by aligning e-books in French with English utterances of LibriSpeech, further augmented with French translations offered by Google Translate. We use the 100 hours clean training set for training, including 47K utterances to train ASR models and double the size for ST models after concatenation with the Google translations. We report results on the test set (2048 utterances) using models selected on the dev set (1071 utterances). MuST-C is built from English TED talks, covering 8 translation directions: English to German (De), Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), Romanian (Ro) and Russian (Ru). We train ASR and ST models on the given training set, containing ∼452 hours with ∼252K utterances on average for each translation pair. We adopt the given dev set for model selection and report results on the common test set, whose size ranges from 2502 (Es) to 2641 (De) utterances.
For all datasets, we extract 40-dimensional log-Mel filterbanks with a step size of 10ms and window size of 25ms as the acoustic features. We expand these features with their first and second-order derivatives, and stabilize them using mean subtraction and variance normalization. We stack the features corresponding to three consecutive frames without overlapping to the left, resulting in the final 360-dimensional acoustic input. For transcriptions and translations, we tokenize and truecase all the text using Moses scripts (Koehn et al., 2007). We train subword models (Sennrich et al., 2016) on each dataset with a joint vocabulary size of 16K to handle rare words, and share the model for ASR, MT and ST. We train all models without removing punctuation.

Model Settings and Baselines
We adopt the Transformer architecture (Vaswani et al., 2017) for all tasks, including M ASR (Eq. 6), M AFS (Eq. 8) and M ST (Eq. 10). The encoder and decoder consist of 6 identical layers, each including a self-attention sublayer, a cross-attention sublayer (decoder alone) and a feedforward sublayer. We employ the base setting for experiments: hidden size d = 512, attention head 8 and feedforward size 2048. We schedule learning rate via Adam (β 1 = 0.9, β 2 = 0.98) (Kingma and Ba, 2015), paired with a warmup step of 4K. We apply dropout to attention weights and residual connections with a rate of 0.1 and 0.2 respectively, and also add label smoothing of 0.1 to handle overfitting. We train all models with a maximum step size of 30K and a minibatch size of around 25K target subwords. We average the last 5 checkpoints for evaluation. We use beam search for decoding, and set the beam size and length penalty to 4 and 0.6, respectively. We set = −0.1, and β = 2 /3 for AFS following Louizos et al. (2018), and finetune AFS for an additional m = 5K steps. We evaluate translation quality with tokenized case-sensitive BLEU (Papineni et al., 2002), and report WER for ASR performance without punctuation.
We compare our models with four baselines: ST Besides, we offer another baseline, ST + CNN, for comparison on MuST-C En-De: we replace the fixed-rate subsampling with a one-layer 1D depthseparable convolution, where the output dimension is set to 512, the kernel size over temporal dimension is set to 5 and the stride is set to 6. In this way, the ASR encoder features will be compressed to around 1/6 features, a similar ratio to the fixed-rate subsampling.

Results on MuST-C En-De
We perform a thorough study on MuST-C En-De. With AFS, the first question is its feasibility. We start by analyzing the degree of sparsity in speech features (i.e. sparsity rate) yielded by AFS, focusing on the temporal sparsity rate #{g t i =0} /n and the feature sparsity rate #{g f j =0} /d. To obtain different rates, we vary the hyperparameter λ in Eq. 7 in a range of [0.1, 0.8] with a step size 0.1.
Results in Figure 3 show that large amounts of encoded speech features (> 59%) can be easily pruned out, revealing heavy inner-speech redundancy. Both AFS t and AFS t,f drop ∼60% temporal features with λ of 0.1, and this number increases to > 85% when λ ≥ 0.5 (Figure 3b), remarkably surpassing the sparsity rate reported by Zhang et al. (2020) on text summarization (71.5%). In contrast to rich temporal sparsification, we get a feature sparsity rate of 0, regardless of λ's value, although increasing λ decreases g f (Figure 3a). This suggests that selecting neurons from the feature dimension is harder. Rather than filtering neurons, the feature gate g f acts more like a weighting mechanism on them. In the rest of the paper, we use sparsity rate for the temporal sparsity rate.
We continue to explore the impact of varied sparsity rates on the ASR and ST performance. Figure  4 shows their correlation. We observe that AFS slightly degenerates ASR accuracy (Figure 4a), but still retains ∼95% accuracy on average; AFS t,f often performs better than AFS t with similar sparsity rate. The fact that only 15% speech features successfully support 95% ASR accuracy proves the informativeness of these selected features. These findings echo with (Zhang et al., 2020), where they observe a trade-off between sparsity and quality.
However, when AFS is applied to ST, we find consistent improvements to translation quality by > 0.8 BLEU, shown in Figure 4b. Translation quality on the development set peaks at 22.17 BLEU  achieved by AFS t,f with a sparsity rate of 85.5%. We set λ = 0.5 (corresponding to sparsity rate of ∼85%) for all other experiments, since AFS t and AFS t,f reach their optimal result at this point.
We summarize the test results in Table 1, where we set k = 6 or k = 7 for ST+Fixed Rate with a sparsity rate of around 85% inspired by our above analysis. Our vanilla ST model yields a BLEU score of 17.44; pretraining on ASR further enhances the performance to 20.67, significantly outperforming the results of Di Gangi et al. (2019) by 3.37 BLEU. This also suggests the importance of speech encoder pretraining Stoian et al., 2020;Wang et al., 2020). We treat ST with ASR-PT as our real baseline. We observe improved translation quality with fixed-rate subsampling, +0.47 BLEU at k = 6. Subsampling offers a chance to bypass noisy speech signals and reducing the number of source states makes learning translation alignment easier, but deciding the optimal sampling rate is tough. Results in Figure 5 reveal that fixed-rate subsampling deteriorates ST performance with suboptimal rates. Replacing fixed-rate subsampling with our one-layer CNN also fails to improve over the baseline, although CNN offers more flexibility in feature manipulation. By con- trast to fixed-rate subsampling, the proposed AFS is data-driven, shifting the decision burden to the data and model themselves. As a result, AFS t and AFS t,f surpass ASR-PT by 0.9 BLEU and 1.71 BLEU, respectively, substantially narrowing the performance gap compared to the cascade baseline (-0.14 BLEU). We also observe improved decoding speed: AFS runs ∼1.37× faster than ASR-PT. Compared to the fixed-rate subsampling, AFS is slightly slower which we ascribe to the overhead introduced by the gating module. Surprisingly, Table 1 shows that the vanilla ST runs slower than ASR-PT (0.87×) while the cascade model is slightly faster (1.06×). By digging into the beam search algorithm, we discover that ASR pretraining shortens the number of steps in beam-decoding: 94 ASR-PT vs. 112 vanilla ST (on average). The speedup brought by cascading is due to the smaller English vocabulary size compared to the German vocabulary when processing audio inputs.

Why (Adaptive) Feature Selection?
Apart from the benefits in translation quality, we go deeper to study other potential impacts of (adaptive) feature selection. We begin with inspecting training curves. Figure 6 shows that ASR pretraining improves model convergence; feature selection makes training more stable. Compared to other models, the curve of ST with AFS is much smoother, suggesting its better regularization effect.
We then investigate the effect of training data size, and show the results in Figure 7. Overall, we do not observe higher data efficiency by feature selection on low-resource settings. But instead, our results suggest that feature selection delivers larger performance improvement when more training data is available. With respect to data efficiency, ASR pretraining seems to be more important (Figure 7, left) (Bansal et al., 2019;Stoian et al., 2020). Com- MuST-C En-De. We split the original training data into non-overlapped five subsets, and train different models with accumulated subsets. Results are reported on the test set. Note that we perform ASR pretraining on the original dataset. λ = 0.5, k = 6. pared to AFS, the fixed-rate subsampling suffers more from small-scale training: it yields worse performance than ASR-PT when data size ≤ 100K, highlighting better generalization of AFS.
In addition to model performance, we also look into the ST model itself, and focus on the crossattention weights. Figure 8 visualize the attention value distribution, where ST models with feature selection noticeably shift the distribution towards larger weights. This suggests that each ST encoder output exerts greater influence on the translation. By removing redundant and noisy speech features, feature selection eases the learning of the ST encoder, and also enhances its connection strength with the ST decoder. This helps bridge the modality gap between speech and text translation. Although fixed-rate subsampling also delivers a distribution shift similar to AFS, its inferior ST performance compared to AFS corroborates the better quality of adaptively selected features.

AFS vs. Fixed Rate
We compare these two approaches by analyzing the number of retained features with respect to word duration and temporal position. Results in Figure 9a show that the underlying pattern behind these two methods is similar: words with longer duration correspond to more speech features. However, when it comes to temporal position, Figure 9b illustrates their difference: fixed-rate subsampling is context-independent, periodically picking up features; while AFS decides feature selection based on context information. The curve of AFS is more smooth, indicating that features kept by AFS are more uniformly distributed across different positions, ensuring the features' informativeness.
AFS t vs. AFS t,f Their only difference lies at the feature gate g f . We visualize this gate in Figure  10. Although this gate induces no sparsification, it offers AFS t,f the capability of adjusting the weight of each neuron. In other words, AFS t,f has more freedom in manipulating speech features.   2019a) and data augmentation (Wang et al., 2020). Comparability to previous work is limited due to possible differences in tokenization and letter case. To ease future cross-paper comparison, we provide SacreBLEU (Post, 2018) 4 for our models.

Related Work
Speech Translation Pioneering studies on ST used a cascade of separately trained ASR and MT systems (Ney, 1999). Despite its simplicity, this approach inevitably suffers from mistakes made by ASR models, and is error prone. Research in this direction often focuses on strategies capable of mitigating the mismatch between ASR output and MT input, such as representing ASR outputs with lattices (Saleem et al., 2004;Mathias and Byrne, 2006;Beck et al., 2019), injecting synthetic ASR errors for robust MT (Tsvetkov et al., 2014;Cheng et al., 2018) and differentiable cascade modeling (Kano et al., 2017;Anastasopoulos and Chiang, 2018;Sperber et al., 2019). In contrast to cascading, another option is to perform direct speech-to-text translation. Duong et al. (2016) and Bérard et al. (2016) employ the attentional encoder-decoder model (Bahdanau et al., 2015) for E2E ST without accessing any intermediate transcriptions. E2E ST opens the way to bridging the modality gap directly, but it is data-hungry, sample-inefficient and often underperforms cascade models especially in low-resource settings (Bansal et al., 2018). This led researchers to explore solutions ranging from efficient neural architecture design (Karita et al., 2019;Sung et al., 2019) to extra training signal incorporation, including multi-task learning (Weiss et al., 2017;Liu et al., 2019b), submodule pretraining (Bansal et al., 2019;Stoian et al., 2020;Wang et al., 2020), knowledge distillation (Liu et al., 2019a), meta-learning (Indurthi et al., 2019) and data augmentation Jia et al., 2019;Pino et al., 2019). Our work focuses on E2E ST, but we investigate feature selection which has rarely been studied before.
Speech Feature Selection Encoding speech signals is challenging as acoustic input is lengthy, noisy and redundant. To ease model learning, previous work often selected features via downsampling techniques, such as convolutional modeling  and fixed-rate subsampling (Lu et al., 2015). Recently,  and Na et al. (2019) proposed dynamic subsampling for ASR which learns to skip uninformative features during recurrent encoding. Unfortunately, their methods are deeply embedded into recurrent networks, hard to adapt to other architectures like Transformer (Vaswani et al., 2017). Recently, Salesky et al. (2020) have explored phoneme-level representations for E2E ST, but this requires nontrivial phoneme recognition and alignment.
Instead, we resort to sparsification techniques which have achieved great success in NLP tasks recently (Correia et al., 2019;Child et al., 2019;Zhang et al., 2020). In particular, we employ L 0 DROP (Zhang et al., 2020) for AFS to dynamically retain informative speech features, which is fully differentiable and independent of concrete encoder/decoder architectures. We extend L 0 DROP by handling both temporal and feature dimensions with different gating networks, and apply it to E2E ST.

Conclusion and Future Work
In this paper, we propose adaptive feature selection for E2E ST to handle redundant and noisy speech signals. We insert AFS in-between the ST encoder and a pretrained, frozen ASR encoder to filter out uninformative features contributing little to ASR. We base AFS on L 0 DROP (Zhang et al., 2020), and extend it to modeling both temporal and feature dimensions. Results show that AFS improves translation quality and accelerates decoding by ∼1.4× with an average temporal sparsity rate of ∼84%. AFS successfully narrows or even closes the performance gap compared to cascading models.
While most previous work on sparsity in NLP demonstrates its benefits from efficiency and/or interpretability perspectives (Zhang et al., 2020), we show that sparsification in our scenario -E2E ST -leads to substantial performance gains.
In the future, we will work on adapting AFS to simultaneous speech translation.