Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for complex sequence tasks, like speech translation, systems that cascade several models trained on sub-tasks have shown to be superior, suggesting that the compositionality of cascaded systems simplifies learning and enables sophisticated search capabilities. In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks. These hidden intermediates can be improved using beam search to enhance the overall performance and can also incorporate external models at intermediate stages of the network to re-score or adapt towards out-of-domain data. One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from a speech recognition sub-task. The model demonstrates the aforementioned benefits and outperforms the previous state-of-the-art by around +6 and +3 BLEU on the two test sets of Fisher-CallHome and by around +3 and +4 BLEU on the English-German and English-French test sets of MuST-C.


Introduction
The principle of compositionality loosely states that a complex whole is composed of its parts and the rules by which those parts are combined (Lake and Baroni, 2018). This principle is present in engineering, where task decomposition of a complex system is required to assess and optimize task allocations (Levis et al., 1994), and in natural language, where paragraph coherence and discourse analysis rely on decomposition into sentences (Johnson, 1992;Kuo, 1995) and sentence level semantics relies on decomposition into lexical units (Liu et al., 2020b).
Similarly, many sequence-to-sequence tasks that convert one sequence into another (Sutskever et al., 2014) can be decomposed to simpler sequence subtasks in order to reduce the overall complexity. For example, speech translation systems, which seek to process speech in one language and output text in another language, can be naturally decomposed into the transcription of source language audio through automatic speech recognition (ASR) and translation into the target language through machine translation (MT). Such cascaded approaches have been widely used to build practical systems for a variety of sequence tasks like hybrid ASR (Hinton et al., 2012), phrase-based MT , and cascaded ASR-MT systems for speech translation (ST) (Pham et al., 2019).
End-to-end sequence models like encoderdecoder models (Bahdanau et al., 2015;Vaswani et al., 2017), are attractive in part due to their simplistic design and the reduced need for hand-crafted features. However, studies have shown mixed results compared to cascaded models particularly for complex sequence tasks like speech translation (Inaguma et al., 2020) and spoken language understanding (Coucke et al., 2018). Although direct target sequence prediction avoids the issue of error propagation from one system to another in cascaded approaches (Tzoukermann and Miller, 2018), there are many attractive properties of cascaded systems, missing in end-to-end approaches, that are useful in complex sequence tasks.
In particular, we are interested in (1) the strong search capabilities of the cascaded systems that compose the final task output from individual system predictions (Mohri et al., 2002;Kumar et al., 2006;Beck et al., 2019), (2) the ability to incorporate external models to re-score each individual system (Och and Ney, 2002;Huang and Chiang, 2007), (3) the ability to easily adapt individual components towards out-of-domain data (Koehn and Schroeder, 2007;Peddinti et al., 2015), and finally (4) the ability to monitor performance of the individual systems towards the decomposed sub-task (Tillmann and Ney, 2003;Meyer et al., 2016).
In this paper, we seek to incorporate these properties of cascaded systems into end-to-end sequence models. We first propose a generic framework to learn searchable hidden intermediates using an auto-regressive encoder-decoder model for any decomposable sequence task ( §3). We then apply this approach to speech translation, where the intermediate stage is the output of ASR, by passing continuous hidden representations of discrete transcript sequences from the ASR sub-net decoder to the MT sub-net encoder. By doing so, we gain the ability to use beam search with optional external model re-scoring on the hidden intermediates, while maintaining end-to-end differentiability. Next, we suggest mitigation strategies for the error propagation issues inherited from decomposition.
We show the efficacy of searchable intermediate representations in our proposed model, called the Multi-Decoder, on speech translation with a 5.4 and 2.8 BLEU score improvement over the previous state-of-the-arts for Fisher and CallHome test sets respectively ( §6). We extend these improvements by an average of 0.5 BLEU score through the aforementioned benefit of re-scoring the intermediate search with external models trained on the same dataset. We also show a method for monitoring sub-net performance using oracle intermediates that are void of search errors ( §6.1). Finally, we show how these models can adapt to out-of-domain speech translation datasets, how our approach can be generalized to other sequence tasks like speech recognition, and how the benefits of decomposition persist even for larger corpora like MuST-C ( §6.2).

Compositionality in Sequences Models
The probabilistic space of a sequence is combinatorial in nature, such that a sentence of L words from a fixed vocabulary V would have an output space S of size |V| L . In order to deal with this combinatorial output space, an output sentence is decomposed into labeled target tokens, y = (y 1 , y 2 , . . . , y L ), where y l ∈ V.
An auto-regressive encoder-decoder model uses the above probabilistic decomposition in sequence-to-sequence tasks to learn next word prediction, which outputs a distribution over the next target token y l given the previous tokens y 1:l 1 and the input sequence x = (x 1 , x t , . . . , x T ), where T is the input sequence length. In the next sub-section we detail the training and inference of these models.

Auto-regressive Encoder-Decoder Models
Training: In an auto-regressive encoder-decoder model, the ENCODER maps the input sequence x to a sequence of continuous hidden representations The DECODER then auto-regressively maps h E and the preceding ground-truth output tokens,ŷ 1: . . , h D L ) and the likelihood of each output token y l is given by SOFTMAXOUT, which denotes an affine projection of h D l to V followed by a softmax function.
During training, the DECODER performs token classification for next word prediction by considering only the ground truth sequences for previous tokensŷ. We refer to thisĥ D as oracle decoder representations, which will be discussed later.
Inference: During inference, we can maximize the likelihood of the entire sequence from the output space S by composing the conditional probabilities of each step for the L tokens in the sequence.
This is an intractable search problem and it can be approximated by either greedily choosing argmax at each step or using a search algorithm like beam search to approximateỹ. Beam search (Reddy, 1988) generates candidates at each step and prunes the search space to a tractable beam size of B most likely sequences. As B → ∞, the beam search result would be equivalent to equation 4.
GREEDYSEARCH := argmax y l P (y l | x, y 1:l 1 ) BEAMSEARCH := BEAM(P (y l | x, y 1:l 1 )) In approximate search for auto-regressive models, like beam search, the DECODER receives alternate candidates of previous tokens to find candidates with a higher likelihood as an overall sequence. This also allows for the use of external models like Language Models (LM) or Connectionist Temporal Classification Models (CTC) for re-scoring candidates .

Proposed Framework
In this section, we present a general framework to exploit natural decompositions in sequence tasks which seek to predict some output C from an input sequence A. If there is an intermediate sequence B for which A → B sequence transduction followed by B → C prediction achieves the original task, then the original A → C task is decomposable.
In other words, if we can learn P (B | A) then we can learn the overall task of P (C | A) through max B (P (C | A, B)P (B | A)), approximated using Viterbi search. We define a first encoderdecoder SUB A→B NET to map an input sequence A to a sequence of decoder hidden states, h D B . Then we define a subsequent SUB B→C NET to map h D B to the final probabilistic output space of C. Therefore, we call h D B hidden intermediates. The following equations shows the two sub-networks of our framework, SUB A→B NET and SUB B→C NET, which can be trained end-to-end while also exploiting compositionality in sequence tasks. 2 SUB A→B NET: Note that the final prediction, given by equation 6, does not need to be a sequence and can be a categorical class like in spoken language understanding tasks. Next we will show how the hidden intermediates become searchable during inference.

Searchable Hidden Intermediates
As stated in section §2.2, approximate search algorithms maximize the likelihood, P (y | x), of the entire sequence by considering different candidates y l at each step. Candidate-based search, particularly in auto-regressive encoder-decoder models, also affects the decoder hidden representation, h D , as these are directly dependent on the previous candidate (refer to equations 1 and 3). This implies that by searching for better approximations of the previous predicted tokens, y l 1 = (y BEAM ) l 1 , we also improve the decoder hidden representations for the next token, h D l = (h D BEAM ) l . As y BEAM →ŷ, the decoder hidden representations tend to the oracle decoder representations that have only errors from next word prediction, h D BEAM →ĥ D . A perfect search is analogous to choosing the ground truthŷ at each step, which would yieldĥ D .
We apply this beam search of hidden intermediates, thereby approximatingĥ D B with h D B BEAM . This process is illustrated in algorithm 1, which 1885 shows beam search for h D B BEAM that are subsequently passed to the SUB B→C NET. 3 In line 7, we show how an external model like an LM or a CTC model can be used to generate an alternate sequence likelihood, P EXT (y B l ), which can be combined with the SUB A→B NET likelihood, P B (y B l | x) , with a tunable parameter λ.
Algorithm 1 Beam Search for Hidden Intermediates: We perform beam search to approximate the most likely sequence for the sub-task A → B, y B BEAM , while collecting the corresponding DECODER B hidden representations, h D B BEAM . The output h D B BEAM , is passed to the final sub-network to predict final output C and y B BEAM is used for monitoring performance on predicting B. We can monitor the performance of the SUB A→B NET by comparing the decoded intermediate sequence y B BEAM to the ground trutĥ y B . We can also monitor the SUB B→C NET performance by using the aforementioned oracle representations of the intermediates,ĥ D B , which can be obtained by feeding the ground truthŷ B to DECODER B . By passingĥ D B to SUB B→C NET, we can observe its performance in a vacuum, i.e. void of search errors in the hidden intermediates.

Multi-Decoder Model
In order to show the applicability of our end-to-end framework we propose our Multi-Decoder model for speech translation. This model predicts a sequence of text translations y ST from an input se-quence of speech x and uses a sequence of text transcriptions y ASR as an intermediate. In this case, the SUB A→B NET in equation 5 is specified as the ASR sub-net and the SUB B→C NET in equation 6 is specified as the MT sub-net. Since the MT sub-net is also a sequence prediction task, both sub-nets are encoder-decoder models in our architecture (Bahdanau et al., 2015;Vaswani et al., 2017). In Figure  1 we illustrate the schematics of our transformer based Multi-Decoder ST model which can also be summarized as follows: As we can see from Equations 9 and 10, the MT sub-network attends only to the decoder representations,ĥ DASR , of the ASR sub-network, which could lead to the error propagation issues from the ASR sub-network to the MT sub-network similar to the cascade systems, as mentioned in §1. To alleviate this problem, we modify equation 10 such that DECODER ST attends to both h EST and h EASR : We use the multi-sequence cross-attention discussed by Helcl et al. (2018), shown on the right side of Figure 1, to condition the final outputs generated byĥ DST l on both speech and transcript information in an attempt to allow our network to recover from intermediate mistakes during inference. We call this model the Multi-Decoder w/ Speech-Attention.
pus (Post et al., 2013) which contains 170 hours of Spanish conversational telephone speech, transcriptions, and English translations. All punctuations except apostrophes were removed and results are reported in terms of detokenized case-insensitive BLEU (Papineni et al., 2002;Post, 2018). We compute BLEU using the 4 references in Fisher (dev, dev2, and test) and the single reference in Call-Home (dev and test) (Post et al., 2013;Kumar et al., 2014;Weiss et al., 2017). We use a joint source and target vocabulary of 1K byte pair encoding (BPE) units (Kudo and Richardson, 2018).
We prepare the corpus using the ESPnet library and we follow the standard data preparation, where inputs are globally mean-variance normalized logmel filterbank and pitch features from up-sampled 16kHz audio (Watanabe et al., 2018). We also apply speed perturbations of 0.9 and 1.1 and the SS SpecAugment policy (Park et al., 2019).
Baseline Configuration: All of our models are implemented using the ESPnet library and trained on 3 NVIDIA Titan 2080Ti GPUs for ≈12 hours. For the Baseline Enc-Dec baseline, discussed in §4, we use an ENCODER ASR consisting of a convolutional sub-sampling by a factor of 4 (Watanabe et al., 2018) and 12 transformer encoder blocks with 2048 feed-forward dimension, 256 attention dimension, and 4 attention heads. The DECODER ASR and DECODER ST both consist of 6 transformer decoder blocks with the same configuration as ENCODER ASR . There are 37.9M trainable parameters. We apply dropout of 0.1 for all components, detailed in the Appendix (A.1). We train our models using an effective batchsize of 384 utterances and use the Adam optimizer (Kingma and Ba, 2015) with inverse square root decay learning rate schedule. We set learning rate to 12.5, warmup steps to 25K, and epochs to 50. We use joint training with hybrid CTC/attention ASR  by setting mtl-alpha to 0.3 and asr-weight to 0.5 as defined by Watanabe et al. (2018). During inference, we perform beam search (Seki et al., 2019) on the ST sequences, using a beam size of 10, length penalty of 0.2, max length ratio of 0.3 (Watanabe et al., 2018).  beam size of 1, which is a greedy search, results in lower ASR sub-net and overall ST performances. As beam sizes become larger, gains taper off as can be seen between beam sizes of 10 and 16.

External models for better search
External models like CTC acoustic models and language models are commonly used for re-scoring encoder-decoder models , due to the difference in their modeling capabilities. CTC directly models transcripts while being conditionally independent on the other outputs given the input, and LMs predict the next token in a sequence. Both variants of the Multi-Decoder improve due to improved ASR sub-net performance using exter-  nal CTC and LM models for re-scoring, as shown in Table 3. We use a recurrent neural network LM trained on the Fisher-CallHome Spanish transcripts with a dev perplexity of 18.8 and the CTC model from joint loss applied during training. Neither external model incorporates additional data. Although the impact of the LM-only re-scoring is not shown in the ASR % WER, it reduces substitution and deletion rates in the ASR and this is observed to help the overall ST performance.

Error propagation avoidance
As discussed in §3, our Multi-Decoder model inherits the error propagation issue as can be seen in Figure 3.

Generalizability
In this section, we discuss the generalizability of our framework towards out-of-domain data. We also extend our Multi-Decoder model to other sequence tasks like speech recognition. Finally, we apply our ST models to a larger corpus with more language pairs and a different domain of speech. The buckets on the x-axis are determined using the utterance level % WER using the Multi-Decoder ASR sub-net performance.

Robustness through Decomposition
Like cascaded systems, searchable intermediates provide our model adaptability in individual subsystems towards out-of-domain data using external in-domain language model, thereby giving access to more in-domain data. Specifically for speech translation systems, this means we can use indomain language models in both source and target languages. We test the robustness of our Multi-Decoder model trained on Fisher-CallHome conversational speech dataset on read speech CoVost-2 dataset (Wang et al., 2020b). In Table 4 we show that re-scoring the ASR sub-net with an in-domain LM improves ASR with around 10.0% lower WER, improving the overall ST performance by around +2.5 BLEU. Compared to an in-domain ST baseline (Wang et al., 2020a), our out-of-domain Multi-Decoder with in-domain ASR re-scoring demonstrates the robustness of our approach.

Decomposing Speech Transcripts
We apply our generic framework to another decomposable sequence task, speech recognition, and show the results of various levels of decomposition in Table 5. We show that with phoneme, character, or byte-pair encoding (BPE) sequences as intermediates, the Multi-Decoder presents strong results on both Fisher and CallHome test sets. We also observe that the BPE intermediates perform bet-    (Weiss et al., 2017) ter than phoneme/character variants, which could be attributed to the reduced search capabilities of encoder-decoder models using beam search on longer sequences (Sountsov and Sarawagi, 2016) like in phoneme/character sequences.

Extending to MuST-C Language Pairs
In addition to our results using the 170 hours of the Spanish-English Fisher-CallHome corpus, in Table 6 we show that our decompositional framework is also effective on larger ST corpora. In particular, we use 400 hours of English-German and 500 hours of English-French ST from the MuST-C corpus (Di Gangi et al., 2019). Our Multi-Decoder model improves by +2.7 and +1.5 BLEU, in German and French respectively, over end-to-end baselines from prior works that do not use additional training data. We show that ASR re-scoring gives an additional +0.1 and +0.4 BLEU improvement. 5 By extending our Multi-Decoder models to this MuST-C study, we show the generalizability of our 5 Details of the MuST-C data preparation and model parameters are detailed in Appendix (A.4).
approach across several dimensions of ST tasks. First, our approach consistently improves over baselines across multiple language-pairs. Second, our approach is robust to the distinct domains of telephone conversations from Fisher-CallHome and the TED-Talks from MuST-C. Finally, by scaling from 170 hours of Fisher-CallHome data to 500 hours of MuST-C data, we show that the benefits of decomposing sequence tasks with searchable hidden intermediates persist even with more data. Furthermore, the performance of our Multi-Decoder models trained with only English-German or English-French ST data from MuST-C is comparable to other methods which incorporate larger external ASR and MT data in various ways. For instance, Zheng et al. (2021) use 4700 hours of ASR data and 2M sentences of MT data for pretraining and multi-task learning. Similarly, Bahar et al. (2021) use 2300 hours of ASR data and 27M sentences of MT data for pretraining. Our competitive performance without the use of any additional data highlights the data-efficient nature of our proposed end-to-end framework as opposed to the baseline encoder-decoder model, as pointed out by Sperber and Paulik (2020).

Discussion and Relation to Prior Work
Compositionality: A number of recent works have constructed composable neural network modules for tasks such as visual question answering (Andreas et al., 2016), neural MT (Raunak et al., 2019), and synthetic sequence-to-sequence tasks (Lake, 2019). Modules that are first trained separately can subsequently be tightly integrated into a single end-to-end trainable model by passing differentiable soft decisions instead of discrete decisions in the intermediate stage (Bahar et al., 2021). Further, even a single encoder-decoder model can be decomposed into modular components where the encoder and decoder modules have explicit functions (Dalmia et al., 2019).
Joint Training with Sub-Tasks: End-to-end sequence models been shown to benefit from introducing joint training with sub-tasks as auxiliary loss functions for a variety of tasks like ASR , ST (Salesky et al., 2019;Liu et al., 2020a;Dong et al., 2020;Le et al., 2020), SLU (Haghani et al., 2018). They have been shown to induce structure (Belinkov et al., 2020) and improve the model performance (Toshniwal et al., 2017), but this joint training may reduce data efficiency if some sub-nets are not included in the final endto-end model Wang et al., 2020c). Our framework avoids this sub-net waste at the cost of computational load during inference.
Speech Translation Decoders: Prior works have used ASR/MT decoding to improve the overall ST decoding through synchronous decoding (Liu et al., 2020a), dual decoding (Le et al., 2020), and successive decoding (Dong et al., 2020). These works partially or fully decode ASR transcripts and use discrete intermediates to assist MT decoding. Tu et al. (2017) and Anastasopoulos and Chiang (2018) are closest to our multi-decoder ST model, however the benefits of our proposed framework are not entirely explored in these works.
Two-Pass Decoding: Two-pass decoding involves first predicting with one decoder and then re-evaluating with another decoder (Geng et al., 2018;Sainath et al., 2019;Hu et al., 2020;Rijhwani et al., 2020). The two decoders iterate on the same sequence, so there is no decomposition into sub-tasks in this method. On the other hand, our approach provides the subsequent decoder with a more structured representation than the input by decomposing the complexity of the overall task. Like two-pass decoding, our approach provides a sense of the future to the second decoder which allows it to correct mistakes from the previous first decoder.
Auto-Regressive Decoding: As auto-regressive decoders inherently learn a language model along with the task at hand, they tend to be domain specific (Samarakoon et al., 2018;Müller et al., 2020). This can cause generalizability issues during inference (Murray and Chiang, 2018;Yang et al., 2018), impacting the performance of both the task at hand and any downstream tasks. Our approach alleviates these problems through intermediate search, external models for intermediate re-scoring, and multi-sequence attention.

Conclusion and Future Work
We present searchable hidden intermediates for endto-end models of decomposable sequence tasks. We show the efficacy of our Multi-Decoder model on the Fisher-CallHome Es→En and MuST-C En→De and En→Fr speech translation corpora, achieving state-of-the-art results. We present various benefits in our framework, including sub-net performance monitoring, beam search for better hidden intermediates, external models for better search, and error propagation avoidance. Further, we demonstrate the flexibility of our framework towards out-of-domain tasks with the ability to adapt our sequence model at intermediate stages of decomposition. Finally, we show generalizability by training Multi-Decoder models for the speech recognition task at various levels of decomposition.
We hope insights derived from our study stimulate research on tighter integrations between the benefits of cascaded and end-to-end sequence models. Exploiting searchable intermediates through beam search is just the tip of the iceberg for search algorithms, as numerous approximate search techniques like diverse beam search (Vijayakumar et al., 2018) and best-first beam search (Meister et al., 2020) have been recently proposed to improve diversity and approximation of the most-likely sequence. Incorporating differentiable lattice based search (Hannun et al., 2020) can also allow the subsequent sub-net to digest n-best representations.

Acknowledgements
This work started while Vikas Raunak was a student at CMU, he is now working as a Research Scientist at Microsoft. We thank Pengcheng Guo, Hirofumi Inaguma, Elizabeth Salesky, Maria Ryskina, Marta Méndez Simón and Vijay Viswanathan for their helpful discussion during the course of this project. We also thank the anonymous reviewers for their valuable feedback. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system (Nystrom et al., 2015), which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). The work was supported in part by an AWS Machine Learning Research Award. This research was also supported in part the DARPA KAIROS program from the Air Force Research Laboratory under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright notation there on. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government.

A.1 Training and Inference hyperparameters
We tune training and inference hyperparameters using only the dev sets. We first determined the best hyperparameters for our baseline Enc-Dec implementation and fixed all settings not pertaining to the unique searchable hidden intermediates of our Multi-Decoder. Then, we find the best hyperparameters for our proposed models under these constraints to demonstrate a true comparison against the baseline. We report detokenized case-sensitive BLEU (Post, 2018) on the tst-COMMON set. We apply the same text processing as done in (Inaguma et al., 2020) and use a joint source and target vocabulary of 8K byte pair encoding (BPE) units (Kudo and Richardson, 2018). Similar to §5, we use the ES-Pnet library to prepare the corpus, and apply the same data preparation and augmentations.
Multi-Decoder Configuration: For the MuST-C experiments, we scaled our Multi-Decoder w/ Speech-Attention config from the Fisher-CallHome experiments by increasing the ENCODER ST to contain 4 transformer encoder blocks. We increased the attention dim and attention heads of the ENCODER ASR and DECODER ASR to 512 dimension and 8 heads respectively, while only increasing the attention dimension to 512 for ENCODER ST and DECODER ST . This increased the total trainable parameters to 135M, which we trained on 4 NVIDIA V-100 GPUs for ≈3 days. We also found that increasing the attention dropout of ASR decoder to 0.2 helped with the increased parameters. We kept the remaining dropout parameters the same as our previous experiments. We also keep the remaining training configurations the same like the effective batch-size, learning rate and warmup steps, loss weighting and SpecAugment policy. During inference, we use the same beam sizes from our Fisher-CallHome experiments and we perform a search across the length penalty and max length ratio settings using the MuST-C dev sets. Ground-Truth puedes ayudar para que se haga justicia más rápido you can help so that justice is served quickly Multi-Decoder puedes ayudar para que sea justicia más rápido you can help so it's faster +Speech-Attention puedes ayudar para que sea justicia más rápido you can help so that it's faster justice Ground-Truth pero tiene muchas cosas muy bonitas but there are many beautiful things Multi-Decoder pero tienen muchas cosas muy bonitas but they have a lot of nice things +Speech-Attention pero tienen muchas cosas muy bonitas but there are many very beautiful things Ground-Truth acampar ir a pescar y ir a las montañas a esquiar camping and fishing and going to the mountains to ski Multi-Decoder acampar y a pescar y y de las montañas esquiar camping and fishing and and the mountains skiing +Speech-Attention a campar y ir a pescar y ir a las montañas a esquiar camping and go fishing and go to the mountains to ski  Table 8: Results presenting the performance of our Baseline Enc-Dec implementation and our Multi-Decoder models as evaluated by three metrics: BLEU, METEOR, and Translation Edit Rate (TER). These are the same models as in Table 1, which uses BLEU. All results are from the Fisher-CallHome Spanish-English test corpus.
In the intermediate ASR beam search we use a length penalty of 0.1 and 0.2 for English-German and English-French respectively. In the ST beam search we use a max length ratio of 0.3 and length penalties of 0.6 and 0.5 for English-German and English-French respectively. For our experiments with ASR re-scoring, we use a LM weight of 0.1 and a CTC weight of 0.1. In these re-scoring experiments we also set the ASR length penalty to 0.6 and the ST length penalty to 0.5, while increasing the ST max length ratio to 0.5. The LMs used were trained on the English transcripts of the MuST-C English-German and English-French corpora, with dev perplexities of 32.7 and 23.2 respectively.