Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one results. Furthermore, our parallel models demonstrate no trade-off between ASR and ST compared to the vanilla multi-task architecture. Our code and pre-trained models are available at https://github.com/formiel/speech-translation.


Introduction
While cascade speech-to-text translation (ST) systems operate in two steps: source language automatic speech recognition (ASR) and source-to-target text machine translation (MT), recent works have attempted to build end-to-end ST without using source language transcription during decoding (Bérard et al., 2016;Weiss et al., 2017;Bérard et al., 2018). After two years of extensions to these pioneering works, the last results of the IWSLT 2020 shared task on offline speech translation (Ansari et al., 2020) demonstrate that end-to-end models are now on par (if not better) than their cascade counterparts. Such a finding motivates even more strongly the works on multilingual (one-to-many, many-to-one, many-tomany) ST Inaguma et al., 2019;Wang et al., 2020a) for which end-to-end models are well adapted by design. Moreover, of these two approaches: cascade proposes a very loose integration of ASR and MT (even if lattices or word confusion networks were used between ASR and MT before end-to-end models appeared) while most end-to-end approaches simply ignore ASR subtask, trying to directly translate from source speech to target text. We believe that these are two edge design choices and that a tighter coupling of ASR and MT is desirable for future end-to-end ST applications, in which the display of transcripts alongside translations can be beneficial to the users (Sperber et al., 2020). This paper addresses multilingual ST and investigates more closely the interactions between speech transcription (ASR) and speech translation (ST) in a multilingual end-to-end architecture based on Transformer. While those interactions were previously investigated as a simple multi-task framework for a bilingual case (Anastasopoulos and Chiang, 2018), we propose a dual-decoder with an ASR decoder tightly coupled with an ST decoder and evaluate its effectiveness on one-to-many ST. Our model is inspired by , but the interaction between ASR and ST decoders is much tighter. 1 Finally, experiments show that our model outperforms theirs on the MuST-C benchmark (Di . Our contributions are summarized as follows: (1) a new model architecture for joint ASR and multilingual ST; (2) an integrated beam search decoding strategy which jointly transcribes and translates, and that is extended to a wait-k strategy where the ASR hypothesis is ahead of the ST hypothesis by k tokens and vice-versa; and (3) competitive performance on the MuST-C dataset in both bilingual and multilingual settings and improvements on previous joint ASR/ST work.

Related Work
Multilingual ST Multilingual translation (Johnson et al., 2016) consists in translating between different language pairs with a single model, thereby improving maintainability and the quality of low resource languages.  adapt this method to one-to-many multilingual speech translation by adding a language embedding to each source feature vector. They also observe that using the source language (English) as one of the target languages improves performance. Inaguma et al. (2019) simplify the previous approach by prepending a target language token to the decoder and apply it to one-to-many and many-to-many speech translation. They do not investigate many-to-one due to the lack of a large corpus for this. To fill this void, Wang et al. (2020a) release the CoVoST dataset for ST from 11 languages into English and demonstrate the effectiveness of many-to-one ST.
Joint ASR and ST Joint ASR and ST decoding was first proposed by Anastasopoulos and Chiang (2018) through a multi-task learning framework. Chuang et al. (2020) improve multitask ST by using word embedding as an intermediate level instead of text. A two-stage model that performs first ASR and then passes the decoder states as input to a second ST model was also studied previously (Anastasopoulos and Chiang, 2018;Sperber et al., 2019). This architecture is closer to cascaded translation while maintaining end-to-end trainability. Sperber et al. (2020) introduce the notion of consistency between transcripts and translations and propose metrics to gauge it. They evaluate different model types for the joint ASR and ST task and conclude that end-to-end models with coupled inference procedure are able to achieve strong consistency. In addition to existing models having coupled architectures, they also investigate a model where the transcripts are concatenated to the translations, and the shared encoder-decoder network learns to predict this concatenated outputs. It should be noted that our models have lower latency compared to this approach since the concatenation of outputs makes the two tasks sequential in nature. Our work is closely related to that of  who propose an interactive attention mechanism which enables ASR and ST to be performed synchronously. Both ASR and ST decoders do not only rely on their previous outputs but also on the outputs predicted in the other task. We highlight three differences between their work and ours: (a) we propose a more general framework in which  is a special case; (b) tighter integration of ASR and ST is proposed in our work; and (c) we experiment in a multilingual ST setting while previous works on joint ASR and ST only investigated bilingual ST.

Dual-decoder Transformer for Joint ASR and Multilingual ST
We now present the proposed dual-decoder Transformer for jointly performing ASR and multilingual ST. Our models are based on the Transformer architecture (Vaswani et al., 2017) but consist of two decoders. Each decoder is responsible for one task (ASR or ST). The intuition is that the problem at hand consists in solving two different tasks with different characteristics and different levels of difficulty (multilingual ST is considered more difficult than ASR). Having different decoders specialized in different tasks may thus produce better results. In addition, since these two tasks can be complementary, it is natural to allow the decoders to help each other. Therefore, in our models, we introduce a dual-attention mechanism: in addition to attending to the encoder, the decoders also attend to each other.

Model overview
The model takes as input a sequence of speech features x = (x 1 , x 2 , . . . , x Tx ) in a specific source language (e.g. English) and outputs a transcription y = (y 0 , y 1 , . . . , y Ty ) in the same language as well as translations z 1 , z 2 , . . . , z M in M different target languages (e.g. French, Spanish, etc.). When M = 1, this corresponds to joint ASR and bilingual ST . For simplicity, our presentation considers only a single target language with output z = (z 0 , z 1 , . . . , z Tz ). All results, however, apply to the general multilingual case. In the sequel, denote y <t (y 0 , y 1 , . . . , y t−1 ) and y >t (y t+1 , y t+2 , . . . , y Ty ) (y t is included if "<" and ">" are replaced by "≤" and "≥" respectively). In addition, assume that y t is ignored if t is outside of the interval [0, T y ]. Notations apply to z as well.
The dual-decoder model jointly predicts the transcript and translation in an autoregressive fashion: A natural model would consist of a single decoder followed by a softmax layer. However, even if the capacity of the decoder were large enough for handling both ASR and ST generation, a single softmax would require a very large joint vocabulary (with size V y V z where V y and V z are respectively the vocabulary sizes for y and z). Instead, our dual-decoder consists of two sub-decoders that are specialized in producing outputs tailored to the ASR and ST tasks separately. Formally, our model predicts the next output tokens (ŷ s ,ẑ t ) (where 1 ≤ s ≤ T y , 1 ≤ t ≤ T z ) given a pair of previous outputs (y <s , z <t ) as: where [v] i denotes the i th element of the vector v. Note that y s and z t are token indices (1 ≤ y s ≤ V y , 1 ≤ z t ≤ V z ). In (3) and (4), we detail the intermediate quantities p(y s | ·) and p(z t | ·) as obtained from the probability distributions over the output vocabulary. In the above, we have made an important assumption about the joint probability p(y s , z t | ·) that it can be factorized into p(y s | ·)p(z t | ·). Therefore, the joint distribution (1) encoded by the dual-decoder Transformer can be rewritten as We also assumed so far that the sub-decoders start at the same time, which is the most basic configuration. In practice, however, one may allow one sequence to advance k steps compared to the other, known as the wait-k policy (Ma et al., 2019). For example, if ST waits for ASR to produce its first k tokens, then the joint distribution becomes In the next section, we propose two concrete architectures for the dual-decoder, corresponding to different levels of dependencies between the two sub-decoders (ASR and ST). Then, we show that several known models in the literature are special cases of these architectures.

Parallel and cross dual-decoder Transformers
The first architecture is called parallel dual-decoder Transformer, which has the highest level of dependencies: one decoder uses the hidden states of the other to compute its outputs, as illustrated in Figure 1a. The encoder consists of an input embedding layer followed by a positional embedding and a number of self-attention and feed-forward network (FFN) layers whose inputs are normalized (Ba et al., 2016). 2 This is almost the same as the encoder of the original Transformer (Vaswani et al., 2017) (we refer to the corresponding paper for further details), except that the embedding layer in our encoder is a small convolutional neural network (CNN) (Fukushima and Miyake, 1982;LeCun et al., 1989) of two layers with ReLU activations and a stride of 2, therefore reducing the input length by 4.  The parallel dual-decoder consists of: (a) two decoders that follow closely the common Transformer decoder structure, and (b) four additional multi-head attention layers (called dual-attention layers). Each dual-attention layer is complementary to a corresponding main attention layer. We recall that an attention layer receives as inputs a query Q ∈ R d k , a key K ∈ R d k , a value V ∈ R dv and outputs Attention(Q, K, V). 3 A dual-attention layer receives Q from the main branch and K, V from the other decoder at the same level (i.e. at the same depth in Transformer architecture) to compute hidden representations that will be merged back into the main branch (i.e. one decoder attends to the other in parallel). We present in more detail this merging operation in Section 3.4.
Our second proposed architecture is called cross dual-decoder Transformer, which is similar to the previous one, except that now the dual-attention layers receive K, V from the previous decoding step outputs of the other decoder, as illustrated in Figure 1c. Thanks to this design, each prediction step can be performed separately on the two decoders. The hidden representations h y s , h z t in (2) produced by the decoders can thus be decomposed into: 4

Special cases
In this section, we present some special cases of our dual-decoder architecture and discuss their links to existing models in the literature.

Independent decoders
When there is no dual-attention, the two decoders become independent. In this case, the prediction joint probability can be factorized simply as p(y s , z t | y <s , z <t , x) = p(y s | y <s , x)p(z t | z <t , x). Therefore, all prediction steps are separable and thus this model is the most computationally efficient. In the literature, this model is often referred to as multi-task (Anastasopoulos and Chiang, 2018; Sperber et al., 2020).
Chained decoders Another special case corresponds to the extreme wait-k policy, in which one decoder waits for the other to completely finish before starting its own decoding. For example, if ST waits for ASR, then the prediction joint probability reads p(y s , z t | y <s , z <t , x) = p(y s | y <s , x)p(z t | z <t , y, x). This model is called triangle in previous work (Anastasopoulos and Chiang, 2018;Sperber et al., 2020). A special case of this model is when the second decoder in the chain is not directly connected to the encoder, also referred to as two-stage 5 (Sperber et al., 2019;Sperber et al., 2020).
To summarize the different cases, we show below the joint probability distributions encoded by the presented models, in decreasing level of dependencies: where T = max(T y , T z ). Similar formalization for the wait-k policy (7) can be obtained in a straightforward manner. Note that for independent decoders, the distribution is the same as in non-wait-k.

Variants
The previous section presents special cases of our formulation at a high level. In this section, we introduce different fine-grained variants of the dual-decoder Transformers used in the experiments (Section 5).
Asymmetric dual-decoder Instead of using all the dual-attention layers, one may want to allow a one-way attention: either ASR attends ST or the inverse, but not both.
At-self or at-source dual-attention In each decoder block, there are two different attention layers, which we respectively call self-attention (bottom) and source-attention 6 (top). For each, there is an associated dual-attention, named respectively dual-attention at self and dual-attention at source. In the experiments, we study the case where either only the at-self or at-source attention layers are retained.

Merging operators
The Merge layers shown in Figure 1 combine the outputs of the main attention H main and the dual-attention H dual . We experimented dual-attention with two different merging operators: weighted sum or concatenation. We can formally define the merging operators as 5 Also called cascade by Anastasopoulos and Chiang (2018). We omit this term to avoid confusion with the common cascade models that are typically not trained end-to-end. Note that our chained-decoder (both triangle and two-stage) are end-to-end. 6 Also referred to as cross-attention in the literature. We use a different name to avoid confusion with the cross dual-decoder.
For the sum operator, in particular, we perform experiments for learnable or fixed λ.
Remark. The model proposed by  is a special case of our cross dual-decoder Transformer with no dual-attention at source, no layer normalization for the input embeddings (Figure 1c), and sum merging with fixed λ.

Training
The objective, L(ŷ,ẑ, y, z) = αL asr (ŷ, y) + (1 − α)L st (ẑ, z), is a weighted sum of the cross-entropy ASR and ST losses, where (ŷ,ẑ) and (y, z) denote the predictions and the ground-truths for (ASR, ST), respectively. The weight α is set to 0.3 in all experiments. Here we favor the ST task based on the intuition that it is more difficult to train than the ASR one, simply because of the multilinguality. A hyperparameter search may further improve the results. We also employ label smoothing (Szegedy et al., 2016) with = 0.1. For each language pair, training data is sorted by the number of frames. Each mini-batch contains all languages such that their numbers of frames are roughly the same. We follow Inaguma et al. (2019) and prepend a language-specific token to the target sentence. Preliminary experiments showed that this approach was more effective than adding a target language embedding along the temporal dimension to the speech feature inputs (Di .

Decoding
We present the beam search strategy used by our model. Since there are two different outputs (ASR and ST), one may naturally think about two different beams (with possibly some interactions). However, we found that a single joint beam works best for our model. In this beam search strategy, each hypothesis includes a tuple of ASR and ST sub-hypotheses. The two sub-hypotheses are expanded together and the score is computed based on the sum of log probabilities of the output token pairs. For a beam size B, the B best hypotheses are retained based on this score. In this setup, both sub-hypotheses evolve jointly, which resembles the training process more than in the case of two different beams. A limitation of this joint-beam strategy is that, in extreme cases, one of the task (ASR or ST) may only have a single hypothesis. Indeed, at a decoding step t + 1, we take the best B predictions (ŷ t ,ẑ t ) in terms of their sum of scores s(y t , z t ) log p(y t | y <t , z <t ) + log p(z t | y <t , z <t ); it can happen that, e.g., someŷ t has a so dominant score that it is selected for all the hypotheses, i.e. the B (different) hypotheses have a singleŷ t and B differentẑ t . We leave the design of a joint-beam strategy with enforced diversity to future work. Finally, to produce translations for multiple target languages in our system, it suffices to feed different language-specific tokens to the dual-decoder at decoding time.

Dataset
To build a one-to-many model that can jointly transcribe and translate, we use MuST-C (Di , which is currently the largest publicly available one-to-many speech translation dataset. 7 MuST-C covers language pairs from English to eight different target languages including Dutch, French, German, Italian, Portuguese, Romanian, Russian, and Spanish. Each language direction includes a triplet of source input speech, source transcription, and target translation, with size ranging from 385 hours (Portuguese) to 504 hours (Spanish). We refer to the original paper for more details.

Implementation details
Our implementation is based on the ESPnet-ST toolkit (Inaguma et al., 2020). 8 In the following, we provide details for reproducing the results. The pipeline is identical for all experiments. Recently, a very large many-to-many dataset called CoVoST-2 (Wang et al., 2020b) has been released, while its predecessor CoVoST (Wang et al., 2020a) only covers the many-to-one scenario. 8 https://github.com/espnet/espnet Models All experiments use the same encoder architecture with 12 layers. The decoder has 6 layers, except for the independent-decoder model where we also include a 8-layer version (independent++) to compare the effects of dual-attention against simply increasing the number of model parameters.
Text pre-processing Transcriptions and translations were normalized and tokenized using the Moses tokenizer (Koehn et al., 2007). The transcription was lower-cased and the punctuation was stripped. A joint BPE (Sennrich et al., 2016) with 8000 merge operations was learned on the concatenation of the English transcription and all target languages. We also experimented with two separate dictionaries (one for English and another for all target languages), but found that the results are worse.
Speech features We used Kaldi (Povey et al., 2011) to extract 83-dimensional features (80-channel log Mel filter-bank coefficients and 3-dimensional pitch features) that were normalized by the mean and standard deviation computed on the training set. Following common practice (Inaguma et al., 2019;Wang et al., 2020c), utterances having more than 3000 frames or more than 400 characters were removed. For data augmentation, we used speed pertubation (Ko et al., 2015) with three factors of 0.9, 1.0, and 1.1 and SpecAugment (Park et al., 2019) with three types of deterioration including time warping (W ), time masking (T ) and frequency masking (F ), where W = 5, T = 40, and F = 30.
Optimization Following standard practice for training Transformer, we used the Adam optimizer (Kingma and Ba, 2015) with Noam learning rate schedule (Vaswani et al., 2017), in which the learning rate is linearly increased for the first 25K warm-up steps then decreased proportionally to the inverse square root of the step counter. We set the initial learning rate to 1e−3 and the Adam parameters to β 1 = 0.9, β 2 = 0.98, = 1e−9. We used a batch size of 32 sentences per GPU, with gradient accumulation of 2 training steps. All models were trained on a single-machine with 8 32GB GPUs for 250K steps unless otherwise specified. As for model initialization, we trained an independent-decoder model with the two decoders having shared weights for 150K steps and used its weights to initialize the other models. This resulted in much faster convergence for all models. We also included this shared model in the experiments, and for a fair comparison, we trained it for additional 250K steps. Finally, for decoding, we used a beam size of 10 with length penalty of 0.5. 9

Results and analysis
In this section, we report detokenized case-sensitive BLEU 10 (Papineni et al., 2002) on the MuST-C dev sets (Table 1). Results on the test sets are discussed in Section 5.4. Following previous work (Inaguma et al., 2020), we remove non-verbal tokens in evaluation. 11 In Table 1, there are 3 main groups of models, corresponding to independent-decoder, cross dual-decoder (crx), and parallel dual-decoder (par), respectively. In particular, independent++ corresponds to a 8-decoder-layer model and will serve as our strongest baseline for comparison. Figure 2 shows the relative performance of some representative models with respect to this baseline, together with their validation accuracies. In the following, when comparing models, we implicitly mean "on average" (over the 8 languages), except otherwise specified.
Parallel vs. cross Under the same configurations, parallel models outperform their cross counterparts in terms of translation (line 5 vs. line 13, line 6 vs. line 14, and line 7 vs. line 16), showing an improvement of 0.7 BLEU on average. In terms of recognition, however, the parallel architecture has on average a 0.33% higher (worse) WER compared to the cross models. On the other hand, parallel dual-decoders perform better than independent decoders in both translation and recognition tasks, except for the asymmetric case (line 12), the at-self and at-source with sum merging configuration (line 17), and the wait-k model where ST is ahead of ASR (line 19). This shows that both the translation and recognition tasks can benefit from the tight interaction between the two decoders, i.e. it is possible to achieve no trade-off between BLEUs and WERs for the parallel models compared to the independent architecture. This is not the case, however, for the cross dual-decoders that feature weaker interaction than the parallel ones. no normalization for dual-attention input, † sum merging has λ = 0.3 fixed, R3 ASR is 3 steps ahead of ST, T3 ST is 3 steps ahead of ASR. Table 1: BLEU and (average) WER on MuST-C dev set. In the second column (type), crx and parallel denote the cross and parallel dual-decoder, respectively. In the third column (side), st means only ST attends to ASR. Line 1 corresponds to the independent-decoder model where the weights of the two decoders are shared, and independent++ corresponds to the model with 8 decoder layers (instead of 6). It should be noted that line 10 corresponds to the model proposed by . The values that are better than the baseline (independent++) are underlined and colored in blue, while the best values are highlighted in bold.
Interestingly, there is a slight trade-off between the parallel and cross designs: the parallel models are better in terms of BLEUs but worse in terms of WERs. This is to some extent similar to previous work where models having different types of trade-offs between BLEUs and WERs (He et al., 2011;Sperber et al., 2020;Chuang et al., 2020). It should be emphasized that most of the dual-decoder models have fewer or same numbers of parameters compared to independent++. This confirms our intuition that the tight connection between the two decoders in the parallel architecture improves performance. The cross dual-decoders perform relatively well compared to the baseline of two independent decoders with the same number of layers (6), but not so well compared to the stronger baseline with 8 layers.
Symmetric vs. asymmetric In some experiments, we only allow the ST decoder to attend to the ASR decoder. For the cross dual-decoder, this did not yield noticeable improvements in terms of BLEU (21.72 at line 4 vs. 21.71 at line 6), while for the parallel architecture, the results are worse (21.93 at line 12 vs. 22.70 at line 16). The symmetric models also outperform the asymmetric counterparts in terms of WER (12.7 at line 4 vs. 12.2 at line 6, 13.0 at line 12 vs. 12.7 at line 16). It is confirmed again that the two tasks are complementary and can help each other: removing the ASR-to-ST attention hurts performance. In fact, examining the learnable λ in the sum merging operator shows that the decoders learn to attend to each other, though at different rates. We observed that for the same layer depth, the ST decoder always attends more to the ASR one, and for both of them λ increases with the depth of the layer.
At-self dual-attention vs. at-source dual-attention For the parallel dual-decoder, the at-source dualattention produces better results than the at-self counterpart (BLEU: 22.54 vs. 22.26,WER: 12.7 vs. 12.8 at line 14 vs. line 15), while the combination of both does not improve the results (BLEU 22.16,WER 12.8 at line 17). For the concat merging, using both yields better results in terms of translation but slightly hurts the recognition task (BLEU: 22.70 vs. 22.32, WER: 12.7 vs. 12.5 at line 16 vs. line 13).
Sum vs. concat merging The impact of merging operators is not consistent across different models. If we focus on the parallel dual-decoder, sum is better for models with only at-source attention (line 13 vs. line 14) and concat is better for models using both at-self and at-source attention (line 16 vs. line 17).   Table 1. The baseline used for relative BLEU is independent++. One can observe that the parallel models consistently outperform the others in terms of validation accuracy. Best viewed in color.
Input normalization and learnable sum Some experiments confirm the importance of normalizing the input fed to the dual-attention layers (i.e. the LayerNorm layers shown in Figure 1c). The results show that normalization substantially improves the performance (BLEU: 22.17 vs. 21.34, WER: 12.1 vs. 12.3 at line 8 vs. at line 11). It is also beneficial to use learnable weights compared to a fixed value for the sum merging operator (Equation (14) Wait-k policy We compare a non-wait-k parallel dual-decoder (line 14) with its wait-k (k = 3) counterparts. From the results, one can observe that letting ASR be ahead of ST (line 18) improves the performance (BLEU: 22.78 vs. 22.54, WER: 12.6 vs. 12.7), while letting ST be ahead of ASR (line 19) considerably worsen the results (BLEU: 21.85, WER: 13.6). This confirms our intuition that the ST task is more difficult and should not take the lead in the dual-decoder models.
ASR results From the results (last column of Table 1), one can observe that the dual-decoder models outperform the baseline indepedent++, except for the asymmetric case and the wait-k model where ST is 3 steps ahead of ASR. While using a single decoder leads to an average of 14.2% WER, all other symmetric architectures with two decoders (except the ASR-waits-for-ST) have better and rather stable WERs (from 12.1% to 13.0%). Detailed results for each data subset are provided in the Appendix.

Comparison to state-of-the-art
To avoid a hyper-parameter search over the test set, we only select three of our best models together with the baseline independent++ for evaluation. All of the three models are symmetric parallel dualdecoders, the first one has at-source dual-attention with sum merging, the second one has both at-self and at-source dual-attentions with concat merging, and the last one is a wait-k model in which ASR is 3 steps ahead of ST. These models correspond to lines 5, 6, and 7 of Table 2, and will be referred to respectively as par++, par, and par R3 in the sequel. For par++ we increase the number of decoder layers from 6 to 8, thus increasing the number of parameters from 48M to 51.2M, matching that of the baseline. We do not do this for par R3 (48M) as this model already has a higher latency due to the wait-k. All models are trained for 550K steps, corresponding to 25 epochs. Following Inaguma et al. (2020), we use the average of five checkpoints with the best validation accuracies on the dev sets for evaluation.
We compare the results with the previous work  in the multilingual setting. In addition, to demonstrate the competitive performance of our models, we also include the best existing translation performance on MuST-C (Inaguma et al., 2020), although these results were obtained with bilingual systems and from a sophisticated training recipe. Indeed, to obtain the results for each language pair (e.g. en-de), Inaguma et al. (2020) pre-trained an ASR model and an MT model to initialize the  weights of (respectively) the encoder and decoder for ST training. This means that to obtain the results for the 8 language pairs, 24 independent trainings had to be performed in total (3 for each language pair).
The results in Table 2 show that our models achieved very competitive performance compared to bilingual one-to-one models (Inaguma et al., 2020), despite being trained for only half the number of epochs. In particular, the par++ model achieved the best results, consistently surpassing the others on all languages (except on Russian where it is outperformed by the bilingual model). Our results also surpassed those of  by a large margin. We observe the largest improvements on Portuguese (+1.94 at line 5, +1.50 at line 6, and +1.47 at line 7, compared to the bilingual result at line 1), which has the least data among the 8 language pairs in MuST-C. This phenomenon is also common in multilingual neural machine translation where multilingual joint training has been shown to improve performance on low-resource languages (Johnson et al., 2017).

Conclusion
We introduced a novel dual-decoder Transformer architecture for synchronous speech recognition and multilingual speech translation. Through a dual-attention mechanism, the decoders in this model are at the same time able to specialize in their tasks while being helpful to each other. The proposed model also generalizes previously proposed approaches using two independent (or weakly tied) decoders or chaining ASR and ST. It is also flexible enough to experiment with settings where ASR is ahead of ST which makes it promising for (one-to-many) simultaneous speech translation. Experiments on the MuST-C dataset showed that our model achieved very competitive performance compared to state-of-the-art.  Figure 3: The cross dual-decoder Transformer. Unlike the parallel dual-decoder Transformer, here one decoder attends to the previous decoding-step outputs of the other and there is no interaction between their hidden states.  Table 3: Word error rate on MuST-C dev set. The values that are better than the baseline (independent++) are underlined and colored in blue.