Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation

Speech translation (ST) aims to learn transformations from speech in the source language to the text in the target language. Previous works show that multitask learning improves the ST performance, in which the recognition decoder generates the text of the source language, and the translation decoder obtains the final translations based on the output of the recognition decoder. Because whether the output of the recognition decoder has the correct semantics is more critical than its accuracy, we propose to improve the multitask ST model by utilizing word embedding as the intermediate.


Introduction
Speech translation (ST) increasingly receives attention from the machine translation (MT) community recently.To learn the transformation between speech in the source language and the text in the target language, conventional models pipeline automatic speech recognition (ASR) and text-to-text MT model (Bérard et al., 2016).However, such pipeline systems suffer from error propagation.
Previous works show that deep end-to-end models can outperform conventional pipeline systems with sufficient training data (Weiss et al., 2017;Inaguma et al., 2019;Sperber et al., 2019).Nevertheless, well-annotated bilingual data is expensive and hard to collect (Bansal et al., 2018a,b;Duong et al., 2016).Multitask learning plays an essential role in leveraging a large amount of monolingual data to improve representation in ST.Multitask ST models have two jointly learned decoding parts, namely the recognition and translation part.The recognition part firstly decodes the speech of source language into the text of source language, and then based on the output of the recognition part, the translation part generates the text in the target language.Variant multitask models have been explored (Anastasopoulos and Chiang, 2018), which shows the improvement in low-resource scenario.
Although applying the text of source language as the intermediate information in multitask end-toend ST empirically yielded improvement, we argue whether this is the optimal solution.Even though the recognition part does not correctly transcribe the input speech into text, the final translation result would be correct if the output of the recognition part preserves sufficient semantic information for translation.Therefore, we explore to leverage word embedding as the intermediate level instead of text.
In this paper, we apply pre-trained word embedding as the intermediate level in the multitask ST model.We propose to constrain the hidden states of the decoder of the recognition part to be close to the pre-trained word embedding.Prior works on word embedding regression show improved results on MT (Jauregi Unanue et al., 2019;Kumar and Tsvetkov, 2018).Experimental results show that the proposed approach obtains improvement to the ST model.Further analysis also shows that constrained hidden states are approximately isospectral to word embedding space, indicating that the decoder achieves speech-to-semantic mappings.

Multitask End-to-End ST model
Our method is based on the multitask learning for ST (Anastasopoulos and Chiang, 2018), including speech recognition in the source language and translation in the target language, as shown in Fig. 1(a).The input audio feature sequence is first encoded into the encoder hidden state sequence h = h 1 , h 2 , . . ., h T with length T by the pyramid encoder (Chan et al., 2015).To present speech recognition in the source language, the attention mechanism and a decoder is employed to produce source decoder sequence ŝ = ŝ1 , ŝ2 , . . ., ŝM , where M is the number of decoding steps in the source language.For each decoding step m, the probability P (ŷ m ) of predicting the token ŷm in the source language vocabulary can be computed based on the corresponding decoder state s m .To perform speech translation in the target language, both the source language decoder state sequence ŝ and the encoder state sequence h will be attended and treated as the target language decoder's input.The hidden state of target language decoder can then be used to derived the probability P (y q ) of predicting token y q in the target language vocabulary for every decoding step q.
Given the ground truth sequence in the source language ŷ = ŷ1 , ŷ2 , . . ., ŷM and the target language y = y 1 , y 2 , . . ., y Q with length Q, multitask ST can be trained with maximizing log likelihood in both domains.Formally, the objective function of multitask ST can be written as: where α and β are the trade-off factors to balance between the two tasks.

Proposed Methods
We propose two ways to help the multitask endto-end ST model capture the semantic relation between word tokens by leveraging the source language word embedding as intermediate level.
, where V is the vocabulary set and e v ∈ R D is the embedding vector with dimension D for any word v ∈ V , in the recognition task.We choose the source language decoder state (embedding) ŝ to reinforce since it is later used in the translation task.To be more specific, we argue that the embedding generated by the source language decoder should be more semantically correct in order to benefit the translation task.Given the pre-trained source language word embedding Ê, we proposed to constrain the source decoder state ŝm at step m to be close to its corresponding word embedding êŷm with the two approaches detailed in the following sections.

Directly Learn Word Embedding
Since semantic-related words would be close in terms of cosine distance (Mikolov et al., 2018), a simple idea is to minimize the cosine distance (CD) between the source language decoder hidden state ŝm and the corresponding word embedding êŷm for every decode step m, where f θ (•) is a learnable linear projection to match the dimensionality of word embedding and decoder state.With this design, the network architecture of the target language decoder would not be limited by the dimension of word embedding.Fig. 1(b) illustrates this approach.By replacing L src in Eq. ( 1) with L CD , semantic learning from word embedding for source language recognition can be achieved.

Learn Word Embedding via Probability
Ideally, using word embedding as the learning target via minimizing CD can effectively train the decoder to model the semantic relation existing in the embedding space.However, such an approach suffers from the hubness problem (Faruqui et al., 2016) of word embedding in practice (as we later discuss in Sec.4.5).
To address this problem, we introduce cosine softmax (CS) function (Liu et al., 2017a,b) to learn speech-to-semantic embedding mappings.Given the decoder hidden state êm and the word embedding Ê, the probability of the target word ŷm is defined as where cos(•) and f θ (•) are from Eq. ( 2), and τ is the temperature of softmax function.Note that since the temperature τ re-scales cosine similarity, the hubness problem can be mitigated by selecting a proper value for τ.Fig. 1(c) illustrates the approach.
With the probability derived from cosine softmax in Eq. ( 3), the objective function for source language decoder can be written as By replacing L src in Eq. ( 1) with L CS , the decoder hidden state sequence ŝ is forced to contain semantic information provided by the word embedding.

Experimental Setup
We used Fisher Spanish corpus (Graff et al., 2010) to perform Spanish speech to English text translation.And we followed previous works (Inaguma et al., 2019) for pre-processing steps, and 40/160 hours of train set, standard dev-test are used for the experiments.Byte-pair-encoding (BPE) (Kudo and Richardson, 2018) was applied to the target transcriptions to form 10K subwords as the target of the translation part.Spanish word embeddings were obtained from FastText pre-trained on Wikipedia (Bojanowski et al., 2016), and 8000 Spanish words were used in the recognition part.
The encoder is a 3-layer 512-dimensional bidirectional LSTM with additional convolution layers, yielding 8× down-sampling in time.The decoders are 1024-dimensional LSTM, and we used one layer in the recognition part and two layers in the translation part.The models were optimized using Adadelta with 10 −6 as the weight decay rate.Scheduled sampling with probability 0.8 was applied to the decoder in the translation part.Experiments ran 1.5M steps, and models were selected by the highest BLEU on four transcriptions per speech in dev set.

Speech Translation Evaluation
Baseline: We firstly built the single-task end-toend model (SE) to set a baseline for multitask learning, which resulted in 34.5/34.51BLEU on dev and test set respectively, which showed comparable results to Salesky et al. (2019) 34.50 34.51 17.41 15.44 ME 35.35 35.49 23.30 20.40 CD 33.06 33.65 23.53 20.87 CS 35.84 36.32 23.54 21.72 Table 1: BLEU scores trained on different size of data.
we could see that ME outperforms SE in all conditions.High-resource: Column (a) in Table 1 showed the results trained on 160 hours of data.CD and CS represent the proposed methods mentioned in Sec.3.1 and 3.2 respectively.We got mixed results on further applying pre-trained word embedding on ME.CD degraded the performance, which is even worse than SE, but CS performed the best.Results showed that directly learn word embedding via cosine distance is not a good strategy in the high-resource setting, but integrating similarity with cosine softmax function can significantly improve performance.We leave the discussion in Sec.4.5.Low-resource: We also experimented on 40 hours subset data for training, as shown in column (b) in Table 1.We could see that ME, CD and CS overwhelmed SE in low-resource setting.Although CD resulted in degrading performance in high-resource setting, it showed improvements in low-resource scenario.CS consistently outperformed ME and CD on different data size, showing it is robust on improving ST task.

Analysis of Recognition Decoder Output
In this section, we analyzed hidden states s by existing methods.For each word v in corpus, we denoted its word embedding êv as pre-trained embedding, and e v as predicted embedding.Note that because a single word v could be mapped by multiple audio segments, we took the average of all its predicted embedding.We obtained the top 500 frequent words in the whole Fisher Spanish corpus, and tested on the sentences containing only these words in test set.Eigenvector Similarity: To verify our proposed methods can constrain hidden states in the word embedding space, we computed eigenvector similarity between predicted embedding and pre-trained embedding space.The metric derives from Laplacian eigenvalues and represents how similar be- 160 hours 40 hours P@1 P@5 P@1 P@5 ME 1.85 6.29 1.11 9. 62 CD 61.48 77.40 56.30 69.25 CS 17.78 35.19 10.37 25.19 Table 3: Precision@k of semantic alignment on test set.tween two spaces, the lower value on the metric, the more approximately isospectral between the two spaces.Previous works showed that the metric is correlated to the performance of translation task (Søgaard et al., 2018;Chung et al., 2019).As shown in Table 2, predicted embedding is more similar to pre-trained embedding when models trained on sufficient data (160 v.s 40 hours).CD is the most similar case among the three cases, and ME is the most different case.Results indicated that our proposals constrain hidden states in pre-trained embedding space.Semantic Alignment: To further verify if dicted embedding is semantically aligned to pretrained embedding, we applied Procrustes alignment (Conneau et al., 2017;Lample et al., 2017) method to learn the mapping between predicted embedding and pre-trained embedding.Top 50 frequent words were selected to be the training dictionary, and we evaluated on the remaining 450 words with cross-domain similarity local scaling (CSLS) method.Precision@k (P@k, k=1,5) were reported as measurements.As shown in Table 3, CD performed the best, and ME was the worst one.This experiment reinforced that our proposals can constrain hidden states to the similar structure of word embedding space.

Speech Recognition Evaluation
We further analyzed the results of speech recognition for ME and CS.To obtain the recognition results from Eq (3), simply take arg max v P CS (v).The word error rate (WER) of the source language recognition was reported in has worse WER, but higher BLEU compared with ME.We concluded that although leveraging word embedding at the intermediate level instead of text results in worse performance in speech recognition (this indicates that the WER of the recognition part does not fully determine the translation performance), the semantic information could somewhat help multitask models generate better translation in terms of BLEU.We do not include the WER of CD in Table 1 because its WER is poor (>100%), but interestingly, the BLEU of CD is still reasonable, which is another evidence that WER of the intermediate level is not the key of translation performance.

Cosine Distance (CD) v.s. Softmax (CS)
Based on experimental results, we found that proposals are possible to map speech to semantic space.With optimizing CS, BLEU consistently outperformed ME, which shows that utilizing semantic information truly helps on ST.Directly minimizing cosine distance made the predicted embedding space closest to pre-trained embedding space, but performed inconsistently on BLEU in different data sizes.We inferred that the imbalance word frequency training and hubness problem (Faruqui et al., 2016) in word embedding space made hidden states not discriminated enough for the target language decoder while optimizing CS can alleviate this issue.

Conclusions
Our proposals showed that utilizing word embedding as intermediate helps with the ST task, and it is possible to map speech to the semantic space.We also observed that lower WER in source language recognition not imply higher BLEU in target language translation.This work is the first attempt to utilize word embedding in the ST task, and further techniques can be applied upon this idea.For example, crosslingual word embedding mapping methods can be considered within the ST model to shorten the distance between MT and ST tasks.

Figure 1 :
Figure 1: (a) Multitask ST model.Dotted arrows indicate steps in the recognition part.Solid arrows indicate steps in the translation part.(b) Directly learn word embedding via cosine distance.(c) Learn word embedding via cosine softmax function.Both (b)(c) are the recognition part in (a).

Table 4 .
Combining the results shown inTable 1, we could see that CS

Table 4 :
Word error rate (%) trained on different size of data.