A Coarse-Grained Model for Optimal Coupling of ASR and SMT Systems for Speech Translation

Speech translation is conventionally carried out by cascading an automatic speech recognition (ASR) and a statistical machine translation (SMT) system. The hypotheses chosen for translation are based on the ASR system’s acoustic and language model scores, and typically optimized for word error rate, ignoring the intended downstream use: automatic translation. In this paper, we present a coarse-to-ﬁne model that uses features from the ASR and SMT systems to optimize this coupling. We demonstrate that several standard features utilized by ASR and SMT systems can be used in such a model at the speech-translation interface, and we provide empirical results on the Fisher Spanish-English speech translation corpus.


Introduction
Speech translation is the process of translating speech in the source language to text or speech in the target language. This process is typically structured as a three step pipeline.
Step one involves training an Automatic Speech Recognition (ASR) system to transcribe speech to text in the source language.
Step two involves extracting an appropriate form of the ASR output to translate. We will refer to this step as the Speech-Translation interface. In the simplest scenario, the ASR 1best output can be used as the source text to translate. It may be useful to consider alternative ASR hypotheses and these take the form of an N -best list or a word-lattice. An N -best list can be included easily into the tuning and the decoding process of a statistical machine translation (SMT) system (Zhang et al., 2004). Several researchers have proposed solutions to incorporating lattices and confusion networks in this process (Saleem et al., 2004;Matusov et al., 2005;Bangalore and Riccardi, 2000;Dyer et al., 2008a;Bertoldi and Federico, 2005;Quan et al., 2005;Mathias and Byrne, 2006;. Word lattice input to SMT for tuning and decoding increases the complexity of the decoding process because of the exponential number of alternatives that are present. Finally, step three involves training and tuning a Statistical Machine Translation (SMT) system and decoding the output extracted through the speech translation interface.
This paper presents a featurized model which performs the job of hypothesis selection from the outputs of the ASR system for the input to the SMT system. Our motivation is as follows: 1. Using downstream information : Hypothesis selection for the input to the SMT system should be done jointly by the ASR and the SMT systems. That is, there may exist hypotheses that a trained SMT system may find easier to translate and produce better translations for than the ones that are deemed best based on the ASR acoustic and language model scores. Incorporation of knowledge from the downstream process (translation) is vital to selecting translation options, and subsequently producing better translations.
2. Coarse-to-fine grained decoding : An intermediate model which acts as an interface and is a weak (coarse) version of the downstream process may be able to select better hypotheses. In effect, a weak translation decoder can be used as the interface to estimate the expected translation quality of an ASR hypothesis. This method of hypothesis selection should be able to incorporate features from the ASR and the SMT system.
3. Phrase units vs. word units : When a phrase based SMT system is used for translation, optimization for hypothesis selection at the Speech-Translation interface should be conducted using phrases as the basic unit instead of words.

Coarse-to-Fine Speech Translation
In this section, we describe the featurized model (coarse-grain MT decoder) for hypothesis selection that uses information from the ASR and SMT systems (impedance matching). We assume the presence of ASR and SMT systems that have been trained separately. In addition to creating almost no disruption in the traditional pipeline approach, this allows us to incorporate local gains from each system. To elaborate, our methods avoid joint optimization of the ASR and the SMT system with respect to a translation metric (Vidal, 1997;Ney, 1999), which is not feasible for larger datasets. Also, considering the dearth of speech translation training datasets, this method allows independent training of the ASR and SMT systems on data created only for ASR training and parallel data for SMT. We start by introducing the formal machinery that will be used and by presenting a simple example to motivate the model. The complete featurized model follows this exposition. Let Σ and Γ be alphabets of words and phrases respectively in the source language. Using these, we can define the following finite state machines: 1. Word Lattice (L) : A finite state acceptor that accepts word-sequences in the source language ( L : Σ * → Σ * ). This represents the unpruned ASR word lattice output in our model ( Figure 1a).

Phrase segmentation Transducer (S) :
A cyclic finite state transducer that transduces a sequence of words to phrases in the source language (S : Σ * → Γ * ). This is built from the source side of the phrase table. Each path represents one source side phrase in the phrase table. Traversing a path is equivalent to consuming the words in a phrase and producing the phrase as a token ( Figure 1b).
3. Weighted word lattice (L ASR ) : A weighted version of L (L ASR : Σ * → Σ * /R + ). We use the subscript to denote the nature/source of the weights. We will represent weighted versions of P as P ASR/M T with subscripts to denote the origin of the weights (Figure 1c).

A simple model : Maximum Spanning Phrases
We motivate our model with this fairly simple scenario. Suppose that we believe that if our SMT input could be covered by longer source side phrases 1 , we would produce better translations. This may be viewed as a tiling problem where the tiles are the source phrases in the phrase table and the goal is to select the ASR hypothesis that requires the least number of phrases to cover 2 . To achieve this using our existing machinery, we cre-ateS, a weighted version of S (Figure 1 (b)), such that w(δ (S) ) = 0 : π 1 (δ (S) ) ∈ Σ and π 2 (δ (S) ) = 1 : π 2 (δ (S) ) ∈ Γ and π 1 (δ (S) ) = where δ (S) is an edge inS and π 1 and π 2 are the input and output projections respectively. Using this segmentation transducer and an unweighted word lattice, L ( Figure : 1 (a)), we produce a phrase latticeP M T . Assuming the weights are in the logsemiring, the weight of a path δ (P ) * inP M T is Figure 1(c) shows an example of this phrase lattice. Weights in the phrase lattice follow the same definition as the weights in the segmentation transducer. Hence, the weight of a path in the phrase lattice is simply the number of phrases used to cover this path. The shortest path 3 in the phrase latticeP M T , corresponds to the hypothesis we were looking for. This simple example, demonstrates how we may be able to use SMT features (source phrase length in this case) to select hypotheses from the phrase lattice.

A general featurized model for hypothesis selection
We now present a general framework in which hypothesis selection can be carried out using knowledge (features) from the ASR and the SMT system. As described earlier, this form of 'impedance' matching allows us to select hypotheses from an unpruned ASR word lattice for which the SMT system is more likely to find good translations. Incorporating ASR weights also ensures that we take into account what the ASR system considers to be good hypotheses. We start with the previously discussed idea of a phrase lattice, using weights from the ASR system only. That is, Now, we use the weighted phrase acceptorW M T to bring in the SMT features 4 . Composing this with the weighted phrase lattice, we get where (W M T ) * is the Kleene closure of (W M T ). We assume that the edge weights are in the logsemiring. Hence, after these two compositions, the edge weights inP ASR,M T can be represented as is an edge inP ASR,M T , β, γ are feature weights, f ASR and f M T are features from the ASR and SMT system respectively. This form represents a log-linear model (our features are already assumed to be in log-space). where f i is any feature and λ i is the corresponding feature weight.
We may now extract the one-best, N -best or lattice input for the SMT system fromP ASR,M T .

A discussion about related techniques
1. Decoding (Translation) : Our model closely resembles a featurized finite-state transducer based translation model. If we replace the output alphabet of the acceptor (W M T ) * with the target side phrases, we will actually get output in the target language. Even though this model does not explicity include reordering, the coarse-grained decoder has access to information that can enable better decisions about which hypotheses are better for the downstream process (translation).
2. Lattice Decoding : (Dyer et al., 2008b) suggests passing the entire word lattice to the SMT system. However, even if these lattices are not pruned, a beam based decoder might not consider hypotheses that our model may produce through coarse-grained decoding.
3. Language model re-scoring : One may use a bigger source language model to re-score the ASR lattice (or an N -best list). This however, does not consider any SMT features in re-scoring. With our model, we can simply use this as an additional feature.

Training
Training the hypothesis selection model can be carried out using standard methods for log linear models on a held-out set. This also requires decoding (translation) of a deep N -best list derived from the held-out set. The objective of training then simply becomes maximization of the translation quality given any metric that provides sentence level scores. Each time our model produces a hypothesis, its score can be looked up from the pretranslated N -best list. Also, whenever the weights are updated, the only structures that need to be rebuilt areW * M T andP ASR,M T 5 .

Features
We use the following features in our implementation of this model. However, any relevant ASR and SMT feature may be readily added to this model.
1. ASR scores : We incorporate the ASR acoustic (AM) and language (LM) model scores as one combined feature.
Here, LM, AM are negative logprobabilities and α is the acoustic scaling parameter chosen to minimize ASR word error rate.
2. Source phrase count : As described in section 2.1, this feature may be used to capture the intuition that using a fewer number of phrases to cover the input sentence may produce better translations.
3. Length normalized phrase unigram probability : We may use a phrase LM feature by incorporating phrase n-gram probabilities (normalized) by length.
where f j is a source side phrase in the phrase table.
4. Phrase translation entropy : For each source side phrase p j , we may have multiple translations (e i ) in the phrase table with different translation probabilities (p(e i |f j )). A simple entropy measure can be used as a feature to estimate the confidence that the SMT system has in translating f j .
f tr (p j ) = H tr (E|p j ) = − i p tr (e i |f j ) log p tr (e i |f j ) 5. Lexical translation entropy : Similarly, we can use an entropy measure based on the lexical translation probability as a feature.

Results
We use the Fisher and Callhome Spanish-English Speech Translation Corpus (Post et al., 2013) for our experiments. This Fisher dataset consists of 819 transcribed and translated telephone conversations. The corpus is split into a training, dev and two test sets (dev-2 and test). We use the dev set for training the feature weights of the proposed model. We use the Kaldi speech recognition tools (Povey et al., 2011) to build our Spanish ASR systems.
Our state-of-the-art ASR system is the p-norm DNN system of (Zhang et al., 2014). The worderror-rates on the dev and test sets of the Fisher dataset (dev, dev-2, test) are 29.80%, 29.79% and 25.30% respectively. For the SMT system, we use the phrase based translation system of Moses (Koehn et al., 2007) with sparse features. The system is trained and tuned on the train and dev partitions of the Fisher dataset respectively. The BLEU scores of the MT output for the the dev-2 and the test partitions are 65.38% and 62.91% respectively. While decoding the ASR output, we tune on the 1-best ASR output for the dev partition. With this modified system, the BLEU scores for the ASR 1-best output of the dev2 and the test partitions are 40.06% and 40.4% respectively. We use this system as the baseline for our experiments (Table 1). We note that if we were to use the lattice oracle 6 from our ASR system as input to the SMT system, we get a BLEU score of 46.59% for the dev2 partition of the Fisher dataset. This indicates that the best gain (+BLEU) that an oracle lattice reranker could get is only 6.53%.
To tune the weights of the coarse decoder, we decode 500-best ASR outputs for the tuning set with the SMT system. This maps each ASR hypothesis to a target language translation. An OOV feature was added to handle words that were not seen by the SMT system. The tuning process was then carried out so as to maximize the BLEU with respect to the reference translation of the ASR hypothesis selected by the coarse grained decoder. We used ZMERT (Zaidan, 2009) for tuning which was configured to expect a 300-best list from the decoder at every iteration using the Fisher dev set. 15 iterations of tuning were carried out for each experiment. We then use the tuned weight vector to decode the Fisher-dev2 and the Fisher-test set using our coarse grained decoder. We extract the one-best output and use it as input to the pretrained SMT system (description in the preceding section). Table 1 reports the results achieved the featurized coarse grained decoder.

Conclusions
We present a coarse-to-fine featurized model which acts as the interface between ASR and SMT systems. By utilizing information from the upstream (ASR) and the downstream (SMT) systems, this model makes more informed decisions about which hypotheses from the ASR word lattice may result in better translation results. Moreover, the model takes the form of a coarse finite state transducer based translation decoder which imitates the downstream system. This enables it to estimate translation quality even before the complete SMT system is used for decoding. Finally, the proposed model is featurized and may accept any weight from the ASR and SMT system that are deemed useful for optimizing translation quality. The Spanish Fisher corpus is one of a few conversational speech translation datasets available, and we start with a strong baseline system. We therefore persevere with the experimental setup described above, even though the maximum (oracle) improvement by any rescoring method is only 6.5% BLEU, as noted above. This partially explains the small gains reported here, and suggests that this method should be evaluated further on an-other corpus, e.g. the Egyptian Arabic translation dataset, with greater headroom for improvement.