Lattice Transformer for Speech Translation

Recent advances in sequence modeling have highlighted the strengths of the transformer architecture, especially in achieving state-of-the-art machine translation results. However, depending on the up-stream systems, e.g., speech recognition, or word segmentation, the input to translation system can vary greatly. The goal of this work is to extend the attention mechanism of the transformer to naturally consume the lattice in addition to the traditional sequential input. We first propose a general lattice transformer for speech translation where the input is the output of the automatic speech recognition (ASR) which contains multiple paths and posterior scores. To leverage the extra information from the lattice structure, we develop a novel controllable lattice attention mechanism to obtain latent representations. On the LDC Spanish-English speech translation corpus, our experiments show that lattice transformer generalizes significantly better and outperforms both a transformer baseline and a lattice LSTM. Additionally, we validate our approach on the WMT 2017 Chinese-English translation task with lattice inputs from different BPE segmentations. In this task, we also observe the improvements over strong baselines.


Introduction
Transformer based encoder-decoder framework (Vaswani et al., 2017) for Neural Machine Translation (NMT) has currently become the state-ofthe-art in many translation tasks, significantly improving translation quality in text (Bojar et al., 2018;Fan et al., 2018) as well as in speech (Jan et al., 2018).Most NMT systems fall into the category of Sequence-to-Sequence (Seq2Seq) model (Sutskever et al., 2014), because both the input and  output consist of sequential tokens.Therefore, in most neural speech translation, such as that of (Bojar et al., 2018), the input to the translation system is usually the 1-best hypothesis from the ASR instead of the word lattice output with its corresponding probability scores.How to consume word lattice rather than sequential input has been substantially researched in several natural language processing (NLP) tasks, such as language modeling (Buckman and Neubig, 2018), Chinese Named Entity Recognition (NER) (Zhang and Yang, 2018), and NMT (Su et al., 2017).Additionally, some pioneering works (Adams et al., 2016;Sperber et al., 2017;Osamura et al., 2018) demonstrated the potential improvements in speech translation by leveraging the additional information and uncertainty of the packed lattice structure produced by ASR acoustic model.
Efforts have since continued to push the boundaries of long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) models.More precisely, most previous works are in line with the existing method Tree-LSTMs (Tai et al., 2015), adapting to task-specific variant Lattice-LSTMs that can successfully handle lattices and robustly establish better performance than the original models.However, the inherently sequential nature still remains in Lattice-LSTMs due to the topological representation of the lattice graph, precluding long-path dependencies (Khandelwal et al., 2018) and parallelization within training examples that are the fundamental constraint of LSTMs.
In this work, we introduce a generalization of the standard transformer architecture to accept lattice-structured network topologies.The standard transformer is a transduction model relying entirely on attention modules to compute latent representations, e.g., the self-attention requires to calculate the intra-attention of every two tokens for each sequence example.Latest works such as (Yu et al., 2018;Devlin et al., 2018;Lample et al., 2018;Su et al., 2018) empirically find that transformer can outperform LSTMs by a large margin, and the success is mainly attributed to selfattention.In our lattice transformer, we propose a lattice relative positional attention mechanism that can incorporate the probability scores of ASR word lattices.The major difference with the selfattention in transformer encoder is illustrated in Figure 1.
We first borrow the idea from the relative positional embedding (Shaw et al., 2018) to maximally encode the information of the lattice graph into its corresponding relative positional matrix.This design essentially does not allow a token to pay attention to any token that has not appeared in a shared path.Secondly, the attention weights depend not only on the query and key representations in the standard attention module, but also on the marginal / forward / backward probability scores (Rabiner, 1989;Post et al., 2013) derived from the upstream systems (such as ASR).Instead of 1-best hypothesis alone (though it is based on forward scores), the additional probability scores have rich information about the distribution of each path (Sperber et al., 2017).It is in principle possible to use them, for example in attention weights reweighing, to increase the uncertainty of the attention for other alternative tokens.
Our lattice attention is controllable and flexible enough for the utilization of each score.The lattice transformer can readily consume the lattice input alone if the scores are unavailable.A common application is found in the Chinese NER task, in which a Chinese sentence could possibly have multiple word segmentation possibilities (Zhang and Yang, 2018).Furthermore, different BPE operations (Sennrich et al., 2016) or probabilistic subwords (Kudo, 2018) can also bring similar uncertainty to subword candidates and form a compact lattice structure.
In summary, this paper makes the following main contributions.i) To our best knowledge, we are the first to propose a novel attention mechanism that consumes a word lattice and the probability scores from the ASR system.ii) The proposed approach is naturally applied to both the encoder self-attention and encoder-decoder attention.iii) Another appealing feature is that the lattice transformer can be reduced to standard latticeto-sequence model without probability scores, fitting the text translation task.iv) Extensive experiments on speech translation datasets demonstrate that our method outperforms the previous transformer and Lattice-LSTMs.The experiment on the WMT 2017 Chinese-English translation task shows the reduced model can improve many strong baselines such as the transformer.

Background
We first briefly describe the standard transformer that our model is built upon, and then elaborate on our proposed approach in the next section.

Transformer
The Transformer follows the typical encoderdecoder architecture using stacked self-attention, point-wise fully connected layers, and the encoder-decoder attention layers.Each layer is in principle wrapped by a residual connection (He et al., 2016) and a postprocessing layer normalization (Ba et al., 2016).Although in principle, it is not necessary to mask for self-attentions in the encoder, in practical implementation it is required to mask the padding positions.However, self-attention in the decoder only allows positions up to the current one to be attended to, preventing information flow from the left and preserving the auto-regressive property.The illegal connections will be masked out by setting as −10 9 before the softmax operation.

Dot-product Attention
Suppose that for each attention layer in the transformer encoder and decoder, we have two input sequences that can be presented as two matrices X ∈ R n×d and Y ∈ R m×d , where n, m are the lengths of source and target sentences respectively, and d is the hidden size (usually equal to embedding size), the output is h new sequences Z i ∈ R n×d/h or ∈ R m×d/h , where h is the number of heads in attention.In general, the result of multi-head attention is calculated according to the following procedure.
where the matrices W Q , W K , W V ∈ R d×d/h and W O ∈ R d×d represent the learnable projection parameters, and the masking matrix M ∈ R m×m is an upper triangular matrix with zero on the diagonal and non-zero (−10 9 ) everywhere else.Note that i) the three columns in the right-side of Eq (1,2,3) are used to compute the encoder self-attention, the decoder self-attention, and the encoder-decoder attention respectively, ii) I d is the indicator function that returns 1 if it computes decoder self-attention and 0 otherwise, iii) the projection parameters are unique per layer and head, iv) the Softmax in Eq (4) means a row-wise matrix operation, computing the attention weights by scaled dot product and resulting in a simplex ∆ n for each row.

Lattice Transformer
As motivated in the introduction, our goal is to enhance the standard transformer architecture, which is limited to sequential inputs, to consume lattice inputs with additional information from the upstream ASR systems.

Lattice Representation
Without loss of generality, we assume a word lattice from ASR system to be a directed, connected and acyclic graph following a topological ordering such that a child node comes after its parent nodes.x0 x1 x2 x3 x4 x5 x6 x7 x8 x9   x0 0 1 1 2 2 3 2 3 4 5   x1 -1 0 -inf 1 1 2 X 2 3 4   x2 -1 X 0 Figure 2: An example of the lattice relative position matrix, where "-inf" in the matrix is a special number denoting that no relative position exists between the corresponding two tokens.
We add two special tokens to each path of the lattice, which represent the start of sentence and the end of sentence (e.g., Figure 1), so that the graph has a single source node and a single end node, where each node is assigned a token.
Given the definition and property described above, we propose to use a relative positional lattice matrix L ∈ N n×n to encode the graph information, where n is number of nodes in the graph.For any two nodes i, j in the lattice graph, the matrix entry L ij is the minimum relative distance between them.In other words, if the nodes i, j share at least one path, then we have where L p •0 is the distance to the source node in path p.If no common path exists for two nodes, we denote the relative distance as −∞ (−10 9 in practice) for subsequent masking in the lattice attention.The reason for choosing the "min" in Eq ( 6) is that in our dataset we find about 70% of L ij s computed by "min" and "max" are identical, and about 20% entries just differ by 1. Empirically, our experiments also show no significant difference in the performance of either one.
An illustration to compute the lattice matrix for the example in the introduction is shown in Figure 2. Since we can deterministically reconstruct the lattice graph from those matrix elements that are equal to 1, it indicates the relation information between the parent and child nodes.

Controllable Lattice Attention
Besides the lattice graph representation, the posterior probability scores can be simultaneously produced from the acoustic model and language model in most ASR systems.We deliberately design a controllable lattice attention mechanism to incorporate such information to make the attention encode more uncertainties.
In general, we denote the posterior probability of a node i as the forward score f i , where the summation of the forward scores for its child nodes is 1.Following the recursion rule in (Rabiner, 1989), we can further derive another two useful probabilities, the marginal score m i = f i j∈Pa(i) m j and the backward score b i = m i / k∈Ch(i) m k , where Pa(i) or Ch(i) denotes node i's predecessor or successor set, respectively.Intuitively, the marginal score measures the global importance of the current token compared with its substitutes given all predecessors; the backward score is analogous to the forward score, which is only locally associated with the importance of different parents to their children, where the summation of its parent nodes' scores is 1.Therefore, our controllable attention aims to employ marginal scores and forward / backward scores.

Lattice Embedding
We first construct the latent representations of the relative positional lattice matrix L. The matrix L can be straightforwardly decomposed into two matrices: one is the mask L M with only 0 and −∞ values, and the other is the matrix with regular values i.e., L R = L − L M .Given a 2D embedding matrix W L , the embedded vector of L R ij can be written as W L [L R ij , :] with the NumPy style indexing.In order to prevent the the lattice embedding from dynamically changing, we have to clip every entry of L R with a positive integer c 1 , such that W L ∈ R (2c+1)×d/h has a fixed dimensionality and becomes learnable parameters.

Attention with Probability Scores
Our proposed controllable lattice attention is depicted in the left panel of Figure 3.It shows the computational graph with detailed network modules.More concretely, we first denote the lattice embedding for L R as a 3D array E ∈ R n×n×d/h .Then, the attention weights adapted from traditional transformer are integrated with marginal scores that capture the distribution of each path in the lattice.The logits in Eq (4) will become the addition of three individual terms (if we temporarily omit the mask matrix), The original QK will remain since the word embeddings have the majority of significant semantic information.The difficult part in Eq ( 7) is the new dot product term involving the lattice embedding by einsum 2 operation, where einsum is a multi-dimensional linear algebraic array operation in Einstein summation convention.In our case, it tries to sum out the dimension of the hidden size, resulting in a new 2D array ∈ R n×n , which is further be scaled by 1 √ d/h as well.In addition, we aggregate the scaled marginal score vector m ∈ R n together to obtain the logits.With the new parameterization, each term has an intuitive meaning: term i) represents semantic information, term ii) governs the lattice-dependent positional relation, term iii) encodes the global uncertainty of the ASR output.
The attention logits associated with the forward or backward scores are much different from marginal scores, since they govern the local information between the parent and child nodes.They are represented as a matrix rather than a vector, where the matrix has only non-zero values if nodes i, j have a parent-child relation in the lattice graph.First, an upper or lower triangular mask matrix is used to enforce every token's attention to the forward scores of its successors or the backward scores of its predecessors.It seems counterintuitive but the reason is that the summation of the forward scores for each token's child nodes is 1.So is the backward scores of each token's parent nodes.Secondly, before applying the softmax operation, the lattice mask matrix L M is added to each logits to prevent attention from crossing paths.Eventually, the final attention vector used to multiply the value representation V is a weighed averaging of the three proposed attention vectors 2 This op is available in NumPy, TensorFlow, or PyTorch.In our example, Q and E are 2D and 3D arrays, and the result of this op is a 2D array, with the element in ith row, jth column is k Q ik E ijk .A • with different probability scores s • , In summary, the overall architecture of lattice transformer is illustrated in the right of Figure 3.

Discussion
A critical point for the lattice transformer is whether the model can generalize to other common lattice-based inputs.More specifically, how does the model apply to the lattice input without probability scores?And to what extent can we train the lattice model on a regular sequential input?If probability scores are unavailable, we can use the lattice graph representations alone by setting the scalar w m = 0 in Eq (7) and s f = s b = 0, s m = 1 in Eq (8) as non-trainable constants.
We validate this viewpoint on the Chinese-English translation task, where the Chinese input is a pure lattice structure derived from different tokenizations.As to sequential inputs, it is just a special case of the lattice graph with only one path.An interesting point to mention is that our encoder-decoder attention also takes the key and value representations from the lattice input and aggregates the marginal scores, though the sequential target forbids us to use lattice self-attention in the decoder.However, we can still visualize how the sequential target attends to the lattice input.
A practical point for the lattice transformer is whether the training or inference time for such a seemingly complicated architecture is acceptable.In our implementation, we first preprocess the lattice input to obtain the position matrix for the whole dataset, thus the one-time preprocessing will bring almost no over-head to our training and inference.In addition, the extra enisum operation in controllable lattice attention is the most time-consuming computation, but remaining the same computational complexity as QK .Empirically, in the ASR experiments, we found that the training and inference of the most complicated lattice transformer (last row in the ablation study) take about 100% and 40% more time than standard transformer; in the text translation task, our algorithm takes about 30% and 20% more time during training and inference.

Experiments
We mainly validate our model in two scenarios, speech translation with word lattices and posterior scores, and Chinese to English text translation with different BPE operations on the source side.

Speech Translation
For the speech translation experiment, we use the Fisher and Callhome Spanish-English Speech Translation Corpus from LDC (Post et al., 2013), which is produced from telephone conversations.Our baseline models are the vanilla Transformer with relative positional embeddings (Vaswani et al., 2017;Shaw et al., 2018), and Lattice-LSTMs (Sperber et al., 2017).

Datasets
The Fisher corpus includes the contents between strangers, while the Callhome corpus is primarily between friends and family members.The numbers of sentence pairs of the two datasets are respectively 138,819 and 15,080.The source side Spanish corpus consists of four data types: reference (human transcripts), oracle of ASR lattices (the optimal path with the lowest word error rate (WER)), ASR 1-best hypothesis, and ASR lattice.For the data processing, we make caseinsensitive tokenization with the standard moses 3 tokenizer for both the source and target transcripts, and remove the punctuation in source sides.The sentences of the other three types have been already been lowercased and punctuation-removed.To keep consistent with the lattices, we add a token "<s>" at the beginning for all cases.
Setting Description R baseline, trained with human transcripts only R+1 fine-tuned on 1-best hypothesis R+L fine-tuned on lattices without probability scores R+L+S fine-tuned on lattices with probability scores  1 are trained for both Lattice-LSTMs and Lattice Transformer.For fair and comprehensive comparison, we also evaluate all algorithms on the inputs of four types.We initially train the baseline of our lattice transformer with the human transcripts on Fisher/Train data alone, which is equivalent to the modified transformer (Shaw et al., 2018).Then we fine-tune the pre-trained model with 1-best hypothesis or word lattices (and probability scores) for either Fisher or Callhome dataset.
The source and target vocabularies are built respectively from the transcripts of Fisher/Train and Callhome/Train corpus, with vocabulary sizes 32000 and 20391.The hyper-parameters of our model are the same as Transformer-base with 512 hidden size, 6 attention layers, 8 attention heads and beam size 4. We use the same optimization strategy as (Vaswani et al., 2017) for pre-training with 4 GPU cards, and apply SGD with constant learning rate 0.15 for finetuning.We select the best performed model based on Fisher/Dev or Callhome/Dev, and test on Fisher/Dev2, Fisher/Test or Callhome/Test.
To better analyze the performance of our approach, we use an intensive cross-evaluation 3 https://github.com/moses-smt/mosesdecodermethod, i.e., we feed 4 possible inputs to test different models.The cross-evaluation results are put into several 4 × 4 blocks in Table 2 and 3.As the aforementioned discussion, if the input is not ASR lattice, the evaluation on the model R+L+S needs to set w m = s f = s b = 0, s m = 1.If the input is an ASR lattice but fed into the other three models, the probability scores are in fact discarded.

Results on Fisher and Callhome
We mainly compare our architecture with the previous Lattice-LSTMs (Sperber et al., 2017) and the transformer (Shaw et al., 2018) in Table 2. Since the transformer itself is a powerful architecture for sequence modeling, the BLEU scores of the baseline (R) have significant improvement on test sets.In addition, fine-tuning without scores hasn't outperformed the 1-best hypothesis finetuning, but has about 0.5 BLEU improvement on oracle and transcript inputs.We suspect this may be due to the high ASR WER and if the ASR system has a lower WER, the lattice without score fine-tuning may get a better translation.We will leave this as a future research direction on other datasets from better ASR systems.For now, we just validate this argument in the BPE lattice experiments, and detailed discussion sees next section.As to fine-tuning with both lattices and probability scores, it increases the BLEU with a relatively large margin of 0.9/1.0/0.7 on Fisher Dev/Dev2/Test sets.Besides, for ASR 1-best inputs, it is still comparable with the R+1 systems, while for oracle and transcript inputs, there are about 0.5-0.9BLEU score improvements.
The results of Callhome dataset are all finetuned from the pre-trained model based on Fisher/Train corpus, since the data size of Callhome is too small to train a large deep learning model.This is the reason why we adopt the strategy for domain adaption.We use the same method for model selection and test.The detailed results in Table 3 show the consistent performance improvement.

Inference Analysis
On the test datasets of Fisher and Callhome, we make an inference for predicting the translations, and some examples are shown in Table 4.We also visualize the alignment for both encoder selfattention and encoder-decoder attention for the input and predicted translation.Two examples are illustrated in Figure 4 and 5.As expected, the to- kens from different paths will not attend to each other, e.g., "pero" and "perdón" in Figure 4 or "hoy" and "y" in Figure 5.In Figure 4, we observe that the 1-best hypothesis can even result in erroneous translation "sorry, sorry", which is supposed to be "but in peru".In Figure 5, the translation from 1-best hypothesis obviously misses the important information "i heard it".We primarily attribute such errors to the insufficient information within 1-best hypothesis, but if the lattice transformer is appropriately trained, the translations from lattice inputs can possibly correct them.Due to limited space, more visualization examples can be found in supplementary material.

Model Ablation Study
We conduct an ablation study to examine the effectiveness of every module in the lattice transformer.We gradually add one module from a standard transformer model to our most complicated lattice transformer.From the results in Table 5, we can see that the application of marginal scores in encoder or decoder has the most influential impact on the lattice fine-tuning.Further-more, the superimposed application of marginal scores in both encoder and decoder can gain an additional promotion, compared to individual applications.However, the use of forward and backward scores has no significantly extra rewards in this situation.Perhaps due to overfitting, the most complicated lattice transformer on the Callhome of smaller data size cannot achieve better BLEUs than simpler models.

Chinese English Text Translation
In this experiment, we demonstrate the performance of our lattice transformer when the probability scores are unavailable.The comparison baseline method is the vanilla transformer (Vaswani et al., 2017) in both base and big settings.

Datasets and Settings
The Chinese to English parallel corpus for WMT 2017 news task contains about 20 million sentences after deduplication.For Chinese word segmentation, we use Jieba 4 as the baseline (Zhang et al., 2018;Hassan et al., 2018), while the English sentences are tokenized by moses tokenizer.Some data filtering tricks have been applied, such as the ratio within [1/3, 3] of lengths between source and target sentence pairs and the count of tokens in both sides (≤ 200).
Then for the Chinese source corpus, we learn the BPE tokenization with 16K / 32K / 48K operations, while for the English target corpus, we only learn the BPE tokenization with 32K operations.In this way, each Chinese input can be represented as three different tokenized results, thus being ready to construct a word lattice.
The hyper-parameters of our model are the  same as the setting with the speech translation in previous experiments.We follow the optimization convention in (Vaswani et al., 2017) to use ADAM optimizer with Noam invert squared decay.All of our lattice transformers are trained on 4 P-100 GPU cards.Similar to our comparison method, detokenized cased-sensitive BLEU is reported in our experiment.

Results
For our lattice transformer, we have three models trained for comparison.First we use the 32K BPE Chinese corpus alone to train our lattice transformer, which is equivalent to the standard trans- former with relative positional embeddings.Secondly, we train another lattice transformer with the word lattice corpus from scratch.In addition, we follow the convention of the speech translation task in previous experiments by fine-tuning the first model with word lattice corpus.For each setting, the model evaluated on test 2017 dataset is selected from the best model performed on the dev2017 data.The fine-tuning of Lattice Model 3 starts from the best checkpoint of Lattice Model 1.The BLEU evaluation is shown in Table 6, and two examples of attention visualization are shown in Figure 6.Notice that the first two results of transformer-base and -big are directly copied from the relevant references.From the result, we can see that our Model 1 can be comparable with the vanilla transformer-big model in a base setting, and significantly better than the transformer-base model.We also validate the argument that training from scratch can also achieve a better result than most baselines.Empirically, we find an interesting phenomena that training from scratch converges faster than other settings.

Conclusions
In this paper, we propose a novel lattice transformer architecture with a controllable lattice attention mechanism that can consume a word lattice and probability scores from the ASR system.The proposed approach is naturally applied to both the encoder self-attention and encoder-decoder attention.We mainly validate our lattice transformer on speech translation task, and additionally demonstrate its generalization to text translation on the WMT 2017 Chinese-English translation task.In general, the lattice transformer can increase the metric BLEU for translation tasks by a significant margin over many baselines.

Figure 1 :
Figure 1: Illustration of our proposed attention mechanism (best viewed in color).Our attention depends on the tokens of common paths and forward (blue) / marginal (grey) / backward (orange) probability scores.

Figure 3 :
Figure 3: Left panel: the controllable lattice attention, where s m , s f , s b are learnable scalars and s m + s f + s b = 1.Right panel: the overall model architecture of lattice transformer.

Figure 6 :
Figure 6: Attention visualization for Chinese English translation task.

Table 2 :
(Sperber et al., 2017)EU on Fisher.Note that for the lattice transformer architecture with R or R+1 setting, the resulted model is equivalent to a standard transformer with relative positional embeddings.The evaluation of oracle inputs is similar to ASR 1-best, but it can indicate an upper bound of the performance.The evaluation results of Lattice LSTM on Fisher dev are not reported in(Sperber et al., 2017).

Table 3 :
Cross-Evaluation of BLEU on Callhome.
src transcript qué tal , eh , yo soy guillermo , ¿ cómo estás ?porque como esto tiene que ir avanzando ¿ no ?pues , ¿ y llevas muchos años aquì en atlanta ?quererlo y tener fe .tgt reference how are you , eh i 'm guillermo , how are you ? because like this has to be moving forward , no ?well .and you 've been many years here in atlanta ? to love him and have faith .ASR 1-best quedar eh yo soy guillermo cómo estás porque como esto tiene que ir avanzando no pas lleva muchos aos aqu en atlanta quieren lo y tener fe mt from R+1 stay .eh , i 'm guillermo .how are you ?why do you have to move forward or not ?country has been many years here in atlanta they want to have faith ASR lattice quedar que qué eh yo soy dar eh yo tal eh yo soy guillermo cmo comprar con como está estás porque como esto tiene que ir avanzando no país well lleva lleva muchos años aquí en atlanta quieren quererlo lo y tener tenerse fe y tener tenerse fe mt from R+L+S how are you ?i 'm guillermo .how are you ? because since this has to move forward , right ? well , you 've been here many years in atlanta loving him and having faith

Table 4 :
Translation examples on test sets.Note that the presented ASR lattice does not include lattice information.

Table 5 :
Ablation Experiment BLEU Results.The rows of the Lattice LSTM and the Lattice Transformer represent the 1-best hypothesis fine-tuning, and the BLEUs are evaluated on 1-best inputs and on lattice inputs for the others.The colored BLEU values come from Table2 and 3.

Table 6 :
BLEU on WMT 2017 Chinese English