Neural Speech Translation using Lattice Transformations and Graph Networks

Speech translation systems usually follow a pipeline approach, using word lattices as an intermediate representation. However, previous work assume access to the original transcriptions used to train the ASR system, which can limit applicability in real scenarios. In this work we propose an approach for speech translation through lattice transformations and neural models based on graph networks. Experimental results show that our approach reaches competitive performance without relying on transcriptions, while also being orders of magnitude faster than previous work.


Introduction
Translation from speech utterances is a challenging problem that has been studied both under statistical, symbolic approaches (Ney, 1999;Casacuberta et al., 2004;Kumar et al., 2015) and more recently using neural models (Sperber et al., 2017). Most previous work rely on pipeline approaches, using the output of a speech recognition system (ASR) as an input to a machine translation (MT) one. These inputs can be simply the 1-best sentence returned by the ASR system or a more structured representation such as a lattice.
Some recent work on end-to-end systems bypass the need for intermediate representations, with impressive results (Weiss et al., 2017). However, such a scenario has drawbacks. From a practical perspective, it requires access to the original speech utterances and transcriptions, which can be unrealistic if a user needs to employ an out-ofthe-box ASR system. From a theoretical perspective, intermediate representations such as lattices can be enriched through external, textual resources such as monolingual corpora or dictionaries. Sperber et al. (2017) proposes a lattice-tosequence model which, in theory, can address both problems above. However, their model suffers from training speed performance due to the lack of efficient batching procedures and they rely on transcriptions for pretraining. In this work, we address these two problems by applying lattice transformations and graph networks as encoders. More specifically, we enrich the lattices by applying subword segmentation using byte-pair encoding (Sennrich et al., 2016, BPE) and perform a minimisation step to remove redundant nodes arising from this procedure. Together with the standard batching strategies provided by graph networks, we are able to decrease training time by two orders of magnitude, enabling us to match their translation performance under the same training speed constraints without relying on gold transcriptions.

Approach
Many graph network options exist in the literature (Bruna et al., 2014;Duvenaud et al., 2015;Kipf and Welling, 2017;Gilmer et al., 2017): in this work we opt for a Gated Graph Neural Network (Li et al., 2016, GGNN), which was recently incorporated in an encoder-decoder architecture by Beck et al. (2018). Assume a directed graph and L V and L E are respectively vocabularies for nodes and edges, from which node and edge labels ( v and e ) are defined. Given an input graph with node embeddings X, a GGNN is defined as where e = (u, v, e ) is the edge between nodes u and v, N (v) is the set of neighbour nodes for v, ρ is a non-linear function, σ is the sigmoid function and Intuitively, a GGNN reduces to a GRU (Cho et al., 2014) if the graph is a linear chain. Therefore, the GGNN acts as a generalised encoder that updates nodes according to their neighbourhood. Multiple layers can be stacked, allowing information to be propagated through longer paths in the graph. Batching can be done by using adjacency matrices and matrix operations to perform the updates, enabling efficient processing on a GPU.

Lattice Transformations
As pointed out by Beck et al. (2018), GGNNs can suffer from parameter explosion when the edge label space is large, as the number of parameters is proportional to the set of edge labels. This is a problem for lattices, since most of the information is encoded on the edges. We tackle this problem by transforming the lattices into their corresponding line graphs, which swaps nodes and edges. 1 After this transformation, we also add start and end symbols, which enable the encoder to propagate information through all possible paths in the lattice. Importantly, we also remove node scores from the lattice in most of our experiments, but we do revisit this idea in §3.3.
Having lattices as inputs allow us to incorporate additional steps of textual transformations. To showcase this, in this work we perform subword segmentation on the lattice nodes using BPE. If a node is not present in the subword vocabulary, we split it into subwords and connect them in a leftto-right manner.
The BPE segmentation can lead to redundant nodes in the lattice. Our next transformation step is a minimisation procedure, where such nodes are joined into a single node in the graph. To perform this step, we leverage an efficient algorithm for automata minimisation (Hopcroft, 1971), which traverses the graph detecting redundant nodes by using equivalence classes, running in O(n log n) time, where n is the number of nodes.
1 This procedure is also done in Sperber et al. (2017). The final step adds reverse and self-loop edges to the lattice, where these new edges have specific parameters in the encoder. This eases propagation of information and is standard practice when using graph networks as encoders Bastings et al., 2017;Beck et al., 2018). We show an example of all the transformation steps on Figure 1.
In Figure 2 we show the architecture of our system, using the final lattice from Figure 1 as an example. Nodes are represented as embeddings that are updated according to the lattice structure, resulting in a set of hidden states as the output. Other components follow a standard seq2seq model, using a bilinear attention module (Luong et al., 2015) and a 2-layer LSTM (Hochreiter and    Table 1: Out-of-the-box scenario results, in BLEU scores. "L" corresponds to word lattice inputs, "L+S" and "L+S+M" correspond to lattices after subword segmentation and after minimisation, respectively. Each model is trained using 5 different seeds and we report BLEU (Papineni et al., 2001) results using the median performance according to the dev set and an ensemble of the 5 models. For the word-based models, we remove any tokens with frequency lower than 2 (as in Sperber et al. (2017)), while for subword models we do not perform any threshold pruning. We report all results on the Fisher "dev2" set. 4

Out-of-the-box ASR scenario
In this scenario we assume only lattices and 1-best outputs are available, simulating a setting where we do not have access to the transcriptions. Table 1 shows that results are consistent with previous work: lattices provide significant improvements over simply using the 1-best output. More importantly though, the results also highlight the benefits of our proposed transformations and we obtain the best ensemble performance using minimised lattices.  Table 2: Results with transcriptions, in BLEU scores. "L+S+M" corresponds to the same results in Table 1 and "L+S+M+T" is the setting with gold transcriptions added to the training set.

Adding Transcriptions
The out-of-the-box results in §3.1 are arguably more general in terms of applicability in real scenarios. However, in order to compare with the state-of-the-art, we also experiment with a scenario where we have access to the original Spanish transcriptions. To incorporate transcriptions into our model, we convert them into a linear chain graph, after segmenting using BPE. With this, we can simply take the union of transcriptions and lattices into a single training set. We keep the dev and test sets with lattices only, as this emulates test time conditions. The results shown in Table 2 are consistent with previous work: adding transcriptions further enhance the system performance. We also slightly outperform Sperber et al. (2017) in the setting where they ignore lattice scores, as in our approach. Most importantly, we are able to reach those results while being two orders of magnitude faster at training time: Sperber et al. (2017) report taking 1.5 days for each epoch while our architecture can process each epoch in 15min. The reason is because their model relies on the CPU while our GGNN-based model can be easily batched and computed in a GPU.
Given those differences in training time, it is worth mentioning that the best model in Sperber et al. (2017) is surpassed by our best ensemble using lattices only. This means that we can obtain state-of-the-art performance even in an out-of-thebox scenario, under the same training speed constraints. While there are other constraints that may be considered (such as parameter budget), we nevertheless believe this is an encouraging result for real world scenarios.

Adding Lattice Scores
Our approach is not without limitations. In particular, the GGNN encoder ignores lattice scores, which can help the model disambiguate between different paths in the lattice. As a simple first approach to incorporate scores, we embed them using a multilayer perceptron, using the score as the input. This however did not produce good results: performance dropped to 32.9 BLEU in the single model setting and 38.4 for the ensemble.
It is worth noticing that Sperber et al. (2017) has a more principled approach to incorporate scores: by modifying the attention module. This is arguably a better choice, since the scores can directly inform the decoder about the ambiguity in the lattice. Since this approach does not affect the encoder, it is theoretically possible to combine our GGNN encoder with their attention module, we leave this avenue for future work.

Conclusions and Future Work
In this work we proposed an architecture for lattice-to-string translation by treating lattices as general graphs and leveraging on recent advances in neural networks for graphs. 5 Compared to previous similar work, our model permits easy minibatching and allows one to freely enrich the lattices with additional information, which we exploit by incorporating BPE segmentation and lattice minimisation. We show promising results and outperform baselines in speech translation, particularly in out-of-the-box ASR scenarios, when one has no access to transcriptions.
For future work, we plan to investigate better approaches to incorporate scores in the lattices. The approaches used by Sperber et al. (2017) can provide a starting point in this direction. The same minimisation procedures we employ can be adapted to weighted lattices (Eisner, 2003). Another important avenue is to explore this approach in low-resource scenarios such as ones involving endangered languages (Adams et al., 2017;Anastasopoulos and Chiang, 2018).
Workshop on Speech and Language Technologies, hosted at Carnegie Mellon University and sponsored by Johns Hopkins University with unrestricted gifts from Amazon, Apple, Facebook, Google, and Microsoft.