Learning Spoken Language Representations with Neural Lattice Language Modeling

Pre-trained language models have achieved huge improvement on many NLP tasks. However, these methods are usually designed for written text, so they do not consider the properties of spoken language. Therefore, this paper aims at generalizing the idea of language model pre-training to lattices generated by recognition systems. We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks. The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency. Experiments on intent detection and dialogue act recognition datasets demonstrate that our proposed method consistently outperforms strong baselines when evaluated on spoken inputs. The code is available at https://github.com/MiuLab/Lattice-ELMo.


Introduction
The task of spoken language understanding (SLU) aims at extracting useful information from spoken utterances. Typically, SLU can be decomposed with a two-stage method: 1) an accurate automatic speech recognition (ASR) system transcribes the input speech into texts, and then 2) language understanding techniques are applied to the transcribed texts. These two modules can be developed separately, so most prior work developed the backend language understanding systems based on manual transcripts Guo et al., 2014;Mesnil et al., 2014;Goo et al., 2018).
Despite the simplicity of the two-stage method, prior work showed that a tighter integration between two components can lead to better performance. Researchers have extended the ASR 1-best results to n-best lists or word confusion networks in order to preserve the ambiguity of the transcripts. 1 The scource code is available at: https://github. com/MiuLab/Lattice-ELMo.  (Tur et al., 2002;Hakkani-Tür et al., 2006;Henderson et al., 2012;Tür et al., 2013;Masumura et al., 2018). Another line of research focused on using lattices produced by ASR systems. Lattices are directed acyclic graphs (DAGs) that represent multiple recognition hypotheses. An example of ASR lattice is shown in Figure 1. Ladhak et al. (2016) (Vaswani et al., 2017) to consume lattice inputs for machine translation. Huang and Chen (2019) proposed to adapt the transformer model originally pre-trained on written texts to consume lattices in order to improve SLU performance. Buckman and Neubig (2018) also found that utilizing lattices that represent multiple granularities of sentences can improve language modeling.
With recent introduction of large pre-trained language models (LMs) such as ELMo (Peters et al., 2018), GPT (Radford, 2018) and BERT (Devlin et al., 2019), we have observed huge improvements on natural language understanding tasks. These models are pre-trained on large amount of written texts so that they provide the downstream tasks with high-quality representations. However, applying these models to the spoken scenarios poses several discrepancies between the pre-training task and the target task, such as the domain mismatch between written texts and spoken utterances with ASR errors. It has been shown that fine-tuning the pre-trained language models on the data from the target tasks can mitigate the domain mismatch problem (Howard and Ruder, 2018;Chronopoulou et al., 2019). Siddhant et al. (2018) focused on pre-training a language model specifically for spoken content with huge amount of automatic transcripts, which requires a large collection of indomain speech.
In this paper, we propose a novel spoken language representation learning framework, which focuses on learning contextualized representations of lattices based on our proposed lattice language modeling objective. The proposed framework consists of two stages of LM pre-training to reduce the demands for lattice data. We conduct experiments on benchmark datasets for spoken language understanding, including intent classification and dialogue act recognition. The proposed method consistently achieves superior performance, with relative error reduction ranging from 3% to 42% compare to pre-trained sequential LM.

Neural Lattice Language Model
The two-stage framework that learns contextualized representations for spoken language is proposed and detailed below.

Problem Formulation
In the SLU task, the model input is an utterance X containing a sequence of words X = [x 1 , x 2 , · · · , x |X| ], and the goal is to map X to its corresponding class y. The inputs can also be stored in a lattice form, where we use edgelabeled lattices in this work. A lattice L = {N, E} is defined by a set of |N | nodes N = {n 1 , n 2 , · · · , n |N | } and a set of |E| transitions E = {e 1 , e 2 , · · · , e |E| }. A weighted transition is defined as e = {prev[e], next[e], w[e], P (e)}, where prev[e] and next[e] denote the previous node and next node respectively, w[e] denotes the associated word, and P (e) denotes the transition probability. We use in[n] and out[n] to denote the sets of incoming and outgoing transitions of a node n. L <n = {N <n , E <n } denotes the sub-lattice which consists of all paths between the starting node and a node n.

LatticeRNN
The LatticeRNN (Ladhak et al., 2016) model generalizes sequential RNN to lattice-structured inputs. It traverses the nodes and transitions of a lattice in a topological order. For each transition e, Lat-ticeRNN takes w[e] as input and the representation of its previous node h [prev[e]] as the previous hidden state, and then produces a new hidden state of e, h[e]. The representation of a node h[n] is obtained by pooling the hidden states of the incoming transitions. In this work, we employ the Weight-edPool variant proposed by Ladhak et al. (2016), which computes the node representation as Note that we can represent any sequential text as a linear-chain lattice, so LatticeRNN can be seen as a strict generalization of RNNs to DAG-like structures. This property enables us to initialize the weights in a LatticeRNN with the weights of a RNN as long as they use the same recurrent cell.

Lattice Language Modeling
Language models usually estimate p(X) by factorizing it into where X <t = [x 1 , · · · , x t−1 ] denotes the previous context. Training a LM is essentially asking the model to predict a distribution of the next word given the previous words. We extend the sequential LM analogously to lattice language modeling, where the model is expected to predict the next transitions of a node n given L <n . The ground truth distribution is therefore defined as: LatticeRNN is adopted as the backbone of our lattice language model. Since the node representation h[n] encodes all information of L <n , we pass h[n] to a linear decoder to obtain the distribution of next transitions: where θ denotes the parameters of the LatticeRNN and W denotes the trainable parameters of the decoder. We train our lattice language model by minimizing the KL divergence between the ground truth distribution p(w | L <n ) and the predicted distribution p θ (w | h[n]).
Note that the objective for training sequential LM is a special case of the lattice language modeling objective defined above, where the inputs are linear-chain lattices. Hence, a sequential LM can be viewed as a lattice LM trained on linear-chain lattices only. This property inspires us to pre-train our lattice LM in a 2-stage fashion described below.

Two-Stage Pre-Training
Inspired by ULMFiT (Howard and Ruder, 2018), we propose a two-stage pre-training method to train our lattice language model. The proposed method is illustrated in Figure 2.
• Stage 1: Pre-train on sequential texts In the first stage, we follow the recent trend of pre-trained LMs by pre-training a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) LM on general domain text corpus. Here the cell architecture is the same as ELMo (Peters et al., 2018).
• Stage 2: Pre-train on lattices In this stage, we use a bidirectional LatticeL-STM with the same cell architecture as the LSTM pre-trained in the previous stage. Note that in the backward direction we use reversed lattices as input. We initialize the weights of the LatticeLSTM with the weights of the pre-trained LSTM. The LatticeLSTM is further pre-trained on lattices from the training set of the target task with the lattice language modeling objective described above.
We consider this two-stage method more approachable and efficient than directly pre-training a lattice LM on large amount of lattices because 1) general domain written data is much easier to collect than lattices which require spoken data, and 2) LatticeRNNs are considered less efficient than RNNs due to the difficulty of parallelization in computing.

Target Task Classifier Training
After pre-training, our model is capable of providing representations for lattices. Following (Peters et al., 2018), the pre-trained lattice LM is used to produce contextualized node embeddings for downstream classification tasks, as illustrated in the right part of Figure 2. We use the same strategy as Peters et al. (2018) to linearly combine the hidden states from different layers into a representation for each node. The classifier is a newly added 2-layer Lat-ticeLSTM, which takes the node representations as input, followed by max-pooling over nodes, a linear layer and finally a softmax layer. We use the cross entropy loss to train the classifier on each target classification tasks. Note that the parameters of the pre-trained lattice LM are fixed during this stage.

Experiments
In order to evaluate the quality of the pre-trained lattice LM, we conduct the experiments for two common tasks in spoken language understanding.

Tasks and Datasets
Intent detection and dialogue act recognition are two common tasks about spoken language understanding. The benchmark datasets used for intent detection are ATIS (Airline Travel Information Systems) (Hemphill et al., 1990;Dahl et al., 1994;Tur et al., 2010) and SNIPS (Coucke et al., 2018). We use the NXT-format of the Switchboard (Stolcke et al., 2000) Dialogue Act Corpus (SWDA) (Calhoun et al., 2010) and the ICSI Meeting Recorder Dialogue Act Corpus (MRDA) (Shriberg et al., 2004) for benchmarking dialogue act recognition. The SNIPS corpus only contains written text, so we synthesize a spoken version of the dataset using a commercial text-to-speech service. We use an ASR system trained on WSJ (Paul and Baker, 1992) with Kaldi (Povey et al., 2011) to transcribe ATIS, and an ASR system released by Kaldi to transcribe other datasets. The statistics of datasets are summarized in Table 1. All tasks are evaluated with overall classification accuracy.

Model and Training Details
In order to conduct fair comparison with ELMo (Peters et al., 2018), we directly adopt their pre-trained model as our pre-trained sequential LM. The hidden size of the LatticeLSTM classifier is set to 300. We use adam as the optimizer with learning rate 0.0001 for LM pre-training and 0.001 for training the classifier. The checkpoint with the best validation accuracy is used for evaluation.

Results
The results in terms of the classification accuracy are shown in Table 2. All reported numbers are averaged over at least three training runs. Rows (a) and (b) can be considered as the performance upperbound, where we use manual transcripts to train and evaluate the models. We also use BERTbase (Devlin et al., 2019) as a strong baseline, which takes ASR 1-best as the input (row (g)).
Compare with the results on manual transcripts, using ASR results largely degrades the performance due to recognition errors, as shown in rows (e)-(g). In addition, adding pre-trained ELMo embeddings brings consistent improvement over the biLSTM baseline, except for SNIPS when using manual transcripts (row (b)). The baseline models trained on ASR 1-best are also evaluated on lattice oracle paths. We report the results as the performance upperbound for the baseline models (rows (c)-(d)).
In the lattice setting, the baseline bidirectional LatticeLSTM (Ladhak et al., 2016) (row (h)) con-sistently outperforms the biLSTM with 1-best input (row (e)), demonstrating the importance of taking lattices into account. Our proposed method achieves the best results on all datasets except for ATIS (row(i)), with relative error reduction ranging from 3.2% to 42% compare to biLSTM+ELMo (row(f)). The proposed method also achieves performance comparable to BERT-base on ATIS. We perform ablation study for the proposed two-stage pre-training method and report the results in rows (j) and (k). It is clear that skipping either stage degrades the performance on all datasets, demonstrating that both stages are crucial in the proposed framework. We also evaluate the proposed model on 1-best results (row (l)). The results show that it is still beneficial to use lattice as input after finetuning.

Conclusion
In this paper, we propose a spoken language representation learning framework that learns contextualized representation of lattices. We introduce the lattice language modeling objective and a two-stage pre-training method that efficiently trains a neural lattice language model to provide the downstream tasks with contextualized lattice representations. The experiments show that our proposed framework is capable of providing high-quality representations of lattices, yielding consistent improvement on SLU tasks.