A Bi-Model Based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling

Intent detection and slot filling are two main tasks for building a spoken language understanding(SLU) system. Multiple deep learning based models have demonstrated good results on these tasks . The most effective algorithms are based on the structures of sequence to sequence models (or “encoder-decoder” models), and generate the intents and semantic tags either using separate models. Most of the previous studies, however, either treat the intent detection and slot filling as two separate parallel tasks, or use a sequence to sequence model to generate both semantic tags and intent. None of the approaches consider the cross-impact between the intent detection task and the slot filling task. In this paper, new Bi-model based RNN semantic frame parsing network structures are designed to perform the intent detection and slot filling tasks jointly, by considering their cross-impact to each other using two correlated bidirectional LSTMs (BLSTM). Our Bi-model structure with a decoder achieves state-of-art result on the benchmark ATIS data, with about 0.5% intent accuracy improvement and 0.9 % slot filling improvement.


Introduction
The research on spoken language understanding (SLU) system has progressed extremely fast during the past decades. Two important tasks in an SLU system are intent detection and slot filling. These two tasks are normally considered as parallel tasks but may have cross-impact on each other. The intent detection is treated as an utterance classification problem, which can be modeled using conventional classifiers including regression, support vector machines (SVMs) or even deep neural networks (Haffner et al., 2003;Sarikaya et al., 2011). The slot filling task can be formulated as a sequence labeling problem, and the most popular approaches with good performances are using conditional random fields (CRFs) and recurrent neural networks (RNN) as recent works (Xu and Sarikaya, 2013).
Some works also suggested using one joint RNN model for generating results of the two tasks together, by taking advantage of the sequence to sequence (Sutskever et al., 2014) (or encoderdecoder) model, which also gives decent results as in literature (Liu and Lane, 2016a).
In this paper, Bi-model based RNN structures are proposed to take the cross-impact between two tasks into account, hence can further improve the performance of modeling an SLU system. These models can generate the intent and semantic tags concurrently for each utterance. In our Bi-model structures, two task-networks are built for the purpose of intent detection and slot filling. Each task-network includes one BLSTM with or without a LSTM decoder (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005).
The paper is organized as following: In section 2, a brief overview of existing deep learning approaches for intent detection and slot fillings are given. The new proposed Bi-model based RNN approach will be illustrated in detail in section 3. In section 4, two experiments on different datasets will be given. One is performed on the ATIS benchmark dataset, in order to demonstrate a state-of-the-art result for both semantic parsing tasks. The other experiment is tested on our internal multi-domain dataset by comparing our new algorithm with the current best performed RNN based joint model in literature for intent detection and slot filling.

Background
In this section, a brief background overview on using deep learning and RNN based approaches to perform intent detection and slot filling tasks is given. The joint model algorithm is also discussed for further comparison purpose.

Deep neural network for intent detection
Using deep neural networks for intent detection is similar to a standard classification problem, the only difference is that this classifier is trained under a specific domain. For example, all data in ATIS dataset is under the flight reservation domain with 18 different intent labels. There are mainly two types of models that can be used: one is a feed-forward model by taking the average of all words' vectors in an utterance as its input, the other way is by using the recurrent neural network which can take each word in an utterance as a vector one by one (Xu and Sarikaya, 2014).

Recurrent Neural network for slot filling
The slot filling task is a bit different from intent detection as there are multiple outputs for the task, hence only RNN model is a feasible approach for this scenario. The most straight-forward way is using single RNN model generating multiple semanctic tags sequentially by reading in each word one by one (Liu and Lane, 2015;Mesnil et al., 2015;Peng and Yao, 2015). This approach has a constrain that the number of slot tags generated should be the same as that of words in an utterance. Another way to overcome this limitation is by using an encoder-decoder model containing two RNN models as an encoder for input and a decoder for output (Liu and Lane, 2016a). The advantage of doing this is that it gives the system capability of matching an input utterance and output slot tags with different lengths without the need of alignment. Besides using RNN, It is also possible to use the convolutional neural network (CNN) together with a conditional random field (CRF) to achieve slot filling task (Xu and Sarikaya, 2013).

Joint model for two tasks
It is also possible to use one joint model for intent detection and slot filling (Guo et al., 2014;Liu and Lane, 2016a,b;Zhang and Wang, 2016;Hakkani-Tür et al., 2016). One way is by using one encoder with two decoders, the first decoder will generate sequential semantic tags and the second decoder generates the intent. Another approach is by consolidating the hidden states information from an RNN slot filling model, then generates its intent using an attention model (Liu and Lane, 2016a). Both of the two approaches demonstrates very good results on ATIS dataset.

Bi-model RNN structures for joint semantic frame parsing
Despite the success of RNN based sequence to sequence (or encoder-decoder) model on both tasks, most of the approaches in literature still use one single RNN model for each task or both tasks. They treat the intent detection and slot filling as two separate tasks. In this section, two new Bi-model structures are proposed to take their cross-impact into account, hence further improve their performance. One structure takes the advantage of a decoder structure and the other doesn't.
An asynchronous training approach based on two models' cost functions is designed to adapt to these new structures.

Bi-model RNN Structures
A graphical illustration of two Bi-model structures with and without a decoder is shown in Figure 1. The two structures are quite similar to each other except that Figure 1a contains a LSTM based decoder, hence there is an extra decoder state s t to be cascaded besides the encoder state h t .

Remarks:
The concept of using information from multiplemodel/multi-modal to achieve better performance has been widely used in deep learning (Dean et al., 2012;Wang, 2017;Ngiam et al., 2011;Srivastava and Salakhutdinov, 2012), system identification (Murray-Smith and Johansen, 1997;Narendra et al., 2014Narendra et al., , 2015 and also reinforcement learning field recently (Narendra et al., 2016;Wang and Jin, 2018). Instead of using collective information, in this paper, our work introduces a totally new approach of training multiple neural networks asynchronously by sharing their internal state information.
3.1.1 Bi-model structure with a decoder The Bi-model structure with a decoder is shown as in Figure 1a. There are two inter-connected bidirectional LSTMs (BLSTMs) in the structure, one is for intent detection and the other is for slot filling. Each BLSTM reads in the input utterance sequences (x 1 , x 2 , · · · , x n ) forward and backward, and generates two sequences of hidden states hf t and hb t . A concatenation of hf t and hb t forms a  Hence, Our bidirectional LSTM f i (·) generates a sequence of hidden states (h i 1 , h i 2 , · · · , h i n ), where i = 1 corresponds the network for intent detection task and i = 2 is for the slot filling task.
In order to detect intent, hidden state h 1 t is combined together with h 2 t from the other bidirectional LSTM f 2 (·) in slot filling task-network to generate the state of g 1 (·), s 1 t , at time step t: whereŷ 1 n contains the predicted probabilities for all intent labels at the last time step n.
For the slot filling task, a similar network structure is constructed with a BLSTM f 2 (·) and a LSTM g 2 (·). f 2 (·) is the same as f 1 (·), by reading in the a word sequence as its input. The difference is that there will be an output y 2 t at each time step t for g 2 (·), as it is a sequence labeling problem. At each step t: where y 2 t is the predicted semantic tags at time step t.

Bi-Model structure without a decoder
The Bi-model structure without a decoder is shown as in Figure 1b. In this model, there is no LSTM decoder as in the previous model.
For the intent task, only one predicted output label y 1 intent is generated from BLSTM f 1 (·) at the last time step n, where n is the length of the utterance. Similarly, the state value h 1 t and output intent label are generated as: For the slot filling task, the basic structure of BLSTM f 2 (·) is similar to that for the intent detection task f 1 (·), except that there is one slot tag label y 2 t generated at each time step t. It also takes the hidden state from two BLSTMs f 1 (·) and f 2 (·), i.e. h 1 t−1 and h 2 t−1 , plus the output tag y 2 t−1 together to generate its next state value h 2 t and also the slot tag y 2 t . To represent this as a function mathematically:

Asynchronous training
One of the major differences in the Bi-model structure is its asynchronous training, which trains two task-networks based on their own cost functions in an asynchronous manner. The loss function for intent detection task-network is L 1 , and for slot filling is L 2 . L 1 and L 2 are defined using cross entropy as: and where k is the number of intent label types, m is the number of semantic tag types and n is the number of words in a word sequence. In each training iteration, both intent detection and slot filling networks will generate a groups of hidden states h 1 and h 2 from the models in previous iteration. The intent detection task-network reads in a batch of input data x i and hidden states h 2 , and generates the estimated intent labelsŷ 1 intent . The intent detection task-network computes its cost based on function L 1 and trained on that. Then the same batch of data x i will be fed into the slot filling tasknetwork together with the hidden state h 1 from intent task-network, and further generates a batch of outputs y 2 i for each time step. Its cost value is then computed based on cost function L 2 , and further trained on that.
The reason of using asynchronous training approach is because of the importance of keeping two separate cost functions for different tasks. Doing this has two main advantages: 1. It filters the negative impact between two tasks in comparison to using only one joint model, by capturing more useful information and overcoming the structural limitation of one model. 2. The cross-impact between two tasks can only be learned by sharing hidden states of two models, which are trained using two cost functions separately.

Experiments
In this section, our new proposed Bi-model structures are trained and tested on two datasets, one is the public ATIS dataset (Hemphill et al., 1990) containing audio recordings of flight reservations, and the other is our self-collected datset in three different domains: Food, Home and Movie. The ATIS dataset used in this paper follows the same format as in (Liu and Lane, 2015;Mesnil et al., 2015;Xu and Sarikaya, 2013;Liu and Lane, 2016a). The training set contains 4978 utterance and the test set contains 893 utterance, with a total of 18 intent classes and 127 slot labels. The number of data for our self-collected dataset will be given in the corresponding experiment sections with a more detailed explanation. The performance is evaluated based on the classification accuracy for intent detection task and F1-score for slot filling task.

Training Setup
The layer sizes for both the LSTM and BLSTM networks in our model are chosen as 200. Based on the size of our dataset, the number of hidden layers is chosen as 2 and Adam optimization is used as in (Kingma and Ba, 2014). The size of word embedding is 300, which are initialized randomly at the beginning of experiment.

Performance on the ATIS dataset
Our first experiment is conducted on the ATIS benchmark dataset, and compared with the current existing approaches, by evaluating their intent detection accuracy and slot filling F1 scores. A  (Liu and Lane, 2015) Hybrid RNN 95.06% NA (Mesnil et al., 2015) RNN-EM 95.25% NA (Peng and Yao, 2015) CNN CRF 95.35% NA (Xu and Sarikaya, 2013)    mance comparison in three domains of data. The Bi-model structure with a decoder gives the best performance in all cases based on its intent accuracy and slot filling F1 score. The intent accuracy has at least 0.5% improvement, the F1 score improvement is around 1% to 3% for different domains.

Conclusion
In this paper, a novel Bi-model based RNN semantic frame parsing model for intent detection and slot filling is proposed and tested. Two substructures are discussed with the help of a decoder or not. The Bi-model structures achieve state-of-theart performance for both intent detection and slot filling on ATIS benchmark data, and also surpass the previous best SLU model on the multi-domain data. The Bi-model based RNN structure with a decoder also outperforms the Bi-model structure without a decoder on both ATIS and multi-domain data.