Towards Fine-Grained Transfer: An Adaptive Graph-Interactive Framework for Joint Multiple Intent Detection and Slot Filling

In real-world scenarios, users usually have multiple intents in the same utterance. Unfortunately, most spoken language understanding (SLU) models either mainly focused on the single intent scenario, or simply incorporated an overall intent context vector for all tokens, ignoring the fine-grained multiple intents information integration for token-level slot prediction. In this paper, we propose an Adaptive Graph-Interactive Framework (AGIF) for joint multiple intent detection and slot filling, where we introduce an intent-slot graph interaction layer to model the strong correlation between the slot and intents. Such an interaction layer is applied to each token adaptively, which has the advantage to automatically extract the relevant intents information, making a fine-grained intent information integration for the token-level slot prediction. Experimental results on three multi-intent datasets show that our framework obtains substantial improvement and achieves the state-of-the-art performance. In addition, our framework achieves new state-of-the-art performance on two single-intent datasets.


Introduction
Spoken language understanding (SLU) (Young et al., 2013) is a core component of task-oriented dialog systems. It consists of two typical subtasks, intent detection and slot filling (Tur and De Mori, 2011). Take the utterance "Please play happy birthday" for example, the intent detection can be seen as a classification task to classify the intent label (i.e., PlayMusic) while the slot filling can be treated as a sequence labeling task to predict the slot label sequence (i.e., O, O, B-music, I-music). Dominant SLU systems in the literature (Goo et al., 2018;E et al., 2019;Qin et al., 2019) adopt joint * Email corresponding. models to model the relation between the two tasks, which is a direction we follow.
Though achieving promising performances, most prior work only focus on the simple single intent scenario. Their models are trained based on the assumption that each utterance only has one single intent. Actually, users usually express multiple intents in an utterance and Gangadharaiah and Narayanaswamy (2019) shows that 52% of examples are multi-intent in the amazon internal dataset. Nevertheless, the existing trained single intent SLU models fail to effectively handle the multi-intent settings with the original network structure. Ideally, when an SLU system meets an utterance with multiple intents, as shown in Figure 1(a), the model should directly detect its all intents (PlayMusic and GetWeather). Hence, it is important to consider multi-intent SLU.
Unlike the prior single intent SLU model which can simply leverage the utterance's single intent to guide slot prediction (Goo et al., 2018;Qin et al., 2019), multi-intent SLU faces to multiple intents and presents a unique challenge that is worth studying: how to effectively incorporate multiple intents information to lead the slot prediction. To this end, Gangadharaiah and Narayanaswamy (2019) first explored the multi-task framework with the slot-gated mechanism (Goo et al., 2018) for joint multiple intent detection and slot filling. Their model incorporated intent information by simply treating an intent context vector as multiple intents information. While this is a direct method for incorporating multiple intents information, it does not offer fine-grained intent information integration for token-level slot filling in the sense that each token is guided with the same complex intents information, which is shown in Figure 1(a). In addition, providing the same intent information for all tokens may introduce ambiguity, where it's hard for each token to capture the related intent information. As  shown in Figure 1(b), these tokens "happy birthday" should focus on the intent "PlayMusic" while tokens "deepwater bonaire" depend on the intent "GetWeather". Thus, each token should focus on the corresponding intent and it's critical to make a fine-grained intent information integration for the token-level slot prediction.
In this paper, we propose an Adaptive Graph-Interactive Framework (AGIF) to address the aforementioned concern. The core module is the proposed adaptive intent-slot graph interaction layer, which is constructed of each token's hidden state of slot filling decoder and embeddings of predicted multiple intents. In this graph, each token's slot node directly connects all predicted intent nodes to explicitly build the correlation between slot and intents. Such an interaction graph is applied to each token adaptively, which make each token has the ability to capture different relevant intent information so that fine-grained multiple intents integration can be achieved. In contrast to prior work simply incorporate multiple intents information statically where the same intents information is used for guiding all tokens, our intent-slot interaction graph is constructed adaptively with graph attention network over each token. This encourages our model to automatically filter the irrelevant information and capture important intent at the token-level.
We first conduct experiments on the multi-intent benchmark dataset DSTC4 (Schuster et al., 2019). Then, to verify the generalization of our framework, we empirically construct two large-scale multiintent datasets MixATIS (Hemphill et al., 1990) and MixSNIPS (Coucke et al., 2018). The results of these experiments show the effectiveness of our framework by outperforming the current state-of-the-art method. To the best of our knowledge, there are no public large-scale multiple intents datasets and we hope the release of it would push forward the research of multi-intent SLU. In addition, our framework achieves state-of-the-art performance on two public single-intent datasets including ATIS (Tur and De Mori, 2011) and SNIPS (Coucke et al., 2018), which further verifies the generalization of the proposed model.
To facilitate future research in this area, all datasets and codes are publicly available at https: //github.com/LooperXX/AGIF.

Approach
The architecture of our framework is demonstrated in Figure 2, which consists of a shared encoder, an adaptive intent-slot graph interaction layer and two separate decoders. First, the encoder ( §2.1) uses a shared self-attentive encoder to represent an utterance, which can grasp the shared information between intent detection and slot filling. Then, the intent detection decoder ( §2.2) performs the multi-label classification to detect multiple intents. Finally, we introduce the adaptive intent-slot graph interaction layer ( §2.3) to explicitly leverage the multiple intents information for guiding slot prediction. Both intent detection and slot filling are optimized simultaneously via a multi-task learning scheme.

Self-Attentive Encoder
In the self-attentive encoder, following Qin et al. (2019), we use BiLSTM with the self-attention mechanism to leverage both advantages of temporal features within word orders and contextual information.

Intent Detection Slot Filling Adaptive Intent-Slot Graph Interaction
Intent Classifier Bidirectional LSTM A bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) consists of two LSTM layers. For the input sequence {x 1 , x 2 , . . . , x T } (T is the number of tokens in the input utterance), the BiLSTM reads it forwardly from x 1 to x T and backwardly from x T to x 1 to produce a series of context-sensitive hidden states H = {h 1 , h 2 , . . . , h T }.

Self-Attention
We follow Vaswani et al. (2017) to use a self-attention mechanism over word embedding to capture context-aware features. We first map the matrix of input vectors X ∈ R T ×d (d represents the mapped dimension) to queries Q, keys K and values V matrices by using different linear projections parameters W q , W k , W v . Attention weight is computed by dot product between Q, K and the self-attention output A ∈ R T ×d is a weighted sum of values: where d k denotes the dimension of keys. We concatenate these two representations as the final encoding representation: where E = {e 1 , . . . , e T } ∈ R T ×2d and || is concatenation operation.

Intent Detection Decoder
We follow Gangadharaiah and Narayanaswamy (2019) to perform multiple intent detection as the multi-label classification problem. We compute the utterance context vector over E = {e 1 , . . . , e T } ∈ R T ×2d . In our case, we use a self-attention module (Zhong et al., 2018;Goo et al., 2018) to capture relevant context: where w e ∈ R 1×2d is the trainable parameters, p t is corresponding normalized self-attention score. c is the weighted sum of each element e t and utilized for intent detection: where W i , W c are trainable parameters of the intent decoder, y I = {y I 1 , . . . , y I N I } is the intent output of the utterance and N I is the number of single intent labels. σ represents the sigmoid activation function.
During inference, we predict intents I = {I 1 , . . . , I n } and I i represents probability y I I i greater than t u , where 0 < t u < 1.0 is a hyperparameter tuned using the validation set. 1 For example, if the y I = {0.9, 0.3, 0.6, 0.7, 0.2} and the t u is 0.5, we predict intents I = {1, 3, 4}.

Adaptive Intent-Slot Graph Interaction for Slot Filling
In this paper, one of the core contribution is adaptively leveraging multiple intents to guide the slot prediction, encouraging each token to capture the corresponding relevant intent information. In particular, we adopt the graph attention network (GAT) (Veličković et al., 2017) to model the interaction between intents and slot at the token-level.
In this section, we first describe the vanilla graph attention network. Then, we show how to directly leverage multiple intents information for slot prediction with the adaptive intent-slot graph interaction layer.
Vanilla Graph Attention Network For a given graph with N nodes, one-layer GAT take the initial node featuresH = {h 1 , . . . ,h N },h n ∈ R F as input, aiming at producing more abstract representation,H = {h 1 , . . . ,h N },h n ∈ R F , as its output. The graph attention operated on the node representation can be written as: where N i is the first-order neighbors of node i (including i) in the graph, W h ∈ R F ×F and a ∈ R 2F is the trainable weight matrix, α ij is the normalized attention weight denoting the importance of eachh j toh i and σ represents the nonlinearity activation function.
GAT inject the graph structure into the mechanism by performing masked attention, i.e, GAT only compute F(h i ,h j ) for nodes j ∈ N i . To stabilize the learning process of self-attention, GAT extend the above mechanism to employ multi-head attention from Vaswani et al. (2017): where α k ij is the normalized attention weight computed by the k-th function F k , || is concatenation operation and K is the number of heads. Thus, the outputh n will consists of KF features in the middle layers and the final prediction layer will employ averaging instead of concatenation to get the final prediction results.
Adaptive Intent-Slot Graph Interaction for Slot Prediction We use a unidirectional LSTM as the slot filling decoder. At each decoding step t, the decoder state s t is calculated by previous decoder state s t−1 , the previous emitted slot label distribution y S t−1 and the aligned encoder hidden state e t : Instead of directly utilizing the s t to predict the slot label, we build a graphic structure named adaptive intent-slot graph interaction to explicitly leverage multiple intents information to guide the t-th slot prediction. In this graph, the slot hidden state at t time step is s t and predicted multiple intents information I = {I 1 , . . . , I n }, where n denotes the number of predicted intents, are used as the initialized representations at t time stepH where d represents the dimension of vertices representation and φ emb (·) represents the embedding matrix of intents. In addition, the predicted intents are connected to each other to consider their mutual interaction because all of them express the same utterance's intent.
For convenience, we useh to represent node i in the l-th layer of the graph consisting of the decoder state node and predicted intent nodes at t time step.h [l,t] 0 is the slot hidden state representation in the l-th layer. To explicitly leverage the multiple intents information, the slot hidden state node is directly connected to all predicted intents and the slot node representation in the l-th layer can be calculated as: where N i represents the first-order neighbors of node i, i.e., the decoder state node and the predicted intent nodes, and the update process of all node representations can be calculated by Equation 6, 7 and 9.
With L-layer adaptive intent-slot graph interaction, we obtain the final slot hidden state representationh [L,t] 0 at t time step, which adaptively capture important intents information at token-level. The representationh [L,t] 0 is utilized for slot filling: where o S t is the predicted slot label of the t-th word in the utterance.

Multi-Task Training
Following Qin et al. (2020), we adopt a joint model to consider the two tasks and update parameters by joint optimizing. The intent detection objective is: (12) Similarly, the slot filling task objective is defined as: where N I is the number of single intent labels, N S is the number of slot labels and M is the number of words in an utterance. The final joint objective is formulated as: where α is hyper-parameter.

Datasets
Multiple Intent Datasets We conduct experiments on the benchmark DSTC4 (Kim et al., 2017b), which is human-human multi-turn dialogues. We adopt the same dataset partition in the DSTC4 main task and we regard its speech act attributes as intents. 2 It has 12,759 utterances for training, 4,812 utterances for validation and 7,848 utterances for testing.
To verify the generalization of the proposed model, we construct the multi-intent SLU dataset, MixSNIPS. MixSNIPS dataset is collected from the Snips personal voice assistant (Coucke et al., 2018) by using conjunctions, e.g., "and", to connect sentences with different intents and ensure that the ratio of sentences has 1-3 intents is [0.3, 0.5, 0.2]. Finally, we get the 45,000 utterances for training, 2,500 utterances for validation and 2500 utterances for testing on the MixSNIPS dataset. Similarly, we construct another multi-intent SLU dataset, Mix-ATIS, from the ATIS dataset (Hemphill et al., 1990). There are 18,000 utterances for training, 1,000 utterances for validation and 1,000 utterances for testing. The constructed datasets have been released for future research.
Single Intent Datasets In addition, we also conduct experiments on two public benchmark singleintent datasets to validate the efficiency of our proposed model. One is the ATIS dataset (Hemphill et al., 1990) and the other is SNIPS dataset (Coucke et al., 2018), which are widely used as benchmark in SLU research. Both datasets follow the same format and partition as in Goo et al. (2018) and Qin et al. (2019).

Experimental Settings
The self-attentive encoder hidden units is 256 in all datasets. 2 regularization is 1 × 10 −6 and dropout rate is 0.4 for reducing overfitting. We use Adam (Kingma and Ba, 2014) to optimize the parameters in our model and adopted the suggested hyper-parameters for optimization. The graph layer number is 3 for DSTC4 dataset and 2 for the other datasets. For all the experiments, we select the model which works the best on the dev set and then evaluate it on the test set. All experiments are conducted at TITAN Xp and GeForce RTX 2080Ti. The epoch number is 50 for MixSNIPS and 100 for MixATIS and DSTC4.

Baselines
We first compare our model with the existing state-of-the-art multi-intent SLU baseline: Joint Multiple ID-SF. Gangadharaiah and Narayanaswamy (2019) proposes a multi-task framework with the slot-gated mechanism for multiple intent detection and slot filling.
Then, we compare our framework with the existing state-of-the-art single-intent SLU: 1) Attention BiRNN. Liu and Lane (2016) propose an alignment-based RNN with the attention mechanism, which implicitly learns the relationship between slot and intent.  propagation. This model achieves the state-of-theart performance.
To enable single-intent SLU baselines can handle the multi-intent utterances, we follow Gangadharaiah and Narayanaswamy (2019) to connect them with # to get the single multi-intent label for a fair comparison, we name it as concatenation version. To further verify the effectiveness of our framework, we change the state-of-the-art baseline Stack-Propagation to directly predict the multi-intent label by changing the inten decoder with replacing softmax as sigmoid and using binary cross-entropy loss. We refer it as the sigmoiddecoder.
For the Attention BiRNN, Slot-Gated Atten, SF-ID Network and Stack-Propagation, we run their official source code to obtain the results. For the Bi-Model and Joint Multiple ID-SF, we reimplemented the models and obtained the results on the same datasets because the original paper did not release their codes.

Main Results
Following Goo et al. (2018) and Qin et al. (2019), we evaluate the performance of slot filling using F1 score, intent prediction using accuracy and macro F1 score, the sentence-level semantic frame parsing using overall accuracy which represents all metrics are right in an utterance. Table 1 shows the experiment results of the proposed models on the MixATIS and MixSNIPS datasets.
From the results, we have three observations: 1) Our framework outperforms Joint Multiple ID-SF baseline by a large margin and achieves stateof-the-art performance. On the MixATIS dataset, we achieve 0.6% improvement on Slot (F1) score, 0.6% improvement on Intent (F1), 2.7% improvement on Intent (Acc). On the MixSNIPS dataset, we achieve 3.5% improvement on Slot (F1) score,  0.4% improvement on Intent (F1), 0.8% improvement on Intent (Acc). This indicates that our adaptive intent-slot graph interaction successfully incorporates relevant intent information to improve slot prediction. In addition, we obtain 6.4% improvement and 9.8% improvement on Overall (Acc) on MixATIS and MixSNIPS dataset, respectively. We attribute this to the fact that our adaptive intent-slot graph interaction mechanism can better help grasp the relationship between the intent and slots and improve the whole SLU.
2) The concatenation outperforms the sigmoiddecoder version, this is because concatenation can greatly reduce the multi-intent search space, which makes it easier for single intent systems to predict multiple intents. For example, on the ATIS dataset, there exist 17 single intents and 4 combined multiintent in the training data. The multi-intent systems make a binary prediction at each intent while the concatenation model predicts the limited combined intent search space (17 + 4).
3) Though facing the difficulty of multi-intent prediction, our framework outperforms the stateof-the-art single-intent model (Stack-Propagation (concatenation)), which further proves the proposed token-level adaptive graph interaction layer  can improve the SLU performance.

Performance on the DSTC4 dataset
To further analyze the performance of the AGIF model, we conduct experiments on the real-world multi-intent SLU dataset, DSTC4. The results are shown in Table 2. From the results, we achieve 5.9% improvement on Slot (F1) score, 2.5% improvement on Intent (F1), 7.1% improvement on Intent (Acc) and 5.8% improvement on Overall (Acc) compared with Joint Multiple ID-SF. This further proves that our adaptive intent-slot graph interaction could aggregate the pertinent intent information to enhance the token-level slot prediction.

Effectiveness of Intent-Slot Graph Interaction Mechanism
• Graph Attention Mechanism vs. Vanilla Attention Mechanism Instead of adopting the GAT to model the interaction between the predicted intents and slot, we utilize the attention mechanism to incorporate the intents information for slot filling at the token-level. We name it as Vanilla Attention Interaction. We first use the hidden state of slot filling decoder as the query to attend to the intent embedding to obtain the context intent vector, and then we sum the vector and the hidden state of slot filling decoder to get the final slot prediction. The results are shown in Vanilla Attention Interaction row in Table 3, we observe the overall performance drops 2.4% on the MixSNIPS dataset. We attribute it to the fact that the multi-layer graph attention net- work can automatically capture relevant intents information and better aggregate intents information for each token slot prediction.
• Graph Attention Mechanism vs. Graph Convolution Mechanism We replace the graph attention layer with the graph convolution layer and keep other components unchanged. We refer to it as GCN-based Interaction. The results are shown in GCN-based Interaction row in Table 3, we observe the performance drops in all metrics in the MixS-NIPS dataset. We suggested that GCN-based Interaction cannot adaptively attribute different weights to each node in the intent-slot graph while our graph attention mechanism can automatically filter irrelevant intent information for each token.

Effectiveness of Adaptive Intent-Slot Interaction Mechanism
• Adaptive Interaction Mechanism vs.

Sentence-Level Augmented Mechanism
We first conduct experiments by statically providing the same intent information for all tokens slot prediction where we sum the predicted intent embeddings and directly add it to the hidden state of slot filling decoder. We refer to it as sentence-level augmented.
The result is shown in Table 3. We can observe that if we only provide overall intent information for slot filling, we obtain the worse results, which demonstrates the  effectiveness of adaptively incorporating intent information at the token-level. We believe the reason is that providing the same intents for all tokens can cause the ambiguity where each token is hard to extract the relevant intent information while our adaptive intent interaction mechanism can achieve the fine-grained intent interaction and capture the related intent information to guide the slot prediction.
A natural question that raised is whether the more parameters involved by AGIF contribute to the final performance. To verify that the proposed adaptive interaction mechanism rather than the added parameters works, for sentencelevel augmented mechanism model, we apply multiple LSTM layers (2-layers) to slot filling decoder and we name it as more parameters.
The results in Table 3 show that our framework outperforms the more parameters model in overall accuracy, which verifies that the improvements comes from the proposed adaptive intent-slot interaction mechanism rather than the involved parameters.
• Qualitative Analysis. We provide a case study to intuitively understand the token-level adaptive intent-slot interaction mechanism. As shown in Figure 3, AGIF predicts "I-movie name" correctly for the slot label of "before" while Joint Multiple ID-SF predicts it as "I-object name" incorrectly. We observed that "I-object name" doesn't belong to the intent "SearchScreeningEvent" but to the intent "RateBook". We attribute it to the reason that each token is guided with the same complex intents information making it incorrectly and confusedly capture the information of the other intent "RateBook". In contrast, our adaptive graph interaction mechanism can offer fine-grained intent information integration for token-level slot filling to predict the slot label correctly.

Visualization
With the attempt to better understand what the adaptive intent-slot graph interaction layer has learned, we visualize the intent attention weights of slot filling hidden states node in the output head of the adaptive intent-slot graph interaction layer, which is shown in Figure 4. Based on the utterance "can you add confessions to my playlist called clásica and what is the weather forecast for closeby burkina" and the intents "AddToPlaylist" and "GetWeather", we can clearly see the attention weights successfully focus on the correct intent, which means our graph interaction layer can learn to incorporate the correlated intent information at each slot. More specifically, our model properly aggregates the corresponding "AddToPlaylist" intent information at slots "confessions, my, clásica" and "GetWeather" intent information at slots"close-by burkina".

Evaluation on the Single-Intent Datasets
We conduct experiments on two public singleintent benchmarks to evaluate the generalizability of our framework. We compare our model with the single-intent state-of-the-art models including SF-ID, Stack-Propagation and multi-intent model including Joint Multiple ID-SF. Table 4 shows the experiment results of the proposed models on the ATIS and SNIPS datasets. From the table, we can see that our model outperforms all the compared baselines and achieves state-of-the-art performance. This demonstrates the generalizability and effectiveness of our framework whether handling multiintent or single-intent SLU.

Related Work
Intent Detection Intent detection is formulated as an utterance classification problem. Different classification methods, such as support vector machine (SVM) and RNN (Haffner et al., 2003;Sarikaya et al., 2011), have been proposed to solve it. Xia et al. (2018) adopts a capsule-based neural network with self-attention for intent detection.
However, the above models mainly focus on the single intent scenario, which can not handle the complex multiple intent scenario. Xu and Sarikaya (2013b) and Kim et al. (2017a) explore the complex scenario, where multiple intents are assigned to a user's utterance. Xu and Sarikaya (2013b) use log-linear models to achieve this, while we use neural network models. Compared with their work, we jointly perform multi-label intent detection and slot prediction, while they only consider the subtask intent detection.
Slot Filling Slot filling can be treated as a sequence labeling task. The popular approaches are conditional random fields (CRF) (Raymond and Riccardi, 2007) and recurrent neural networks (RNN) (Xu and Sarikaya, 2013a;Yao et al., 2014).
Recently,  and Tan et al. (2018) introduce the self-attention mechanism for CRF-free sequential labeling.  2019) consider the cross-impact between the slot and intents. Our framework follows those state-of-the-art joint model paradigm, and further focus on the multiple intents scenario while the above joint models do not consider. Recently, Gangadharaiah and Narayanaswamy (2019) propose a joint model to consider the multiple intent detection and slot filling simultaneously where they explicitly leverage overall intent information with the gate mechanism to guide all tokens slot prediction. Compared with this work, the main differences are as following: 1) Our framework exploits a fine-grained intent information transfer with a unified graph interaction architecture while their work simply incorporates the same intents information for all tokens slot prediction. 2) As far as we know, their corpus and code are not distributed, which makes it hard to follow. In contrast, we empirically construct two large-scale multi-intent SLU datasets where all datasets and code have been released. We hope it would push forward the research of multiintent SLU.

Conclusion
In our paper, we propose a token-level adaptive graph-interactive framework to model the interaction between multiple intents and slot at each token, which can make a fine-grained intent information transfer for slot prediction. To our best of knowledge, this is the first work to explore fine-grained intents information transfer in multi-intent SLU. In addition, we release two multi-intent datasets and hope it can push forward the research this area. Experiments on four datasets show the effectiveness of the proposed models and achieve state-of-the-art performance.