Improving Slot Filling by Utilizing Contextual Information

Slot Filling (SF) is one of the sub-tasks of Spoken Language Understanding (SLU) which aims to extract semantic constituents from a given natural language utterance. It is formulated as a sequence labeling task. Recently, it has been shown that contextual information is vital for this task. However, existing models employ contextual information in a restricted manner, e.g., using self-attention. Such methods fail to distinguish the effects of the context on the word representation and the word label. To address this issue, in this paper, we propose a novel method to incorporate the contextual information in two different levels, i.e., representation level and task-specific (i.e., label) level. Our extensive experiments on three benchmark datasets on SF show the effectiveness of our model leading to new state-of-the-art results on all three benchmark datasets for the task of SF.


Introduction
Slot Filling (SF) is the task of identifying the semantic constituents expressed in a natural language utterance. It is one of the sub-tasks of spoken language understanding (SLU) and plays a vital role in personal assistant tools such as Siri, Alexa, and Google Assistant. This task is formulated as a sequence labeling problem. For instance, in the given sentence "Play Signe Anderson chant music that is newest.", the goal is to identify "Signe Anderson" as "artist", "chant music" as "music-item" and "newest" as "sort".
Early work on SF has employed feature engineering methods to train statistical models, e.g., Conditional Random Field (Raymond and Riccardi, 2007). Later, deep learning emerged as a promising approach for SF (Yao et al., 2014;Peng et al., 2015; * This work was done when the first author was an intern at Adobe Research. Liu and Lane, 2016). The success of deep models could be attributed to pre-trained word embeddings to generalize words and deep learning architectures to compose the word embeddings to induce effective representations. In addition to improving word representation using deep models, Liu and Lane (2016) showed that incorporating the context of each word into its representation could improve the results. Concretely, the effect of using context in word representation is two-fold: (1) Representation Level: As the meaning of the word is dependent on its context, incorporating the contextual information is vital to represent the true meaning of the word in the sentence (2) Task Level: For SF, the label of the word is related to the other words in the sentence and providing information about the other words, in prediction layer, could improve the performance. Unfortunately, the existing work employs the context in a restricted manner, e.g., via attention mechanism, which obfuscates the model about the two aforementioned effects of the contextual information.
In order to address the limitations of the prior work to exploit the context for SF, in this paper, we propose a multi-task setting to train the model. More specifically, our model is encouraged to explicitly ensure the two aforementioned effects of the contextual information for the task of SF. In particular, in addition to the main sequence labeling task, we introduce new sub-tasks to ensure each effect. Firstly, in the representation level, we enforce the consistency between the word representations and their context. This enforcement is achieved via increasing the Mutual Information (MI) between these two representations. Secondly, in the task level, we propose two new sub-tasks: (1) To predict the label of the word solely from its context and (2) To predict which labels exist in the given sentence in a multi-label classification setting. By doing so, we encourage the model to encode task-specific features in the context of each word. Our extensive experiments on three benchmark datasets, empirically prove the effectiveness of the proposed model leading to new the state-of-the-art results on all three datasets.

Related Work
In the literature, Slot Filling (SF), is categorized as one of the sub-tasks of spoken language understanding (SLU). Early work employed feature engineering for statistical models, e.g., Conditional Random Field (Raymond and Riccardi, 2007). Due to the lack of generalisation ability of feature based models, deep learning based models superseded them (Yao et al., 2014;Peng et al., 2015;Kurata et al., 2016;Hakkani-Tür et al., 2016). Also, joint models to simultaneously predict the intent of the utterance and to extract the semantic slots has also gained a lot of attention (Guo et al., 2014;Liu and Lane, 2016;Zhang and Wang, 2016;Wang et al., 2018;Goo et al., 2018;Qin et al., 2019;E et al., 2019). In addition to the supervised settings, recently other setting such as progressive learning (Shen et al., 2019) or zero-shot learning has also been studied (Shah et al., 2019). To the best of our knowledge, none of the existing work introduces a multi-task learning solely for the SF to incorporate the contextual information in both representation and task levels.

Model
Our model is trained in a multi-task setting in which the main task is slot filling to identify the best possible sequence of labels for the given sentence. In the first auxiliary task we aim to increase consistency between the word representation and its context. The second auxiliary task is to enhance task specific information in contextual information. In this section, we explain each of these tasks in more details.

Slot Filling
Formally, the input to a SF model is a sequence of words X = [x 1 , x 2 , . . . , x n ] and our goal is to predict the sequence of labels Y = [y 1 , y 2 , . . . , y n ]. In our model, the word x i is represented by vector e i which is the concatenation of the pre-trained word embedding and POS tag embedding of the word x i . In order to obtain a more abstract representation of the words, we employ a Bi-directional Long Short-Term Memory (BiLSTM) over the word rep-resentations E = [e 1 , e 2 , . . . , e n ] to generate the abstract vectors H = [h 1 , h 2 , . . . , h n ]. The vector h i is the final representation of the word x i and is fed into a two-layer feed forward neural net to compute the label scores s i for the given word, s i = F F (h i ). As the task of SF is formulated as a sequence labeling task, we exploit a conditional random field (CRF) layer as the final layer of SF prediction. More specifically, the predicted label scores S = [s 1 , s 2 , . . . , s n ] are provided as emission score to the CRF layer to predict the label sequenceŶ = [ŷ 1 ,ŷ 2 , . . . ,ŷ n ]. To train the model, the negative log-likelihood is used as the loss function for SF prediction, i.e., L pred .

Consistency between Word and Context
In this sub-task we aim to increase the consistency between the word representation and its context. To obtain the context of each word, we use max pooling over the outputs of the BiLSTM for all words of the sentence excluding the word itself, We aim to increase the consistency between vectors h i and h c i . To this end, we propose to maximize the Mutual Information (MI) between the word representation and its context. In information theory, MI evaluates how much information we know about one random variable if the value of another variable is revealed. Formally, the mutual information between two random variable X 1 and X 2 is obtained by: Using this definition of MI, we can reformulate the MI equation as KL-Divergence between the joint distribution P X 1 X 2 = P (X 1 , X 2 ) and the product of marginal distributions P X 1 X 2 = P (X 1 )P (X 2 ): Based on this understanding of MI, if the two random variables are dependent then the mutual information between them (i.e. the KL-Divergence in Equation 2) would be the highest. Consequently, if the representations h i and h c i are encouraged to have large mutual information, we expect them to share more information.
Computing the KL-Divergence in equation 2 could be prohibitively expensive (Belghazi et al., 2018), so we need to estimate it. To this end, we exploit the adversarial method introduced in (Hjelm et al., 2019). In this method, a discriminator is employed to distinguish between samples from the joint distribution and the product of the marginal distributions to estimate the KL-Divergence in Equation 2. In our case, the sample from joint distribution is the concatenation [h i : h c i ] and the sample from the product of the marginal distribution is the concatenation [h i : h c j ] where h c j is a context vector randomly chosen from the words in the mini-batch. Formally: Where D is the discriminator. This loss is added to the final loss function of the model.

Prediction by Contextual Information
In addition to increasing consistency between the word representation and its context representation, we aim to increase the task-specific information in contextual representations. To this end, we train the model on two auxiliary tasks. The first one aims to use the context of each word to predict the label of that word. The goal of the second auxiliary task is to use the global context information to predict sentence level labels. We describe each of these tasks in more details in the following subsections.

Predicting Word Label
In this sub-task, we use the context representations of each word to predict its label. It will increase the information encoded in the context of the word about the label of the word. We use the same context vector h c i for the i-th word as described in the previous section. This vector is fed into a two-layer feed forward neural network with a softmax layer at the end to output the probabilities for each class, P i (.|{x 1 , x 2 , ..., x n }/x i ) = sof tmax(F F (h c i )). Finally, we use the following negative log-likelihood as the loss function to be optimized during training:

Predicting Sentence Labels
The word label prediction enforces the context of each word to contain information about its label but it lacks a global view about the entire sentence. In order to increase the global information about the sentence in the representation of the words, we aim to predict the labels existing in a sentence from the representations of its words. More specifically, we introduce a new sub-task to predict which labels exists in the given sentence. We formulate this task as a multi-label classification problem. Formally, for each sentence, we predict the binary vector Y s = [y s 1 , y s 2 , ..., y s |L| ] where L is the set of all possible word labels. In the vector Y s , y s i is 1 if the sentence X contains i-th label from the label set L otherwise it is 0.
To predict vector Y s , we first compute the representation of the sentence. This representation is obtained by max pooling over the outputs of the BiLSTM, H = M axP ooling(h 1 , h 2 , ..., h n ). Afterwards, the vector H is fed into a two-layer feed forward neural net with a sigmoid activation function at the end to compute the probability distribution of Y s (i.e., P k (.|x 1 , x 2 , ..., x n ) = σ k (F F (H)) for k-th label in L). Note that since this task is a multi-label classification, the number of neurons at the final layer is equal to |L|. We optimize the following binary cross-entropy loss: − (y s k · log(P k (y s k |x 1 , x 2 , ..., x n ))+ (1 − y s k ) · log(1 − P k (y s k |x 1 , x 2 , ..., x n ))) Finally, to train the entire model we optimize the following combined loss: where α, β and γ are the trade-off parameters to be tuned based on the development set performance.

Dataset and Parameters
We evaluate our model on three SF datasets. Namely, we employ ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018) and EditMe (Manuvinakurike et al., 2018). ATIS and SNIPS are two widely adopted SF dataset and EditMe is a SF dataset for editing images with four slot labels (i.e., Action, Object, Attribute and Value). The statistics of the datasets are presented in the Appendix A. Based on the experiments on EditMe development set, the following parameters are selected: GloVe embedding with 300 dimensions to initialize word embedding ; 200 dimensions for the all hidden layers in LSTM and feed forward neural net; 0.1 for trade-off parameters α, β and γ; and Adam optimizer with learning rate 0.001. Following previous work, we use F1-score to evaluate the model.

Baselines
We compare our model with other deep learning based models for SF. Namely  , and SPTID (Qin et al., 2019). Note that we compare our model with the single-task version of these baselines. We also compare our model with other sequence labeling models which are not specifically proposed for SF. Namely, we compare the model with CVT (Clark et al., 2018) and GCDT . CVT aims to improve input representation using improving partial views and GCDT exploits contextual information to enhance word representations via concatenation of context and word representation. Table 1 reports the performance of the model and baselines. The proposed model outperforms all baselines in all datasets. Among all baselines, GCDT achieves best results on two out of three datasets. This superiority shows the importance of explicitly incorporating the contextual information into word representation for SF. However, the proposed model improves the performance substantially on all datasets by explicitly encouraging the consistency between a word and its context in representation level and task-specific (i.e., label) level. Also, Table 1 shows that EditMe dataset is more challenging than the other datasets, despite having fewer slot types. This difficulty could be explained by the limited number of training examples and more diversity in sentence structures in this dataset.

Ablation Study
Our model consists of three major components: (1) MI: Increasing mutual information between word  and its context representations (2) WP: Predicting the label of the word using its context to increase word level task-specific information in the word context (3) SP: Predicting which labels exist in the given sentence in a multi-label classification to increase sentence level task-specific information in word representations. In order to analyze the contribution of each of these components, we also evaluate the model performance when we remove one of the components and retrain the model. The results are reported in Table 2. This Table shows that all components are required for the model to have its best performance. Among all components, the word level prediction using the contextual information has the major contribution to the model performance. This fact shows that contextual information trained to be informative about the final task is necessary to obtain the representations which could boost the performance.

Conclusion
In this work, we introduced a new deep model for the task of Slot Filling (SF). In a multi-task setting, our model increases the mutual information between the word representation and its context, improves label information in the context and predicts which concepts are expressed in the given sentence. Our experiments on three benchmark datasets show the effectiveness of our model by achieving the state-of-the-art results on all datasets for the SF task.

A Dataset Statistics
In our experiments, we employ three benchmark datasets, ATIS, SNIPS and EditMe. Table 3 presents the statistics of these three datasets. Moreover, in order to provide more insight into the Ed-itMe dataset, we report the labels statistics of this dataset in Table 4.