Exploring Named Entity Recognition As an Auxiliary Task for Slot Filling in Conversational Language Understanding

Slot filling is a crucial task in the Natural Language Understanding (NLU) component of a dialogue system. Most approaches for this task rely solely on the domain-specific datasets for training. We propose a joint model of slot filling and Named Entity Recognition (NER) in a multi-task learning (MTL) setup. Our experiments on three slot filling datasets show that using NER as an auxiliary task improves slot filling performance and achieve competitive performance compared with state-of-the-art. In particular, NER is effective when supervised at the lower layer of the model. For low-resource scenarios, we found that MTL is effective for one dataset.


Introduction
Most of the current dialogue systems depend on an NLU component to extract semantic information from an utterance. Such semantic information is often represented as a semantic frame which contains the domain, intent of the user, and predefined attributes (slots). Each word of the utterance is labeled with a slot, which defines a particular attribute (an entity, time, etc) of the utterance. Table 1 shows an example of a semantic frame for the sentence "Show me the prices of all flights from Atlanta to Washington DC" with Begin/In/Out (BIO) representation.
We focus on slot filling, a task of automatically extracting slots for a given utterance. This task can be treated as a sequence labeling problem and the most successful approach is to employ a conditional random fields (CRF) on top of a deep recurrent neural networks (RNN). In general, there are two ways of training a slot filling model: (i) train a domain-specific model (Goo et al., 2018;Wang et al., 2018)   et Jaech et al., 2016;Jha et al., 2018;Kim et al., 2017). One popular transfer learning technique is multi-task learning (MTL) (Caruana, 1997) in which a joint model is trained on a target (main) task and several auxiliary tasks simultaneously to learn better feature representations across tasks. This technique has shown potential on various NLP tasks and offer flexibility as it allows transfer learning across different domains and tasks (Yang et al., 2017). On slot filling, Jaech et al. (2016) train a single slot filling model on different domains and show that MTL is particulary useful in low resource scenarios. Identifying beneficial auxiliary task for the target task is important when applying MTL (Bingel and Søgaard, 2017). In this work, we investigate the effectiveness of Named Entity Recognition (NER) as an auxiliary task for slot filling. We propose NER because of two main reasons. First, the slot values are typically named entities, for example airline name, city name, etc. Second, the state of the art performance of models for NER have been relatively high (Lample et al., 2016;Ma and Hovy, 2016). Therefore, we expect that the learned features of NER can improve the slot filling performance. Finally, NER corpus is relatively easier to obtain compared to domain specific slot filling datasets.
We are interested to answer the following questions: • Does NER help the performance of slot filling in the MTL setup? As NER labels are usually more coarse-grained than slot filling labels, predicted NER label might provide good signal to the more fine-grained slot labels. For example, the location LOC label in NER can be a strong indicator for slots fromloc.city name or toloc.city name and filter out other slot labels which are not related to location. We hope the model can learn more general knowledge first and transfer such knowledge to predict more specific slot information using MTL.
• What is the effect of supervising NER on the lower layer of the MTL model to the slot filling performance? Inspired by recent work of Søgaard and Goldberg (2016), we investigate the effect of supervising NER on different layers of the model. Our hypothesis is that a more "general" feature is better learned on the lower layer in order to support a task which depends on a more "specific" feature.
In addition, we also experiment on crossdomain slot filling models by jointly training slot filling datasets from similar domains using a MTL setup. We explore two techniques to measure similarity between domains: domain similarity by Ruder and Plank (2017a) and label embedding mapping by Kim et al. (2015).
We experiment with three datasets from different domains. Our experiments show that for all datasets, using NER as an auxiliary task is beneficial for the slot filling performance. NER is consistently helpful when it is supervised at the lower layer. On the low resource scenario, we found mixed results, in which MTL is only effective for 1 dataset.

Model
This section describes the slot filling model, the multi-task learning setup, and the data selection that we use in our experiments.

Slot Filling Model
For the slot filling model, we adopt a neural based model similar to (Lample et al., 2016;Ma and Hovy, 2016), as it achieves the state of the art performance in sequence labeling task (NER). Recent slot filling model of Jha et al. (2018) also used a variant of this model. Given an input sentence, we represent each word w i using a concatenation of its word embedding e(w i ) and characterlevel embeddings c(w i ) : The character-level embeddings are computed using convolutional neural networks (CNN), similar to the one proposed by Kim et al. (2016). We then feed x i to a bidirectional LSTM (biLSTM) wordlevel encoder to incorporate the contextual information of w i . The output of the backward and forward LSTM at each time step is then concatenated and fed into a CRF layer. The CRF layer computes the final output, e.g. the tag of each input. We use one hidden layer between biLSTM and CRF as it has been shown by Lample et al. (2016) that it can improve performance.

Multi-Task Learning
One simple technique to perform MTL is by training the target and auxiliary tasks simultaneously. In this setting, the parameters of the model are shared across tasks, pushing the model to learn feature representations that work well across tasks. Figure 1 depicts the MTL setting that we use in our work. The lower parts of the network, i.e. word embeddings, character-level embeddings, and bi-LSTM encoder are shared among tasks. After the bi-LSTM layer, we use different CRF layers for each task to predict the taskspecific tags (NER or slot filling). We also experiment with MTL setup which uses different level of supervision for the auxiliary task (Søgaard and Goldberg, 2016), in which we use two layers of biLSTM encoder and only share the lower layer of  the encoder and keep the outer layer for the main slot filling task.

Data Selection
Ruder and Plank (2017b) demonstrate that selecting data for training the auxiliary task might improve the target task performance. We investigate two data selection techniques for our MTL experiments: Domain Similarity. We use Jensen-Shannon divergence (JSD; Lin, 1991) to measure domain similarity as proposed by Ruder and Plank (2017b): is the Kullback-Leibler divergence between two distributions P and Q. We use term distributions (Plank and Van Noord, 2011) of each domain to compute P and Q. We select the most similar domain to the main task domain to be used as the auxiliary task.
Label Embedding Mapping. In an MTL setup, sometimes we only want to keep auxiliary labels which are semantically similar to target task labels and remove other irrelevant labels of the auxiliary task. For example, the slot filling label airport.statename is similar to LOC but not to TIME auxiliary NER label. We employ label embedding mapping approach by Kim et al. (2015) using Canonical Correlation Analysis (CCA). The idea is to construct matrix representation where rows are labels and columns are words in the vocabulary. The cell value in the matrix is the pointwise mutual information (PMI) between the label and the word. After that, we perform rank-k SVD on the matrix and normalized the rows of the matrix. Each row with k dimension of the matrix is the label embedding of a particular label. We use the cosine similarity between two label embedding representations to obtain the nearest neighbor.   Target Task & Auxiliary Tasks. For each MTL experiment, there is exactly one target task and one or more auxiliary task(s). The target task is always a slot filling task, i.e. either ATIS, MIT-R, or MIT-M. The auxiliary task(s) consist of a combination of slot filling tasks from different domains of the target task with (or without) a NER task. We select the most similar slot filling task for the target task using the domain similarity technique described in ( §2.3). Table 3 presents the most similar slot filling domain for each slot filling task.

Results and Analysis
Overall Performance. Table 4 summarizes the slot filling performance of our single task (STL) versus MTL models. The performance from previous studies are directly copied from their reported numbers. When using the same supervision level for both target and auxiliary tasks, using the most similar domain performs worse than using all domains. In contrast, using NER together with the most similar domain as auxiliary tasks performs better than using all the domains. Experiments on different supervision level show that using NER as an auxiliary task consistently improves slot filling performance. This re-sult matches our intuition that the task with more coarse-label, such as NER, is better to be supervised at the lower layer of the model. On ATIS and MIT-R datasets, MTL achieves better performance compared to STL. However, on MIT-M, STL outperforms some MTL models.
In order to understand better the behavior of the models, we analyze the results from the development set. For the ATIS dataset, STL and MTL have the same performance in 44 out of 67 slots in the development set. For the rest of the slots, STL performs better mostly on slots related to time such as arrive time.time and depart date.month name while MTL is better on recognizing location related slots such as city name and toloc.state name. For the MIT Restaurant dataset, MTL performs better on 5 out of 8 slots. MTL performs well in identifying slots related to time and location in the MIT Restaurant dataset. For the MIT movie, MTL yields better results for time related slots. As for the person related slots such as character , actor, and director, STL gives better results. Overall, although incorporating NER with slot filling shows improvements, the difference is still rather small especially for the ATIS and the MIT Movie datasets. Further work is needed to explore better mechanism to inject NER information to help slot filling in the MTL setup. It is also interesting to compare the performance of MTL and pipeline based system which utilizes NER prediction as one of the feature for the slot filling model.   Effect of Label Embedding Mapping. We apply label filtering on the auxiliary tasks using the label embedding mapping ( §2.3). On the auxiliary dataset(s), we keep the most similar labels and replace irrelevant labels with O. The MTL setup that we use is the best performing MTL for each dataset in Table 4. As shown in Table 5, the performance of MTL drops when we apply filtering to the auxiliary labels. We suspect that this is due to the quality of the label mapping and also a high number of "O" label after the filtering process.
Low Resource Scenarios. We experiment on low resource scenarios where we vary the number of training sentences to 200, 400, and 800 sentences for each dataset. The MTL setup that we use is the best performing MTL for each dataset in Table 4. As shown in Table 6, MTL consistently performs better than STL for the MIT-R dataset. While for the ATIS and MIT-M datasets, STL mostly gives better results than MTL.

Related Work
Recent studies on slot filling in conversational systems are mostly based on neural models. Wang et al. (2018) introduce a bi-model (RNN) structure to consider cross-impact between intent detection and slot filling. Liu and Lane (2016) propose an attention mechanism on the encoder-decoder model for joint intent classification and slot filling. (Goo et al., 2018) extend the attention mechanism us-ing a slot gated model to learn relationship between slot and intent attention vectors. Hakkani- Tür et al. (2016) use bidirectional RNN as a single model that handle multiple domains by adding a final state that contains domain identifier. The work by Jha et al. (2018); Kim et al. (2017) uses expert based domain adaptation while Jaech et al.
(2016) propose a multi-task learning approach to guide the training of a model for new domain. All of these studies train their model solely on slot filling datasets, while our focus is to exploit a more "general" resource, such as NER, by training the model jointly with slot filling through MTL with different supervision level.

Conclusion
In this work, we investigate the effectiveness of training a slot filling model jointly with NER as an auxiliary task through MTL setup. Our experiments demonstrate that NER is helpful for slot filling. In particular, NER is more effective when it is supervised at the lower layer of the MTL model. However, further work is needed to investigate the effectiveness of domain similarity metric or label embedding mapping as a way to perform data selection in the preprocessing step.