Leveraging Non-Conversational Tasks for Low Resource Slot Filling: Does it help?

Slot filling is a core operation for utterance understanding in task-oriented dialogue systems. Slots are typically domain-specific, and adding new domains to a dialogue system involves data and time-intensive processes. A popular technique to address the problem is transfer learning, where it is assumed the availability of a large slot filling dataset for the source domain, to be used to help slot filling on the target domain, with fewer data. In this work, instead, we propose to leverage source tasks based on semantically related non-conversational resources (e.g., semantic sequence tagging datasets), as they are both cheaper to obtain and reusable to several slot filling domains. We show that using auxiliary non-conversational tasks in a multi-task learning setup consistently improves low resource slot filling performance.


Introduction
Language understanding in task-oriented dialogue systems involves recognizing information (i.e., slot filling) expressed in an utterance to accomplish a particular dialogue task. For example, in a flight booking scenario, the utterance "show me all Delta flights from Milan to New York" contains information belonging to slots in the flight domain, namely airline name (Delta), origin (Milan), and destination (New York). Slots are usually predefined and domain-specific, e.g. in a hotel domain slots can be different, such as room type, length of stay etc. Although recent neural based models (Goo et al., 2018;Wang et al., 2018;Liu and Lane, 2016) have shown remarkable performance in slot filling, they are still based on large labeled data, which means that training a separate model for each domain involves a resource intensive process. Thus, as more domains are added to the system, methods that can generalize slot filling to new domains with limited labeled data (i.e., low-resource settings) are preferable.
Existing works in low resource slot filling are mostly based on transfer learning (Mou et al., 2016), whose aim is to leverage relatively large resources in a source domain (D S ) for a source task (T S ), to help a task (T T ) in a target domain (D T ), where less data are available. Depending on how the adaptation is performed, there are two notable approaches: data-driven adaptation (Jaech et al., 2016;Goyal et al., 2018;Kim et al., 2016), and model-driven adaptation (Kim et al., 2017;Jha et al., 2018). Essentially, both approaches produce a model on the target domain performing training on the same task (slot filling, in our case), i.e., assuming (T S = T T ), although from different domains, i.e. (D S = D T ). All of these approaches assume that slot filling datasets for the source domain are available, and little effort has been devoted in finding and exploiting cheaper T S , which is crucial in a situation where a slot filling dataset in D S is not ready yet (cold-start).
Accordingly, we attempt to leverage nonconversational source tasks (T S = T T ) i.e., tasks that use widely available non-conversational resources, to help slot filling. These resources are cheaper to obtain compared to domain-specific slot filling datasets, and many of them are annotated with rich linguistic knowledge, which is potentially useful for slot filling (Chen et al., 2016). Among these resources, we mention PropBank (Palmer et al., 2005) and FrameNet (Baker et al., 1998), which consist of annotated documents with verb and frame-based semantic roles, respectively; CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) and OntoNotes (Pradhan et al., 2013), which provide named entity information; and Abstract Meaning Representation (AMR) (Banarescu et al., 2013), which provides a graph-based seman-  Table 1: An example of slot filling annotation from the ATIS (Airline Travel Information System) dataset and author-annotated NER and SemTag in IOB format (Ramshaw and Marcus, 1995). Some ATIS slots correspond to NER or SemTag labels, such as FROM LOC and TO LOC with GPE in NER and SemTag. Some slot tags can also be composed of several SemTag labels such as COST REL which is composed of TOP (superlative positive) and IST (intersective adjective).
In this work, we leverage non-conversational tasks as auxiliary tasks in a multi-task learning (MTL) (Caruana, 1997) setup. Given appropriate auxiliary tasks, MTL has shown to be particularly effective in which labeled data is scarce and has been applied to various NLP tasks such as parsing (Søgaard and Goldberg, 2016), POS tagging (Yang et al., 2016), neural machine translation (Luong et al., 2016), and opinion role labeling (Marasovic and Frank, 2018). While there are potentially many non-conversational tasks that we can use as auxiliary tasks, we focus on those that assign semantic class categories to a word, as they are similar in nature to slot filling. In particular, in this work we choose Named Entity Recognition (NER) and the recently introduced Semantic Tagging (SemTag) , motivated by the following rationales: • Both NER and SemTag are semantically related to slot filling. As illustrated in Table 1, slot labels may correspond to either NER or Sem-Tag labels. In addition, SemTag complements NER as its labels subsume NER labels, and thus could be useful to address linguistic phenomena (e.g. comparative expression, intersective adjective) relevant for slot filling and that are beyond named entities. • Both NER and SemTag can be re-used in many slot filling domains. Labels in both tasks are typically more general (coarse-grained) compared to labels in slot filling. • The resources for both tasks are cheaper to obtain compared to domain-specific slot filling datasets, as there have been several initiatives in constructing large datasets for NER and SemTag, for example OntoNotes (Pradhan et al., 2013) and Parallel Meaning Bank (PMB)  respectively. This is beneficial in a cold-start situation in which no slot filling dataset is already available in D S .
Although NER has been already used in slot filling models, most of these approaches (Mesnil et al., 2013(Mesnil et al., , 2015Zhang and Wang, 2016;Gong et al., 2019;Louvan and Magnini, 2018) use and incorporate ground truth NER labels or output of NER systems as features to train a slot filling model, our work differs in the method of learning and leveraging such features from disjoint datasets through MTL and evaluating the performance in low-resource settings.
Our contributions are: (i) we propose to leverage non-conversational tasks, namely NER and SemTag, to improve low resource slot filling through MTL; to our knowledge this MTL combination has not been explored before. (ii) We show that MTL models with NER and SemTag strongly improve single-task slot filling models on three well known datasets. While we focus on using NER and SemTag, our study has shed light on the potential use of non-conversational tasks in general to help low resource slot filling.

Approach
Slot filling is often modeled as a sequence labeling problem. Given a sequence of words x = (x 1 , x 2 , ..., x n ) as input, a model M predicts the corresponding slot labels y = (y 1 , y 2 , ..., y n ) as output.

Base Model
State-of-the-art models on sequence labeling are typically built based on bi-directional LSTM (bi-LSTM), on top of which there is a CRF model (Lample et al., 2016;Ma and Hovy, 2016). The bi-LSTM takes x as input and each word x i is represented as an embedding e i = [w i ; c i ] composed of the concatenation of a word embedding w i and character embeddings c i . The bi-LSTM layer produces the forward output state − → h i and the backward output state ← − h i . The concatenation of the output states, is then fed to a feed-forward (FF) layer, followed by a CRF as the final output layer that predicts a slot label y i by taking into account the mixture of context information captured by the last FF layer and the slot prediction y i−1 from the previous word.

Multi-task Learning Models
In the context of MTL, models for T S , often referred as auxiliary tasks, and for T T , referred as the target task, are simultaneously trained (Yang et al., 2017). In order to perform adaptation, the MTL model M is partitioned into taskspecific parts (M T S and M T T ) and task-sharedparts (M T S ∩T T ). We use two notable MTL architectures: • MTL-Fully Shared Network (MTL-FSN).
The word and character embeddings, and the bi-LSTM layers, are parts of M T S ∩T T . The hidden state outputs of the bi-LSTM are passed to each of the CRF output layers in M T S and M T T . During training a mini-batch of a particular task, the output layers of other tasks are not updated. • Hierarchical-MTL (H-MTL). Inspired by (Søgaard and Goldberg, 2016;Sanh et al., 2019), we introduce a hierarchy of tasks in M to create different levels of supervision. Instead of placing the output CRF layers for all tasks after the shared bi-LSTM layer, we add a taskspecific bi-LSTM in M T T after the shared bi-LSTM and then attach the output layer. In other words, we supervise T S , which have coarsegrained labels in the lower level output layer and T T , which has more fine-grained labels in the higher level output layer.

Experiments
The main objective of our experiments is to validate the hypothesis that using non-conversational tasks as auxiliary tasks in a MTL setup can help low resource slot filling. In our MTL configuration, the target task (T T ) is slot filling, and the auxiliary tasks (T S ) are set to NER or SemTag or both.
Baselines. We compare the two MTL approaches (see §2.2) with the following baselines: • Single-Task Learning (STL). The base model is directly trained and tested on T T , without incorporating any information from T S . The base model (see §2.1) is a bi-LSTM-CRF which is the core of many models for slot filling (Goo   (Giampiccolo et al., 2007), and the bible (Christodoulopoulos and Steedman, 2015). Following the previous publication related to SemTag (Abzianidze and Bos, 2017), we train the SemTag model using the silver data and test on gold data. For all datasets, we use the provided train/dev/test splits. Table 2 shows the overall statistics of each dataset. To simulate the low resource settings, in all experiments we only use 10% training data on T T .
Training. We do not tune the hyperparameters 2 but follow the suggestions and adapt the implementation of Reimers and Gurevych (2017) (Jaech et al., 2016) between T T and T S . Consequently, as the training data size of T S is larger than T T , the same T T data is reused until the whole T S is used in the training. We evaluate the performance by computing the F1-score on the test set using the standard CoNLL-2000 evaluation 4 .  Table 3: Average F1-score and standard deviation (numbers in subscript) of the performance on the test sets. For the T T training split, only 10% data is used. Bold indicates the best score for each T T . N and S in T S denote NER and SemTag, respectively.

Results and Discussion
Overall Performance. Table 3 lists the overall performance of the baselines and of the MTL models. We report the average F-1 score and also the standard deviation, as recommended by Reimers and Gurevych (2018), over three runs from different random seeds. For all T T , it is evident that the MTL models with NER or Sem-Tag combinations yield the best results compared to STL. MTL models also outperform the STL + FB baseline, indicating that training the model simultaneously with the auxiliary task is better than incorporating the output of the independently trained auxiliary models as features for the slot filling model. In terms of the effectiveness of the auxiliary tasks, using NER produces the best results compared to the other T S combinations. The difference between MTL with NER and MTL with SemTag is marginal. Regarding the MTL models, on average, H-MTL yields better scores compared to MTL-FSN in MIT-R and MIT-M, which suggests that supervising tasks with coarse-grained labels and fine-grained labels on different layers is beneficial.
Slot-wise Performance. One of our motivations for using NER and SemTag is that their labels are 4 https://www.clips.uantwerpen.be/conll2000/  Table 4: Performance on slots related to person (PER), location (LOC), and organization (ORG) concepts. We use the best MTL from Table 3 for each T T .
coarse-grained, and that they can be re-used for several slot filling domains. We are interested to see whether MTL improves the performance of slots related to these coarse-grained concepts. In order to do this, we manually created a mapping 5 from the slots to some coarse-grained entity concepts used by CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003) including Person, Organization, and Location. For example, in ATIS, the slot airline name is mapped to Organization, the slot fromloc.city name is mapped to Location, etc. We perform the analysis on the dev set by re-running the evaluation based on the mapping. Results in Table 4 show that in ATIS and MIT-R, MTL brings improvements on slots related to Location and Organization. However, MTL does not help in slots related to Person names in MIT-M. Based on our observation on the prediction results, most errors come from misclassifying DIRECTOR slots as ACTOR slots. Figure 1: Gain (∆F 1) obtained using MTL over STL on increasing training data. Positive numbers mean MTL is better, negative numbers mean MTL is worse. We use the best MTL from Table 3 for each T T .
Performance Gain on Increasing Data Size.
We also carried on an experiment by increasing 5 We provide the mapping in Appendix A the amount of training data on T T , and evaluated the performance on the dev set to understand the usefulness of MTL on varying data size. As shown in Figure 1, as we increase the size of the training data, the gain that we obtain using MTL tends to decrease. The results suggest that MTL is indeed more useful in very low resource scenarios, according to our initial hypothesis. After 40% training data size is used (around 2K utterances), MTL is less useful. We believe that this is because the slot filling datasets are relatively simple, e.g. the texts are short and most of them express a single specific request, thus, it is relatively easy for the model to capture the regularities.
Impact on Auxiliary Tasks Performance. We also perform an analysis to understand the effect of MTL to the model performance for T S . The STL performance of OntoNotes and Semantic Tagging are around 89% and 96% respectively in terms of F1-score. With MTL, on average, the T S model performance decrease about 0.7 points for OntoNotes and 0.2 points for Semantic Tagging. This suggests that T S models do not benefit from the low resource T T through the MTL framework and the training mechanism that we use. In general, whether MTL can benefit model performance in a target task given auxiliary tasks (or vice versa) is still a question and beyond the scope of this paper. While there is no exact answer yet for this question, we refer to (Bingel and Søgaard, 2017;Alonso and Plank, 2017) which study the characteristics of auxiliary tasks that is potential to help target task performance (Bingel and Søgaard, 2017;Alonso and Plank, 2017) .

Conclusions
We proposed to leverage non-conversational tasks, Named Entity Recognition and Semantic Tagging, through multi-task learning to help low resource slot filling. Our experiments demonstrate that: (i) non-conversational tasks are effective to improve slot filling performance, and they are reusable in different slot filling domains; (ii) incorporating a task-hierarchy in the multi-task architecture based on the granularity of the labels is beneficial for the model performance on two out of three datasets.
In the future, we plan to explore other nonconversational resources such as FrameNet (Baker et al., 1998) which provide a repository of event frames and semantic roles that can be relevant for intent classification and slot filling in task-oriented dialogue systems. Also another direction is to apply fine-tuning with the recently popular pretrained language model e.g. BERT (Devlin et al., 2018). Zhilin Yang, Ruslan Salakhutdinov, and