Recent Neural Methods on Slot Filling and Intent Classification for Task-Oriented Dialogue Systems: A Survey

In recent years, fostered by deep learning technologies and by the high demand for conversational AI, various approaches have been proposed that address the capacity to elicit and understand user’s needs in task-oriented dialogue systems. We focus on two core tasks, slot filling (SF) and intent classification (IC), and survey how neural based models have rapidly evolved to address natural language understanding in dialogue systems. We introduce three neural architectures: independent models, which model SF and IC separately, joint models, which exploit the mutual benefit of the two tasks simultaneously, and transfer learning models, that scale the model to new domains. We discuss the current state of the research in SF and IC, and highlight challenges that still require attention.


Introduction
The ability to understand user's requests is essential to develop effective task-oriented dialogue systems. For example, in the utterance "I want to listen to Hey Jude by The Beatles", a dialogue system should correctly identify that the user's intention is to give a command to play a song, and that Hey Jude and The Beatles are, respectively, the song's title and the artist name that the user would like to listen. In a dialogue system this information is typically represented through a semantic-frame structure (Tur and De Mori, 2011), and extracting such representation involves two tasks: identifying the correct frame (i.e. intent classification (IC)) and filling the correct value for the slots of the frame (i.e. slot filling (SF)).
In recent years, neural-network based models have achieved the state of the art for a wide range of natural language processing tasks, including SF and IC. Various neural architectures have been experimented on SF and IC, including RNN-based (Mesnil et al., 2013) and attention-based (Liu and Lane, 2016) approaches, till the more recent transformers models . Input representations have also evolved from static word embeddings (Mikolov et al., 2010;Collobert and Weston, 2008;Pennington et al., 2014) to contextualized word embeddings (Peters et al., 2018;Devlin et al., 2019). Such progress allows to better address dialogue phenomena involving SF and IC, including context modeling, handling out-of-vocabulary words, long-distance dependency between words, and to better exploit the synergy between SF and IC through joint models. In addition to rapid progresses in the research community, the demand for commercial conversational AI is also growing fast, shown by a variety of available solutions, such as Microsoft LUIS, Google Dialogflow, RASA, and Amazon Alexa. These solutions also use various kinds of semantic frame representations as part of their framework.
Motivated by the rapid explosion of scientific progress, and by unprecedented market attention, we think that a guided map of the approaches on SF and IC can be useful for a large spectrum of researchers and practitioners interested in dialogue systems. The primary goal of the survey is to give a broad overview of recent neural models applied to SF and IC, and to compare their performance in the context of task-oriented dialogue systems. We also highlight and discuss open issues that still need to be addressed in the future. The paper is structured as follows: Section 2 describes the SF and IC tasks, commonly used datasets and evaluation metrics. Section 3, 4, and 5 elaborate on the progress and state of the art of independent, joint, and transfer learning models for both tasks. Section 6 discusses the performance of existing models and open challenges.  Table 1: Example of SF and IC output for an utterance. Slot labels are in BIO format: B indicates the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot.

Slot Filling and Intent Classification
This section provides some background relevant for SF and IC, sets the scope of the survey with respect to the context in dialogue systems, defines SF and IC as tasks, and introduces the datasets and the metrics that will be used in the rest of the paper.

Background
Task-oriented dialogue systems aim to assist users to accomplish a task (e.g. booking a flight, making a restaurant reservation and playing a song) through dialogue in natural language, either in a spoken or written form. As in most of the current approaches, we assume a system involving a pipeline of components (Young et al., 2010), where the user utterance is first processed by an Automatic Speech Recognition (ASR) module and then processed by a Natural Language Understanding (NLU) component, which interprets the user's needs. Then a Dialogue State Tracker (DST) accumulates the dialogue information as the conversation progresses and may query a domain knowledge base to obtain relevant data. A dialogue policy manager then decides what is the next action to be executed and a Natural Language Generation (NLG) component produces the actual response to the user.
We focus on the NLU component, and we generalize several recent approaches assuming that the output of the NLU process is a partially filled semantic frame (Wang et al., 2005;Tur and De Mori, 2011), corresponding to the intent of the user in a certain portion of the dialogue, with a number of slot-value pairs that need to be filled to accomplish the intent. The notion of intent originates from the idea that utterances can be assigned to a small set of dialogue acts (Stolcke et al., 2000), and it is now largely adopted to identify a task or action that the system can execute in a certain domain. Slot-value pairs, on the other end, represent the domain of the dialogue, and have been actually implemented either as an ontology (Bellegarda, 2013), possibly with reasoning services (e.g. checking the constraints over slot values) or simply trough a list of entity types that the system needs to identify during the dialogue.
Intents may correspond either to specific needs of the user (e.g. blocking a credit card, transferring money, etc.), or to general needs (e.g. asking for clarification, thanking, etc.). Slots are defined for each intent: for instance, to block a credit card it is relevant to know the name of the owner and the number of the card. Values for the slots are collected through the dialogue, and can be expressed by the user either in a single turn or in several turns. At each user turn in the dialogue the NLU component has to determine the intent of the user utterance (intent classification) and has to detect the slot-value pairs referred in the particular turn (slot filling). Table 1 shows the expected NLU output for the utterance "I want to listen to Hey Jude by The Beatles".

Scope of the Survey
In Section §2.1, we described a task-oriented system as a pipeline of components, saying that SF and IC are core tasks at the NLU level. Particularly, IC consists of classifying an utterance with a set of predefined intents, while SF is defined as a sequence tagging problem (Raymond and Riccardi, 2007;Mesnil et al., 2013), where each token of the utterance has to be tagged with a slot label. In this scenario training data for SF typically consist of single utterances in a dialogue where tokens are annotated with a predefined set of slot names, and slot values correspond to arbitrary sequences of tokens. In this perspective, it is worth mentioning a research line on dialogue state tracking (Henderson et al., 2014;Mrksic et al., 2015;Budzianowski et al., 2018), where the NLU component is usually embedded into DST. What is relevant for our topic is that in this context SF is defined as a classification problem: given the current utterance and the previous dialogue history, the system has to decide whether a certain slot-value pair defined in the domain ontology is referred or not in the current utterance. Although promising, from the NLU perspective, this research line poses constraints (e.g. all slot-value pairs have to be pre-defined in an ontology,) that limit the SF applicability. For this reason, and because NLU components are the prevalent solution in current task-oriented systems, the focus of our survey will be on SF as a sequence tagging problem, as more precisely defined in the next section.

Task Definition
We formulate SF and IC as follows. Given an input utterance x = (x 1 , x 2 , .., x T ), SF consists in a token-level sequence tagging, where the system has to assign a corresponding slot label y slot = (y slot 1 , y slot 2 , .., y slot T ) to each token x i of the utterance. On the other end, IC is defined as a classification task over utterances, where the system has to assign the correct intent label y intent for the whole utterance x. In general, most machine learning approaches learn a probabilistic model to estimate p(y intent , y slot |x, θ) where θ is the parameter of the model. Table 1 shows an example of the expected output of a model for the SF and IC tasks. In the following sections, we outline the main models that have been proposed for SF and IC, and categorize the models into three groups, namely independent models ( §3) , joint models ( §4), and transfer learning based models ( §5).

Datasets for SF and IC
In this section, according to our task definition, we list available dialogue datasets (most of them are publicly available) where each utterance is assigned to one intent, and tokens are annotated with slot names. Most of such datasets are collections of single turn user utterances (i.e., not multi-turn dialogues). An example of a single-turn utterance annotation is shown in Table 1.
The ATIS (Airline Travel Information System) dataset (Hemphill et al., 1990) is the most widely used single-turn dataset for NLU benchmarking. The total number of utterances is around 5K utterances that consist of queries related to the airline travel domain, such as searching for a flight, asking for flight fare, etc. While it has a relatively large number slot and intent labels, the distribution is quite skewed; more than 70% of the intent is a flight search. The slots are dominated by a slot that expresses location names such as FROMLOCATION and TOLOCATION. The MEDIA dataset (Bonneau-Maynard et al., 2005) is constructed by simulating the conversation between a tourist and a hotel representative in the French language. Compared to ATIS, the MEDIA corpus size is around three times larger; however, MEDIA is only annotated with slot labels. The slots are related to hotel booking scenarios such as the number of people, date, hotel facility, relative distance, etc. The MIT corpus (Liu et al., 2013) is constructed through a crowdsourcing platform where crowd workers are hired to create natural language queries in English and annotate the slot label in the queries. The MIT corpus covers two domains, namely movie and restaurant, in which the utterances are related to finding information of a particular movie or actor, searching or booking a restaurant with a particular distance and cuisine criteria. The SNIPS dataset (Coucke et al., 2018) was collected by crowdsourcing through the SNIPS voice platform. Intents include requests to a digital assistant to complete various tasks, such as asking the weather, playing a song, book a restaurant, asking for a movie schedule, etc. SNIPS is now often used as a benchmark for NLU evaluations.
While most datasets are available in English, recently there has been growing interest in expanding slot filling and intent classification datasets to non-English languages. The original ATIS dataset has been derived into several languages, namely Hindi, Turkish (Upadhyay et al., 2018), and Indonesian (Susanto and Lu, 2017). The MultiATIS++ dataset from  expands the ATIS dataset to more languages, namely Spanish, Portuguese, German, French, Chinese, and Japanese. The work from (Bellomaria et al., 2019) introduces the Italian version of the original SNIPS dataset. The Facebook multi-lingual dataset (Schuster et al., 2019), introduced a dataset on Thai and Spanish languages across three domains namely weather, alarm, and reminder. The detailed statistics of each dataset are listed in Appendix A.

Evaluation Metrics
For the IC task, evaluation is performed on the utterance level. The typical evaluation metric for IC is accuracy, calculated as the number of the correct predictions made by the model divided by the total number of predictions. As for SF, the evaluation is performed on the entity level. The common metrics used is the metric introduced in CoNLL-2003 shared task (Sang and Meulder, 2003) to evaluate Named Entity Recognition (NER) by computing the F-1 score. The F1-score, is the harmonic mean score between precision and recall. Precision is the percentage of slot predictions from the model which are correct, while recall is the percentage of slots in the corpus that are found by the model. A slot prediction is considered correct when an exact match is found (Sang and Meulder, 2003). As the slot is annotated in BIO format to mark the boundary of the slot (see Table 1), a correct prediction is only counted when the model can predict the correct slot label on the correct token offset. Consequently, the exact match metrics does not reward cases when the model predict correct slot label but get the incorrect slot boundary (partial match).

Independent Models for SF and IC
Independent models train each task separately and recent neural models typically use RNN as the building block for SF and IC. At each time step t, the encoder transforms the word representation x t to the hidden state h t . For SF, the output layer predicts the slot label y slot t condition on h t . For IC, typically the last hidden state h T is used to predict the intent label y intent of the utterance x. Note that, for independent approaches, the models for SF and IC are trained separately. Most neural models for SF and IC generally consist of several layers, namely an input layer, one or more encoder layer, and an output layer. Consequently, the main differences between models are in the specifics of these layers. The most common dataset used for evaluating independent models is ATIS.
In the input layer of neural models each word is mapped into embeddings. Mesnil et al. (2013) compared several embeddings, namely pre-trained SENNA (Collobert et al., 2011), RNN Language Model (RNNLM) (Mikolov et al., 2011), and random embeddings. SENNA gives the best result compared to other embeddings, and, typically, further fine-tuning word embeddings improves performance. (Yao et al., 2013) report that embeddings learned from scratch directly on ATIS data (task-specific embeddings) are better than SENNA. However, task-specific embeddings are composed not only by words but also by named entities (NE) and syntactic features 1 . NE improves performance significantly while part-ofspeech only adds small benefits. Ravuri and Stolcke (2015) emphasizes the importance of character representation to handle OOV issues.
For the encoder layer, various RNN architectures have been applied to SF and IC (Mesnil et al., 2013;Mesnil et al., 2015;Liu and Lane, 2015). Mesnil et al. (2013) compare the Elman (Elman, 1990) and Jordan (Jordan, 1997) RNNs. They observe that the performance of the Jordan RNN is marginally better than Elman. They also experiment a bi-directional version of Jordan RNN and obtained the best score of 93.89 F1 for SF, performing better than CRF for about +1 absolute F1 improvement. Xu and Sarikaya (2013) use Convolutional Neural Network (CNN) (LeCun et al., 1998) to extract 5-gram features and apply max-pooling to obtain the word representation before passing it to the output layer. Compared with RNN (Yao et al., 2013;Mesnil et al., 2013), CNN gives lower performance for SF on ATIS. Other studies (Yao et al., 2014a;Vu et al., 2016) adapt Long Short-Term Memory Network (LSTM) (Hochreiter and Schmidhuber, 1997) to SF. The LSTM model gives better SF performance compared to CRF, CNN, and RNN. Ravuri and Stolcke (2015) compare the performance of vanilla RNN and LSTM for IC. They find that the vanilla RNN works best for shorter utterances, while LSTM is better for longer utterances.
For the output layer, typically a softmax function is used for prediction at a particular time step. Yao et al. (2014b) propose a R-CRF model combining the feature learning power of RNN and the sequence level optimization of CRF for SF. The RNN + CRF scoring mechanism incorporates the features learned from RNN and the transition scores of the slot slot labels. R-CRF outperforms CRF and vanilla RNN on ATIS and on the Bing query understanding dataset. Table 2 summarizes the performance of independent models on SF and IC.  Takeaways on independent SF and IC models: • Performance of RNN encoders (unidirectional) are Jordan ≤ Elman < LSTM. Bi-directional encoding is additive to the performance of each encoder. • Incorporating more context information is better for SF performance. Using global context information, such as sentence level representation, and attention mechanisms (Kurata et al., 2016;Liu and Lane, 2016) boosts performance of bi-directional encoder even further. • When adding external features is possible, semantic features such as NE are more beneficial than syntactic features for SF. When NE is used, it can boost the model performance for SF significantly. • The slot filling task is related to Named Entity Recognition (NER) (Grishman and Sundheim, 1996) task as slot values can be a named entity such as airline name, city name etc. If the slot filling task is modeled as a sequence tagging problem, basically recent neural models proposed for NER can be used for slot filling and vice versa. To know more about the recent development of neural NER models, one can consult the survey from Yadav and Bethard (2018). • The main disadvantage of independent models is that they do not exploit the interaction between intent and slots and may introduce error propagation when they are used in a pipeline. In Section 3 we reported approaches that treat SF and IC independently. However, as the two tasks always appear together in an utterance and they share information, it is intuitive to think that they can benefit each other. For instance, if the word "The Beatles" is recognized as the slot ARTIST, then it is more likely that the intent of the utterance is PLAYSONG rather than BOOKFLIGHT. On the other hand, recognizing that the intent is PLAYSONG would help to recognize "Hey Jude" as the slot ARTIST rather than MOVIENAME.

Joint Models for SF and IC
Recent approaches model the relationship between SF and IC simultaneously in a joint model. These approaches promote two-way information sharing between the two tasks instead of a one-way (pipeline).
We describe several alternatives to exploit the relation between SF and IC: through parameter and state sharing and gate mechanism.

Parameter and State Sharing
A pioneering work in joint modeling is Xu and Sarikaya (2013), which performs parameter sharing and captures the relation between SF and IC through Tri-CRF (Jeong and Lee, 2008). The model uses CNN as a shared encoder for both tasks and the produced hidden states are utilized for SF and IC. In addition to features learned from the NN and from the slot label transition, Tri-CRF incorporates an additional factor g to learn the correlation between the slot label assigned to each word and the intent assigned to the utterance, which explicitly captures the dependency between the two tasks. A similar approach (Guo et al., 2014), shares the node representation produced by Recursive Neural Network (RecNN) which operates on the syntactic tree of the utterance. The node's representation is shared among SF and IC. Zhang and Wang (2016) use a shared bi-GRU encoder and a joint loss function between SF and IC (Figure 1 Left), in which the loss function has weights associated with each tasks.
Liu and Lane (2016) use a neural sequence to sequence (encoder-decoder) model with attention mechanism commonly used for neural machine translation. The shared encoder is a bi-directional LSTM, and the last hidden state of the encoder is then used by the decoder to generate a sequence of slot labels, while for IC there is a separate decoder. The attention mechanism is used to learn alignments between slot labels in the decoder and words in the encoder. Hakkani-Tür et al. (2016) also adopt parameter sharing similar to Zhang and Wang (2016), but instead of using GRU they use a shared LSTM and perform predictions for slots, intent, and also domain.
In a recent approach by Wang et al. (2018) propose a bi-model based structure to learn the crossimpact between SF and IC. They argue that a single model for two tasks can hurt performance, and, instead of sharing parameters, they use two-task networks to learn the cross-impact between the two tasks and only share the hidden state of the other task. In the model, every hidden state h 1 t in the first network is combined with the hidden state of the second network h 2 t , and vice versa. Training is also done asynchronously, as each model has a separate loss function. Qin et al. Qin et al. (2019) use a self-attentive shared encoder to produce better context-aware representations, then apply IC at the token level and use this information to guide the SF task. They argue that previous work based on single utterance-level intent prediction is more prone to error propagation. If some token-level intent is incorrectly predicted, the other correct token-level prediction can still be useful for corresponding SF. For the final IC prediction, they use a voting mechanism to take into account the IC prediction on each token.  (Figure 1 Right). The input is passed through several layers of transformer encoders and the hidden state outputs are used to compute slot and intent labels. The hidden state h CLS is used for IC 2 while the rest of the hidden states at each time step h i serve SF.

Slot-Intent Gate Mechanism
In addition to parameter and state sharing, a separate network with a slot gating mechanism was introduced by Goo et al. (2018) to model the interaction between SF and IC more explicitly (Figure 1 Middle). In the encoder, a slot context vector for each time step, c S i , and a global intent context vector c I are computed using an attention mechanism (Bahdanau et al., 2015). The slot-gate g s is computed as a function of c S i and c I , g s = v · tanh(c S i + W · c I ). Then, g s is used as a weight between h i and c S i to compute y slot i as follows: y slot i = softmax(W (h i + g s · c S i )). Larger g s indicates a stronger correlation between c S i and c I . E et al. (2019) propose a bi-directional model, SF-ID (SF-Intent Detection) network, sharing ideas with Goo et al. (2018), with two key differences. First, in addition to the slot-gated mechanism, they add an intent-gated mechanism as well. Second, they use an iterative mechanism between the SF and ID network, meaning that the gate vector from SF is injected into the ID network and vice versa. This mechanism is repeated for an arbitrary number of iteration. Compared to (Goo et al., 2018), the SF-ID network performs better both in SF and IC on ATIS and SNIPS. The work from  is also similar to Goo et al. (2018) with two differences. First, they use a self-attention mechanism (Vaswani et al., 2017) to compute c S i . Secondly, they use a separate network to compute gate vector g s , but the input of this network is the concatenation of c S i and the intent embedding v, and g s is defined as g s = tanh(W g [c i slot , v intent ] + b s ). After that, h i is combined with g s through element-wise multiplication to compute y s i as follows: y slot i = softmax(W s (h i g s ) + b s ). They report a +0.5% improvement on SF over Liu and Lane (2016). A recent work by Zhang et al. (2019), further improves the performance of the BERT based model by adding a gate mechanism  to the BERT model. Table 3 compares the performance of the joint models.  Table 3: Performance comparison of joint models for SF and IC on ATIS and SNIPS-NLU.
Takeaways on joint SF and IC models: • The overall performance of joint models for SF and IC (Table 2) is competitive with independent models (Table 3). The advantage of joint models is that they have relatively less parameters than independent models, as both tasks are trained on a single model. • When computational power is not an issue, fine-tuning a pre-trained model such as BERT is the way to go for maximum SF and IC performance. Hybrid methods combining parameter and state sharing + intent gating yield the best performance (Zhang et al., 2019). • For the non BERT-based model, using state sharing (Wang et al., 2018) is the best on ATIS. However, the disadvantage is that it is actually a bi-model and not a single model. • Similar to independent models, contextual information is crucial to performance. Adding a selfattention mechanism (Qin et al., 2019; to either parameter and state sharing or to slot-intent gating can boost performance even further. • When sufficiently large in-domain training data is available, the SF and IC performance in ATIS and SNIPS is already saturated. Therefore, further research on this classic leaderboard chase is not worth it. We discuss more about that in Section 6. • Most of the work in joint models and also independent models (Section §3) reports F1 scores for slot filling performance. However, these scores do not reveal in which specific cases these models behave differently, contributing to overall performance. We leave further analysis on model performance as a potential future work.

Scaling to New Domains
So far, the models that we consider in Section §3 and Section §4 are designed to be trained on a single domain (e.g. banking, restaurant reservation) and require relatively large labeled data to perform well. In practice, new intents and slots are regularly added to a system to support new tasks and domains, requiring data and time intensive processes. Hence, methods to train models for new domains with limited or without labeled data are needed. We refer to this situation as the domain scaling problem. Figure 2: Left: Data-driven approach (Jaech et al., 2016;Hakkani-Tür et al., 2016). Middle: Model-Driven Approach with expert models (Kim et al., 2017). Right: Zero-shot model (Bapna et al., 2017).

Transfer Learning Models for SF and IC
A common approach to deal with domain scaling is transfer learning (TF). 3 In the TF setup we have K source domains D 1 S , D 2 S , . . . , D K S and a target domain D K+1 T , and we assume abundance of data in D S and limited data in D T . Instead of training a target model M T for D T from scratch, TF aims to adapt the learned model M S from D S to produce a model M T trained on D T . TF is typically applied with various parameter sharing and training mechanisms. For SF and IC two approaches are proposed, namely data-driven and model-driven. As for data-driven techniques, typically we combine data from D S and D T and we partition the parameters in the model into parts that are task-specific and parameters that are shared across tasks. Some studies (Jaech et al., 2016;Hakkani-Tür et al., 2016;Louvan and Magnini, 2019) apply this technique using multi-task learning (MTL) (Caruana, 1997) and the models are trained simultaneously on D S and D T (Figure 2 Left). Results have shown that MTL is particularly effective relative to single-task learning (STL) when the data in D T is scarce and the benefits over STL diminish as more data is available. Another technique that is typically used in data-driven approaches is based on pre-train and fine-tune mechanisms. Goyal et al. (2018) train a joint model of SF and IC, M S , on large D S , then fine-tune M S by replacing the output layer corresponding with the label space from D T and train the model further on D T . Siddhant et al. (2019) also uses fine-tuning mechanism, but the main difference with Goyal et al. (2018) is they leverage large unlabeled data to learn contextual embedding, ELMo (Peters et al., 2018), before fine-tuning on D T .
As we need to train from scratch the whole model when adding a new domain, data-driven approaches, especially MTL-based, need increasing training time as the number of domains grows. The alternative strategy, the model-driven approach, alleviates the problem by enabling model reusability. Although different domains have different slot schemas, slots such as date, time and location can be shared. In model driven adaptation "expert" models ( Figure 2 Middle) are first trained on these reusable slots (Kim et al., 2017;Jha et al., 2018) and the outputs of the expert models are used to guide the training of M T for a new target domain. This way the training time of M T is faster, as it is proportional to the D T data size, instead of the larger data size of the whole D S and D T . In this model-driven settings, Kim et al. (2017) do not treat each expert model on each D S equally, instead they use attention mechanism to learn a weighted combinations from the feedback of the expert models. Jha et al. (2018) use a similar model as Kim et al. (2017), however they do not use attention mechanism. For training the expert models, instead of using all available D S , they build a repository consisting of common slots, such as date, time, location slots. The assumption is that these slots are potentially reusable in many target domains. Upon training M S on this reusable repository, the output of M S is directly used to guide the training of M T .

Zero-shot Models for SF and IC
While data-driven and model-driven approaches can share knowledge learned on different domains, such models are still trained on a pre-defined set of labels, and can not handle unseen labels, i.e. not mapped to the existing schema. For example, a model trained to recognize a DESTINATION slot, can not be used directly to recognize the slot ARRIVAL LOCATION for a new domain, although both slots are semantically similar. For this reason, researchers have recently been working on zero-shot models, trained on label representations that leverage natural language descriptions of the slots (Bapna et al., 2017;Lee and Jha, 2019). Assuming that accurate slot descriptions are provided, slots with different names although semantically similar would have similar description as well. Thus, having trained a model for the DESTI-NATION slot with its descriptions, it is now possible to recognize the slot ARRIVAL LOCATION without training on it, but only supplying the corresponding slot description.
In addition to slot description, other zero-shot approaches explore the use of slot value examples Guerini et al., 2018).  showing that a combination of a small number of slot values examples with a slot description performs better than (Bapna et al., 2017;Lee and Jha, 2019) on the SNIPS dataset. Zero-shot models are typically trained on a per-slot basis (Figure 2 Right), meaning that if we have N slots, then the model will output N predictions, therefore, a merging mechanism is needed in case there are prediction overlaps. In order to alleviate the problem of having multiple predictions, Liu et al. (2020b) propose a coarse-to-fine approach, in which the model learns the slot entity pattern (coarsely) to identify a particular token is an entity or not. After that, the model performs a single prediction of the slot type (fine) based on the similarity between the feature representation and the slot description.
Takeaways on scaling to new domains: • Both data driven methods, MTL and pre-train fine tuning, improve performance when data in D T is limited. Both are also flexible, as virtually many tasks from different domains can be plugged into these methods. As the number of domains grow, pre-train and fine tuning is more desirable than MTL. However, fine tuning is more prone to the forgetting problem  compared to MTL. • When the number of domain, K, is massive, the pre-train fine tuning approach and model driven approaches, such as expert based adaptation, are preferable with respect of training time. • When there exists K existing domains and no annotation is available in D T , the choice is zero-shot approaches with the expense of providing meta-information such as slot and intent descriptions. • As typically zero-shot models perform prediction on a per-slot basis, potential disadvantages are model accuracy when there is a prediction overlap and the model can also be computationally inefficient when dealing with many slots.

State of the Art and Beyond
Based on the results in Table 2 and 3, it is evident that neural models have achieved outstanding performance on ATIS and SNIPS, showing that it is relatively easy for neural models to capture patterns that recognize slots and intents. ATIS, in particular, is already overused for SF and IC evaluations and recent analysis (Béchet and Raymond, 2018;Niu and Penn, 2019) have shown that the dataset is relatively simple and the room for performance improvement is tiny. A similar trend in performance can be noted for other datasets, such as SNIPS, and it is likely that performance improvement can be quickly saturated. However, it does not mean these models have solved SF and IC, or NLU problems in general, rather that the model has merely solved the datasets. Nevertheless, there are still a number of issues in SF and IC that need further investigation: Portable and Data Efficient Models. Instead of evaluating models with the typical leaderboard setup with fixed (train/dev/test) splits on a specific target domain, it would be also important to test models in different scenarios, so that different aspects of the model can be captured. For example, as neural models are data hungry, more work is still needed on transfer learning scenarios, where evaluation is carried out with less or without labeled data (zero-shot) for a particular target domain. In addition, most models for SF and IC are evaluated on English, which means that more effort is still needed to make models that work well for other languages. Some recent works have started exploring zero-shot cross lingual methods (Qin et al., 2020;Liu et al., 2020a; and also few-shot scenarios (Hou et al., 2020) and the room for improvement for these scenarios is still large. In short, designing a data efficient model that is portable across domains and languages is still a challenging problem for the coming future.
Leveraging unlabeled data from live traffic. In real situations, personal digital assistants such as Google Home, Apple Siri and Amazon Alexa, receive live traffic data from real users. This large amount of unlabeled data from live traffic is a potential data source for model training, in addition to in-house annotated data. Unlabeled live data are likely different from in-house data, as they can contain more diverse utterances and also noisy and irrelevant utterances. In this situation, existing methods to tap on unlabeled data, such as semi-supervised learning, still face unique challenges to handle live data. It is worth to note that a bottleneck in this direction is that working on live data in academic settings is not trivial. Some recent works explore this line of research by applying semi-supervised learning  and also data selection (Do and Gaspers, 2019) mechanism.
Generative Models. Most of the proposed models are discriminative, among the few works carried out for generative models for SF and IC, (Raymond and Riccardi, 2007;Yogatama et al., 2017) have shown that a generative model is relatively better than a discriminative model in a situation where data is scarce. One possible direction for generative models is to apply data augmentation to automatically create additional training data (Yoo et al., 2019;Zhao et al., 2019;Hou et al., 2018;Kurata et al., 2016;. The main challenge for data augmentation is to generate diverse and fluent synthetic utterance, which preserve the semantics of the original utterance. Evaluation of SF and IC on more complex dataset. Existing neural approaches typically evaluated on single-intent utterance, however in a real-world scenario users may indicate multiple-intent in an utterance e.g. "Show me all flights from Atlanta to London and get the cost" (Gangadharaiah and Narayanaswamy, 2019) or even expressing multiple sentences in one single turn. While most datasets for slot filling and intent classification are single-turn utterance, there are some recent multi-turn datasets that provide slot annotation on the token-level, namely the RESTAURANT-8K, TaskMaster-1 and 2 (Byrne et al., 2019), and Frame (Asri et al., 2017) datasets. The subset of Schema Guided Dialogue (SGD) dataset (Rastogi et al., 2020) used in DTSC-8 is also annotated with slots in the token-level and covers 16 domains. In addition to that, the TOP dataset (Gupta et al., 2018) introduces datasets annotated with hierarchical representation and MTOP dataset (Li et al., 2020) provides both flat and hierarchical representation on 6 languages across 11 domains.

Conclusion
We have surveyed recent neural-based models applied to SF and IC in the context of task-oriented dialogue systems. We examined three approaches, i.e. independent, joint, and transfer learning based models. Joint models exploiting the relation between SF and IC simultaneously shown relatively better performance than independent models. Empirical results have shown that most joint models nearly "solve" widely used datasets, ATIS and SNIPS, given sufficient in-domain training data. Nevertheless, there are still several challenges related to SF and IC, especially improving the scalability of the model to new domains and languages when limited labeled data are available.