A Two-stage Model for Slot Filling in Low-resource Settings: Domain-agnostic Non-slot Reduction and Pretrained Contextual Embeddings

Learning-based slot filling - a key component of spoken language understanding systems - typically requires a large amount of in-domain hand-labeled data for training. In this paper, we propose a novel two-stage model architecture that can be trained with only a few in-domain hand-labeled examples. The first step is designed to remove non-slot tokens (i.e., O labeled tokens), as they introduce noise in the input of slot filling models. This step is domain-agnostic and therefore, can be trained by exploiting out-of-domain data. The second step identifies slot names only for slot tokens by using state-of-the-art pretrained contextual embeddings such as ELMO and BERT. We show that our approach outperforms other state-of-art systems on the SNIPS benchmark dataset.


Introduction
Slot filling models, which predict task-specific names (e.g. artist, time) for these slots from user utterances, are a key component of spoken language understanding (SLU) systems. Deep learning approaches (Mesnil et al., 2013;Hakkani-Tür et al., 2016;Zhang and Wang, 2016;Zhu and Yu, 2018;Gupta et al., 2018;Bapna et al., 2017a) for SLU involve training on a large amount of annotated training data. Likewise, multi-domain studies (Hakkani-Tür et al., 2016;Liu and Lane, 2017) that rely on deep learning methods require a large amount of data for each domain. However, slot filling is a very challenging task if only a few labeled samples are available. Therefore, this paper proposes methods to address the low-resource domain issue of slot filling.
We aim at improving performance of the slot filling task in different low-resource scenarios by exploring the effective usage of a few in-domain samples with two different scenarios: (1) if data from other domains is not possible but a few samples are available in the current domain (2) if data from other domains are available and a few samples are accessible in the current domain. We exploit domain-agnostic syntactic similarities (e.g., the main verb of a sentence cannot be a slot) to learn the conceptual differences between slot and non-slot tokens in order to dismiss non-slot tokens from the input space. Therefore, using labeled data (SLOT and O labels) across domains can improve the non-slot token reduction step in the target domain and thereby the slot name prediction step. Therefore, we propose a novel two-stage model that first reduces this noise by adding a non-slot detection step and then predicts slot names. The identified non-slots are then removed from the input space of the name prediction step. Our modeling approach is inspired by (Zhai et al., 2017;Dauphin et al., 2013;Shah et al., 2019).
We suggest using a few annotated samples as training input instead of slot descriptions and slot names as in zero-shot learning studies (Bapna et al., 2017a;Lee and Jha, 2019;Shah et al., 2019). This is for two reasons: (1) The creation of slot descriptions needs qualified linguistic expertise and is thus expensive. (2) The relationship between slot names and the corresponding tokens is not constant. To give an example, the relationship between the 'genre' slot name and 'drama' token is hypernymic whereas the relationship between the 'artist' slot name and 'Tarkan' token is instance based. Hence, it may not be valid to learn only one function to represent the different relationships between names and tokens.
As a classification algorithm, we employ Rocchio classification method (Rocchio, 1971) for labeling the tokens with their domain specific name labels after reducing the non-slot tokens from the input. Rocchio classifier is a very simple classification method that separates the inputs into centroids computed as the center of mass of all vectors in the class, i.e., builds a prototype vector for each class. Decision process is simply made based on distance metrics. Because of the availability of only a small amount of data in the current domain and the semantically rich and robust presentation in contextual pretrained embeddings, we argue that Rocchio classifier is sufficient for our task. Furthermore by using this simple classification method, we show the effectiveness of the non-slot noise reduction step from the input.
2 Problem Statement

Problem Definition
We partition the slot filling task in two consecutive sub-tasks which are called Slot Labeling and Name Labeling. The Slot Labeling task requires to predict for each token in a sentence one of classes S = {O, SLOT } where SLOT corresponds to slots whereas O represents non-slot tokens. The Name Labeling task requires to predict one label from a predefined name label set N = {...} for a set of candidate slots. This implies that candidate slots have already been identified as SLOT by Slot Labeling task as shown in Figure 1. While S is shared across domains, N is domainspecific. Therefore, training data can be shared across domains for the Slot Labeling task, but not for the Name Labeling task. Thus, we run into the limited data problem for Name Labeling.

Evaluation
We state the evaluation of the proposed systems by computing the average of the precision and recall, i.e, F1 score, over the results of Name Labeling task, although the system consists of two consecutive models. In order to understand the overall performance, the average F1 scores of 7 domains are computed. Additionally, the evaluation values represent the average F1 over three random data splits.

Model Architecture
We define our consecutive model structure as follows: given an utterance with T tokens, first we employ Slot Labeling model in order to identify SLOT tokens while eliminating the non-slot tokens of input utterance. Consecutively, we predict the slot name of the SLOT tokens which are received from the Slot Labeling model. The Figure 3 illustrates the overview of the consecutive model architecture with its inputs and outputs while showing the usage of contextualized word embeddings in order to represent input tokens.

Inputs
The contextualized word representation methods, e.g., ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), use a pre-trained network over the sentence in order to produce unique embeddings based-on the current context, instead of using a single, fixed vector per word like in Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014). The pre-trained models, usually an LSTM (Hochreiter and Schmidhuber, 1997) or a Transformer (Vaswani et al., 2017) can be trained for token-level classification tasks, e.g., named entitity recognition, part-of-speech, or sentence-level classifications, e.g., text classification, sentiment analysis. At the same time, they can leverage the the language modeling (Peters et al., 2018;Devlin et al., 2019) by fine-tuning (Howard and Ruder, 2018) the trained objectives on domain-specific dataset as well as they can be used as feature-based models (Peters et al., 2018;Tenney et al., 2019;Brunner et al., 2020) for the down-stream tasks. In this study, we employ feature-based BERT and ELMo for the slot-filling task in low-resource domain.
BERT uses a bidirectional transformer model which is trained on a masked language modeling task. It uses WordPiece embeddings (Wu et al., 2016) which means each word of an input represented with its sub-tokens. Thus, we use the first sub-token for representing the word as it turns out in (Devlin et al., 2019). Additionally, BERT consists of multiple successive layers, i.e., 24 layers because of preferred BERT-large-cased model, and each layer represents different linguistic notions of syntax or semantics (Clark et al., 2019). In order to find the focused layers on local context (Tenney et al., 2019) in these linguistic notions, the attention visualization tool (Vig, 2019) is used on randomly selected samples. We select 10th, 11th, 12th, and 13th layers and concatenate hidden states of these layers in order to represent the corresponding word.
ELMo concatenates the output of two LSTM independently trained on the bidirectional language modeling task and return the hidden states for the given input sequence.
The proposed consecutive approach uses two different label sets S and N , i.e., as explained in Section 2.1, which share the same sequences per domains. We operate the contextual embeddings on given utterance with the input sequence to assign the contextual embeddings to their corresponding input tokens. Figure 2 shows the domain-agnostic pattern between non-slot token vectors of GetWeather and SearchScreeningE., non-slot tokens (g o) from GetWeather, and non-slot tokens (s o) from Search-ScreeningE. show higher similarity than slot tokens from both. The Slot Labeling step aims to make efficient use of the existing slot labeled dataset from current and different domains in order to exploit that domain-agnostic semantic frames for the current domain. Therefore, we employ two different Slot Labeling models separately according to data availability. Thus, we define two common scenarios to cope with: (1) the absence of data from different domains whereas the occurrence of few labeled samples in the current domain (2) available data from different domains as well as the presence of few labeled samples in the current domain. For the first scenario, we apply Rocchio Slot Labeling whereas Neural Slot Labeling is employed as the solution of the second one.

Slot Labeling
Rocchio Slot Labeling: It is proposed for utilizing a few available labeled samples from the current domain and show the performance of nonslot reduction without any additional samples from different domain on slot and name labeling. Utilizing only a few samples to build classification model for slot labeling, we apply a Rocchio classifier that assigns to observations the label of the class of training samples whose centroid is closest to the observation.
.., v n }, v i represents a slot value. Thus, the Rocchio classifier is trained to map the given slot value to the slot label by using the centroid (µ s ) of the prototypes (X) of the corresponding slot label.
Neural Slot Labeling: We use this with the purpose of using available labeled data from different domains in addition to a few labeled samples of the current domain. The availability of large amount of labeled data from different domain make use of complex architecture such as neural networks. Thus, for ELMo embeddings we use the token classification model proposed by (Peters et al., 2018) whereas for BERT embeddings we implement the token classification model proposed by (Devlin et al., 2019). Thus, for the given X = {w 1 , w 2 , ..., w T } in order to predict Y s = (y 1 , y 2 , ..., y T ) where T is the token number of the given input and y i ∈ S, ELMo embeddings are used with an LSTM+CRF which is trained by maximizing the conditional log-likelihood, BERT embeddings are used with a Linear layer and a following softmax function, The aim of the Neural Slot model is efficiently leveraging domain agnostic features of different task-oriented domains with the networks. Because the existence of available data lets us train the networks in order to find slot/non-slot tokens.

Name Labeling
We assume, only a few samples for the current domain is available for training a model. Thus, the absence of a huge amount of labeled data for the current domain makes it impossible for the use of neural networks. Therefore, we utilize Rocchio classifier as presented in equation 1 to map the given slot value to the name label by using the centroid of the prototypes of the corresponding name label.

Resources
We utilize the SNIPS dataset (Coucke et al., 2018) as a base dataset in our experiment. SNIPS is a SLU dataset of crowd-sourced user utterances with 39 slots and 7 intents. We split SNIPS with the purpose of creating a single-domain dataset.
We create Prototypeand Test-data in order to train the models and evaluate their performance on each domain. Four Prototype groups are generated from SNIPS in order to investigate the performance when the number of samples increases. To accomplish this, we randomly select 10 slot samples embedded in their input sequences (complete sentences) per label in SNIPS. With the initial sample of 10 slots per label, we increment the previous set by 5 randomly-selected slot samples up to 2 times, resulting in 10, 15, and 20 sub-sample groups (10-Prototype ⊂ 15-Prototype ⊂ 20-Prototype). Se-lected slot phases represent one sample in the label space even if the token number is greater then 1. For example, Wind of Change consists of three tokens, however, these three tokens represents one sample. Test data consists of 1000 randomly selected sentences. Prototype and Test include two annotation sets, Name and Slot. Name Set: Provides annotation for the sentences with labels such as artist, object name and O tags for the input sequences. Slot Set: We convert the labels in the Name Set to SLOT tags while keeping the O tags the same. Auxiliary Slot Set: We utilize the slot filling dataset of different domains in order to reduce non-slot tokens by exploiting syntactic similarities between domains. For example, the verb of a sentences does not represent a slot in any domain. Therefore, instead of trying to leverage the semantic similarity between the slot tokens in different domains, we use non-slot token similarity to reduce them from the input space. We obtain this dataset with the same process that converting Name Set to Slot Set.

Proposed Systems 4.2 Experimental Setup
We design our experimental settings to investigate the following research questions. The first question focuses on exploring the impact of existing annotated data from different domains on the performance of the slot/non-slot classification step.   The process of using this model is identical to RocchioSlot+Name Model. The only difference is that we add Slot Sets from other domains -Auxiliary Slot Setand train the Neural Slot Labeling model in order to analyze the impact of out-domain samples on the performance of non-slot reduction and Name Model . An example of this would be the usage of annotated "AddToPlaylist" and "GetWeather" domains data converting the labels to SLOT labels for "Play-Music" in order to train the Neural Slot Labeling model.

Results
Non-slot/slot Classification Results Table 2 and Two-stage Slot Name Labeling Results Table 1 shows that the proposed models outperforms the baseline across domains and sample sizes. It is apparent that the increase of samples sizes largely improves F1 score per domain. As can be seen in Domain Avg., our non-slot reduction models Roc-chioSlot+Name and NeuralSlot+Name outperform the baseline Name Model with > 20%. In addition, by comparing NeuralSlot+Name and Roc-chioSlot+Name, we see that NeuralSlot+Name model results in an > 6% percent increase in the average performance. Impact of Different Contextualized Embeddings ELMo and BERT have comparable performance, with ELMo slightly better on most tasks, e.g., as expected after the study of (Tenney et al., 2019), but the Transformer scoring higher on Rate-Book and SearchCreativeW. consistently with all the models.
Comparisons with State-of-art Systems We compared our systems with the three following studies: (1) Table 1 demonstrates that even though the previous systems use a large amount of data with the neural networks, Roc-chioSlot+Name outperforms the best performance of previous system (CDS) with up to 1% with 20 training examples, whereas the NeuralSlot+Name model outperforms them with up to 8.3% improvement.

Qualitative Analysis
We analyzed the results on individual slots by comparing them according to contextualized embeddings and proposed models. We observed that BERT shows consistent lower results for the tokens like city, state from BookResteurant, and location name, object location type from Search-ScreeningE whereas it outperforms ELMo for proper name detection like object name from Rate-Book and SearchCreativeW. domains. The wrong predictions of Name Labeling, e.g., false-positive rates of names (e.g., object select, cuisine, spatial relation) for O label, draw the attention. An extreme difference between low precision and relatively high recall is observed. However, the precision results are drastically improved when RocchioSlot+ and NeuralSlot+Name models are employed. For example, RateBook domain's slot object select has 0.41 precision with Name Model whereas the precision of it is 0.69 and 0.93 with RocchioSlot+ and NeuralSlot+Name models respectively.
On the other hand, when the timeRange label of GetWeather is reviewed, RocchioSlot+Name as well as Name Model failed. Due to leak of nonslot tokens, timeRange values labeled as 'O'. Roc-chioSlot fails for labeling the values (e.g., eleven months from now) of timeRange with S, because it is a clustering-based method and is not able to capture the sequential dependencies. Neural-Slot+Name models, however, shows significant increases. The comparison of the results from both models indicates that the wrong predictions of the 'O' label drastically reduced with Neural-Slot+Name model. Similar proper nouns, e.g., album and track, in the same domain denote the weakness of the proposed systems. NeuralSlot+Name model is not able to distinguish similar proper nouns. For example, the highest false-negative rate for album is track while it is album for track. 7 Related Work 7.1 Low-resource Domain in NLP Typically in NLP, the domain is meant to refer to some coherent type of dataset that related to the underlying linguistic distribution (Ramponi and Plank, 2020). When the linguistic distribution between target and source domain differ, the performance drops on the target domain. Therefore, hand-labeled samples are needed for many NLP applications even though they are expensive to create and often not available for low-resource languages or domains. Many studies have recently been proposed to tackle the low-resource issue by using different approaches such as transfer learning for domain adaptation (Daume III and Marcu, 2006;Pan and Yang, 2009), and multi-task learning (Peng and Dredze, 2017a). Here, we review the slot filling like sequence labeling studies such as part-of-speech tagging (POS) and named entity recognition (NER) within domain adaptation and multi-task learning.
The domain adaptation approach is used to transfer the domain-general feature space from source tasks as "prior knowledge" to the target task in order to overcome the hand-labeled data scarcity (Blitzer et al., 2006;Daume III and Marcu, 2006;Ramponi and Plank, 2020). For POS tagging, Jiang and Zhai (2007) propose a supervised instance weighting technique with or without labeled instances in target domain, whereas Kann et al. (2018) use character-level and subword-level supervision. However, Han and Eisenstein (2019) demonstrate unsupervised multi-task learning with the domain-adaptive fine-tuning method by utilizing contextualized word embeddings for the new domains. Similarly, NER is a sequence labeling task that is often addressed by domain adaptation and multi-task learning because of the low-resource domain. But, most of the NER tasks consist of different label spaces. Jia et al. (2019) use crossdomain language modeling for performing crosstask knowledge transfer by extracting knowledge of domain differences from raw text, while Peng and Dredze (2017b) utilize multi-task learning approach for shared representations in multiple tasks simultaneously to have better generalize for domain adaptation.
As examined here, most existing work in NLP considers the low-resource issue as a problem of shared feature spaces. The main consideration is always augmenting the most similar feature intersection of source and target domains and use this feature space to improve the low-resource target domain (Daumé III, 2009;Ruder and Plank, 2017;Ramponi and Plank, 2020).

Low-resource Domain in Slot Filling
In a broader sense, two ways of training model have often been applied to slot filling in low-resource domain scenario: (1) use a multi-task learning method (Jaech et al., 2016a;Bingel and Søgaard, 2017) (2) train a model that performs well across domains using domain adaptation or transfer learning techniques e.g., based on external memory (Peng and Yao, 2015), ranking loss (Vu et al., 2016), encoder (Kurata et al., 2016), attention (Zhu and Yu, 2017), multi-task modeling (Jaech et al., 2016b), adversarial training (Kim et al., 2017), pointer networks (Zhai et al., 2017) have recently been proposed. These methods, however, still require a substantial amount of data for adaptation. Additionally, Louvan and Magnini (2018) propose to joint learning with NER as an auxiliary task through a multi-task learning setup and show improvement in slot filling with low-resource scenarios.
Another direction relies on zero-shot learning approaches, i.e., learning method with label descriptions or label names, which have recently been popular in slot filling task. Zero-shot learning (Socher et al., 2013) is a classification setup in learning systems, where the model predict samples from classes that were not seen during training at test time. Zero-shot slot filling, i,e., either relies on slot names or slot descriptions, has been influenced the studies of the domain scaling problem for slots prediction. (Bapna et al., 2017b) leverage the encoding of the slot names and descriptions within a multi-task deep learned slot filling model, to align slots across domains with shared feature extraction. Likewise, (Lee and Jha, 2019) propose a zero-shot adaptive transfer method for slot tagging that utilizes the slot description for transferring reusable concepts across domains for eliminating the need of labeled examples for transferring reusable concepts whereas (Shah et al., 2019) add the a target domain samples to slot descriptions for conveying the domain-agnostic concepts between the intents.

Conclusion and Future Work
We propose a novel two-stage model for slot filling in low-resource domains. Our results demonstrate the importance of non-slot token reduction on slot filling with resource constraints by using a simple classification method. Furthermore, the benefit of employing slot filling data from other different domains for non-slot reduction is demonstrated. In addition, increasing sample sizes for the Prototypes shows significant improvements. Base on our findings, future usage of multi-domain or limited data could be effective in improving slot filling methods from a non-slot reduction perspective. Additionally, the outcomes of the multi-domain data usage in our study contributes a new perspective in supervised domain adaptation and generalization studies.