Example-Driven Intent Prediction with Observers

A key challenge of dialog systems research is to effectively and efficiently adapt to new domains. A scalable paradigm for adaptation necessitates the development of generalizable models that perform well in few-shot settings. In this paper, we focus on the intent classification problem which aims to identify user intents given utterances addressed to the dialog system. We propose two approaches for improving the generalizability of utterance classification models: (1) observers and (2) example-driven training. Prior work has shown that BERT-like models tend to attribute a significant amount of attention to the [CLS] token, which we hypothesize results in diluted representations. Observers are tokens that are not attended to, and are an alternative to the [CLS] token as a semantic representation of utterances. Example-driven training learns to classify utterances by comparing to examples, thereby using the underlying encoder as a sentence similarity model. These methods are complementary; improving the representation through observers allows the example-driven model to better measure sentence similarities. When combined, the proposed methods attain state-of-the-art results on three intent prediction datasets (banking77, clinc150, hwu64) in both the full data and few-shot (10 examples per intent) settings. Furthermore, we demonstrate that the proposed approach can transfer to new intents and across datasets without any additional training.


Introduction
Task-oriented dialog systems aim to satisfy a user goal in the context of a specific task such as booking flights (Hemphill et al., 1990), providing transit information (Raux et al., 2005), or acting as a tour guide (Budzianowski et al., 2018). Task-oriented dialog systems must first understand the user's goal * Work done while Shikib was at Amazon by extracting meaning from a natural language utterance. This problem is known as intent prediction and is a vital component of task-oriented dialog systems (Hemphill et al., 1990;Coucke et al., 2018). Given the vast space of potential domains, a key challenge of dialog systems research is to effectively and efficiently adapt to new domains (Rastogi et al., 2019). Rather than adapting to new domains by relying on large amounts of domainspecific data, a scalable paradigm for adaptation necessitates the development of generalizable models that perform well in few-shot settings (Casanueva et al., 2020;Mehri et al., 2020).
The task of intent prediction can be characterized as a two step process: (1) representation (mapping a natural language utterance to a semantically meaningful representation) and (2) prediction (inferring an intent given a latent representation). These two steps are complementary and interdependent, thereby necessitating that they be jointly improved. Therefore, to enhance the domain adaptation abilities of intent classification systems we propose to (1) improve the representation step through observers and (2) improve the prediction step through example-driven training.
While BERT (Devlin et al., 2018) is a strong model for natural language understanding tasks (Wang et al., 2018), prior work has found a significant amount of BERT's attention is attributed to the [CLS] and [SEP] tokens, though these special tokens do not attribute much attention to the words of the input until the last layer (Clark et al., 2019;Kovaleva et al., 2019). Motivated by the concern that attending to these tokens is causing a dilution of representations, we introduce observers. Rather than using the latent representation of the [CLS] token, we instead propose to have tokens which attend to the words of the input but are not attended to. In this manner, we disentangle BERT's attention with the objective of improving the semantic content captured by the utterance representations.
A universal goal of language encoders is that inputs with similar semantic meanings have similar latent representations (Devlin et al., 2018). To maintain consistency with this goal, we introduce example-driven training wherein an utterance is classified by measuring similarity to a set of examples corresponding to each intent class. While standard approaches implicitly capture the latent space to intent class mapping in the learned weights (i.e., through a classification layer), example-driven training makes the prediction step an explicit nonparametric process that reasons over a set of examples. By maintaining consistency with the universal goal of language encoders and explicitly reasoning over the examples, we demonstrate improved generalizability to unseen intents and domains.
By incorporating both observers and exampledriven training on top of the CONVBERT model 1 (Mehri et al., 2020), we attain state-of-the-art results on three intent prediction datasets: BANK-ING77 (Casanueva et al., 2020), CLINC150 (Larson et al., 2019), and HWU64  in both full data and few-shot settings. To measure the generalizability of our proposed models, we carry out experiments evaluating their ability to transfer to new intents and across datasets. By simply modifying the set of examples during evaluation and without any additional training, our example-driven approach attains strong results on both transfer to unseen intents and across datasets. This speaks to the generalizability of the approach. Further, to demonstrate that observers mitigate the problem of diluted representations, we carry out probing experiments and show that the representations produced by observers capture more semantic information than the [CLS] token.
The contributions of this paper are as follows: (1) we introduce observers in order to avoid the potential dilution of BERT's representations, by disentangling the attention, (2) we introduce exampledriven training which explicitly reasons over a set of examples to infer the intent, (3) by combining our proposed approaches, we attain state-of-the-art results across three datasets on both full data and few-shot settings, and (4) we carry out experiments demonstrating that our proposed approach is able to effectively transfer to unseen intents and across datasets without any additional training. 1 https://github.com/alexa/DialoGLUE/

Methods
In this section, we describe several methods for the task of intent prediction. We begin by describing two baseline models: a standard BERT classifier (Devlin et al., 2018) and CONVBERT with task-adaptive masked language modelling (Mehri et al., 2020). The proposed model extends the CON-VBERT model of Mehri et al. (2020) through observers and example-driven training. Given the aforementioned two step characterization of intent prediction, observers aim to improve the representation step while example-driven training improves the prediction step.

BERT Baseline
Across many tasks in NLP, large-scale pre-training has resulted in significant performance gains (Wang et al., 2018;Devlin et al., 2018;Radford et al., 2018). To leverage the generalized language understanding capabilities of BERT for the task of intent prediction, we follow the standard fine-tuning paradigm. Specifically, we take an off-the-shelf BERT-base model and perform end-to-end supervised fine-tuning on the task of intent prediction.

Conversational BERT with
Task-Adaptive MLM Despite the strong language understanding capabilities exhibited by pre-trained models, modelling dialog poses challenges due to its intrinsically goal-driven, linguistically diverse, and often informal/noisy nature. To this end, recent work has proposed pre-training on open-domain conversational data (Henderson et al., 2019;Zhang et al., 2019b). Furthermore, task-adaptive pre-training wherein a model is trained in a self-supervised manner on a dataset prior to fine-tuning on the same dataset, has been shown to help with domain adaptation (Mehri et al., 2019;Gururangan et al., 2020;Mehri et al., 2020). Our models extend the CONVBERT model of Mehri et al. (2020) which (1) pre-trained the BERT-base model on a large open-domain dialog corpus and (2) performed task-adaptive masked language modelling (MLM) as a mechanism for adapting to specific datasets.

Observers
The pooled representation of BERT-based models is computed using the [CLS] token. Analysis of BERT's attention patterns has demonstrated that a significant amount of attention is attributed to  (Clark et al., 2019;Kovaleva et al., 2019). It is often the case that over half of the total attention is to these tokens (Clark et al., 2019). Furthermore, the [CLS] token primarily attends to itself and [SEP] until the final layer (Kovaleva et al., 2019). It is possible that attending to these special BERT tokens, in combination with the residual connections of the BERT attention heads, is equivalent to a no-op operation. However, it is nonetheless a concern that this behavior of attending to tokens with no inherent meaning (since [CLS] does not really attend to other words until the final layer) results in the latent utterance level representations being diluted.
We posit that a contributing factor of this behavior is the entangled nature of BERT's attention: i.e., the fact that the [CLS] token attends to words of the input and is attended to by the words of the input. This entangled behavior may inadvertently cause the word representations to attend to [CLS] in order to better resemble its representation and therefore make it more likely that the [CLS] token attends to the word representations. In an effort to mitigate this problem and ensure the representation contains more of the semantic meaning of the utterance, we introduce an extension to traditional BERT fine-tuning called observers.
Observers, pictured in Figure 1, attend to the tokens of the input utterance at every layer of the BERT-based model however they are never attended to. The representation of the observers in the last layer is then used as the final utterance level representation. In this manner, we aim to disentangle the relationship between the representation of each word in the input and the final utterance level representation. By removing this bi-directional relationship, we hope to avoid the risk of diluting the representations (by inadvertently forcing them to attend to a meaningless [CLS] token) and therefore capture more semantic information in the final utterance level representation. Throughout our experiments we use 20 observer tokens (which are differentiated only by their position embeddings) and average their final representations. The positions of the observer tokens is consistent across all utterances (last 20 tokens in the padded sequence). Specifically, the concept of observers modifies F in Equations 1 and 2. While we maintain the BERTbased model architecture, we instead produce the utterance level representation by averaging the representations of the observer tokens and using that for classification rather than the [CLS] token.

Example-Driven Training
A universal goal of language encoders is for inputs with similar semantic meanings to have similar latent representations. BERT (Devlin et al., 2018) has been shown to effectively identify similar sentences (Reimers and Gurevych, 2019) even without additional fine-tuning (Zhang et al., 2019a). Through example-driven training, we aim to reformulate the task of intent prediction to be more consistent with this universal goal of language encoders.
Using a BERT-like encoder, we train an intent classification model to (1) measure the similarity of an utterance to a set of examples and (2) infer the intent of the utterance based on the similarity to the examples corresponding to each intent. Rather than implicitly capturing the latent space to intent class mapping in our learned weights (i.e., through a classification layer), we make this mapping an explicit non-parametric process that reasons over a set of examples. Our formulation, similar to metric-based meta learning (Koch et al., 2015), only performs gradient updates for the language encoder, which is trained for the task of sentence similarity. Through this example-formulation, we hypothesize that the model will better generalize in few-shot scenarios, as well as to rare intents.
We are given (1) a language encoder F that encodes an utterance to produce a latent representation, (2) a natural language utterance utt, and (3) a set of n examples {(x 1 , y 1 ), . . . , (x n , y n )} where x 1,...,n are utterances and y 1,...,n are their corresponding intent labels. With F being a BERT-like model, the following equations describe exampledriven intent classification: The equations above describe a non-parametric process for intent prediction. Instead, through the example-driven formulation (visualized in Figure  2), the underlying language encoder (e.g., BERT) is being trained for the task of sentence similarity. A universal goal of language encoders is that inputs with similar semantic meaning should have similar latent representations. By formulating intent prediction as a sentence similarity task, we are adapting BERT-based encoders in a way that is consistent with this universal goal. We hypothesize that in contrast to the baseline models, this formulation facilitates generalizability and has the potential to better transfer to new intents and domains. At training time, we populate the set of examples in a two step process: (i) for each intent class that exists in the training batch, we sample one different utterance of the same intent class from the training set and (ii) we randomly sample utterances from the training set until we have a set of examples that is double the size of the training batch size (128 example utterances). During inference, our example set is comprised of all the utterances in the training data.

Datasets
We evaluate our methods on three intent prediction datasets: BANKING77 (Casanueva et al., 2020), CLINC150 (Larson et al., 2019), and HWU64 . These datasets span several domains and consist of many different intents, making them more challenging and more reflective of commercial settings than commonly used intent predic-tion datasets like SNIPs (Coucke et al., 2018). BANKING77 contains 13,083 utterances related to banking with 77 different fine-grained intents. CLINC150 contains 23,700 utterances spanning 10 domains (e.g., travel, kitchen/dining, utility, small talk, etc.) and 150 different intent classes. HWU64 includes 25,716 utterances for 64 intents spanning 21 domains (e.g., alarm, music, IoT, news, etc.). Casanueva et al. (2020) forego a validation set for these datasets and instead only use a training and testing set. We instead follow the setup of Mehri et al. (2020), wherein a portion of the training set is designated as the validation set.

Experimental Setup
We evaluate in two experimental settings following prior work (Casanueva et al., 2020;Mehri et al., 2020): (1) using the full training set and (2) using 10 examples per intent or approximately 10% of the training data. In both settings, we evaluate on the validation set at the end of each epoch and perform early stopping with a patience of 20 epochs for a maximum of 100 epochs. Since the few-shot experiments are more sensitive to initialization and hyperparameters, we repeat the few-shot experiments 5 times and take an average over the experimental runs. For the few-shot settings, our models use only the few-shot training data for both masked language modelling and as examples at inference time in the example-driven models (i.e., they do not see any additional data). Our experiments with observers all use 20 observers, however we include an ablation in the appendix (Table 6; see supplementary materials).

Results
Our experimental results, as well as the results obtained by Casanueva et al. (2020) and Mehri et al. (2020) are shown in Table 1. Combining observers and example-driven training results in (1) SoTA results across the three datasets and (2) a significant improvement over the BERT-base model, especially in the few-shot setting (+5.02% on average).
Furthermore, the results show that the use of observers is particularly conducive to the exampledriven training setup. Combining these two approaches gains strong improvements over the Con-vBERT + MLM model (few-shot: +4.98%, full data: +0.41%). However, when we consider the two proposed approaches independently, there is no consistent improvement for both example-driven (few-shot: -0.46% full data: +0.24%) and ob-servers (few-shot: +0%, full data: -0.42%). The fact that these two methods are particularly conductive to each other signifies the importance of using them jointly. The representation step of intent prediction is tackled by observers, which aim to better capture the semantics of an input by disentangling the attention and therefore avoiding the dilution of the representations. The prediction step, is improved through example-driven training which uses the underlying BERT-based model to predict intents by explicitly reasoning over a set of examples. This characterization highlights the importance of jointly addressing both steps of the process simultaneously. Using observers alone does not lead to significant improvements because the linear classification layer cannot effectively leverage the improved representations. Using exampledriven training alone does not lead to significant improvements because the [CLS] representations do not capture enough of the underlying utterance semantics. The enhanced semantic representation of observers is necessary for example-driven training: by improving the latent representations of utterances, it is easier to measure similarity in the set of examples.

Analysis
This section describes several experiments that were carried out to show the unique benefits of observers and example-driven training, as well as to validate our hypothesis regarding the two methods. First, we show that with the example-driven formulation for intent prediction, we can attain strong performance on intents unseen during training. Next, we show that the generalization to new intents transfers across datasets. Next, we carry out a probing experiment that demonstrates that the latent representation of the observers contains greater semantic information about the input. Finally, we discuss an ablation over the number of observers used which demonstrates that the benefit of observers is primarily a consequence of the disentangled attention.

Transfer to Unseen Intents
By formulating intent prediction as a sentence similarity task, the example-driven formulation allows for the potential to predict intents that are unseen at training time. We carry out experiments in the few-shot setting for each dataset, by (1) randomly removing 4 -10 intent classes when training in an   (3) reporting results only on the unseen intents. We repeat this process 30 times for each dataset and the results are reported in Table 2. It should be noted that we do not perform MLM training on the utterances corresponding to the unseen intents.
These results demonstrate that the exampledriven formulation generalizes to new intents, without having to re-train the model. The performance on the unseen intents approximately matches the performance of the best model which has seen all intents (denoted BEST FULLY TRAINED MODEL in Table 2). These results highlight a valuable property of the proposed formulation: namely, that new intent classes can be added in an online manner without having to re-train the model. While the off-the-shelf BERT-base and CONVBERT models, which are not at all fine-tuned on the datasets, are able to identify similar sentences to some extenttraining in an example-driven manner drastically improves performance.
The addition of observers, in combination with example-driven training, significantly improves performance on this experimental setting (+18.42%). This suggests that the observers generalize better to unseen intents, potentially because the observers are better able to emphasize words that are key to differentiating between intents (e.g., turn the volume up vs turn the volume down).

Transfer Across Datasets
While transferring to unseen intents is a valuable property, the unseen intents in this experimental setting are still from the same domain. To further evaluate the generalizability of our models, we carry out experiments evaluating the ability of models to transfer to other datasets. Using the full data setting with 10 training utterances per intent, we (1) train a model on a dataset and (2) evaluate the models on a new dataset, using the training set of the new dataset as examples during inference. In this manner, we evaluate the ability of the models to transfer to unseen intents and domains without additional training.
The results in Table 3 demonstrate the ability of the the model with obsevers and example-driven training to transfer to new datasets, which consist of both unseen intents and unseen domains. These results show that the example-driven model performs reasonably well even when transferring to domains and intents that were not seen at training time. These results, in combination with the results shown in Table 2 speak to the generalizability of the proposed methods. Specifically, by formulating intent prediction as a sentence similarity task through example-driven training, we are maintaining consistency with a universal goal of language encoders (i.e., that utterances with similar semantic meanings have similar latent representations) that effectively transfers to new settings.

Observers Probing Experiment
We hypothesized that by disentangling the attention in BERT-based models, the observers would avoid the dilution of representations (which occurs because words attend to a meaningless [CLS] token) and therefore better capture the semantics of the input. We validate this hypothesis through the experimental evidence presented in Table 2 wherein the use of observers results in a significant performance improvement on unseen intents. To demonstrate that observers better capture the semantics of an input, we carry out a probing experiment using the word-content task of Conneau et al. (2018).
We generate a latent representation of each utterance using models with and without observers. We then train a classifier layer on top of the frozen representations to reproduce the words of the input. Similar to Conneau et al. (2018), we avoid using the entire vocabulary for this probing experiment and instead use only the most frequent 1000 words for each dataset. With infrequent words, there would be uncertainty about whether the performance difference is a consequence of (1) the semantic content of the representation or (2) the quality of the probing model. Since we are concerned with measuring the former, we only consider the most frequent words to mitigate the effect of latter. Table 4 shows the micro-averaged F-1 score for the task of reproducing the words in the utterance, given the different latent representations.
A latent representation that better captures the semantics of the input utterance, will be better able to reproduce the specific words of the utterance. The results in Table 4 show that the use of observers results in latent representations that better facilitate the prediction of the input words (+1.50 or 5% relative improvement). These results further validate the hypothesis that the use of observers results in better latent representations.

Number of Observers
To further understand the performance of the observers, we carry out an ablation study over the number of observers. The results shown in Table 6 (in the Appendix) demonstrate that while multiple observers help, even a single observer provides benefit. This suggests that the observed performance gain is a primarily a consequence of the disentangled attention rather than averaging over multiple observers. This ablation provides further evidence that the use of observers mitigates the dilution of the utterance level representations.

Intent Prediction
Intent prediction is the task of converting a user's natural language utterance into one of several predefined classes, in an effort to describe the user's intent (Hemphill et al., 1990;Coucke et al., 2018). Intent prediction is a vital component of pipeline task-oriented dialog systems, since determining the goals of the user is the first step to producing an appropriate response (Raux et al., 2005;Young et al., 2013). Prior to the advent of large-scale pre-training (Devlin et al., 2018;Radford et al., 2018), approaches for intent prediction utilize taskspecific architectures and training methodologies (e.g., multi-tasking, regularization strategies) that aim to better capture the semantics of the input (Bhargava et al., 2013;Hakkani-Tür et al., 2016;Gupta et al., 2018;Niu et al., 2019).   Table 4: Micro-averaged F-1 scores for the task of reproducing the words of the input (using only the most frequent 1000 words) given the different latent representations.

Model
The large-scale pre-training of BERT makes it more effective for many tasks within natural language understanding (Wang et al., 2018), including intent prediction (Chen et al., 2019a;Castellucci et al., 2019). However, recent work has demonstrated that leveraging dialog-specific pre-trained models, such as ConveRT (Henderson et al., 2019;Casanueva et al., 2020) or CONVBERT (Mehri et al., 2020) obtains better results. In this paper, we build on a strong pre-trained conversational encoder (CONVBERT) (1) by enhancing its ability to effectively capture the semantics of the input through observers and (2) by re-formulating the problem of intent prediction as a sentence similarity task through example-driven training in an effort to better leverage the strengths of language encoders and facilitate generalizability.

Observers
Analysis of BERT's attention weights shows that a significant amount of attention is attributed to special tokens, which have no inherent meaning (Clark et al., 2019;Kovaleva et al., 2019). We address this problem by disentangling BERT's attention through the use of observers. There have been several avenues of recent work that have explored disentangling the attention mechanism in Transformers.  explore disentangling the attention heads of a Transformer model conditioned on dialog acts to improve response generation. He et al. (2020) disentangle the attention corresponding to the words and to the position embeddings to attain performance gains across several NLP tasks. Guo et al. (2019) propose an alternative to the fully-connected attention, wherein model complexity is reduced by replacing the attention connections with a star shaped topology.

Example-Driven Training
Recent efforts in NLP have shown the effectiveness of relying on an explicit set of nearest neighbors to be effective for language modelling , question answering (Kassner and Schütze, 2020) and knowledge-grounded dialog (Fan et al., 2020). However, these approaches condition on examples only during inference or in a non end-to-end manner. In contrast, we train the encoder to classify utterances by explicitly reasoning over a set of examples.
The core idea of example-driven training is similar to that of metric-based meta learning which has been explored in the context of image classification, wherein the objective is to learn a kernel function (which in our case is BERT) and use it to compute similarity to a support set (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017). In addition to being the first to extend this approach to the task of intent prediction, the key difference of example-driven training is that we use a pre-trained language encoder (Mehri et al., 2020) as the underlying sentence similarity model (i.e., kernel function). Ren and Xue (2020) leverage a triplet loss for intent prediction, which ensures that their model learns similar representations for utterances with the same intent. We go beyond this, by performing end-to-end prediction in an example-driven manner.
Our non-parametric approach for intent prediction allows us to attain SoTA results and facilitate generalizability to unseen intents and across datasets.

Conclusion
In order to enhance the generalizability of intent prediction models, we introduce (1) observers and (2) example-driven training. We attain SoTA results on three datasets in both full data and the few shot settings. Furthermore, our proposed approach exhibits the ability to transfer to unseen intents and across datasets without any additional training, highlighting its generalizability. We carry out a probing experiment that shows the representations produced by observers to better capture the semantic information in the input.
There are several avenues for future work.
(1) Observers and example-driven training can be extended beyond intent prediction to tasks like slot filling and dialog state tracking. (2) Since observers are disentangled from the attention graph, it is worth exploring whether it possible to force each of the observers to capture a different property of the input (i.e., intent, sentiment, domain, etc.). (3) Our mechanism for measuring sentence similarity in our example-driven formulation can be improved.

Ethical Considerations
Our paper presents several approaches for improving performance on the task of intent prediction in task-oriented dialogs. We believe that neither our proposed approaches nor the resulting models have cause for ethical concerns. There is limited potential for misuse. Given the domain of our data (i.e., task-oriented dialogs), failure of the models will not result in harmful consequences. Our paper relies on significant experimentation, which may have result in a higher carbon footprint, however this is unlikely to be drastically higher than the average NLP paper. Table 6 shows examples of predictions on the HWU corpus using both observers and exampledriven. These examples show that semantically similar example utterances are identified, particularly when using observers. Furthermore, the examples in Table 6 show that explicitly reasoning over examples makes intent classification models more interpretable.

B Ablations
We carry out ablations over the number of observers used to train and evaluate the models. Furthermore, we vary the number of examples seen at inference time, as a percentage of the set of training examples. The results shown in Table 6 demonstrate that while having more observers helps, even a single observer provides benefits. This suggests that the observed performance gain (shown in Table  1) is primarily a consequence of the disentangled attention rather than averaging over multiple observers.
The ablation over the number of examples used at inference time demonstrates that the models perform reasonably well with much fewer examples (e.g., 5% is <1000 examples or approximately 5 per intent). The performance drop in the few-shot experiments suggests that it is important to train with more data, however the results in Table 6