HINT3: Raising the bar for Intent Detection in the Wild

Intent Detection systems in the real world are exposed to complexities of imbalanced datasets containing varying perception of intent, unintended correlations and domain-specific aberrations. To facilitate benchmarking which can reflect near real-world scenarios, we introduce 3 new datasets created from live chatbots in diverse domains. Unlike most existing datasets that are crowdsourced, our datasets contain real user queries received by the chatbots and facilitates penalising unwanted correlations grasped during the training process. We evaluate 4 NLU platforms and a BERT based classifier and find that performance saturates at inadequate levels on test sets because all systems latch on to unintended patterns in training data.


Introduction
Over the last few years, task-oriented dialogue systems have gained increasing traction for applications like personal assistants, automated customer support agents, etc. This has led to the availability of several commercialised and/or open conversational bot building platforms. Most popular systems today involve intent detection as a vital part of their Natural Language Understanding (NLU) pipeline. Recent advances in transfer learning (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019) has enabled systems that perform quite well on existing benchmarking datasets (Larson et al., 2019;Casanueva et al., 2020).
Definitions of intent often vary across users, tasks and domains. Perception of intent could range from a generic abstraction such as "Ordering a product" to extreme granularity such as "Enquiring for a discount on a specific product if ordered using a specific card". Additionally, factors such as imbalanced data distribution in the training set, assumptions during training data generation, diverse background of domain experts involved in defining the classes make this task more challenging. During inference, these systems may be deployed to users with diverse cultural backgrounds who might frame their queries differently even when communicating in the same language. Furthermore, during inference, apart from correctly identifying in-scope queries, the system is expected to accurately reject out-of-scope (Larson et al., 2019) queries, adding on to the challenge.
Most existing datasets for intent detection are generated using crowdsourcing services. To accurately benchmark in real-world settings, we release 3 new single-domain datasets, each spanning multiple coarse and fine grain intents, with the test sets being drawn entirely from actual user queries on the live systems at scale instead of being crowdsourced. On these datasets, we find that the performance of existing systems saturates at unsatisfactory levels because they end up learning spurious patterns from the training dataset instead of generalising to the perceived meanings of intents.
We evaluate 4 NLU platforms -Dialogflow 1 , LUIS 2 , Rasa NLU 3 , Haptik 45 and a BERT (Devlin et al., 2019) based classifier on all 3 datasets and highlight gaps in language understanding. We further probe into queries where all the current systems fail and question the efficacy of the current approach of learning. Additionally, we repeat all our experiments on the subset of training data and show a performance drop in all the systems despite retaining relevant and sufficient utterances in the training subset. We've made our datasets and code freely accessible on GitHub to promote

Prior Work
Despite intent detection being an important component of most dialogue systems, very few datasets have been collected from real users. Web Apps, Ask Ubuntu and Chatbot datasets from (Braun et al., 2017) contain a limited number of intents (<10), oversimplifying the task. More recent datasets like HWU64 from (Liu et al., 2019) and CLINC150 from (Larson et al., 2019) span a large number of intents in multiple domains but are generated using crowd sourcing services hence are limited in diversity in user expressions which arise from but not limited to domain specific presumptions, context from how and where the bot is made available, paraphrases emerging from cultural and ethnic diversity of user base, conversational slang, etc. Our work has some similarity with CLINC150, in that they also highlight the problem of out-ofscope intent detection and with BANKING77 from (Casanueva et al., 2020)

Datasets
We introduce HINT3, a collection of datasets shown in Table 2 -SOFMattress, Curekart and Powerplay11 each containing diverse set of intents in a single domain -mattress products retail, fitness  Table 1 shows few example intents of varying granularity in HINT3 dataset, along with examples of training queries created by domain experts and in-scope, out-of-scope queries received from real users.

Training Data Collection
Training data is prepared by a team of domain experts trying to emulate real users after in-depth research of historical user queries. The experts do not create an explicit set of out of scope queries primarily because the universe of such queries is infinitely big. Training datasets show class imbalance, occurrence of domain specific words, acronyms 7 . All training data queries are in English.

Dataset Variants
In addition to Full training sets, we create Subset versions for each training set. For each class, after retaining the first query we iterate over the 7 github.com/hellohaptik/HINT3/tree/master/data exploration rest, discarding a query if it has an entailment score (Bowman et al., 2015) greater than 0.6 in both directions with any of the queries retained so far i.e. the subset version has the following property where I is the set of all intents,X i is the set of queries retained for class i, E(h, p) is the entailment scoring function with h as hypothesis and p as premise. We use ELMo model trained on SNLI (Peters et al., 2018;Parikh et al., 2016) 8 for E(h, p). These are intended to evaluate performance with only semantically different sentences in the training set as ideally systems should already understand semantically similar queries to the ones present in the training set.

Test Data Collection and Annotation
Our test sets contain the first message received by live systems from real users over a period of 15 days. Inter-annotator agreement was 75.8%, 80.0% and 73.4% for SOFMattress, Curekart and Power-play11 respectively and conflicts were resolved by domain experts. One major reason for low interannotator agreement was unclear criteria for defining an intent which sometimes lead to overlapping intents of different levels of granularity, even after we had made sure to manually merge any conflicting or highly similar intents in the training data. Directly coming from real users our test set queries also contain messaging slangs, acronyms, spelling mistakes, grammatical mistakes and usage of code-mixed languages 7 . Queries in non-Latin script or code-mixed languages were marked as out of scope (labelled as NO NODES DETECTED). Since live chat systems don't cater all the queries related to a brand, our test set contains relevant outof-scope queries received from users about that domain. Any identifiable information of users, brands was replaced with made-up values in both train and test sets.

Benchmark Evaluation
We evaluated the performance of our datasets on platforms like Dialogflow, LUIS, RASA and Haptik in addition to evaluating performance on BERT. All layers of BERT were fine-tuned with a learning rate of 4e-5 for up to 50 epochs with a warmup period of 0.1 and early stopping.

Out-Of-Scope (OOS) prediction
We use thresholds on the model's probability estimate for the task of predicting whether a query is OOS. We show performance on thresholds ranging from 0.1 to 0.9 at an interval of 0.1 to show the maximum performance a model can achieve irrespective of how we choose the threshold.

Metrics
We consider Accuracy and Matthew's Correlation Coefficient 9 as overall performance metrics for the systems. We use OOS recall (Larson et al., 2019) to evaluate performance on OOS queries and accuracy of in-scope queries to evaluate performance on inscope queries. Figure 1 presents results for all systems, for both Full and Subset variations of the dataset. Best Accuracy on all the datasets is in the early 70s. Best MCC for the datasets varies from 0.4 to 0.6, suggesting the systems are far from perfectly understanding natural language.

Results
In Table 3, we consider in-scope accuracy at a very low threshold of 0.1, to see if false positives on OOS queries would not have mattered, what's the maximum in-scope accuracy that current systems are able to achieve. Our results show that even with such a low threshold, the maximum in-scope accuracy which systems are able to achieve on Full Training set is pretty low, unlike the 90+ in-scope accuracies of these systems which have been reported on other public datasets like CLINC150 in (Larson et al., 2019). And, the in-scope accuracy is even worse for the Subset of the training data. Table 5 shows percentage drop in in-scope accuracy on subset data across all systems as compared to in-scope accuracy on full data. The drop varies from 0.6% to 22.3% across datasets and platforms. In an ideal world, this drop should be close to 0 across all datasets, as if the system understands the meaning of queries in training data, its performance should not get affected at all by removing queries in training data which are semantically similar to the ones already present.
Analyzing few example queries which failed on all platforms in Table 4 suggests that these models    aren't actually "understanding" language or capturing "meaning", instead capturing spurious patterns in training data, as was also pointed in (Bender and Koller, 2020). Predicting based on these spurious patterns, which models latch on to during training, leads to models having high confidence even on OOS queries. Figure 2 shows this behaviour on SOFMattress Full dataset, as significant percentage of OOS queries have high confidence scores on all systems, except LUIS, for which it is at the cost of in-scope accuracy.

Conclusion
This paper analyzed intent detection on 3 new datasets consisting of both in-scope and out-ofscope queries received on 3 live chat bots over a period of 15 days. Our findings indicate that there's a significant gap in performance on crowdsourced datasets vs in a real world setup. NLU systems don't seem to be actually "understanding" language or capturing "meaning". We believe our analysis and dataset will lead to developing better, more robust dialogue systems.