Dynamic Semantic Matching and Aggregation Network for Few-shot Intent Detection

Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances. Although recent works demonstrate that multi-level matching plays an important role in transferring learned knowledge from seen training classes to novel testing classes, they rely on a static similarity measure and overly fine-grained matching components. These limitations inhibit generalizing capability towards Generalized Few-shot Learning settings where both seen and novel classes are co-existent. In this paper, we propose a novel Semantic Matching and Aggregation Network where semantic components are distilled from utterances via multi-head self-attention with additional dynamic regularization constraints. These semantic components capture high-level information, resulting in more effective matching between instances. Our multi-perspective matching method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances. We also propose a more challenging evaluation setting that considers classification on the joint all-class label space. Extensive experimental results demonstrate the effectiveness of our method. Our code and data are publicly available.


Introduction
Intent Detection (ID) is a crucial task in natural language understanding, whose objective is to extract underlying intents behind the given utterances. The extracted intents could provide further contexts for further downstream Natural Language Processing tasks such as dialogue state tracking or question answering. Unlike traditional text classification, ID is challenging for two main reasons (1) Utterances are usually short and diversely expressed, (2) Emerging intents occur continuously, especially across different domains (Liu et al., 2019a).
Despite recent advances, state-of-the-art ID methods (Haihong et al., 2019;Goo et al., 2018) require a large amount of annotated data to achieve competitive performance. This requirement inhibits models' capability in generalizing to newly emerging intents with no or limited annotations during inference. Re-training or fine-tuning large models on few samples of emerging classes could easily lead to overfitting problems.
Motivated by human capability in correctly categorizing new classes with only a few examples (Lake et al., 2011;Gidaris and Komodakis, 2018), few-shot learning (FSL) paradigms are adopted to tackle the scarcity problems of emerging classes. FSL methods take advantage of a small set of labeled examples (support set) to learn how to discriminate unlabeled samples (query samples) between classes, even those not seen during training.
Recent works in FSL Ye and Ling, 2019) focus on learning the matching information between the labeled samples (support) and the unlabeled samples (query) to provide additional contextual information for instance-level representations, leading to effective prototype representation. However, these methods only extract similarity based on fine-grained word semantics, failing to capture the diverse expressions of users' utterances. This problem could further lead to overfitting either to seen intents or novel intents, especially in the challenging Generalized Few-shot Intent Detection (GFSID) setting (Xia et al., 2020) where both seen and novel intents are existent in a joint label space during inference. Instead, matching support and query samples on coarser-grained semantic components could provide additional informative contexts beyond word levels. For instance, two utterances "i need to get a table at a pub with southeastern cuisine" and "book a spot for six friends" share a sim-ilar intent label "Book Restaurant". While wordlevel semantics might find similar action words as "get" and "book", these words do not necessarily contribute to the correct intent findings. Instead, coarser-grained semantics such as "get a table" and "book a spot" could provide further hints to identify "Book Restaurant" intent.
As semantic components (SC) could be effectively extracted from multi-head self-attention, matching these SC between support and query can enhance both query and support representations, leading to improvements in generalization from seen training classes to unseen testing classes. To further enhance the dynamics of extracted SC across various domains and diversely expressed utterances, we introduce additional head regularizations. In addition, to overcome the insufficiency of a single similarity measure for matching sentences with diverse semantics, a more comprehensive matching method is further explored.
Our main contribution is summarized as follows: • We propose a Semantic Matching and Aggregation Network that automatically extracts multiple semantic components from support and query sentences via multi-head selfattention. Additional regularizations are introduced to (1) encourage extracted heads to attend to all words of utterances and (2) encourage semantic alignment between utterances with similar intent labels.
• Comprehensive multi-perspective matching is proposed to reduce reliance on a single fixed similarity measure and enhance generalizability towards Generalized Few-shot Learning setting (GFSL).
• We also propose a more challenging but realistic FSL and GFSL evaluation setting.

Related Work
Few-shot Learning Few-shot learning refers to problems where classifiers are required to generalize to unseen classes with only a few training examples per class . To overcome challenges of potential overfitting, most FSL methods adopt meta-learning approach where knowledge is extracted and transferred across multiple tasks. There are two major approaches towards FSL: (1) metric-based approach whose goal is to learn feature extractor that extract and generalize to emerging classes (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018), and (2) optimizationbased approach that aims to optimize model parameters from few samples (Santoro et al., 2016;Finn et al., 2017;Ravi and Larochelle, 2017;Mishra et al., 2018). In this work, we focus mostly on metric-based learning approach. Specifically, we extend Prototypical Network (PN) (Snell et al., 2017) in which prototypes are not only represented by support samples but also matching information between support and query samples. Traditionally, FSL methods are evaluated in episodic procedure due to the major principle that test and train conditions must match (Vinyals et al., 2016). Each episode represents a meta-learning task in which the models explicitly "learn to learn" minimize the loss on an unlabeled/ query set given the support/ labeled set. However, we claim that this evaluation is lack of practicality for two main reasons. First, evaluation on random samples could not help us understand the strengths or weaknesses of the model. For instance, if the trained model overfits a subset of novel classes, it is impossible to pinpoint the overfitting classes with episodic evaluation. Secondly, in realistic applications, there is a need to categorize unlabeled samples into one of the novel/joint classes, rather than a set of sampled classes. Episodic testing does not provide an end-to-end systematic evaluation. Therefore, in our work, we propose a more challenging but realistic non-episodic evaluation setting where unlabeled samples are only inferred once with a probablility distribution over a fixed set of classes in novel or joint label space.
Sentence Matching Recent FSL works adopt multi-level matching and aggregation methods to improve FSL performance (Gao et al., 2019;Ye and Ling, 2019). Instead of constructing prototypes purely from support samples, recent works integrate matching information between support and query samples on multiple levels. Gao et al. (2019) introduces feature-level and instance-level attention.  introduces additional word-level attention and proposes more advanced multi-cross attention on instancelevel. On the other hand, Ye and Ling (2019) adopts soft matching between support and query samples to build local context representation for both support and query samples. These methods have been proven effective in few-shot relation classification tasks. However, they rely on overly fine-grained level matching which potentially causes overfitting problems towards either seen or unseen set of classes. Our work mainly differs in two aspects: (1) Comprehensive multi-perspective matching for information matching and (2) Matching on coarser-grained semantic-component levels that are extracted dynamically for effective knowledge transfer, especially in GFSL settings.

Problem Formulation
In this section, we provide definitions for both Fewshot Intent Detection (FSID) and GFSID task. Traditional FSL task is defined as C-way K-shot classification task in which classifier performs a series of tasks during both training and inference, which involves C randomly chosen classes with only K labeled samples from each class (K ≤ 5). These C · K samples are named as support samples. This series of tasks are repeated via episodes (Vinyals et al., 2016). In each episode, the objective is to correctly classify unlabeled samples (query samples) by using only the support samples.
We denote seen label space as Y s , novel label space as Y n , and Y s ∩ Y n = ∅. Given the seen labels (Y s ), we define D s = {(x 1 , y 1 ), (x 2 , y 2 ), ...(x Ns , y Ns )}, where N s denotes the total number of seen samples and (x, y) denotes a pair of utterance and intent label. Similarly, D n = {(x 1 , y 1 ), (x 2 , y 2 ), ...(x Nn , y Nn )}.
Given an unlabeled utterance x, the objective of FSID is to maximize correct prediction for x within the novel label subspace Y n as summarized in (1).
For GFSID, there exists an additional joint label space Y j = Y s ∪ Y n . Unlike FSID, GFSID is more challenging as the test samples could come from either seen or novel sample space. The objective function is modified as follows.

Methodology
In this section, we introduce our proposed architecture. Specifically, we divide the framework into 3 main components: Semantic Encoder, Semantic Matching & Aggregation, Instance Aggregation & Class Matching as illustrated in Figure 1.

Semantic Encoder
The objective of Semantic Encoder (SE) is to extract semantic components from the given support Encoder capture high-level semantics beyond word semantic level. Matching these components is more effective than word-by-word matching as contextual phrases are further taken into consideration and non-essential words do not distract the matching functions.
or query instances. Given an input support or query instance x = [x 1 , x 2 , ..., x T ] with T words, SE first maps each word into a d w dimensional word embedding. Pre-trained embedding such as Glove (Pennington et al., 2014), or even contextualized embedding BERT (Devlin et al., 2018) could be elevated. In our work, we adopt pre-trained FastText embedding (Bojanowski et al., 2017).
To capture semantic and syntactic information of the given instance, we adopt self-attentive semantic encoder inspired by multi-head self-attention in (Lin et al., 2017). Specifically, we first use Bi-Directional Long short-term Memory (Bi-LSTM) to capture contextual information between words within a sentence.
The hidden representation of x (denoted as H ∈ R T ×2d h ) is a concatenation of both forward and backward hidden states where d h is the hidden size.
To capture more fine-grained signals other than sentence vector representation, self-attention mechanism is adopted to extract important semantic components of the sentence. Each semantic component, denoted as "head", is learned from the hidden state H via multi-layer perceptrons (MLP).
where W s1 , W s2 are the learning weights with dimension of R da×2d h and R r×da . d a and r can be simply seen as the hidden size and output size of the embedded feed-forward network. r represents the number of heads or important features that the network extracts from the given sentence. The r-head representation M ∈ R r×2d h is a product of attention matrix and the obtained hidden states M = AH.
Additional regularization terms are introduced to enforce (1) Each head focuses on different aspects of a sentence, (2) All words in an utterance are covered by the extracted heads, (3) Head distribution between query and support with the same intent labels should be similar to one another. These regularized terms are optimized together with the query classification loss (L class ) to further improve the model's performance. In summary, our training loss is summarized as follows. L = L class + αL self attn + βL unif orm + γL discr (6) where α, β, γ are hyperparameters.
Self-attention regularization Additional regularization term is needed to enforce that each attention head focuses on different semantic components of the utterance. The most intuitive approach is to minimize the number of "attended" tokens for each head, forcing each head vector to attend to a single aspect of the given sentence (Lin et al., 2017).
where A denotes the obtained attention matrix from SE and ||•|| 2 F denotes Frobenius matrix norm. Head uniform regularization To ensure that all words of a given utterance are covered by at least one head obtained by multi-head self-attention, we minimize the Kullback-Leiber (KL) divergence between the word probability distribution over all heads ( r i=1 A i ) and a uniform distribution U.
Head uniform regularization is introduced to increase robustness and dynamic of extraction behavior by covering even rare words that are not widely used in utterances.
Head distribution regularization To encourage semantic alignment between support and query samples of the same intent, we minimize the KL divergence in terms of head distributions among those with similar intents while maximizing KL divergence among those that are different.
(9) L Q and L S denote the lengths of query and support sentences respectively.Ŷ Q and Y S denote predicted query label and ground truth support label respectively. This regularization allows for dynamic multi-head self-attention extraction behavior by incorporating query predicted label from downstream task into the objective function.

Semantic Matching & Aggregation
In order to enrich representations for both support and query instances, given SCs extracted from Semantic Encoder, we introduce Semantic Matching & Aggregation module to capture and aggregate matching local contexts between support and query via SCs. Specifically, our module is made up of two components: (1) Multi-perspective Semantic Matching and (2) Semantic Aggregation.
Extracted head representations from SE (matrix M) for both support and query samples are used in this module.We denote representations of k-th support sample as S k = [M 1 s k , M 2 s k , ..., M r s k ] and query sample as Q = [M 1 q , M 2 q , ..., M r q ] respectively , where r denotes the number of extracted heads from SE. This module is applied to both support and query samples to build an enhanced instance representationŜ k andQ. For simplicity, we only define the one-way matching (S k → Q).

Multi-perspective Semantic Matching
Following (Wang et al., 2017), we define the multiperspective matching function f m between two vectors as m = f m (v 1 , v 2 ; W) where W ∈ R l×d is a trainable weight parameter. l is a hyperparameter defining the number of perspectives. Each perspective in vector m is a cosine similarity between weighted vectors v 1 and v 2 . In other words, We define four different components of multiperspective matching method as follows.
Head-wise Matching Each head's forward and backward contextualized embedding of S k are compared with the corresponding head's forward and backward contextual embedding of Q. − Max-pooling Matching Each head's forward and backward contextualized embedding of S k is compared with all heads' forward and backward contextual embedding of Q. However, only the maximum value in each dimension is extracted and retained in the matching vector.
Attentive Matching Unlike Max-Pooling matching, Attentive Matching is divided into two steps (1) Head representative is aggregated via similarity scores between different heads of each support and query sample (2) Matching head representative and the support heads. For similarity measure, cosine function is utilized.
Head representative is defined as a weighted sum of all query heads.
The computed head representative is compared with each head's contextualized embedding of S k .
Max-Attentive Matching Similar to Attentive Matching, Max-Attentive extracts head representative in Equation (13)

Semantic Aggregation
In order to aggregate the matched representation into a single instance representation, we use another Bi-LSTM whose input is a concatenation of matched representation in previous sections.
Similarly, we obtain the final representation of query with reverse matching (

Instance Aggregation & Class Matching
As indicated in previous works, when class label covers diverse semantics, each support instance contributes differently to the class prototype given the query instance. Therefore, we replace the mean operation over all support instances of PN with attentive aggregation. Attention weight for each support instanceŜ k is learned via a MLP.
Support prototype (Ŝ) is computed as a weighted sum aggregation via support attention weight and each k-th support instance representation.
Another MLP is used as class matching function by using support prototype and query representation.Ŷ Weights W 9 ∈ R d h and W 10 ∈ R d h ×4d h are shared between instance aggregation (Equation (17)) and class matching (Equation (19)) for optimal performance (Ye and Ling, 2019).

Dataset
We evaluate our proposed model on two realworld datasets for the GFSID task: SNIPS-NLU (SNIPS) and NLU-Evaluation Dataset (NLUE). Both datasets are widely as benchmarks for Natural Language Understanding tasks. Statistics of both datasets are summarized in Table 1.
For each dataset, we define Seen-Novel-Joint datasets. To build a joint dataset (D j ), we aggregate 20% of seen intent utterances with novel intent utterances. The remaining seen intent utterances (80%) are used as training data (reported N s in Table 1). The support samples (1 or 5 shots) are randomly sampled in advance and not counted in either N s , N n or N j . SNIPS-NLU: Following (Xia et al., 2018),we select two intents (RateBook and AddToPlaylist) as novel/ emerging intents and the other five intents as seen intents. NLUE: Following (Liu et al., 2019b), we utilize a subset of utterances covering 64 intents. We randomly choose 16 intents as unseen intents while the remaining 48 intents are considered seen.

Baselines
We compare our model with several traditional FSL models, and specifically metric-based network models. For fair comparison and consistency, we implement our SE proposed in Section 4.1 for all considered baselines. Final instance embedding is obtained as a mean operation over all heads. The only exception is HAPN and MLMAN as they require local matching (i.e. word matching) modules. In that case, we use output of Bi-LSTM (in Equation (4)) and enhance it with the head regularization term (Section 4.1) during training.
• Relation Network (RN) (Sung et al., 2018) few-shot model that uses neural network to learn deep metric known as relation scores.
• Hybrid Attention-based Prototypical Network (HATT) (Gao et al., 2019): initial fewshot learning model that integrates featurelevel attention and instance-level attention between support and query samples.
• Hierarchical Prototypical Network (HAPN) : few-shot learning paradigm that extracts similarity on all feature, word and instance levels.
• Multi-level Matching and Aggregation Network (MLMAN) (Ye and Ling, 2019): multi-level matching approach exploiting both fusion and dot product similarity on local/ word level to enhance instance representation.

Implementation Details
We use 3-fold cross-validation to tune all of the hyperparameters based on S-J accuracy on SNIPS and Fold 1 of NLUE datasets as summarized in Table 2. Pre-trained FastText word embedding is used to initialize word embedding and stays fixed during both training and testing for fair comparison between our proposed model and baselines. We train each model over 1000 randomly sampled episodes with learning rate of 0.0001. The number of query samples (N Q ) for each episode is 20.
Following (Shi et al., 2019), we evaluate our models on overall Seen-Joint (S-J) and Seen-Novel (S-N) accuracy. Reported S-J accuracy denotes GFSID evaluation result while S-N indicates traditional FSID results. Reported h-accuracy is a harmonic mean between S-J and S-N accuracy to evaluate the stability of the overall model in both GFSID and FSID settings.
Episodic Evaluation Traditional FSL methods are evaluated in episodes due to the major principle that test and train conditions (C-way K-shot) must match (Vinyals et al., 2016). On SNIPS dataset,we  conduct experiments with K = {1, 5} and C = 2 with 5 random seed initialization and report average accuracy in Table 3. For NLUE dataset, we average accuracy over 10 Folds with similar K and C = 5. The sampling procedure for GFSL is conducted in a similar way as (Shi et al., 2019).

Non-episodic Evaluation
As mentioned in Section 2, Episodic Evaluation is lack of practicality and does not provide an end-to-end system evaluation. Therefore, we also evaluate the models on our proposed non-episodic procedure where unlabeled samples are only inferred once and the predicted probability distribution is over all Y n or Y j label space.

Experimental Results
As we observe from Table 3 and 4, our proposed model outperforms the previous baselines by a large margin in both episodic and non-episodic evaluations on both datasets. Our model also observes a consistent stability between FSID and GF-SID tasks across both datasets. All of the models observe a major decrease in accuracy when evaluated on our challenging nonepisodic evaluation as compared to the traditional episodic procedure. Specifically, GFSID tasks are mostly affected by non-episodic evaluation (around 10% S-J accuracy drop in both datasets). On SNIPS dataset, since both non-episodic and episodic eval-uations on S-N are conducted as 2-way 1-shot or 2-way 5-shot, the reported accuracy is almost similar. However, on the other hand, as C and |Y n | or |Y j | are different (5 vs 16 or 64) on NLUE dataset, we observe significant differences in reported S-N accuracy across all models.
On NLUE dataset, S-N accuracy is consistently lower than S-J accuracy across all models. This is mainly because the hyperparameter N Q is higher than theN n on NLUE (20 > 17.1), affecting the training and evaluation on D n .

Ablation Study
Multi-perspective Matching To evaluate the effectiveness of our Semantic Matching Module, we conduct further studies on individual components of our head matching. Table 5 shows that using only a single matching function is not sufficient to capture matching information between query and support samples. By aggregating all four matching methods, we observe a consistent improvement in both FSL and GFSL evaluations.
Head Matching vs Word Matching As introduced in Section 4, each head aims to extract a SC that covers a different aspect of a given sentence. To evaluate the effectiveness of head matching, we compare it with its corresponding word matching. In word matching, the hidden state embedding (h i ) from Bi-LSTM is used for comparison rather than  Figure 2: Word level matching. Y-axis denotes words of a sample query utterance "i think the chronicle entitled the spirit of st louis should be given a zero rating" and X-axis (left) denote words of negative support utterance "book a table at t-rex distant from halsey st" and X-axis (right) denotes positive support "rate this novel a 3". The label for query and positive support is "Rate Book" and the negative support's label is "Book Restaurant". The lighter color implies higher attention score. the head representation (M i ). In addition, instead of head-wise matching, we compare each word forward and backward embedding of sentence S k with the last (forward) and first (backward) embedding of sentence Q where T q denotes the last word in sentence Q. Figure 2 illustrates an example when overly finegrained matching sends the wrong matching signal, causing mis-classification for a query sample. Although "st" exists in both query and negative support sample, it contains different meanings depending on contexts ("street" vs "saint") and does not contribute to the correct intent "Rate Book". However, word matching assigns high matching score, leading to mis-classification of query sample as "Book Restaurant" intent. As shown in the right part of Figure 2 word matching fails to identify indicative matching information with positive support sample (i.e. "rate" vs "rating"). This observation indicates that matching on the overly fine-grained word level semantics could lead to overfitting problems as only query samples of high  Figure 3: Head level matching between the same query and positive support utterance. Y-axis denotes 3 heads extracted from query utterance labeled with the word distribution of each head. X-axis denotes 3 heads extracted from positive support utterance with similar label technique. Different curve colors are used to denote head indexes. The lighter color of each cell in 3x3 square matrix denotes the higher attention score.
word overlaps with support samples could yield high matching score. As utterances are diversely expressed, word-level semantic is insufficient to capture similarity between different utterances of the same intent. On the other hand, when we use extracted heads for matching, as observed from Figure 3, the importance of "st" is significantly downplayed. Instead, query heads focus on extracting different aspects of the query: verb "should", "be" (head 1), object target "chronicle" (head 2), rating-related information "ratings" (head 3). These key components are also captured in the positive support: target object("novel") and rating keyword ("rate"). As clearly indicated in Figure 3, the head with color blue of query and positive support sample that both extract important rating-related keywords ("rating" vs "rate") achieve high matching score.
This observation confirms our intuitions (1) Each SC extracts essential high-level semantics of a given utterance, (2) Without sharing word-level similarity, essential keywords for intent label of query samples are extracted and matched with those from support samples (i.e. "rating" vs "rate") via intermediate semantic component level. Further qualitative results in Table 6 validate the effectiveness of head-vs-head matching as it outperforms its word matching counterpart in all evaluation scenarios. This is mainly because the semantic components extracted from SE effectively capture the most important words in the given utterances as observed in a sample query utterance, reducing the necessity to focus on matching irrelevant words. Head Matching Regularization As observed from Table 6, adding each additional regularization term boosts both GFSL and FSL performance. L cross contributes most to the overall performance improvement. It is mainly due to its ability to align head distribution of samples with the same class label. Therefore, each extracted head could focus more on an indicative signal of the intent label.

Conclusions
In this paper, we propose an effective Semantic Matching and Aggregation Network for few-shot intent detection. Semantic components extracted from multi-head self-attention capture higher level contextual information beyond the word level, enhancing model's generalizability towards both seen and novel intents, especially when utterances are diversely expressed. Comprehensive multiperspective matching method thoroughly exploits the similarity between query and support samples for further robust representations. In this work, we also propose a more challenging but realistic nonepisodic evaluation for both FSL and GFSL beyond traditional setting. Our model achieves the stateof-the-art performance in both evaluation settings for SNIPS and NLUE benchmark datasets. Further studies of more dynamic semantic extraction and effectively synthesized matching techniques are our desired future work.