Composed Variational Natural Language Generation for Few-shot Intents

In this paper, we focus on generating training examples for few-shot intents in the realistic imbalanced scenario. To build connections between existing many-shot intents and few-shot intents, we consider an intent as a combination of a domain and an action, and propose a composed variational natural language generator (CLANG), a transformer-based conditional variational autoencoder. CLANG utilizes two latent variables to represent the utterances corresponding to two different independent parts (domain and action) in the intent, and the latent variables are composed together to generate natural examples. Additionally, to improve the generator learning, we adopt the contrastive regularization loss that contrasts the in-class with the out-of-class utterance generation given the intent. To evaluate the quality of the generated utterances, experiments are conducted on the generalized few-shot intent detection task. Empirical results show that our proposed model achieves state-of-the-art performances on two real-world intent detection datasets.


Introduction
Intelligent assistants have gained great popularity in recent years since they provide a new way for people to interact with the Internet conversationally (Hoy, 2018). However, it is still challenging to answer people's diverse questions effectively. Among all the challenges, identifying user intentions from their spoken language is important and essential for all the downstream tasks.
Most existing works (Hu et al., 2009;Xu and Sarikaya, 2013;Xia et al., 2018) formulate intent detection as a classification task and achieve high performance on pre-defined intents with sufficient labeled examples. With this * Work was done when Congying was a research intern at Salesforce Research. ever-changing world, a realistic scenario is that we have imbalanced training data with existing manyshot intents and insufficient few-shot intents. Previous intent detection models (Yin, 2020;Yin et al., 2019) deteriorate drastically in discriminating the few-shot intents.
To alleviate this scarce annotation problem, several methods (Wei and Zou, 2019; Malandrakis et al., 2019;Yoo et al., 2019) have been proposed to augment the training data for low-resource spoken language understanding (SLU). Wei and Zou (2019) introduce simple data augmentation rules for language transformation like insert, delete and swap. Malandrakis et al. (2019) and Yoo et al. (2019) utilize variational autoencoders (Kingma and Welling, 2013) with simple LSTMs (Hochreiter and Schmidhuber, 1997) that have limited model capacity to do text generation. Furthermore, these models are not specifically designed to transfer knowledge from existing many-shot intents to few-shot intents.
In this paper, we focus on transferable natural language generation by learning how to compose utterances with many-shot intents and transferring to few-shot intents. When users interact with intelligent assistants, their goal is to query some information or execute a command in a certain domain (Watson Assistant, 2017). For instance, the intent of the input "what will be the highest temperature next week" is to ask about the weather. The utterance can be decomposed into two parts, "what will be" corresponding to an action "Query" and "the highest temperature" related to the domain "Weather". These actions or domains are very likely to be shared among different intents including the few-shot ones (Xu et al., 2019). For example, there are a lot of actions ("query", "set", "remove") that can be combined with the domain of "alarm". The action "query" also exists in multiple domains like "weather", "calendar" and "movie". Ideally, if we can learn the expressions representing for a certain action or domain and how they compose an utterance for existing intents, then we can learn how to compose utterances for few-shot intents naturally. Therefore, we define an intent as a combination of a domain and an action. Formally, we denote the domain as y d and the action as y a . Each intent can be expressed as y = (y d , y a ).
A composed variational natural language generator (CLANG) is proposed to learn how to compose an utterance for a given intent with an action and a domain. CLANG is a transformer-based (Vaswani et al., 2017) conditional variational autoencoder (CVAE) . It contains a bilatent variational encoder and a decoder. The bilatent variational encoder utilizes two independent latent variables to model the distributions of action and domain separately. Special attention masks are designed to guide these two latent variables to focus on different parts of the utterance and disentangle the semantics for action and domain separately. Through decomposing utterances for existing many-shot intents, the model learns to generate utterances for few-shot intents as a composition of the learned expressions for domain and action.
Additionally, we adopt the contrastive regularization loss to improve our generator learning. During the training, an in-class utterance from one intent is contrasted with an out-of-class utterance from another intent. Specifically, the contrastive loss is to constrain the model to generate the positive example with a higher probability than the negative example with a certain margin. With the contrastive loss, the model is regularized to focus on the given domain and intent and the probability of generating negative examples is reduced.
To quantitatively evaluate the effectiveness of CLANG for augmenting training data in lowresource intent detection, experiments are conducted for the generalized few-shot intent detection task (GFSID) (Xia et al., 2020). GFSID aims to discriminate a joint label space consisting of both existing many-shot intents and few-shot intents.
Our contributions are summarized below. 1) We define an intent as a combination of a domain and an action to build connections between existing many-shot intents and few-shot intents. 2) A composed variational natural language generator (CLANG) is proposed to learn how to compose an utterance for a given intent with an action and a domain. Utterances are generated for few-shot in-tents via a composed variational inferences process. 3) Experiment results show that CLANG achieves state-of-the-art performance on two real-world intent detection datasets for the GFSID task.

Composed Variational Natural Language Generator
In this section, we introduce the composed variational natural language generator (CLANG). As illustrated in Figure 1, CLANG consists of three parts: input representation, bi-latent variational encoder, and decoder.

Input Representation
For a given intent y decomposed into a domain y d and an action y a and an utterance x = (w 1 , w 2 , ..., w n ) with n tokens, we designed the input format like BERT as ([CLS], y d , y a , [SEP], w 1 , w 2 , ..., w n , [SEP]). As the example in Figure 1, the intent has the domain of "weather" and the action of "query". The utterance is "what will be the highest temperature next week". The input is represented as ([CLS], weather, query, [SEP], what, will, be, the, highest, temperature, next, week, [SEP]). Texts are tokenized into subword units by Word-Piece (Wu et al., 2016). The input embeddings of a token sequence are represented as the sum of three embeddings: token embeddings, position embeddings (Vaswani et al., 2017), and segment embeddings (Devlin et al., 2018). The segment embeddings are learned to identify the intent and the utterance with different embeddings.

Bi-latent Variational Encoder
As illustrated in Figure 1, the bi-latent variational encoder is to encode the input into two latent variables that contain the disentangled semantics in the utterance corresponding to domain and action separately.
Multiple transformer layers (Vaswani et al., 2017) are utilized in the encoder. Through the selfattention mechanism, these transformer layers not only extract semantic meaningful representations for the tokens, but also model the relation between the intent and the utterance. The embeddings for the domain token and the action token output from the last transformer layer are denoted as e d and e a . We encode e d into variable z d to model the distribution for the domain and e a are encoded into variable z a to model the distribution for the action.
Ideally, we want to disentangle the information  for the domain and the action, making e d attend to tokens related to domain and e a focus on the expressions representing the action. To achieve that, we make a variation of the attention calculations in transformer layers to avoid direct interactions among the domain token and the action token in each layer.
Instead of applying the whole bidirectional attention to the input, an attention mask matrix M ∈ R N ×N is added to determine whether a pair of tokens can be attended to each other (Dong et al., 2019). N is the length of the input. For the l-th Transformer layer, the output of a self-attention head A l is computed via: where the attention mask matrix calculated as: The output of the previous transformer layer T l−1 ∈R N ×d h is linearly projected to a triple of queries, keys and values parameterized by matrices d h is the hidden dimension for the transformer layer, and d k is the hidden dimension for a self-attention head.
The proposed attention mask for the domain token and the action token is illustrated in Figure 2. The Domain y d and the action y a are prevented from attending to each other. All the other tokens have are allowed to have full attentions. The elements in the mask matrix for the attentions between domain and action are −∞, and 0 for all the others.
The disentangled embeddings e d and e a are encoded into two latent variables z d and z a to model the posterior distributions determined by the intent elements separately: p(z d |x, y d ), p(z a |x, y a ). The latent variable z d is conditioned on the domain y d , while z a is controlled by the action y a . By modeling the true distributions, p(z d |x, y d ) and p(z a |x, y a ), using a known distribution that is easy to sample from (Kingma et al., 2014), we constrain the prior distributions, p(z d |y d ) and p(z a |y a ), as multivariate standard Gaussian distributions. A reparametrization trick (Kingma and Welling, 2013) is used to generate the latent vector z d and z a separately. Gaussian parameters (µ d , µ a , σ 2 d , σ 2 a ) are projected from e d and e a : Noisy variables ε d ∼ N (0, I), ε a ∼ N (0, I) are utilized to sample z d and z a from the learned distribution: The KL-loss function is applied to regularize the prior distributions for these two latent variables to be close to the Gaussian distributions: A fully connected layer with Gelu (Hendrycks and Gimpel, 2016) activation function is applied on z d and z a to compose these two latent variables together and outputs z. The composed latent information z is utilized in the decoder to do generation.

Decoder
The decoder utilizes the composed latent information together with the intent to reconstruct the input utterance p(x|z d , z a , y d , y a ). As shown in Figure 1, a residual connection is built from the input representation to the decoder to get the embeddings for all the tokens. To keep a fixed length and introduce the composed latent information z into the decoder, we replace the first [CLS] token with z.
The decoder is built with multiple transformer layers to generate the utterance. Text generation is a sequential process that we use the left context to predict the next token. To simulate the left-toright generation process, another attention mask is utilized for the decoder. In the attention mask for the decoder, tokens in the intent can only attend to intent tokens, while tokens in the utterance can attend to both the intent and all the left tokens in the utterance.
For the first token z which holds composed latent information, it is only allowed to attend to itself due to the vanishing latent variable problem. The latent information can be overwhelmed by the information of other tokens when adapting VAE to natural language generators either for LSTM (Zhao et al., 2017) or transformers (Xia et al., 2020). To further increase the impact of the composed latent information z and alleviate the vanishing latent variable problem, we concatenate the token representations of z to all the other token embeddings output from the last transformer layer in the decoder.
The hidden dimension increases to 2 × d h after the concatenation. To reduce the hidden dimension to d h and get the embeddings to decode the vocabulary, two fully-connected (FC) layers followed by a layer normalization (Ba et al., 2016) are applied on top of the transformer layers. Gelu is used as the activation function in these two FC layers. The embeddings output from these two FC layers are decoded into tokens in the vocabulary. The embeddings at position i = {1, ..., n − 1} are used to predict the next token at position i + 1 till the [SEP] token is generated.
To train the decoder to reconstruct the input, a reconstruction loss is formulated as:

Learning with contrastive loss
Although the model can generate utterances for a given intent, such as "are there any alarms set for seven am" for "Alarm Query", there are some negative utterances generated. For example, "am i free between six to seven pm" is generated with the intent of "Alarm Query". This would be because in the training, it lacks supervision to distinguish inclass from out-of-class examples especially for fewshot intents. To alleviate the problem, we adopt a contrastive loss in the objective function and reduce the probability to generate out-of-class samples.
Given an intent y = (y d , y a ), an in-class utterance x + from this intent and an out-of-class utterance x − from another intent. The contrastive loss constrains the model to generate the in-class example x + with a higher probability than x − . In the same batch, we feed the in-class example (y d , y a , x + ) and the outof-class example (y d , y a , x − ) into CLANG to model the likelihood: P (x + |y) and P (x − |y). The chain rule is used to calculate the likelihood of the whole utterance: p(x|y) = p(w 1 |y)p(w 2 |y, w 1 )...p(w n |y, w 1 , ..., T w n−1 ). In the contrastive loss, the log-likelihood of the inclass example is constrained to be higher than the out-of-class example with a certain margin λ: To leverage challenging out-of-class utterances, we choose the most similar utterance with a different intent as the out-of-class utterance. Three indicators are considered to measure the similarity between the in-class utterance and all the utterances with a different intent: the number of shared unigrams s 1 , bi-grams s 2 between the utterances and the number of shared uni-grams between the name of intents s 3 . The sum of these three numbers, s = s 1 + s 2 + s 3 , is utilized to find the out-of-class utterance with the highest similarity. If there are multiple utterances having the same highest similarity s, we random choose one as the negative utterance.
The overall loss function is a summation of the KL-loss, the reconstruction loss and the contrastive loss:

Generalized Few-shot Intent Detection
Utterances for few-shot intents are generated by sampling two latent variables, z d and z a , separately from multivariate standard Gaussian distributions. Beam search is applied to do the generation. To improve the diversity of the generated utterances, we sample the latent variables for s times and save the top k results for each time. The overall generation process follows that of Xia et al. (2020). These generated utterances are added to the original traning dataset to alleviate the scare annotation problem. We finetune BERT with the augmented dataset to solve the generalized few-shot intent detection task. The whole pipeline is referred as BERT + CLANG in the experiments.

Experiments
To evaluate the effectiveness of the proposed approach for generating labeled examples for fewshot intents, experiments are conducted for the GF-SID task on two real-world datasets. The few-shot intents are augmented with utterances generated from CLANG.

Datasets
Following (Xia et al., 2020), two public intent detection datasets are used in the experiments: SNIPS-NLU (Coucke et al., 2018) and NLUED (Xingkun Liu and Rieser, 2019). These two datasets contain utterances from users when interacting with intelligent assistants and are annotated with pre-defined intents. Dataset details are illustrated in Table 1. SNIPS-NLU 1 contains seven intents in total. Two of them (RateBook and AddToPlaylist) as regraded as few-shot intents. The others are used as existing intents with sufficient annotation. We randomly choose 80% of the whole data as the training data and 20% as the test data. NLUED 2 is a natural language understanding dataset with 64 intents for human-robot interaction in home domain, in which 16 intents as randomly selected as the few-shot ones. A sub-corpus of 11, 036 utterances with 10-folds cross-validation splits is utilized.

Baselines
We compare the proposed model with a few-shot learning model and several data augmentation methods. 1) Prototypical Network (Snell et al., 2017) (PN) is a distance-based few-shot learning model. It can be extended to the GFSID task naturally by providing the prototypes for all the intents. BERT is used as the encoder for PN to provide a fair comparison. We fine-tune BERT together with the PN model. This variation is referred to as BERT-PN+. 2) BERT. For this baseline, we oversampled the few-shot intents by duplicating the few-shots to the maximum training examples for one class. 3) SVAE (Bowman et al., 2015) is a variational autoencoder built with LSTMs. 4) CGT (Hu et al., 2017) adds a discriminator based on SVAE to classify the sentence attributes. 5) EDA (Wei and Zou, 2019) uses simple data augmentations rules for language transformation. We apply three rules in the experiment, including insert, delete, and swap. 6) CG-BERT (Xia et al., 2020) is the first work that combines CVAE with BERT to do few-shot text generation. BERT is fine-tuned with the augmented training data for these generation baselines. The whole pipelines are referred to as BERT + SVAE, BERT + CGT, BERT + EDA and BERT + CG-BERT in Table 2. An ablation study is also provided to understand the importance of contrastive loss by removing it from CLANG.  Table 2: Generalized few shot intent detection with 1-shot and 5-shot settings on SNIPS-NLU and NLUED. Seen is the accuracy on the seen intents (accs), Unseen/Novel is the accuracy on the novel intents (accs), H-Mean is the harmonic mean of seen and unseen accuracies.

Implementation Details
Both the encoder and the decoder use six transformer layers. Pre-trained weights from BERTbase are used to initialize the embeddings and the transformer layers. The weights from the first six layers in BERT-base are used to initialize the transformer layers in the encoder and the later six layers are used to initialize the decoder. Adam optimizer (Kingma and Ba, 2014) is applied for all the experiments. The margin for the contrastive loss is 0.5 for all the settings. All the hidden dimensions used in CLANG is 768. For CLANG, the learning rate is 1e-5 and the batch size is 16. Each epoch has 1000 steps. Fifty examples from the training data are sampled as the validation set. The reconstruction error on the validation set is used to search for the number of training epochs in the range of [50,75,100]. The reported performances of CLANG and the ablation of contrastive loss are both trained with 100 epochs. The hyperparameters for the generation process including the top index k and the sampling times s are chosen by evaluating the quality of the generated utterances. The quality evaluation is described in section 3.5. We search s in the list of [10,20], and k in the list of [20,30]. We use k = 30 and s = 20 for BERT + CLANG in NLUED, while use k = 30 and s = 10 for all the other experiments. When fine-tuning BERT for the GFSID task, we fix the hyperparameters as follows: the batch size is 32, the learning rate is 2e-5, and the number of the training epochs is 3.

Experiment Results
The experiment results for the generalized fewshot intent detection task are shown in Table 2. Performance is reported for two datasets with both 1-shot and 5-shot settings. For SNIPS-NLU, the performance is calculated with the average and the standard deviation over 5 runs. The results on NLUED are reported over 10 folds.
Three metrics are used to evaluate the model performances, including the accuracy on existing many-shot intents (acc m ), the accuracy on few-shot intents (acc f ) together with their harmonic mean (H). As the harmonic mean of acc m and acc f , H is calculated as: We choose the harmonic mean as our evaluation criteria instead of the arithmetic mean because the overall results are significantly affected by the many-shot class accuracy acc m over the few-shot classes acc f in arithmetic mean (Xian et al., 2017). Instead, the harmonic mean is high only when the accuracies on both many-shot and few-shot intents are high. Due to this discrepancy, we evaluate the harmonic mean which takes a weighted average of the many-shot and few-shot accuracy.
As illustrated in Table 2, the proposed pipeline BERT + CLANG achieves state-of-the-art performance on the accuracy for many-shot intents, few-shot intents, and their harmonic mean for the SNIPS-NLU dataset. As for the NLUED dataset, BERT + CLANG outperforms all the baselines on the accuracy for few-shot intents and the harmonic mean, while achieves comparable results on many-shot intents compared with the best baseline. Since the many-shot intents have sufficient training data, the improvement mainly comes from few-shot intents with scarce annotation. For example, the accuracy for few-shot intents on NLUED with the 5-shot setting improves 5% from the best baseline (BERT + CG-BERT).
Compared to the few-shot learning method, CLANG achieves better performance consistently in all the settings. BERT-PN+ achieves decent performance on many-shot intents while lacks the ability to provide embeddings that can be generalized from existing intents to few-shot intents.
For data augmentation baselines, CLANG obtains the best performance on few-shot intents and the harmonic mean. These results demonstrate the high quality and diversity of the utterances generated form CLANG. CGT and SVAE barely improve the performance for few-shot intents. They only work well with sufficient training data. The utterances generated by these two models are almost the same as the few-shot examples. The performance improved by EDA is also limited since it only provides simple language transformation like insert and delete. Compared with CG-BERT that incorporates the pre-trained language model BERT, CLANG further improves the ability to generate utterances for few-shot intents with composed natural language generation.
From the ablation study illustrated in Table 3, removing the contrastive loss decreases the accuracy for few-shot intents and the harmonic mean. It shows that the contrastive loss regularizes the generation process and contributes to the downstream classification task.

Result Analysis
To further understand the proposed model, CLANG, result analysis and generation quality evaluation are provided in this section. We take the fold 7 of the NLUED dataset with the 5-shot setting as an example. It contains 16 novel intents with 5 examples per intent. The intent in this paper is defined as a pair of a domain and an action. The domain or the action might be shared among the many-shot intents and the few-shot intents. The domain/action that exists in many-shot intents is named as a seen domain/action; otherwise, it is called a novel domain/action. To analyze how well our model performs on different few-shot intents, we split fewshot intents into four types: a novel domain with a seen action (Novel d ), a novel action with a seen domain (Novel a ), both domain and action are seen (Dual s ), both domain and action are novel (Dual u ). We compare our proposed model with CG-BERT on these different types. As illustrated in Table  4, CLANG consistently performs better than CG-BERT on all the types. The performance for intents with a seen action and a novel domain improves 20.90%. This observation indicates that our model is better at generalizing seen actions into novel domains.  As a few-shot natural language generation model, diversity is a very important indicator for quality evaluation. We compare the percentage of unique utterances generated by CLANG with CG-BERT. In CG-BERT, the top 20 results are generated for each intent by sampling the hidden variable for once. There are 257 unique sentences out of 320 utterances (80.3%). In CLANG, the top 30 results for each intent are generated by sampling the latent variables for once. We got 479 unique sentences out of 480 utterances (99.8%), which is much higher than CG-BERT.
Several generation examples are shown in Table  5. CLANG can generate good examples (indicated by G) that have new slots values (like time, place, or action) not existing in the few-shot examples (indicated by R). For example, G1 has a new time slot and G5 has a new action. Bad cases (indicated by B) like B1 and B5 fill in the sentence with improper slot values. CLANG can also learn sentences from other intents. For instance, G3 transfer the expression in R3 from "Recommendation Events" to "recommendation movies". However, A case study is further provided for the Alarm Query intent with human evaluation. There are 121 unique utterances generated in total. As shown in   (2019) generate fully annotated utterances to alleviate the data scarcity issue in spoken language understanding tasks. These models utilize LSTM as encoders (Hochreiter and Schmidhuber, 1997) with limited model capacity. Xia et al. (2020) provide the first work that combines CVAE with BERT to generate utterances for generalized few-shot intent detection.
Recently, large-scale pre-trained language models are proposed for conditiaonal text generation tasks (Dathathri et al., 2019;Keskar et al., 2019), but they are only evaluated by human examination. They are not aiming at improving downstream classification tasks in low-resource conditions. Contrastive Learning in NLP Contrastive learning that learns the differences between the positive data from the negative examples has been widely used in NLP (Gutmann and Hyvärinen, 2010;Mikolov et al., 2013;Cho et al., 2019). Gutmann and Hyvärinen (2010) leverage the Noise Contrastive Estimation (NCE) metric to discriminate the observed data from artificially generated noise samples. Cho et al. (2019) introduce contrastive learning for multi-document question generation by generating questions closely related to the positive set but far away from the negative set. Different from previous works, our contrastive loss learn a positive example against a negative example together with label information.

Conclusion
In this paper, we propose a novel model, Composed Variational Natural Language Generator (CLANG) for few-shot intents. An intent is defined as a combination of a domain and an action to build connections between existing intents and few-shot intents. CLANG has a bi-latent variational encoder that uses two latent variables to learn disentangled semantic features corresponding to different parts in the intent. These disentangled features are composed together to generate training examples for few-shot intents. Additionally, a contrastive loss is adopted to regularize the generation process. Experimental results on two real-world intent detection datasets show that our proposed method achieves state-of-the-art performance for GFSID.