Data-Efficient Paraphrase Generation to Bootstrap Intent Classification and Slot Labeling for New Features in Task-Oriented Dialog Systems

Recent progress through advanced neural models pushed the performance of task-oriented dialog systems to almost perfect accuracy on existing benchmark datasets for intent classification and slot labeling. However, in evolving real-world dialog systems, where new functionality is regularly added, a major additional challenge is the lack of annotated training data for such new functionality, as the necessary data collection efforts are laborious and time-consuming. A potential solution to reduce the effort is to augment initial seed data by paraphrasing existing utterances automatically. In this paper, we propose a new, data-efficient approach following this idea. Using an interpretation-to-text model for paraphrase generation, we are able to rely on existing dialog system training data, and, in combination with shuffling-based sampling techniques, we can obtain diverse and novel paraphrases from small amounts of seed data. In experiments on a public dataset and with a real-world dialog system, we observe improvements for both intent classification and slot labeling, demonstrating the usefulness of our approach.


Introduction
Intent classification and slot labeling are two fundamental components in task-oriented dialog systems, producing a formal meaning representation for an utterance that the system can act upon to fulfill the user's request. As shown in Figure 1, it is typically modeled by classifying the utterance into a set of supported intents and labeling its sequence of tokens with expressed slots. We refer to the combined output of these two steps as the interpretation of the utterance. While the performance on these tasks has by now reached high accuracies on benchmarks like SNIPS (Coucke et al., 2018), the widespread use of real-world systems like Apple's Siri, Amazon's Alexa and Google's Assistant leads to new research challenges. As such systems are constantly expanding their functionality, bootstrapping new features is a common task that we focus on in this work.
In this paper, we define a new feature as a set of one or more intents and related slots that were not known to the system before. The main challenge when introducing a new feature is that typically, only very little seed training data is available, which makes it difficult to train good intent and slot models for the feature. However, manually collecting annotated data is expensive and time-consuming, slowing down the expansion of the system's functionality. We therefore aim to reduce that time and effort by automatically augmenting the seed training data and follow the idea of recent work to leverage paraphrase generation (Malandrakis et al., 2019;. Through paraphrasing, we can automatically increase the amount of training data, but for the data to be useful, we have to ensure that the new utterances are diverse and different from what we already have. If not, the data augmentation would degenerate to simply upsampling the seed data. That can on its own already be helpful, but it does not expose any new input examples to the model during training which can help the model learn and generalize even better. On the other hand, we also have to ensure that paraphrases remain realistic and natural utterances that are representative of the real-world data distribution. In contrast to previous work, we propose a data-efficient augmentation approach that can work with very small amounts of seed examples. First, we propose using an interpretation-to-text model that can leverage similarities to existing intents and shared slots when generating paraphrases for new features. In addition, instead of requiring pairs of paraphrases, it can be trained on single utterances annotated with intents and slots as they already exist for the downstream tasks. Second, we further propose new paraphrase sampling strategies that increase the amount and diversity of obtained paraphrases by using random sampling and shuffling the input representation. And third, we obtain token-level slot labels for the paraphrases via alignment-based label projection, instead of relying on self-labeling with a baseline model as in prior work. These differences make our approach very data-efficient, such that it can produce diverse and novel training examples for a new feature from just 100 seed utterances.
We evaluate our technique by simulating the introduction of new features on SNIPS, a common benchmark for intent classification and slot labeling in English, and on German data from a real-world dialog system. In both settings, we see substantial improvements for new features without negative effects on existing features, demonstrating the usefulness of our approach despite little seed data being available.

Related Work
Intent classification and slot labeling have been studied for several decades as fundamental building blocks of task-oriented dialog systems, dating back at least to the introduction of the ATIS corpus (Price, 1990) and subsequent work, e.g. (Pieraccini et al., 1992). In recent years, much progress has been made by applying deep learning techniques, such as recurrent networks (Mesnil et al., 2013), and modeling both tasks jointly with a single model (Zhang and Wang, 2016;Hakkani-Tür et al., 2016). Approaches that further improve knowledge sharing between the tasks, such as slot-gates (Goo et al., 2018) or bidirectional task connections (E et al., 2019), continued this line of work. Yet, sufficiently large training datasets are required for such advanced machine learning approaches to be effective.
To overcome the lack of training data for new features and avoid costly data collections, several proposals have been made in the recent past. Machine translation can be used to obtain training examples if the feature already exists for other languages (Gaspers et al., 2018). Cross-lingual transfer learning is another technique to effectively use such existing data (Do and Gaspers, 2019). If a feature is already being actively used, feedback signals from users, such as paraphrases or interruptions, can be identified in user interactions to obtain additional training data (Muralidharan et al., 2019). If unlabeled utterances exist for the feature, pairs of paraphrases between labeled and unlabeled data might be found, which allows deducing labels for the unlabeled utterance from the paraphrase relationship (Qiu et al., 2019).
In our work, we make no assumption about the availability of labeled data in other languages, user feedback or unlabeled data, as none of them is necessarily available when bootstrapping a new feature. Therefore, the closest existing work to ours is  and (Malandrakis et al., 2019), who both deal with the setup where only seed examples are available. Malandrakis et al. (2019) propose using conditional variational auto-encoders to generate paraphrases for the seed data and show that the paraphrases increase intent classification performance in their experiment. In contrast to our work, they do not evaluate on slot labeling and do not suggest a technique to add slot labels to the paraphrases.  generate paraphrases for seed examples with a transformer network and self-label them with a baseline intent and slot model. When added as training data, this combination of paraphrasing with self-supervised learning improves both tasks in their experiment. Our approach differs mainly in terms of data efficiency, as we do not rely on a baseline model for self-labeling -which in turn re-quires sufficient seed data for training -and in addition, we leverage data from existing features and use more advanced sampling strategies to obtain sufficient amounts of high-quality paraphrases from little seed data. We therefore demonstrate that our approach is applicable to earlier stages of bootstrapping by running our experiments on orders of magnitude less seed data than . Furthermore, our paraphrase generation approach differs by training an interpretation-to-text model instead of a text-totext model, such that no corpus of paraphrase pairs is needed. We rely exclusively on the already existing data for the downstream tasks of intent classification and slot labeling.
Beyond our specific use case, the task of paraphrase generation in general has gained much attention in recent years, driven partly by large-scale datasets such as Quora Questions Pairs 1 and the WikiAnswers 2 corpus, which enabled training complex models. Prakash et al. (2016) proposed one of the first neural models, relying on residual LSTMs. Further work explored auto-encoders (Gupta et al., 2018), generator-discriminator models trained with reinforcement learning  or decompositions into multiple recurrent and transformer models that paraphrase at different levels . All approaches in this line of work follow the text-to-text approach and assume training data in the form of paraphrase pairs. Because existing large datasets of this type focus on questions, we found direct application of models trained on them to the dialog domain to yield very unnatural paraphrases.
More closely related to the interpretation-to-text approach we use in this paper is work on generating natural language from structured data. Various versions of this task, differing mainly by the type of input, have been studied, e.g. generation from abstract meaning representations (Konstas et al., 2017) or from tabular data , as well as generation in unsupervised settings with denoising autoencoders (Freitag and Roy, 2018). The E2E NLG challenge (Dušek et al., 2020), a recent competition carried out on a dataset from the restaurant domain with slot-based input representations, resembles our paraphrase generation scenario very closely. Among 62 competition entries, the organizers found that sequence-to-sequence models, as also used in our work, were most popular and also very competitive. However, a difference of that line of work is that paraphrases are evaluated only intrinsically, which does not necessarily reflect their usefulness for intent classification and slot labeling.

Data Augmentation Approach
We formalize the introduction of a new feature as having two sets of training data D N and D O , where the former contains the seed examples for the new feature and the latter has all available examples for existing features. An example (u, i, s) ∈ D O ∪ D N consists of a tokenized utterance u, the true intent label i and true slot labels s in BIO-format. We refer to the label of a slot as the slot name and the covered tokens as its slot value. Letŝ be the mapping of slot names to their values, e.g.ŝ = {Movie-Type : movies, Spatial-Rel : closest, Loc-Type : movie theatre} for the example in Figure 1.
In our scenario, D N contains one or a few new intents not yet in D O 3 and is very small. We propose a paraphrase generation model PG that produces new examples D P = PG(D N ) with the goal of improving intent classification and slot labeling performance when those models are trained on In this section, we will describe how PG learns to paraphrase utterances, how we sample diverse paraphrases from it and how we project labels onto them to obtain examples D P .

Interpretation-to-Text Paraphrase Generation
In our interpretation-to-text approach, rather than learning to map utterances to paraphrases, we instead use examples (u, i, s) to model utterance sequences token by token conditioned on their interpretation: where enc j (i,ŝ) is an attention-based encoding of the interpretation (i,ŝ) at decoding step j. The idea behind this approach is that the notion of two utterances being paraphrases is, for our purposes, equivalent  Figure 2: The interpretation-to-text model encodes an interpretation as a sequence of intent, slot name and slot value embeddings and is trained to predict the utterance. At inference time, we sample token by token and shuffle the input slot order (bottom) to obtain diverse paraphrases.
with them having the same interpretation. Thus, we model a mapping from that unique representation of a set of paraphrases -the shared interpretation -to all its realizations. At inference time, the conditional language model p(u | i,ŝ) can then, once conditioned on a specific interpretation, provide a distribution over all possible realizations from which paraphrases can be sampled.
In terms of data-efficiency, an obvious advantage is that downstream training data D can be used to directly train the model. We can also include the large set D O in addition to the seed data D N to train a much more powerful paraphrase model than with D N alone or with out-of-domain paraphrase data. Thereby, the model can learn more general properties of the language used with dialog systems and similarities of utterances across intents and slots. At inference time, when sampling paraphrases for interpretations from D N token by token, the model can interpolate between utterances seen in D N or D O and thus is able to create novel utterances not seen during training but that reflect the interpretation.
We implement p(u | i,ŝ) following the sequence-to-sequence paradigm using a bi-directional GRU for encoding and a GRU equipped with attention and a pointer mechanism for decoding (See et al., 2017). Embeddings for utterance tokens, slot names and intents are learned. An interpretation is fed into the model as a sequence starting with the embedded intent, followed by vectors that are the sum of a slot name and slot value token embedding. See Figure 2 for an illustration. The same vocabulary and embeddings are used in encoder and decoder, but only tokens can be generated on the output side. We train minimizing cross-entropy for the utterances in D O ∪ D N .

Paraphrase Sampling Strategies
At inference time, sequence-to-sequence models typically use greedy decoding or beam search to find the (approximately) best sequence given an input. In our use case of obtaining additional training data, however, we are not only interested in the most likely realization of an interpretation, but rather want multiple diverse and novel utterances. We therefore rely on input shuffling and random sampling.

Input Representation Shuffling
The fact that feeding a sequence to the model requires us to order the slots in a certain way provides a simple way for obtaining multiple paraphrases: shuffling the order. We observed that although we train the model using one order (corresponding to the utterance), decoding with alternative ones provides paraphrases that also mention the values in that alternative order. A problem is that decoded utterances sometimes miss a slot, which motivates our quality metric partial slot carry-over (PSCO), measuring the fraction of slotsŝ for which at least one token of the slot valueŝ v j occurs in the decoded utterance u .
Using that metric, we collect k paraphrases given a seed utterance u as follows: First, we create input sequences corresponding to all permutations of the slot values. Next, we decode the best utterance for each input with beam search, compute its PSCO and keep only candidates with a rate of 1. From the remaining candidates, we sample k (if they are more) or upsample to k (if they are less). The last step is to ensure that each seed leads to exactly k paraphrases and the initial distribution is thereby preserved.

Random Sampling
As a second strategy, we replace beam search with random sampling, which samples an utterance token by token according to the probability distribution over the vocabulary rather than trying to find the most likely sequence. Since this decoding process is not deterministic, multiple rounds of sampling yield different utterances, which allows us to obtain multiple paraphrases per seed. When decoding token u j , we scale the logits z t over the vocabulary with a temperature α before applying softmax and then sample from the top-β tokens according to their probabilities. α < 1 makes the distribution spikier, such that the most likely tokens are sampled more often, while α > 1 promotes less likely ones, as does a larger β which defines the size of the distribution's head to consider. With higher values, a larger part of the distribution will be explored, leading to higher novelty and diversity of the utterances, but potentially also to more unnatural sequences.

Label Projection
Given a paraphrase u for a seed example (u, i, s), we finally need to project labels i and s to turn u into a useful example for the downstream tasks. While this is trivial for intent labels -we can simply use i -it is more intricate for per-token slot labels as u and u differ. Nevertheless, many tokens typically overlap, which motivates using a token alignment-based approach inspired by Fürstenau and Lapata (2009). For each source-target token pair (u j , u k ), we first compute a similarity sim(u j , u k ) ∈ [0, 1]. We use the inverse of character-level Levenshtein edit distance normalized by length, which works well to identify identical tokens or slight morphological variations. Experiments with word embeddings to capture more semantic similarities made only small differences on our data. Based on the pairwise similarities, we then consider all alignments of source tokens in u to targets in u and score them by their average similarity of the chosen alignments, allowing that a source is aligned to a virtual empty target token if no better target is found. Rather than exactly solving this optimization problem as done in (Fürstenau and Lapata, 2009), we find the best alignment greedily, choosing the most similar target for each source from left to right. Finally, we project slot labels s along that alignment, restore BIO prefixes, and use the result s to form the new training example (u , i, s ) for D P .

Experimental Setup
We experiment with our data augmentation approach by simulating the introduction of a new feature on a public dataset with English utterances and on internal data from a real-world dialog system in German.

Data
SNIPS The Snips NLU dataset (Coucke et al., 2018) is a commonly used benchmark for intent classification and slot labeling covering 7 intents and 39 slots. It is typically split into 13,084 train, 700 validation, and 700 test examples. We run 7 experiments, each time picking one of the intents to be the new feature. For each, we use the full training and validation data of the remaining 6 intents as D train O and D val O (together about 11,815 instances), while we sample 5% of the data for the new feature to be the seed data D train N and D val N (together 100 instances). We use the regular test split of SNIPS for evaluation. Since SNIPS is moderate in size and our experimental setup involves several randomized parts, we repeat the experiments multiple times and report averaged results. We use each of the 7 intents as the simulated new feature, sample three different 5% seed subsets for it and train models with 10 different seeds, resulting in 210 different experiment runs per approach.
Internal Data In addition to SNIPS, we use randomly sampled data from logs of a commercial dialog system in German to simulate the introduction of 5 different features that have been introduced in the past. For each, we use |D train N | = 450 and |D val N | = 50 seed utterances.

Intent and Slot Model
For intent classification and slot labeling, we use a neural model inspired by existing work. Tokens are embedded as 300-dimensional vectors and encoded with a bi-directional GRU with hidden size 512. For intent classification, the concatenated final representations of the GRUs are fed through a 300-d ReLU layer, followed by dropout and a final softmax layer over intent labels. For slot labeling, we use a similar two-layer network, but apply it to the GRU states from each timestep to predict slot labels per token. The model is trained with Adam, a batch size of 64 and dropout of 0.2 until the performance stops improving on the validation data. We train the model separately for intent classification and slot labeling, which departs from recent work that observed improvements through joint training (see Section 2). However, we choose this setup to be able to study the effect of our data augmentation on both tasks in isolation.

Compared Approaches
In our experiments, we compare the following variations of the techniques presented in Section 3: • Baseline A model trained using just the seed data D N and existing data D O . The goal for all data augmentation approaches is to improve upon this baseline.
• Upsampling5 A variation of the baseline that repeats each seed example 5 times to match the amount of training data for the new feature obtained with the augmentation techniques below.
• Beam1 A model trained on D O , D N and D P , where the latter comes from decoding one paraphrase for each seed in D N with beam search. We use a beam size of 5 and normalize scores by length.
• Beam5 Same as Beam1, but using the top-5 hypotheses from the beam for each seed.
• Shuffle+Beam5 Same as Beam5, but also using input shuffling. We select 5 paraphrases per seed based on PSCO from the beams across all shuffles as described in Section 3.2.1.
• Shuffle+Rand5 Same as Shuffle+Beam5, but using random sampling as in Rand1. We sample 3 paraphrases per shuffle and then select 5 across all shuffles by PSCO as in Section 3.2.1.
Given the number of seed examples available for the new feature -100 on SNIPS and 500 on our internal dataset -the absolute number of examples added by the augmentation methods and baselines are 100 or 500 on SNIPS and 500 or 2,500 on the internal dataset. Beyond these two settings, it would be interesting to explore more seed-paraphrase-ratios and to determine an optimal one, but we leave this for future work to instead focus on the comparison of different sampling techniques in this work. The paraphrasing model is trained on D N ∪ D O with a vocabulary of 30,000 tokens. Embeddings for tokens, intents and slot names are initialized with 300-dimensional GloVe vectors (Pennington et al., 2014) (or randomly if not found) and updated during training. The encoder and decoder GRUs have a size of 300 and two layers each. We train with a batch size of 64 and dropout of 0.3 with early stopping.

Evaluation Metrics
Our main evaluation metric is the performance of the downstream tasks, measured by accuracy for intent classification and slot F1-score for slot labeling. We focus on the change in performance when using data augmentation compared to the baseline, as it shows whether the augmentation is helpful to bootstrap a new feature. Because bootstrapping a new feature can be detrimental to existing ones, we separately look at changes for the new feature (should be positive) and for existing features (should not be negative).
To be able to get additional insights into relative strengths of different paraphrase generation approaches, we also adopt the following metrics to compare paraphrases: • ESCO Similar to PSCO defined in Eq. 2, we compute the exact slot carry over (ESCO), which differs in counting only slots whose complete slot value has been carried over. Where we use PSCO=1 for candidate selection, this stricter version can still differentiate among the selections.
• Novelty In our use case, new examples need to be different from the seeds, otherwise, the augmentation degenerates to simply upsampling the data. We therefore use BLEU (Papineni et al., 2002) as a measure of similarity and compute for each paraphrase u of seed u the score 1 − BLEU4(u, u ).
Higher scores indicate higher novelty. We report averages over all sampled paraphrases.
• Diversity In addition, we want paraphrases to be diverse instead of sampling the same paraphrase repeatedly. To measure diversity, we also compute 1 − BLEU4(u , u ), but among pairs of paraphrases u , u of the same seed. We report the average over all pairs, higher means more diversity.
Note that we report the three metrics introduced above primarily to better understand how paraphrases sampled with different methods differ. Whether such differences, e.g. higher novelty or diversity, are in fact beneficial is reflected by the downstream performance measured as intent accuracy and slot F1. Figure 3 shows several examples for paraphrases obtained with our approach. As the examples show, utterances range from slight variations of the original utterance, dropping only some tokens (B1, A1) or changing the order as encouraged by input shuffling (A2, A3), to more strongly deviating ones that can in the extreme become unnatural and ungrammatical (B2, C1). Table 1 compares the generated paraphrases with our quality metrics. As expected, paraphrases from random sampling are more novel and more diverse than from beam search, but at the expense of carrying  over less slots, indicating that too novel paraphrases might no longer fully represent the intended interpretation. Note that by design (see Section 3.2.1), PSCO is 1 for both input shuffling-based methods, but ESCO reveals that the slot carry over rate is in fact lower for random sampling. While we observe a trade-off between slot carry over and novelty/diversity in general, it is notable that input shuffling increases novelty and diversity while at the same time also achieving the highest slot carry over, for both beam search and random sampling. Hence, our proposed combination of shuffling and selection by PSCO appears to be an effective way to improve paraphrases among all dimensions. Table 2 shows the effect of adding the paraphrases as training data on SNIPS. Comparing the variations of our method, we make the following observations: Across both tasks and both sampling methods, generating more paraphrases per seed utterance (500 vs. 100) improves performance more. With regard to sampling methods, the results show that random sampling helps more than beam search, which is in line with the increased novelty and diversity observed in Table 1. However, that effect is less pronounced for slot labeling, as the lower slot carry over coming with higher novelty and diversity is more problematic for learning slot labeling, but almost irrelevant for intent classification. Input shuffling, on the other hand, seems beneficial for slot labeling but not for intent classification, which could be because word order is less relevant for intent classification in general. When evaluating on existing features, there is a small performance drop, but it is negligibly small, in particular when compared to the gain on the new feature. Finally, we note that the Upsampling baseline is strong, but, while it beats several of our ablations, its performance is not as good as our full method, indicating that the higher novelty and diversity of the sampled paraphrases (see Table 1) are beneficial for the downstream task.

Effect on Downstream Tasks
On the internal data, we compared only our full method including input shuffling across the two sampling methods. The average improvement over all simulated new features is even bigger for intent  Table 3: Intent classification (IC) and slot labeling (SL) performance change on internal data (German) across 5 (simulated) new features when adding paraphrases generated with Shuffle+Beam5 (S+B5) or Shuffle+Rand5 (S+R5). Numbers are absolute changes with regard to the baseline.
classification, as Table 3 shows, again with random sampling providing bigger improvements than beam search. Also, none of the improvements for the new feature changes the performance on existing features substantially, as desired. Compared to SNIPS, however, improvements are smaller for slots, and beam search outperforms random sampling. We attribute the smaller improvement to the fact that more (shared) slots are already known to the model since the set of existing features is larger. Thus, the baseline can already perform better, which makes it harder to improve. The breakdown by new feature in Table 3 illustrates that changes are different across features, which we attribute to the degree of variety within utterances for specific features and their similarity to already existing features.

Discussion and Future Work
Our results are promising and show that improvements are possible even with small seed datasets. That being said, we want to emphasize that finding the paraphrases that provide the combination of novelty, diversity and meaning-preservation that benefit the downstream task the most is challenging, as two different models with hyper-parameters and the sampling parameters are involved. Tuning them carefully is therefore very costly, and we leave exploring methods for that to future work. In addition, future work should study the optimal number of paraphrases for a new feature, whether sampling methods, in particular slot shuffling, work equally well across domains and how the paraphrase generation model could benefit from language model pretraining. Finally, we would like to point out that in practice, our approach can be combined with any other data augmentation technique discussed in Section 2, such as using machine translation or user feedback, to bootstrap new features even further.

Conclusion
We proposed a data augmentation approach for seed data of new features in dialog systems that relies on interpretation-to-text paraphrase models, shuffling and random sampling to generate paraphrases and alignment-based label projection. We demonstrated that using the resulting new training examples improves performance for intents and slots on an English benchmark and German dialog system data.