Data Augmentation with Atomic Templates for Spoken Language Understanding

Spoken Language Understanding (SLU) converts user utterances into structured semantic representations. Data sparsity is one of the main obstacles of SLU due to the high cost of human annotation, especially when domain changes or a new domain comes. In this work, we propose a data augmentation method with atomic templates for SLU, which involves minimum human efforts. The atomic templates produce exemplars for fine-grained constituents of semantic representations. We propose an encoder-decoder model to generate the whole utterance from atomic exemplars. Moreover, the generator could be transferred from source domains to help a new domain which has little data. Experimental results show that our method achieves significant improvements on DSTC 2&3 dataset which is a domain adaptation setting of SLU.


Introduction
The SLU module is a key component of spoken dialogue system, parsing user utterances into corresponding semantic representations in a narrow domain. The typical semantic representation for SLU could be semantic frame (Tur and De Mori, 2011) or dialogue act (Young, 2007). In this paper, we focus on SLU with the dialogue act that a sentence is labelled as a set of act-slot-value triples. For example, the utterance "Not Chinese but I want Thai food please" has an annotation of "deny(food=Chinese), inform(food=Thai)".
Deep learning has achieved great success in the SLU field (Mesnil et al., 2015;Liu and Lane, 2016;Zhao et al., 2019). However, it is notorious for requiring large labelled data, which limits the scalability of SLU models. Despite re- Figure 1: Workflow of the data augmentation with atomic templates for SLU. cent advancements and tremendous research activity in semi-supervised learning and domain adaptation, the deep SLU models still require massive amounts of labelled data to train. Therefore, data augmentation for SLU becomes appealing, and it needs three kinds of capabilities: • Expression diversity: There are always various expressions for the same semantic meaning. Lack of expression diversity is an obstacle of SLU.
• Semantic diversity: The data augmentation method should generate data samples with various semantic meanings.
• Domain adaptation: The method should be able to utilize data from other domains and adapt itself to a new domain rapidly.
In this work, we propose a new data augmentation method for SLU, which consists of atomic templates and a sentence generator (as shown in Figure 1). The method starts from dialogue acts which are well-structured and predefined by domain experts. Rich and varied dialogue acts can be created automatically, which is guided by a domain ontology.
To enhance the capability of domain adaptation for the sentence generator, we propose to first interpret dialogue acts in natural language so that new dialogue acts can be understood by the generator. However, interpreting a dialogue act at sentence level costs a lot, atomic templates are exploited to alleviate human efforts. The atomic templates are created at phrase level, which produce exemplars for fine-grained constituents (i.e. act-slot-value triple) of semantic representations (i.e. dialogue act). Thus, the sentence generator can be an encoder-decoder model, paraphrasing a set of atomic exemplars to an utterance.
We evaluate our approach on DSTC 2&3 dataset (Henderson et al., 2014a,b), a benchmark of SLU including domain adaptation setting. The results show that our method can obtain significant improvements over strong baselines with limited training data.

Related Work
Data augmentation for SLU Hou et al. (2018) use a sequence-to-sequence model to translate between sentence pairs with the same meaning. However, it cannot generate data for new semantic representations. Yoo et al. (2018) use a variational autoencoder to generate labelled language. It still lacks variety of semantic meanings. Let x and y denote an input sentence and the corresponding semantic representation respectively. These two works estimate p(x, y) with generation models to produce more labelled samples, while they have two drawbacks. 1) The models cannot control which kinds of semantic meaning should be generated for supply. 2) The models cannot generate data for new semantic representations which may contain out-of-vocabulary (OOV) labels. We propose to generate utterance based on semantic representations by estimating p(x|y), because y is well-structured and easy to be synthesized. To overcome the OOV problem of semantic labels, we first map them to atomic exemplars with a little of human effort. Zero-shot learning of SLU Besides data augmentation, zero-shot learning of SLU (Ferreira et al., 2015;Yazdani and Henderson, 2015) is also related, which can adapt to unseen semantic labels. Yazdani and Henderson (2015) exploit a binary classifier for each possible act-slot-value triple to predict whether it exists in the input sentence. Zhao et al. (2019) propose a hierarchical decoding model for SLU. However, they still have problem with new act and slot. Bapna et al. (2017); Lee and Jha (2018); Zhu and Yu (2018) try to solve it with textual slot descriptions. In this paper, we propose atomic templates to describe act-slot-value triples but not separate slots or acts.

Method
In this section, our data augmentation method with atomic templates will be introduced, which generates additional data samples for training SLU model.

SLU Model
For dialogue act based SLU, we adopt the hierarchical decoding (HD) model (Zhao et al., 2019) which dynamically parses act, slot and value in a structured way. The model consists of four parts: • A shared utterance encoder, bidirectional LSTM recurrent neural network (BLSTM) (Graves, 2012); • An act type classifier on the utterance; • A slot type classifier with the utterance and an act type as inputs; • A value decoder with the utterance and an act-slot type pair as inputs.
The value decoder generates word sequence of the value by utilizing a sequence-to-sequence model with attention (Luong et al., 2015) and pointer network (Vinyals et al., 2015a) which helps handling out-of-vocabulary (OOV) values.

Data Generation with Atomic Templates
As illustrated in Figure 1, the workflow of the data augmentation is broken down into: 1) mapping act-slot-value triples to exemplars with atomic templates, and 2) generating the corresponding utterance depending on the atomic exemplars. Let x = x 1 · · · x |x| denote the utterance (word sequence), and y = {y 1 , · · · , y |y| } denote the dialogue act (a set of act-slot-value triples). We wish to estimate p(x|y), the conditional probability of utterance x given dialogue act y.
However, there are also some disadvantages to directly using dialogue acts as inputs: • The dialogue acts from different narrow domains may conflict, e.g. different slot names for the same meaning.
• Act and slot types may not be defined in a literal way, e.g. using arbitrary symbols like "city 1" and "city 2" to represent city names in different contexts.
Thus, it is hard to adapt the model p(x|y) to new act types, slot types, and domains. We propose to interpret the dialogue act in short natural language, and then rephrase it to the corresponding user utterance. While interpreting the dialogue act at sentence level costs as much as building a rule-based SLU system, we choose to interpret act-slot-value triples with atomic templates which involve minimum human efforts. Table 1 gives some examples of atomic templates used in DSTC 2&3 dataset. Atomic templates produce a simple description (atomic exemplar e i ) in natural language for each act-slotvalue triple 1 y i . If there are multiple templates for triple y i , which generate multiple atomic exemplars E(y i ), we choose the most similar one e i = argmax e∈E(y i ) sim(e, x) in the training stage, and randomly select one e i from E(y i ) in the data augmentation stage. The similarity function sim(e, x) we used is Ratcliff-Obershelp algorithm (Black, 2004). Therefore, p(x|y) = p(x|{y 1 · · · y |y| }) = p(x|{e 1 · · · e |y| })

Sentence generator
An encoder-decoder model is exploited to generate the utterance based on the set of atomic exemplars by estimating p(x|{e 1 · · · e |y| }).
1 It should be noted that an act-slot-value triple may have an empty value, e.g. "request(phone)" refers to asking for phone number. The triple may also need no slot, e.g. "bye()".
As the set of atomic exemplars is unordered, we encode them independently. For each atomic exemplar e i = w i1 · · · w iT i (a sequence of words with length T i ), we use a BLSTM to encode it. The hidden vectors are recursively computed at the j-th time step via: where [·; ·] denotes vector concatenation, ψ(·) is a word embedding function, f LSTM is the forward LSTM function and b LSTM is the backward one. The final hidden vectors of the forward and backward passes are utilized to represent each atomic After encoding all the atomic exemplars, we have a list of hidden vectors H(y) = [h 11 · · · h 1T 1 ; · · · ; h |y|1 · · · h |y|T |y| ].
A LSTM model serves as the decoder (Vinyals et al., 2015b), generating the utterance x word-by-word: Before the generation, the hidden vector of the decoder is initialized as 1/|y| |y| i=1 c i , the mean of representations of all the atomic exemplars. The pointer softmax (Gulcehre et al., 2016) enhanced by a trick of targeted feature dropout (Xu and Hu, 2018) is adopted to tackle OOV words, which will switch between generation and copying from the input source dynamically.

Data
In our experiments, we use the dataset provided for the second and third Dialog State Tracking Challenge (DSTC 2&3) (Henderson et al., 2014a,b). DSTC2 contains a large number of training dialogues (∼15.6k utterances) related to restaurant search while DSTC3 is designed to address the problem of adaptation to a new domain (tourist information) with only a small amount of seed data (dstc3 seed, 109 utterances). The manual transcriptions are used as user utterances to eliminate the impact of speech recognition errors.
We follow the data partitioning policy as Zhu et al. (2014), which randomly selects one-half of DSTC3 test data as the oracle training set (∼9.4k utterances) and leaves the other half as the evaluation set (∼9.2k utterances).

Experimental Setup
Delexicalized triples are used in case of nonenumerable slot, like "inform(food=[food])" in Table 1. There are 41 and 35 delexicalized triples for DSTC2 and DSTC3 respectively. For each triple, we prepare two short templates on average. Note that, compared to designing sentence-level templates, writing atomic templates for triples at phrase level requires much less human efforts.
For data augmentation of DSTC3, there are two ways to collect extra dialogue acts as inputs 2 : • • Combination: As all possible triples are predefined in the domain ontology of DSTC3 by experts. We use a general policy of triples combination, which randomly selects at most N c triples to make up a dialogue act. (N c is set as 3 empirically.) Then we fill up each non-enumerable slot in the dialogue act by randomly choosing a value of this slot, which ends when each value appears at least N v times in all the collected dialogue acts. N v is set as 3 empirically. After that, we have 1420 and 20670 dialogue acts from the seed abridgement and combination respectively. New data samples are generated starting from these dialogue acts, through the atomic templates and the sentence generator (e.g. 1-best output is kept). The SLU model and sentence generator use the same hyper-parameters. The dimension of word embeddings (Glove 3 word vectors are used for initialization) is 100 and the number of hidden units is 128. Dropout rate is 0.5 and batch size is 20. Maximum norm for gradient clipping is set to 5 and Adam optimizer is used with an initial learning rate of 0.001. All training consists 2 The data splits, atomic templates and extra dialogue acts will be available at https://github.com/sz128/ DAAT_SLU 3 http://nlp.stanford.edu/data/glove. 6B.zip of 50 epochs with early stopping on the development set. We report F1-score of extracting actslot-value triples by the official scoring script from http://camdial.org/˜mh521/dstc/.

Systems
We first compare two SLU models to answer why the HD model is chosen for dialogue act based SLU: • ZS: zero-shot learning of SLU (Yazdani and Henderson, 2015) which can adapt to unseen dialogue acts.
• HD: the hierarchical decoding model (Zhao et al., 2019) is adopted in our system.
We make comparisons of other data augmentation methods and the atomic templates (AT): • Naive: We replace the value simultaneously existing in an utterance and its semantic labels of dstc3 seed by randomly selecting a value of the corresponding slot. It ends when each value appears at least N v times.
• Human: Zhu et al. (2014) proposed to design a large number of sentence-level templates for DSTC3 with lots of human efforts.
• Oracle: The oracle training set is used which simulates the perfect data augmentation.
Without data augmentation, the SLU models are pre-trained on DSTC2 dataset (source domain) and finetuned with dstc3 seed set. In our data augmentation method, the sentence generator based on atomic concepts is also pre-trained on DSTC2 dataset and finetuned with the dstc3 seed. The SLU model is first pre-trained on DSTC2 dataset, then finetuned with the augmented dataset and finally finetuned with the dstc3 seed.

Results and Analysis
The main results are illustrated in Table 2. We can see that: 1. The hierarchical decoding (HD) model gets better performance than the zero-shot learning (ZS) method of SLU.
2. The seed data dstc3 seed limits the power of the SLU model, and even the naive augmentation can enhance it.  3. Our data augmentation method with atomic templates (AT) improves the SLU performance dramatically. One reason may be the generated data has higher variety of semantic meaning than the naive augmentation. Combination can make up more dialogue acts and shows better result than Seed abridgement, while Seed abridgement provides more realistic dialogue acts. Thus, their union gives the best result.
4. The best performance of our method is close to the human-designed sentence-level templates (Zhu et al., 2014), while our approach needs much less human efforts.

Ablation Study
We conduct several ablation studies to analyze the effectiveness of different components of our method, as shown in Table 3. By removing SLU model pretraining on DSTC2 ("-dstc2") and finetuning on the seed data ("-dstc3 seed"), we can see a significant decrease in SLU performance. When we subsequently cast aside the sentence generator ("-sentence generator", i.e. using the atomic exemplars as inputs of SLU model directly), the SLU performance decreases by 10.3%. This shows that the sentence generator can produce more natural utterances. If we replace the atomic exemplars as the corresponding act-slot-value triples ("-atomic templates"), the SLU performance drops sharply. The reason may be that the atomic templates provide a better description of corresponding semantic meanings than the surfaces of the triples. Examples of generated data samples are in Appendix A.
Number of seed samples Figure 2 shows the number of seed samples used versus SLU performance on DSTC3 evaluation set. For zero-shot case (no seed samples), our method is much better than the baseline. When the number of seed sam-  Table 3: SLU performances on the DSTC3 evaluation set when removing different modules of our method. ples increases, our method outperforms the baseline constantly.

Conclusion
In this paper, we propose a new data augmentation method with atomic templates for SLU. The atomic templates provide exemplars in natural language for act-slot-value triples, involving minimum human efforts. An encoder-decoder model is exploited to generate utterances based on the atomic exemplars. We believe our method can also be applied to SLU systems with other semantic representations (e.g. semantic frame). Experimental results show that our method achieves significant improvements on DSTC 2&3 dataset, and it is is very effective for SLU domain adaptation with limited data.