Self-Learning Architecture for Natural Language Generation

In this paper, we propose a self-learning architecture for generating natural language templates for conversational assistants. Generating templates to cover all the combinations of slots in an intent is time consuming and labor-intensive. We examine three different models based on our proposed architecture - Rule-based model, Sequence-to-Sequence (Seq2Seq) model and Semantically Conditioned LSTM (SC-LSTM) model for the IoT domain - to reduce the human labor required for template generation. We demonstrate the feasibility of template generation for the IoT domain using our self-learning architecture. In both automatic and human evaluation, the self-learning architecture outperforms previous works trained with a fully human-labeled dataset. This is promising for commercial conversational assistant solutions.


Introduction
Intelligent Conversational Assistants are prevalent now. They have been integrated into a wide range of IoT devices. Various recent studies on stochastic language generation have been conducted in NLG. Although these stochastic approaches outperform traditional LSTM language models, they have drawbacks including the requirement of large training datasets, low accuracy of trained models and lack of naturalness of generated sentences. Therefore, Most of commercial conversational assistant services adopt template-based approach (Cheyer and Guzzoni, 2014;Mirkovic and Cavedon, 2011) to implement natural language generation. This approach is robust and feasible for commercialization, but requires creating a large number of templates. In this approach, one needs to manually generate NLG templates that cover all the possible combinations of intents and slots. 1 There are statistical approaches for generating templates with 2, 3 and 4 slots (Narayan et al., 2011), but they suffer from exponential complexity as the number of slots increases. In addition, substitution-based implementation with Sim-pleNLG (Gatt and Reiter, 2009) can handle some of the cases, but this is not versatile enough. The number of templates required to cover all possible combinations of slots is: where n is the total number of possible slots. The value is exponential as shown in equation 1. Manually generating an exponential number of templates is undesirable, especially when the intents get more complex. We propose a self-learning architecture for NLG which solves the problem of generating an exponential number of templates. In order to generate informative and natural sentences, arguably, it is more important to generate consistent system responses with limited syntactic information than to generate error prone system responses with more variation in their grammatical form. In our proposed solution, we start with an initial training set containing less than or equal to 2 slots per intent, and iteratively build our model to increase its ability to cover more complex inputs. Thus, the re-1 Throughout this paper, intent denotes the intention of user utterance and slot denotes the variable part (slot value) and its name if any (slot name). Slot can be replaced by another phrase in user utterances or responses. NLG will be generated from dialog act and it will be written as Intent( SlotN ame1 = SlotV alue1; SlotN ame2 = SlotV alue2 ; · · ·). For example, from dialog act up(functionname=tem perature;devicename='airconditioner';location='bedroom'), response "I turned up the temperature of the airconditioner in the bedroom" is generated. quired number of human generated templates decreases from 2 n − 1 to n 1 + n 2 = O(n 2 ).

Related Work
Several approaches have been studied for NLG systems. The template-based approach (McRoy et al., 2003;Channarukul et al., 2002) is most commonly used, because it guarantees the advantage of completely controlling the output quality. However, that approach has two main disadvantages. First, it is labor-intensive to generate and maintain templates (Galley et al., 2001). Second, it does not scale well when domains are changed or expanded (Channarukul et al., 2002;Reiter, 1996). Recently, various stochastic NLG methods such as Sequence to Sequence generation with attention (Seq2Seq w/ attn) (Dušek and Jurčíček, 2016) and Semantically Conditioned LSTM (SC-LSTM) (Wen et al., 2015) have been studied to overcome the disadvantages of template-based approaches. They also aim to remove the need for manual alignment of training dataset. They are directly trained on a data corpus and reduce the human effort required for producing templates.
Other End-to-End deep learning approaches have been studied (Gehrmann et al., 2018), along with domain adaptation (Dethlefs, 2017), to solve the problem of domain specificity. Though they produce state-of-the-art results among stochastic approaches, the generated sentences are not natural enough for real-world conversational assistant services, where even the slightest mistakes can be detrimental. Another hurdle while training these models is the issue of finding the right dataset for training (Gatt and Krahmer, 2018). We show that our proposed self-learning architecture can outperform existing neural models and greatly reduce human effort compared to template-based approaches without compromising much on output quality.

Self-Learning Architecture for NLG
In this section, we explain the proposed architecture to resolve the problem discussed above. We also briefly explain the models that we use for response generation, namely, Rule-based, Seq2Seq and SC-LSTM. Initially, all dialog acts in the training dataset contain all the combinations of slots with less than k 2 slots per training instance. As a preparation step, we manually generate the initial training dataset as DialogAct, Response pairs for all the combinations of slots with up to two slots per instance. With the initial value of k as 2, see Algorithm 1.
Thus, we have successfully increased the size of our training dataset. Sample data for successive training steps is shown in Table 1. We experimented with three different models for response generation task using Dialog Acts: Rule-based model, Seq2Seq model and SC-LSTM model. Figure 1 depicts the flow for the self-learning architecture.

Rule-Based Approach
This method was inspired by (Filippova, 2010), where shortest path finding algorithms are used Algorithm 1 Self Learning Architecture 1: while k < max slots do 2: Train a new language generation model that generates responses from dialog acts containing up to k slots 3: Generate responses containing k + 1 slots using Dialog Acts containing k + 1 slots and the model trained above 4: Augment the training dataset with the inferred sentences containing k + 1 slots from the previous step 5: k = k + 1 6: end while to summarize multiple sentences.
Although it is straightforward to understand and gives good scores on metrics such as longest common subsequence (LCS) metric used by (Zhao et al., 2002), the approach is different from our objective. We need to conserve all information from the source sentence. Consequently, we use the shortest common supersequence (SCS) for our task. Finding SCS of more than two sentences is an NP-complete problem (Räihä and Ukkonen, 1981), therefore we simplify it by considering SCS of two sentences.
Assumption 1: If sentence X is a good response for slot combination A and sentence Y is one for combination B, then their SCS Z will be a good response for slot combination A ∪ B. An example of A, B, X and Y is A : up(functionname=temperature;devicename=' airconditioner'), B : up(functionname=temperature;location='bedr oom'), X : I will turn up the temperature for the airconditioner, Y : I will turn up the temperature in the bedroom. The SCS of X and Y is "I will turn up the temperature for the airconditioner in the bedroom", which is plausible response sentence for A ∪ B.
Assumption 2: SCS Z will be an even better response if |A \ B| and |B \ A| are small. Based on Assumptions 1 and 2 above, it is natural to think of a simple algorithm (see Algorithm 2) which generates natural response given slot combination A. There can be numerous candidates, so we opt for the shortest one. for (0 ≤ i < j < |A|) do 8: return Y 12: end procedure

Sequence to Sequence with Attention
Seq2Seq model (Sutskever et al., 2014) has been widely used in many machine learning tasks. For our task, we use an LSTM-based encoder and decoder model with Attention mechanism (Dušek and Jurčíček, 2016). The LSTM encoder takes a sequence of Dialog Acts as input, where each slot is a symbol in the vocabulary. While decoding, they use an LSTM-based re-ranker at the end to calculate slot errors and penalize responses that have wrong slots as compared to the input Dialog Act.

Semantically Conditioned LSTM
The SC-LSTM Model (Wen et al., 2015) deals with the issue of repetitive word generation in LSTM-based NLG. It receives the Dialog Act in the form of a bit vector, where each bit in the vector denotes whether a particular slot-value pair exists in the Dialog Act. This model tries to mitigate the issue by using additional control cell along with the standard LSTM cell. The control cell produces a surface realization which accurately encodes the input information and helps in cutting off repetitive words.

Dataset for Initial Training
We use a small domain-specific dataset for our experiment. Our dataset is derived from the one used by (Georgila et al., 2018), focused on the domain of IoT Home Appliances. We extract a subset of the system responses, and extend them for increased coverage, and generate DialogAct, Response pairs. The generated DialogAct, Response pairs are given to our

Evaluation Metrics
To evaluate our models, we measure BLEU scores and use a human evaluation. The BLEU score is the n-gram similarity between the reference response and the generated response. Higher BLEU scores indicate a better model. The whole test set was used to measure BLEU scores. For human rating, we asked 21 human evaluators to rank the outputs of all five models, considering the grammatical correctness and informativeness of the generated responses. Then we calculated the Mean Rank for each model. Lower Mean Rank indicates a better model. Randomly chosen 10 responses with 3 to 5 slots were used for human rating.

Results
The results can be found in Table 2 and Table 3. We were able to train both deep learning models using our self-learning architecture successfully, which can be trained from small amounts of data. The human labeled SC-LSTM model has better scores than the corresponding Seq2Seq model. When self-learning architecture is used, SC-LSTM outperforms all the other models. We think SC-LSTM model is more suitable for learning from structured data than the other models are. This is because SC-LSTM model uses an addi-tional gated cell which plays the role of sentence planning to produce a surface realization which accurately encodes the input information. When the model encounters a Dialog Act with more slotvalue pairs than those in the training dataset, it recognizes the extra slot and generates the response. The Seq2Seq model captures the same relation above using attention. However Seq2Seq-model could not outperform the SC-LSTM model, when it encounters a Dialog Act with more slots than the model was originally trained on. Therefore it fails on our self-learning task. The Rule-based model produces reasonable outputs, but its output heavily depends on the nature of input data and requires human effort. One interesting observation is that the BELU score was the highest when the maximum number of slots was used. We think the BLEU score can increase when the number of target slots rises above a certain level, because the dataset with more slots has more training responses and has responses about more complex combinations of slots.

Conclusion
This paper presents a self-learning architecture for NLG. We experimented with three different models: Rule-based, Seq2Seq and SC-LSTM models in the IoT domain. Our data-efficient architecture not only reduces human effort, but also outperforms models trained on the fully human labeled dataset. We think the proposed method can reduce the effort of building large-scale NLG systems for commercial conversational assistant solutions. In future work, we plan to apply other neural models to our architecture, and extend that architecture to cover multi-domain generation tasks.