Multi-task Learning for Natural Language Generation in Task-Oriented Dialogue

In task-oriented dialogues, Natural Language Generation (NLG) is the final yet crucial step to produce user-facing system utterances. The result of NLG is directly related to the perceived quality and usability of a dialogue system. While most existing systems provide semantically correct responses given goals to present, they struggle to match the variation and fluency in the human language. In this paper, we propose a novel multi-task learning framework, NLG-LM, for natural language generation. In addition to generating high-quality responses conveying the required information, it also explicitly targets for naturalness in generated responses via an unconditioned language model. This can significantly improve the learning of style and variation in human language. Empirical results show that this multi-task learning framework outperforms previous models across multiple datasets. For example, it improves the previous best BLEU score on the E2E-NLG dataset by 2.2%, and on the Laptop dataset by 6.1%.


Introduction
Natural Language Generation (NLG) is the final procedure in the pipeline of task-oriented dialogues. As the result of NLG is directly facing users, its readability and informativeness have a direct impact on users' perception of the entire dialogue system. On one hand, the response must contain the desired information, referred to as meaning representation (MR), in order to provide or request a user's information. On the other hand, the system response needs to mimic the fluency and variation in human language to improve the user experience. To this end, there have been numerous studies on methods to generate natural responses for task-driven dialogues.
Early work primarily employ predefined rules or syntax (Cheyer and Guzzoni, 2014;Langkilde and Knight, 1998). Though these frameworks can provide adequate information, their lack of naturalness and variation in language make the response rather rigid. Moreover, these methods usually require non-trivial manual work to create templates, rendering them unscalable across domains.
Recently, corpus-based methods have gained considerable popularity in natural language generation (Wen et al., 2015a,b;Dušek and Jurčíček, 2016). With the increasing availability of rich dialogue task data, corpus-based frameworks design end-to-end trainable systems. With minimum human effort, these methods directly learn the pattern and styles of human responses from the data, while conveying the required task-specific meaning representation information. Furthermore, the booming of deep learning technology in natural language processing increases these models' capacity to generate sophisticated human-like responses. For instance, Dušek and Jurčíček (2016) employs sequence-to-sequence structure and attention mechanism to generate response tokens from the MR sequence. Wen et al. (2015b) uses a semantic control vector integrated into an LSTM to guide the response generation process. Li et al. (2015) uses maximum mutual information as objective function to generate diverse and appropriate responses. Wen et al. (2016) proposes data counterfeiting to reduce the complexity of transferring trained parameters across multiple domains. However, it still remains a challenge in task-oriented dialogue systems to generate truly natural utterance indistinguishable from a human's response.
On the other hand, language modeling is a technique typically employed to learn language patterns from text. It has been successfully used to generate natural and semantically sound utterances for text summarization, speech recognition and other NLP tasks (Roark et al., 2004;Rush et al., 2015). As task-oriented dialogue datasets usually contain rich human responses, leveraging language modeling has a great potential to boost an NLG model's capacity to mimic human language.
Due to recent successes of multi-task learning in NLP (Collobert et al., 2011;Xu et al., 2018), we propose a multi-task scheme to tackle natural language generation in task-oriented dialogues. For the NLG task, we employ a sequence-to-sequence framework. The decoder uses an attention mechanism to carry over information from the encoder on an MR sequence. Therefore, the NLG task generates response conditioned on input MR.
The primary contribution of our work is to incorporate a language modeling task on humangenerated responses as an unconditioned complementary process that brings in more languagerelated elements, without the intervention of required MR information. Furthermore, the unsupervised nature of language modeling means we do not need additional labelled data. Thus, under multi-task learning framework, we simultaneously train the NLG and language modeling tasks. To facilitate multi-task learning, we carry out language modeling task in decoder and it partially shares parameters from the NLG task.
To evaluate the effectiveness of our model, NLG-LM, we conduct evaluation on 5 taskoriented dialogue NLG tasks: E2E-NLG (Novikova et al., 2017), TV, Laptop, Hotel and the Restaurant datasets (Wen et al., 2015b(Wen et al., , 2016. NLG-LM achieves new state-of-the-art results on all 5 datasets. For example, it outperforms Slug (Juraska et al., 2018), the best model in E2E-NLG competition, by 2.2% in BLEU score. Ablation studies show that the introduction of language modeling task during training can improve the result by 2.4% in BLEU score on average.

Problem Description
In task-oriented dialogues, the natural language generation (NLG) process is to produce system utterances as natural language, given systemgenerated meaning representations from previous steps in the pipeline. Each MR is a slot-value pair, where the slot indicates the category of the information to convey and the value represents the content. For example, (area, city south) is a meaning representation and the corresponding utterance should indicate city south as area information.
In addition to meaning representation, dialogue acts (DA) are given to differentiate between different types of system actions. Typical examples of dialogue acts include inform, request and confirm. For a given meaning representation, the NLG process should generate different utterances for different dialogue acts. For instance, confirm dialogue act usually leads to system response starting with "Let me confirm" or "Correct me if I'm wrong".
In task-oriented dialogues, NLG is framed as a supervised learning problem. Given training } is the set of meaning representations, and u i is a sample utterance generated by human labellers, the goal is to generate utterance u given a new pair of dialogue act d and meaning representations r.
As certain types of meaning representation contain entities like location names and product types that are usually proper nouns, we use the delexicalization technique to replace values with a special slot token, slot name , during training and generation. The ultimate response is obtained via a reversal lexicalization process to replace slot tokens with their corresponding values.

The NLG task
We approach the NLG problem using the sequence-to-sequence method (Sutskever et al., 2014). Compared with SC-LSTM (Wen et al., 2015b), this method does not need to create additional one-hot MR vector, and can be much more easily extended across different domains with varying meaning representations.
We first concatenate dialogue act d and meaning representations r as a single input sequence I with m tokens. The output sequence O is similarly obtained from the given utterance u, with n tokens. Both sequences are delexicalized. We put special sentence tokens BOS and EOS around each sequence.
The goal is to generate output tokens one at a time, given previously predicted tokens and the input sequence. This can be modeled as maximizing the conditional probability distribution: p(w 1 , ..., w n |I) = n t=1 p(w t |w 1 , ..., w t−1 ; I) (1) To do this, we employ the encoder-decoder method.
Encoder. We train a dictionary D to map each token to a fixed-length vector of dimension d. The input embedding sequence then goes into a layer of bidirectional RNN to produce contextualized embeddings. We use GRU (Cho et al., 2014) as the RNN unit and sum up the forward and backward RNN outputs. The output of the encoder is denoted by (u 1 , ..., u m ) ∈ R d h ×m , where d h is the RNN's output dimension and m is its input sequence length.
Decoder. The decoder employs an RNN with an attention mechanism to generate tokens one at a time. It starts with the beginning-of-sentence token and uses the final hidden state from encoder RNN as the initial hidden state. In the t-th step, we use the same dictionary D from the encoder to map the t-th output token into vector s t and apply dropout. Then, given the previous hidden state h t−1 , the decoder first computes attention weights over encoder outputs: Here, W 1 ∈ R d h ×2d h and b ∈ R d h are parameters. The weights {α i } m i=1 are then applied to encoder outputs to obtain the context vector c = 1≤i≤m α i u i .
The context vector c and embedded vector s t are then concatenated and sent into decoder GRU with output g ∈ R d h and a new hidden state h t ∈ R d h .
To generate the next token, we reuse the dictionary D with its transposed weights W D 1 . We again integrate the context vector to fuse in contextual information: where W 2 ∈ R d×2d h is a parametrized matrix. p t is the probability distribution of the next token over all tokens in dictionary.
The loss function is cross entropy. Suppose the one-hot vector for the ground-truth at t-th step is 1 Suppose D ∈ R |V |×d , its transposed weights WD ∈ R d×|V | . y t , then the loss function for each training sample sequence pair is:

Coupling with Language Model
The encoder-decoder approach above incorporates the information from dialogue act and meaning representation at each step via attention. However, due to this mechanism, the generated utterance inevitably relies to a great extent on the input sequence, focusing less on the fluency and variation of human language, which is as important as conveying the required information in task-oriented dialogues.
On the other hand, language modeling is typically used to characterize the naturalness of words, phrases and sentences. A well-trained language model can assign natural and semantically sound utterance higher scores than rigid and unnatural sentences. In deep learning, the language model task is often solved by a recurrent neural network. However, instead of depending on an input sequence like in Equation (1), the probability of the next token in language models only relies on preceding words: p LM (w 1 , ..., w n ) = n t=1 p LM (w t |w 1 , ..., w t−1 ) (8) We propose that by integrating language modelling into the NLG process as an additional objective, the generated sentences will better approximate the styles and variation in human response.
To do this, we add another GRU unit, GRU LM , to the decoder, that has its own hidden state h LM t−1 and takes the embedded vector s t as input. The output is g LM and the new hidden state is h LM t . The probability distribution of next token in language model is: where W 2 [: d h ] are the first d h columns of W 2 . As we can see, the context c does not affect the probability computation for language modeling. The loss function of language model is:    Finally, we linearly combine the two loss functions into a single multi-task loss function: We depict our model structure in Figure 1. As shown, the dictionary D is shared between the NLG task and language modeling task.

Datasets and settings
We evaluated the models on five datasets from different domains, covering restaurant booking, hotel booking and retail. The largest dataset is from the E2E-NLG task (Novikova et al., 2017), consisting of 51.2K MR-utterance pairs in the restaurant domain. We also use the four datasets from RNN-LG (Wen et al., 2016), including dialogue scenarios in TV retails,laptop retails,with 14.1K,26.5K,8.7K and 8.5K samples respectively. For fairness, we use the official evaluation scripts from E2E-NLG and RNN-LG (Wen et al., 2016) to assess models. We use the BLEU-4 (Papineni et al., 2002) and NIST (Przybocki et al., 2009)

metrics.
Delexicalization. In the experiment, we do not delexicalize slots that have binary values or are inappropriate for verbatim substitution. For instance, in E2E-NLG datasets, we only delexicalize name and near slots. For TV dataset, we delexicalize all slots except hasusbport. In Laptop dataset, we delexicalize all slots except isforbusinesscomputing and request. In Hotel dataset, we delexicalize all slots except acceptscreditcards, dogsallowed and hasinternet. In Restaurant dataset, we delexicalize all slots except kidsallowed and request.  Baseline. Our baseline models include TGen (Dušek and Jurčíček, 2016), SC-LSTM (Wen et al., 2015b), RALSTM (Tran and Nguyen, 2017) and Slug (Juraska et al., 2018).
Training details. We use Adamax (Kingma and Ba, 2014) as the optimizer. We use teacher forcing, which means that during training, the decoder is always presented with the previous ground-truth token. The language modeling task is only used during training, and it uses utterances from the same batch as NLG task. The inference uses a beam search of width 10. We use the multi-task coefficient α = 0.5 in all experiments. The hyperparameters were chosen on the dev set with early stopping, as shown in Table 3.

Result
We present our experimental results in Table 1 and 2. As shown, our model, NLG-LM, outperforms the baseline models in all 5 datasets. In E2E-NLG dataset, it achieves 2.2% higher BLEU score and 0.013 higher NIST score than Slug. In TV, Laptop, Hotel and Restaurant datasets, NLG-LM greatly improves previously best result by 7.6%, 6.1%, 4.1%, and 1.6%, respectively. We also ran our model without the language modeling task as an ablation study, denoted by w/o LM. As seen, language modeling can improve the result by 0.8% to 6.1%, or on average 2.4%, which demonstrates the effectiveness of multi-task learning.
In the appendix, we examined some predicted samples generated from our model, which shows that the addition of the language model makes the generated responses more natural and variable.
Efficiency. We compared the training time of NLG-LM with that of w/o LM. As shown in Table 4, the additional language model only introduces 23.6% more training time, since it does not involve expensive attention computation. Therefore, NLG-LM can offer more natural response generation with comparable efficiency.

Conclusions
In this paper, we propose a novel multi-task learning method, NLG-LM. It incorporates a language model task into the response generation process as an unconditioned complementary process to boost the naturalness of generated utterances. We fit both tasks into a sequence-to-sequence structure under a multi-task learning scheme. Empirical results show that NLG-LM significantly outperforms previous methods in 5 large-scale datasets with reasonable computational efficiency. Ablation studies show the effectiveness of using the language modeling task within a multi-task scheme.