Learning a Simple and Effective Model for Multi-turn Response Generation with Auxiliary Tasks

We study multi-turn response generation for open-domain dialogues. The existing state-of-the-art addresses the problem with deep neural architectures. While these models improved response quality, their complexity also hinders the application of the models in real systems. In this work, we pursue a model that has a simple structure yet can effectively leverage conversation contexts for response generation. To this end, we propose four auxiliary tasks including word order recovery, utterance order recovery, masked word recovery, and masked utterance recovery, and optimize the objectives of these tasks together with maximizing the likelihood of generation. By this means, the auxiliary tasks that relate to context understanding can guide the learning of the generation model to achieve a better local optimum. Empirical studies with three benchmarks indicate that our model can significantly outperform state-of-the-art generation models in terms of response quality on both automatic evaluation and human judgment, and at the same time enjoys a much faster decoding process.


Introduction
As an important topic in conversational AI, opendomain human-machine conversation is gaining increasing attention from both academia and industry.A common approach to building such a system is to learn a response generation model within an encoder-decoder framework using neural sequence architectures (Sutskever et al., 2014;Vaswani et al., 2017).While the encoder-decoder framework has been successfully applied in various text generation tasks such as machine translation (Vaswani et al., 2017), summarization (Rush et al., 2015), paraphrase generation (Dong et al., 2017), etc., it has to deal with a unique challenge in the task of response generation: modeling con-versation contexts.A conversation context often exhibits a hierarchical structure with dependency existing on both a word-level and an utterancelevel.Moreover, as indicated in (Xing et al., 2018;Zhang et al., 2019), information in a context is rather redundant for responding: commonly only a few words and utterances are useful for response generation, and the positions of the relevant words and utterances vary from case to case.To model the hierarchy of conversation contexts, hierarchical recurrent encoder-decoder (HRED) (Serban et al., 2016) extends the vanilla sequence-to-sequence model by a word-level encoder and an utterancelevel encoder.Later on, a hierarchical recurrent attention network (HRAN) (Xing et al., 2018) harnesses the decoder of the HRED model with wordlevel attention and utterance-level attention to dynamically highlight the effect of relevant words and utterances in response synthesis.Very recently, ReCoSa (Zhang et al., 2019) further exploits multilayer multi-head self-attention1 to model long-term dependency among utterances and responses.From HRED to HRAN, and then to ReCoSa, the performance of the models in terms of response quality becomes better and better (Zhang et al., 2019), but the models also grow to be more and more complicated.For example, the number of parameters in ReCoSa is more than twice as that in HRED.Thus, when we enjoy the improved performance from the increased complexity, the complexity may also impede the application of the models in some scenarios (e.g., in a mobile scenario).
In this work, we study multi-turn response generation and target on a model that has a simple structure yet can make use of conversation contexts as well as the existing deep models.The key idea is to transfer the burden of context understanding from modeling to learning by designing several auxiliary tasks, and leverage the auxiliary tasks as regularization in model estimation.Specifically, the model we use for response generation concatenates utterances in a conversation context as a long sequence, and only exploits one-layer self-attention in encoding and one-layer context attention in decoding.In such a frugal setting, the representation capability of the model shrinks a lot compared with deep transformers.As a remedy, we augment the maximum likelihood estimation (MLE) in learning with objectives from four auxiliary tasks including word order recovery, utterance order recovery, masked word recovery, and masked utterance recovery.In the first two tasks, we predict the correct order of words and utterances from a random shuffle of words in an utterance and a random shuffle of utterances in a context respectively.The goal of the two tasks is to enhance understanding of the sequential dependency among words and utterances within a context.The other two tasks are inspired by the recent breakthrough from BERT (Devlin et al., 2019), in which we randomly mask a word in an utterance and an utterance in a context respectively, and predict the masked word and the masked utterance using the remaining words and utterances.The two tasks may encourage the learning process to pay more attention to semantics of words and utterances in their contexts, and help the learning process find better representations of words and utterances for the generation model.The auxiliary tasks and the MLE task share the encoder of the generation model.Through learning with multiple tasks, optimization for response generation and optimization for context understanding are performed in a joint form.The context understanding related tasks can guide the MLE to achieve a better local optimum, and thus realize superior performance in response generation with a simple neural structure.
We test the proposed approach with three benchmarks including the Ubuntu Dialogue Corpus (Lowe et al., 2015), DailyDialog (Li et al., 2017), and PERSONA-CHAT (Zhang et al., 2018).Evaluation results on all three datasets indicate that our model can significantly outperform state-of-the-art generation models in terms of both automatic evaluation and human judgment.Moreover, with a parameter set even smaller than HRED, our model is 2x faster than ReCoSa in response decoding.
Our contributions in the paper are three-fold: (1) proposal of balancing model complexity and model capability in multi-turn response generation; (2) proposal of four auxiliary learning tasks that transfer context understanding from modeling to learning; and (3) empirical verification of the effectiveness and the efficiency of the proposed model on three benchmarks.

Related Work
End-to-end open-domain dialogue generation is built upon the encoder-decoder architecture (Shang et al., 2015;Vinyals and Le, 2015), and the vanilla sequence-to-sequence structure has been widely extended to address challenges such as generic responses (Li et al., 2015;Xing et al., 2017), context modeling (Serban et al., 2016(Serban et al., , 2017;;Xing et al., 2018;Zhang et al., 2019), and grounding by persona/emotion/knowledge (Li et al., 2016;Zhang et al., 2018;Zhou et al., 2018;Dinan et al., 2018).In this work, we study how to leverage conversation context for multi-turn response generation, which represents a fundamental problem in dialogue generation.Different from the existing work that enhances the representation capability of models through neural architecture engineering, we turn to an orthogonal direction that we keep the generation model simple, and optimize the simple structure by learning with auxiliary tasks that encode context understanding.As a result, our model can provide high-quality responses at a low cost.Before us, there have been a few studies on learning a primary task with auxiliary ones (Rei and Yannakoudakis, 2017;Yu and Jiang, 2016;Ding et al., 2017;Trinh et al., 2018;Mehri et al., 2019;Wu et al., 2019).The work is unique in that through extensive empirical studies, we verified that a simple structure learned with auxiliary tasks can work as well as deep architectures in dialogue generation.

Approach
We first formalize the problem in question, and then detail the model and the learning tasks.

Problem Formalization
Suppose that we have a dataset , where U i = (U i,1 , . . ., U i,n ) denotes a context with U i,j the j-th utterance, and R i is a response regarding to U i .The goal is to estimate a generation probability distribution P (R|U) from D, and thus, given a new context U, one can generate a response for U following P (R|U).A common practice is to learn P (R|U) by maximizing the log-likelihood of D (i.e.MLE) which can be formulated as N i=1 log P (Ri|Ui). (1) When P (R|U) is in a simple structure, only learning with MLE could be insufficient to obtain a model that can well capture the syntax and the semantics of contexts.An evidence is that simple architectures like HRED is much worse than complicated architectures like ReCoSa in terms of response quality, as reported by the existing work (Zhang et al., 2019).Since a simple structure is still favored, we consider aiding the objective given by Equation ( 1) with extra ones that can reinforce context understanding in the learning process.

Generation Model
Figure 1 illustrates the architecture of the generation model.In a nutshell, the model is in a transformer-based structure (Vaswani et al., 2017) with one attentive layer (in the transformer layer) in the encoder and one attentive layer in the decoder.The auxiliary tasks, which will be presented later, share the encoder with the generation model.We prefer a transformer-based structure instead of a recurrent structure, because the former is easier to parallelize than the latter, and thus can further enhance efficiency of the model in an online system.
Encoder: we unfold all words in (U, R) into W = (w 1 , . . ., w m , w m+1 , . . ., w m+t ), where m is the number of words in context U, and t is the number of words in response R. ∀i ∈ {1, . . ., m + t}, w i is represented by a summation of word embedding, position embedding, and segment embedding: where W E(w i ) represents the word embedding of w i initialized using GloVe (Pennington et al., 2014), P E(w i ) is the position embedding of w i which is defined by P e(w i ), where e(w i ) is a one-hot vector with the only non-zero entry indicating the position of w i in W, and P ∈ R d×Mp is a randomly initialized matrix with M p an upper bound of the number of words in a dialogue.SE(w i ) is the segment embedding of w i defined similarly with the one-hot vector indicating the position of the utterance that contains w i .The embedding matrix is then fed to a transformer layer, which can be formulated as  where FNN(•) is a feed-forward neural network and MultiHead(Q, K, V ) is a multi-head attention function with Q a query, K a key, and V a value.To control the receptive field of selfattention in different tasks, we add a mask matrix M ∈ R (m+t)×(m+t) (Dong et al., 2019) in attention computation, and let M determine whether a pair of words can attend to each other according to the learning tasks.Thus, MultiHead(Q, K, V ) is defined by

Context-Response Attention
where ⊕ refers to a concatenation operation, and M is given by Mij = 0, allow to attend, −∞, prevent from attending. (5) Decoder: suppose that (w m+1 , . . ., w m+l−1 ) are words generated until step l − 1, then the next word w m+l is predicted according to: where O(w m+l−1 ) is defined by FNN(MultiHead(E(w m+l−1 ), E, E) with E = [E(w 1 ), . . ., E(w m+l−1 )] the output of the encoder, and W s is a trainable parameter.

Auxiliary Tasks
Heading for learning the simple structure that can effectively make use of contexts for response generation, we design two kinds of auxiliary tasks including order recovery and masked content recovery.The order recovery tasks aim to enhance the capability of the self-attention module on capturing the sequential relationship among words and utterances, while the masked content recovery tasks can optimize the self-attention module to enhance semantic connection among words and utterances.
Order recovery: a recent study (Sankar et al., 2019) indicates that transformer-based models are insensitive to ordering of words and utterances, which means that the information they learn could be just bag-of-words representations.Thus, we consider recovering the correct order from random shuffling on both a word level and an utterance level to force self-attention to be aware of relative positions of words and utterances in the context.
Word order recovery: Figure 2 (a) illustrates the task.Given a randomly sampled utterance U = (w 1 , . . ., w k ) from a context U, we randomly shuffle the words in U and obtain a disordered utterance Ū = ( w1 , . . ., wk ).Then, we replace U in U with Ū and form a corrupt context Ū.The goal of the task is to predict U from Ū .The loss of the task can be formulated as where E( wi ) is obtained from E( Ū) which is the representation of Ū given by the encoder of the generation model, W s is shared with Equation ( 6).For this task, the mask matrix M in Equation ( 4) is defined by: Mij = 0, wi and wj are in the same utterance, −∞, wi and wj are in different utterances. (8) Utterance order recovery: where k i is the number of words in U o i .S = {S i } n i=1 forms a sentence memory that is accessible by the processing module.The processing module exploits multi-head self-attention and GRU to guarantee the property that vectors retrieved from memory S will not change if the memory is randomly shuffled.Formally, the processing module is defined by where the last hidden state h n is permutation invariant regarding to input.The writing module is another GRU that decodes {o 1 , . . ., o n } one by one.At step i, the hidden state hi is defined by where hi−1 is the hidden state at step i − 1 with h0 = h n , x i is the embedding of o i−1 (i.e., the embedding of the ground-truth position of , and c i is a context vector which is defined via attention over {h t } n t=1 : where V 1 , W 1 , W 2 , and b 1 are parameters.The prediction model is finally formulated as The loss function of the task is defined by For this task and the following ones, M in Equation ( 4) is defined as a zero matrix meaning that every pair of words can attend to each other in the context.
Masked content recovery: a major challenge in context understanding is the information omission problem (e.g., coreferences) that widely exists in utterances (Su et al., 2019).The challenge requires a model to connect semantically related words and utterances.Thus, we design masked content recovery tasks on both a word level and an utterance level to enhance the self-attention module in terms of awareness of the semantic connections.
• Word level: for each utterance in a context, we randomly replace 15% words with a special token [MASK].
• Utterance level: we randomly pick an utterance from a context, and replace all words in the utterance with a special token [MASK].Since the only difference of the two tasks is the input, we present them in a uniform way.Given a context U = (w 1 , . . ., w m ), suppose that the masked context is Ū = (w * 1 , . . ., w * m ), where w * i = [MASK] if w i is masked, otherwise w * i = w i , then, the loss of the tasks can be formulated as where E(w * i ) is the representation of w * i obtained by passing Ū through the encoder of the generation model, x ∈ {mwr, mur} indexes the two tasks, I[•] is an indicator function, and W s is shared with (6).

Learning Objective
The full loss function is finally defined by: Algorithm 1: Optimization Algorithm where α is a hyper-parameter as a trade-off between MLE and the objectives of the auxiliary tasks.The learning algorithm is summarized in Algorithm 1, where Θ refers to a set of parameters including both the parameters of the generation model and the parameters of the auxiliary objectives.

Experiments
We conduct experiments on DailyDialog (Li et al., 2017), PERSONA-CHAT (Zhang et al., 2018), and the Ubuntu Dialogue Corpus (UDC) (Lowe et al., 2015), and compare our model with state-of-the-art baselines in terms of response quality, parameter size, and decoding speed.

Datasets
Both DailyDialog and PERSONA-CHAT are open domain datasets.Dialogues in DailyDialog cover a wide range of topics in daily scenarios and resemble human communications in their daily life; while PERSONA-CHAT contains multi-turn chitchat conversations between turkers according to their assigned profiles.Since the focus of the work is how to leverage conversation history for response generation, we just append the profiles (the original ones) to the corresponding dialogues as an extension of contexts.To control the length of the dialogues and increase the number of instances, we slide a window on the training/validation/test dialogues in both datasets, and split a dialogue longer than 11 utterances to multiple instances (i.e., the window size is 11).Moreover, we also truncate long utterances with the first 25 words kept.Vocabularies are formed with all words appearing in the entire data and are shared by contexts and responses.The vocabulary size of DailyDialog is 25, 000 and the vocabulary size of PERSONA-CHAT is 18, 750.The UDC data are collected from Ubuntu chat logs with two-person multi-turn conversations about Ubuntu-related problems.Here we use the same data as in (Zhang et al., 2019).Table 1 reports some statistics of the three datasets.

Implementation Details
We train the baselines and our model on RTX 2080, and initialize word embedding with GloVe vectors (Pennington et al., 2014).In our model, the dimension of all vectors is set as 512.The number of heads in multi-head attention is set as 8.
We adopt the Adagrad algorithm (Duchi et al., 2011) in optimization with a learning rate 0.05 and a batch size 80/60/32 in DailyDialog/PERSONA-CHAT/Ubuntu.All models are tuned on the validation sets according to perplexity.We stop training if the perplexity does not drop in three consecutive epochs.The GlobalMaxStep T 1 is set as 50k.The AuxTrainEpoch T 2 is set as 30.The BatchNumPerEpoch N is 551/1595/124, 375 for DailyDialog/PERSONA-CHAT/Ubuntu.

Evaluation Metrics
We evaluate the performance of the models in terms of response quality with both automatic metrics and human judgment.In automatic evaluation, besides BLEU-4 (Papineni et al., 2002) and perplexity (Sutskever et al., 2014), we follow (Serban et al., 2017) and employ Embedding Average (Average), Embedding Extrema (Extrema), and Embedding Greedy (Greedy) as metrics.We also follow (Li et al., 2015) and measure the informativeness of responses with distinct-1 and distinct-2 that are calculated as the ratios of distinct unigrams and bigrams.
In human evaluation, we randomly sample 500 dialogues from each of the three test sets, and recruit 3 native speakers as human annotators.For each context in the 500 dialogues, each annotator compares a response from our model and a response from a baseline model.The two responses are top one results from greedy search, and are randomly shuffled to hide their sources.The annotators judge which response is better based on informativeness, consistency, and fluency of the responses.If an annotator cannot tell which response is better, he/she is required to label a "tie".Each annotator individually judges 500 pairs for all combinations of our model and baseline models.In total, each one labels 2, 500 pairs for one dataset.Fleiss kappa (Fleiss and Cohen, 1973) is employed to measure agreement among the annotators.
In addition to response quality, we also compare our model with baselines on decoding speed.We calculate the average prediction time per word in response generation using all dialogues in the test sets.The efficiency comparison is conducted on a GPU environment with a single RTX 2080.

Evaluation Results
Table 2 reports evaluation results on automatic metrics.Our model outperforms all baseline methods on most of the metrics on all the three datasets.The last two columns of the tables compare different models in terms of parameter size and decoding speed.Note that in training, the auxiliary tasks contain parameters outside the generation model.Therefore, in the column of parameter size, we report two numbers for our model with the one before "/" parameter size in training and the one after "/" parameter size of the generation model.It is remarkable that the parameter size of our model, even in training, is smaller than HRED.In spite of this, the model still outperforms ReCoSa  (2) SSN only considers utterance order, while we also leverage word order, word content, and utterance content in learning.In fact, we find that the proposed auxiliary tasks can improve a 2-layer (one for encoder and one for decoder) RNN-based seq2seq model as well, as reported in Supplementary Material.On most metrics, RNN with full auxiliary tasks is better than SSN but worse than the proposed model.
Table 3 summarizes human evaluation results.We can see that our model outperforms all baseline models, and most of the kappa values exceed 0.6 indicating substantial agreement among the annotators.Based on the annotation results, we find that our model tends to generate diverse and context consistent responses, indicating the effect of the auxiliary tasks.

Discussions
To further understand the merit of the auxiliary tasks, we make some analysis regarding to the following questions: Q1 how do the simple architecture learned with the auxiliary tasks compare with a deep architecture; Q2 if learning with the auxiliary tasks can also improve deep architectures; and Q3 how different auxiliary tasks affect the performance of the model.
Answer to Q1: we aim to move one step further to understand how the auxiliary tasks enhance the capability of the simple generation model on context understanding.While this is not trivial for neural models, we assume that one can let a transformer-based model capture more semantics in contexts by stacking more layers in the encoder, and examine to what extent the simple model learned with the auxiliary tasks is equivalent to a deep architecture.Figure 3  Answer to Q2: since the auxiliary tasks are useful for the simple model, it is also interesting to check if they work as well for deep architectures.Figure 3 shows the results, in which the dash-dotted lines represent the deep architectures learned with  the full auxiliary tasks.First of all, we can conclude that the auxiliary tasks are also useful for deep architectures, since there is clear PPL drop for the same models learned with and without (i.e., the solid lines) the auxiliary tasks.Secondly, the auxiliary tasks are more useful for simple structures, since the gap between the same models learned with and without the tasks becomes smaller and smaller when the number of encoding layers increases.The results indicate that after stacking enough layers, the effect of the auxiliary tasks is overwhelmed by the model itself.Therefore, the merit of the auxiliary tasks is to allow us to learn a generation model that enjoys both efficacy and efficiency, which is exactly the goal of the work.Improvement with respect to the number of layers of the encoder on UDC is more steady than that on DailyDialog and PERSON-CHAT.This is because the training set of UDC is much larger than those of the other two datasets.

Figure 1 :
Figure 1: Architecture of the generation model.
Figure 2 (d)  illustrates the task.Given context U = (U 1 , . . ., U n ), we randomly shuffle the utterances and obtain a disordered context Ū = (U o 1 , . . ., U on ).The goal is to predict the correct positions for utterances in Ū.The prediction model falls in a read-processwrite framework(Vinyals et al., 2015).In the reading module, the model first represents Ū as Ē = ( Ē(w 1,1 ), . . ., Ē(w n,m )) via the encoder of the generation model, where w i,j is the j-th word in utterance U o i (words within an utterance are ordered), and then obtains the representation of utterance

Figure 2
Figure 2 (b) and Figure 2 (c) illustrate the task of masked word recovery (mwr) and the task of masked utterance recovery (mur) respectively.Since the only difference of the two tasks is the input, we present them in a uniform way.Given a context U = (w 1 , . . ., w m ), suppose that the masked context is Ū = (w * 1 , . . ., w * m ), where w * i = [MASK] if w i is masked, otherwise w * i = w i , then, the loss of the tasks can be formulated as

Input:
Training data D, GlobalMaxStep T1, AuxTrainEpoch T2, InitialRate α, compares our model with deep architectures in terms of perplexity on the three datasets, in which we get the deep architectures by stacking transformer layers in the encoder of our model.The dotted lines represent our model learned with the auxiliary tasks, and the solid lines represent the deep architectures learned with MLE.Approximately, our model is equivalent to a deep model with a 4-layer encoder on the DailyDialog data, a 6-layer encoder on the PERSONA-CHAT data, and a 3-layer encoder on the Ubuntu data.

Table 1 :
Statistics of the datasets.

Table 2 :
Evaluation results on automatic metrics.Numbers in bold indicate the best performing model on the corresponding metrics.

Table 3 :
Human evaluation results.The ratios are calculated by combining annotations from three judges together.