Low-Resource Response Generation with Template Prior

We study open domain response generation with limited message-response pairs. The problem exists in real-world applications but is less explored by the existing work. Since the paired data now is no longer enough to train a neural generation model, we consider leveraging the large scale of unpaired data that are much easier to obtain, and propose response generation with both paired and unpaired data. The generation model is defined by an encoder-decoder architecture with templates as prior, where the templates are estimated from the unpaired data as a neural hidden semi-markov model. By this means, response generation learned from the small paired data can be aided by the semantic and syntactic knowledge in the large unpaired data. To balance the effect of the prior and the input message to response generation, we propose learning the whole generation model with an adversarial approach. Empirical studies on question response generation and sentiment response generation indicate that when only a few pairs are available, our model can significantly outperform several state-of-the-art response generation models in terms of both automatic and human evaluation.


Introduction
Human-machine conversation is a long-standing goal of artificial intelligence.Early dialogue systems are designed for task completion with conversations restricted in specific domains (Young et al., 2013).Recently, thanks to the advances in deep learning techniques (Sutskever et al., 2014;Vaswani et al., 2017) and the availability of large amounts of human conversation on the internet, building an open domain dialogue system with data-driven approaches has become the new fashion in the research of conversational AI.Such dia-logue systems can generate reasonable responses without any needs on rules, and have powered products in the industry such as Amazon Alexa (Ram et al., 2018) and Microsoft XiaoIce (Shum et al., 2018).
State-of-the-art open domain response generation models are based on the encoder-decoder architecture (Vinyals and Le, 2015;Shang et al., 2015).On the one hand, with proper extensions to the vanilla structure, existing models now are able to naturally handle conversation contexts (Serban et al., 2016;Xing et al., 2018), and synthesize responses with various styles (Wang et al., 2017), emotions (Zhou et al., 2018), and personas (Li et al., 2016a).On the other hand, all the existing success of open domain response generation builds upon an assumption that the large scale of paired data (Shao et al., 2016) or conversation sessions (Sordoni et al., 2015) are available.In this work, we challenge this assumption by arguing that one cannot always obtain enough pairs or sessions for training a neural generation model.For example, although it has been indicated by existing work (Li et al., 2016b;Wang et al., 2018) that question asking in conversation can enhance user engagement, we find that in a public dataset 1with 5 million conversation sessions crawled from Weibo, only 7.3% sessions have a question response and thus can be used to learn a question generator for responding 2 .When we attempt to generate responses that express positive sentiment, we only get 360k pairs (18%) with positive responses from a dataset with 2 million messageresponse pairs crawled from Twitter.Indeed, existing big conversation data mix various intentions, styles, emotions, personas, and so on.Thus, we have to face the data sparsity problem, as long as we attempt to create a generation model with constraints on responses.
In this work, we jump out of the paradigm of learning from large scale paired data 3 , and investigate how to build a response generation model with only a few pairs at hand.Aside from the paired data, we assume that there are a large number of unpaired data available.The assumption is reasonable since it is much easier to get questions or sentences with positive sentiment than to get such responses paired with messages.We formalize the problem as low-resource response generation from paired and unpaired data, which is less explored by existing work.Since the paired data are insufficient for learning the mapping from a message to a response, the challenge of the task lies in how to effectively leverage the unpaired data to enhance the learning on the paired data.Our solution to the challenge is a two-stage approach where we first distill templates from the unpaired data and then use them to guide response generation.Targeting on an unsupervised approach to template learning, we propose representing the templates as a neural hidden semimarkov model (NHSMM) estimated through maximizing the likelihood of the unpaired data.Such latent templates encode both semantics and syntax of the unpaired data and then are used as prior in an encoder-decoder architecture for modeling the paired data.With the latent templates, the whole model is end-to-end learnable and can perform response generation in an explainable manner.To ensure the relevance of responses regarding input messages and at the same time make full use of the templates, we propose learning the generation model with an adversarial approach.
Empirical studies are conducted on two tasks: question response generation and sentiment response generation.For the first task, we exploit the dataset published in (Wang et al., 2018) and augment the data with questions crawled from Zhihu4 .For the second task, we build a paired dataset from Twitter by filtering responses with an off-theshelf sentiment classifier and augment the dataset with tweets in positive sentiment extracted from a large scale tweet dataset published in (Cheng et al., 2010).Evaluation results on both auto-matic metrics and human judgment indicate that with limited message-response pairs, our model can significantly outperform several state-of-theart response generation models.The source code is available online. 5ur contributions in this work are three-folds: (1) proposal of low-resource response generation with paired and unpaired data for open domain dialogue systems; (2) proposal of encoder-decoder with template prior; and (3) empirical verification of the effectiveness of the model with two largescale datasets.

Related Work
Inspired by neural machine translation, early work applies the sequence-to-sequence with attention model (Shang et al., 2015) to open domain response generation, and gets promising results.Later, the basic architecture is extended to suppress generic responses (Li et al., 2015;Zhao et al., 2017;Xing et al., 2017); to model the structure of conversation contexts (Serban et al., 2016); and to incorporate different types of knowledge into generation (Li et al., 2016a;Zhou et al., 2018).In addition to model design, how to learn a generation model (Li et al., 2016c(Li et al., , 2017)), and how to evaluate the models (Liu et al., 2016;Lowe et al., 2017;Tao et al., 2018), are drawing attention in the community of open domain dialogue generation.In this work, we study how to learn a response generation model from limited pairs, which breaks the assumption made by existing work.We propose response generation with paired and unpaired data.As far as we know, this is the first work on low-resource response generation for open domain dialogue systems.
Traditional template-based text generation (Becker, 2002;Foster and White, 2004;Gatt and Reiter, 2009) relies on handcrafted templates that are expensive to obtain.Recently, some work explores how to automatically mine templates from plain text and how to integrate the templates into neural architectures to enhance interpretability of generation.Along this line, Duan et al. (2017) mine patterns from related questions in community QA websites and leverage the patterns with a retrieval-based approach and a generation-based approach for question generation.Wiseman et al. (2018) exploit a hidden semi-markov model for joint template extraction and text generation.In addition to structured templates, raw text retrieved from indexes is also used as "soft templates" in various natural language generation tasks (Guu et al., 2018;Pandey et al., 2018;Cao et al., 2018;Peng et al., 2019).In this work, we leverage templates for open domain response generation.Our idea is inspired by (Wiseman et al., 2018), but latent templates estimated from one source are transferred to another source in order to handle the low-resource problem, and the generation model is learned by an adversarial approach rather than by maximum likelihood estimation.
Before us, the low-resource problem has been studied in tasks such as machine translation (Gu et al., 2018b,a), pos tagging (Kann et al., 2018), word embedding (Jiang et al., 2018), automatic speech recognition (Tüske et al., 2014), taskoriented dialogue systems (Tran and Nguyen, 2018;Mi et al., 2019), etc.In this work, we pay attention to low-resource open domain response generation which is untouched by existing work.We propose attacking the problem with unpaired data, which is related to the effort in low-resource machine translation with monolingual data (Gulcehre et al., 2015;Sennrich et al., 2015;Zhang and Zong, 2016).Our method is unique in that rather than using the unpaired data through multitask learning (Zhang and Zong, 2016) or backtranslation (Sennrich et al., 2015), we extract linguistic knowledge from the data as latent templates and use the templates as prior in generation.

Low-Resource Response Generation
In this section, we first formalize the setting upon which we study low-resource response generation and then elaborate the model of response generation with paired and unpaired data, including how to learn latent templates from the unpaired data, and how to perform generation with the templates.

Problem Formalization
Suppose that we have a dataset D , where ∀i, (X i , Y i ) is a pair of message-response, and n represents the number of pairs in D P .Different from existing work, we assume that n is small (e.g., a few hundred thousands) and further assume that there is another set i=1 with T i a piece of plain text sharing the same characteristics with {Y i } n i=1 (e.g., both are questions) and N > n.Our goal is to learn a generation probability P (Y |X) with both D P and D U .Thus, given a new message X, we can generate a response Y for X following P (Y |X).
Since the limited resource in D P may not support accurately learning of P (Y |X), we try to transfer the linguistic knowledge in D U to response generation.The challenges then lie in two aspects: (1) how to represent the linguistic knowledge in D U ; and (2) how to effectively leverage the knowledge extracted from D U for response generation, given that D U cannot provide any information of correspondence between a message X and a response Y .The remaining part of the section will describe our solutions to the two problems.

Learning Templates from D U
In the representation of the knowledge in D U , we hope that both semantic information and syntactic information can be kept.Thus, we consider extracting templates from D U as the knowledge.A template segments a piece of text as a structured representation.With the templates, semantically and functionally similar text segments are grouped together.Since the templates encode the structure of language in D U , they can inform the generation model about how to express a response in a desired way (e.g., as a question or with the specific sentiment).Here, we prefer an unsupervised and parametric approach to learning templates, since "unsupervised" means that the approach is generally applicable to various tasks, and "parameteric" allows us to naturally incorporate the templates into the generation model.Then, a natural choice for template learning is the neural hidden semi-markov model (NHSMM) (Dai et al., 2016;Wiseman et al., 2018).
NHSMM is an HSMM parameterized with neural networks.HSMM (Murphy, 2002) extends HMM by allowing a hidden state to emit a sequence of observations and thus can segment a piece of text with the latent variables and group similar segments by the variables.Formally, given an observed sequence Y = (y 1 , . . ., y S ), the joint distribution of Y and its segmentation is where z t ∈ {1, . . ., K} is the hidden state for step t, l t ∈ {1, . . ., D} is the duration variable for z t that represents the number of tokens emitted by z t , i(t) = t j=1 l j with i(0) = 0 and i(S ) = S, and y i(t−1)+1:i(t) is the sequence of (y i(t−1)+1 , . . ., y i(t) ).
The hidden vector for position j is formulated as where refers to element-wise multiplication, and g zt ∈ R d 2 is a gate (in total, there are K gate vectors as parameters).
where W 1 ∈ R V ×d 2 and b 1 ∈ R d 2 are parameters with V the vacabulary size.Following Murphy (2002), the marginal distribution of Y can be obtained by the backward algorithm which is formulated as β t+d (j)P (d|j)P (y t+1:t+d |j, d) β * 0 (j)P (q 1 = j). ( where q t is the hidden state of the t-th word in Y , and the base cases β S (i) = 1, ∀i ∈ {1, . . ., K}.Specifically, to learn more reasonable segmentations, we parsed every sentence by stanford parser (Manning et al., 2014) and forced NHSMM not to break syntactic elements such as VP and NP, etc.The parameters of the NHSMM are estimated by maximizing the log-likelihood of D U through backpropagation.

Response Generation with Template Prior
We propose incorporating the templates parameterized by the NHSMM learned from D U into response generation as prior.Figure 1 illustrates the architecture of the generation model.In a nutshell, the model first samples a chain of states with duration as a template.The template specifies a segmentation of the response to generate.Then, the hidden representations of the segments defined by Equation ( 1) are fed to an encoder-decoder architecture for response generation, where the hidden states of the decoder are calculated with both attention over the hidden states of the input message given by the encoder and the hidden representations of the segments given by the template prior.
The template prior acts as a base and assists the encoder-decoder in response generation regarding to an input message, when paired information is insufficient for learning the correspondence between a message and a response.Note that similar to the conditional variational autoencoder (CVAE) (Zhao et al., 2017), our model also exploits hidden variables for response generation.The difference is that the hidden variables in our model are structured and learned from extra resources, and thus encode more semantic and syntactic information.
Specifically, we segment responses in D P with Viterbi algorithm (Zucchini et al., 2016), collect all chains of states as a pool and sample a chain from the pool uniformly.We do not sample states according to the transition matrix [A(i, j)] K×K , since it is difficult to determine the end of a chain.Suppose that the sampled chain is (z 1 , . . .z S ), then ∀1 ≤ t ≤ S, we sample an l t for z t according to P (l t |z t ), and finally form a latent template (< z 1 , l 1 >, . . .< z S , l S >).Given a message X = (x 1 , . . ., x L ), the encoder exploits a GRU to transform X into a hidden sequence H X = (h X,1 , . . ., h X,L ) with the i-th hidden state , where e x i ∈ R d 3 is the embedding of word x i and h X,0 = 0. Then when predicting the t-th word of the response, the decoder calculates the probability P (y t |y 1:t−1 , X, T ) via

Learning Approach
Intuitively, we can estimate the parameters of the encoder-decoder and fine-tune the parameters of NHSMM by maximizing the likelihood of D P (i.e., MLE).However, since D P only contains a few pairs, the MLE approach may suffer from a dilemma: (1) if we stop training early, then both the template prior and the encoder-decoder are not sufficiently supervised by the pairs.In that case, the linguistic knowledge in D U will play a more important role in response generation and result in irrelevant responses regarding to messages; or (2) if we let the training go deep, then the template prior will be overwhelmed by the pairs in D P .As a result, the generation model will lose the knowledge obtained from D U .Since response generation starts from a latent template, we consider learning the model with an adversarial approach (Goodfellow et al., 2014) that can well balance the effect of the latent template and the input message.The learning involves a generator G described in Section 3 and a discriminator D. G is updated with REINFORCE algorithm (Williams, 1992) with rewards defined by D, and D is updated to distinguish human responses in D P from responses generated by G.
Generator Pre-training.To improve the stability of adversarial learning, we first pre-train G with MLE on D P .∀(X i , Y i ) ∈ D P , the template prior T i is obtained by running Viterbi algorithm (Zucchini et al., 2016) on Y i rather than by sampling.Let Y i = (y i,1 , . . ., y i,S i ), then the objective of pre-training is given by Discriminator Update.The discriminator D is defined by a convolutional neural network (CNN) based binary classifier (Kim, 2014).D takes a message-response pair as input and outputs a score that indicates how likely the response is from humans.In the model, the message and the response are separately embedded as vectors by CNNs, and then the concatenation of the two vectors are fed to a 2-layer MLP to calculate the score.Let Ŷi be the response generated by G for X i , then D is updated by maximizing the following objective: Generator Update.The generator G is updated by the policy gradient method (Yu et al., 2017;Li et al., 2017).Let ŷ1:t be a partial response generated by G from beam search for message X until step t, then we adopt the Monte Carlo (MC) search method and sample N paths that supplement ŷ1:t as responses { Ŷi } N i=1 .The intermediate reward for The gradient for updating G is given by (5) where θ represents the parameters of G, and T is a sampled template.To control the quality of MC search, we sample from top 50 most probable words at each step.
The learning algorithm is summarized in Algorithm 1.Note that in learning of the generation model from D P , we freeze the embedding of states (i.e., e zt in Equation ( 1)) and the embedding of words given by the NHSMM, and update all other parameters in generator pre-training and the following adversarial learning.

Experiments
We test the proposed approach on two tasks: question response generation and sentiment response generation.The first task requires a model to generate a question as a response to a given message; while in the second task, as a showcase, responses should express the positive sentiment.

Experiment Setup
Datasets.For the question response generation task, we choose the data published in (Wang et al., 2018) as the paired dataset.The data are obtained by filtering 9 million message-response pairs mined from Weibo with 20 handcrafted question templates and are split as a training set, a validation set, and a test set with 481k, 5k, and 5k pairs respectively.In addition to the paired data, we crawl 776k questions from Zhihu, a Chinese community QA website featured by high-quality Algorithm 1 Learning a generation model with paired and unpaired data.content, as an unpaired dataset.Both datasets are tokenized by Stanford Chinese word segmenter6 .We keep 20, 000 most frequent words in the two data as a vocabulary for the encoder, the decoder, and the NHSMM.The vocabulary covers 95.8% words appearing in the messages, in the responses, and in the questions.Other words are replaced with "UNK".For the sentiment response generation task, we mine 2 million message-response pairs from Twitter FireHose, filter responses with the positive sentiment using Stanford Sentiment Annotator toolkit (Socher et al., 2013), and obtain 360k pairs as a paired dataset.As pre-processing, we remove URLs and usernames, and transform each word to its lower case.After that, the data is split as a training set, a validation set, and a test set with 350k, 5k, and 5k pairs respectively.Besides, we extract 1 million tweets with positive sentiment from a public corpus (Cheng et al., 2010) as an unpaired dataset.Top 20, 000 most frequent words in the two data are kept as a vocabulary that covers 99.3% words.Words excluded from the vocabulary are treated as "UNK".In both tasks, human responses in the test sets are taken as ground truth for automatic metric calculation.From each test set, we randomly sample 500 distinct messages and recruit human annotators to judge the quality of responses generated for these messages.

1892
Evaluation Metrics.We conduct evaluation with both automatic metrics and human judgements.For automatic evaluation, besides BLEU-1 (Papineni et al., 2002) and Rouge-L (Lin, 2004), we follow (Serban et al., 2017) and employ Emebedding Average (Average), Embedding Extrema (Extrema), and Embedding Greedy (Greedy) as metrics.All these metrics are computed by a popular NLG evaluation project available at https://github.com/Maluuba/nlg-eval.In terms of human evaluation, for each task, we recruit 3 well-educated native speakers as annotators, and let them compare our model and each of the baselines.Every time, we show an annotator a message (in total 500) and two responses, one from our model and the other from a baseline model.Both responses are top 1 results in beam search, and the two responses are presented in random order.The annotator then compare the two responses from three aspects: (1) Fluency: if the response is fluent without grammatical error; (2) Relevance: if the response is relevant to the given message; and (3) Richness: if the response contains informative and interesting content, and thus may keep conversation going.For each aspect, if the annotator cannot tell which response is better, he/she is asked to label a "tie".Each pair of responses receive 3 labels on each of the three aspects, and agreements among the annotators are measured by Fleiss' kappa (Fleiss and Cohen, 1973).

Baselines
We compare our model with the following baselines: (1) Seq2Seq: the basic sequence-tosequence with attention architecture (Bahdanau et al., 2015).( 2) CVAE: the conditional variational autoencoder that represents the relationship between messages and responses with latent variables (Zhao et al., 2017).
We use the code published at https://github.com/snakeztc/NeuralDialog-CVAE.(3) HTD: the hard typed decoder model proposed in (Wang et al., 2018) that exhibits the best performance on the dataset selected by this work for question response generation.The model estimates distributions over three types of words (i.e., interrogative, topic, and ordinary) and modulates the final distribution during generation.Since our experiments are conducted on the same data as those in (Wang et al., 2018), we run the code shared at https://github.com/victorywys/Learning2Ask_TypedDecoder with the default setting.( 4) ECM: emotional chatting machine proposed in (Zhou et al., 2018).We implement the model with the code published at https://github.com/tuxchow/ecm.Since the model can handle various emotions, we train the model with the entire 2 million Twitter message-response pairs labeled with a positive, negative, and neutral sentiment.Thus, when we only focus on responses with positive sentiment, ECM actually performs multi-task learning for response generation.In the test, we set the sentiment label as "positive".We name our model S2S-Temp.Besides the full model, we also examine three variants in order to understand the effect of unpaired data and the role of adversarial learning: (1) S2S-Temp-None.The proposed model is trained only with the paired data, where the NHSMM is estimated from responses in the paired data; (2) S2S-Temp-50%.The proposed model is trained with 50% unpaired data; and (3) S2S-Temp-MLE.The pretrained generator described in Section 4. These variants are only involved in automatic evaluation.

Implementation Details
In both tasks, we set the number of states (i.e., K) and the the maximal number of emissions (i.e., D) in NHSMM as 50 and 4 respectively.d 3 are set as 600, 300, and 300 respectively.In adversarial learning, we use three types of filters with window sizes 1, 2 and 3 in the discriminator.
The number of filters is 128 for each type.The number of samples obtained from MC search (i.e., N ) at each step is 5.We learn all models using Adam algorithm (Kingma and Ba, 2015), monitor perplexity on the validation sets, and terminate training when perplexity gets stable.In our model, learning rates for NHSMM, the generator, and the discriminator are set as 1 × 10 −3 , 1 × 10 −5 , and 1 × 10 −3 respectively.

Evaluation Results
Table 1 and Table 2 report the results of automatic evaluation on the two tasks.We can see that on both tasks, S2S-Temp outperforms all baseline models in terms of all metrics, and the improvements are statistically significant (t-test with p-value< 0.01).The results demonstrate that when only limited pairs are available, S2S-Temp can effectively leverage unpaired data to enhance the quality of response generation.Although lacking fine-grained check, from the comparison among S2S-Temp-None, S2S-Temp-50%, and S2S-Temp, we can conclude that the performance of S2S-Temp improves with more unpaired data.Moreover, without unpaired data, our model is even worse than CVAE since the structured templates cannot be accurately estimated from such a few data, and as long as half of the unpaired data are available, the model outperforms the baseline models on most metrics.The results further verified the important role the unpaired data plays in learning of a response generation model from low resources.S2S-Temp is better than S2S-Temp-MLE, indicating that the adversarial learning approach can indeed enhance the relevance of re-sponses regarding to messages.Table 3 shows the results of human evaluation.In terms of all the three aspects, S2S-Temp is better than all the baseline models.The values of kappa are all above 0.6, indicating substantial agreement among the annotators.When the size of paired data is small, the basic Seq2Seq model tends to generate more generic responses.That is why the gap between S2S-Temp and Seq2Seq is much smaller on fluency than those on the other two aspects.With the latent variables, CVAE brings both content and noise into responses.Therefore, the gap between S2S-Temp and CVAE is more significant on fluency and relevance than that on richness.HTD can greatly enrich the content of responses, which is consistent with the results in (Wang et al., 2018), although sometimes the responses might be irrelevant to messages or ill-formed.ECM does not perform well on both automatic evaluation and human judgement.

Case Study
To further understand how S2S-Temp leverages templates for response generation, we show two examples with the test data, one for question response generation in Table 4 and the other for sentiment response generation in Table 5, where subscripts refer to states of the NHSMMs.First, we can see that a template defines a structure for a response.By varying templates, we can have responses with different syntax and semantics for a message.Second, some states may have consistent functions across responses.For example, state 36 in question response generation may refer to pronouns, and "I'm" and "it was" correspond to the same state 23 in sentiment response generation.Finally, some templates provide strong syntactic signals to response generation.For example, the  segmentation of "Really?I don't believe it" given by the template (48,36,32) matches the parsing result "FRAG + LS + VP " given by stanford syntactic parser.

Conclusions
We study low-response response generation for open domain dialogue systems by assuming that paired data are insufficient for modeling the relationship between messages and responses.To augment the paired data, we consider transferring knowledge from unpaired data to response generation through latent templates parameterized as a hidden semi-markov model, and take the templates as prior in generation.Evaluation results on question response generation and sentiment response generation indicate that when limited pairs are available, our model can significantly outperform several state-of-the-art response generation models.
e i , e j , e o ∈ R d 1 are embeddings of state i, j, o respectively, and b i,j , b i,o are scalar bias terms.In practice, we set b i,j = −∞ ⇔ i = j to disable self-transition, because the adjacent states play different syntactic or semantic roles in a desired template.The emission distribution P (y i(t−1)+1:i(t) |z t , l t ) is defined by 1 ] [It;  1 ] [′s;  2 ][a;  2 ][brilliant;  2 ] [mov;  3 ]

Table 1 :
Automatic evaluation results for the task of question response generation.Numbers in bold mean that the improvement over the best performing baseline is statistically significant (t-test with p-value< 0.01).

Table 2 :
Automatic evaluation results for the task of sentiment response generation.Numbers in bold mean that the improvement over the best performing baseline is statistically significant (t-test, with p-value< 0.01).
d 1 , d 2 and

Table 3 :
Human annotation results.W, L, and T refer to Win, Lose and Tie respectively.The first three rows are results on question response generation, and the last three rows are results on sentiment response generation.The ratios are calculated by combining labels from the three judges.

Table 5 :
Sentiment response generation with various templates.