Learning to Control the Specificity in Neural Response Generation

In conversation, a general response (e.g., “I don’t know”) could correspond to a large variety of input utterances. Previous generative conversational models usually employ a single model to learn the relationship between different utterance-response pairs, thus tend to favor general and trivial responses which appear frequently. To address this problem, we propose a novel controlled response generation mechanism to handle different utterance-response relationships in terms of specificity. Specifically, we introduce an explicit specificity control variable into a sequence-to-sequence model, which interacts with the usage representation of words through a Gaussian Kernel layer, to guide the model to generate responses at different specificity levels. We describe two ways to acquire distant labels for the specificity control variable in learning. Empirical studies show that our model can significantly outperform the state-of-the-art response generation models under both automatic and human evaluations.


Introduction
Human-computer conversation is a critical and challenging task in AI and NLP.There have been two major streams of research in this direction, namely task oriented dialog and general purpose dialog (i.e., chit-chat).Task oriented dialog aims to help people complete specific tasks such as buying tickets or shopping, while general purpose dialog attempts to produce natural and meaningful conversations with people regarding a wide range of topics in open domains (Perez-Marin, 2011;Sordoni et al.).In recent years, the latter has at- tracted much attention in both academia and industry as a way to explore the possibility in developing a general purpose AI system in language (e.g., chatbots).
A widely adopted approach to general purpose dialog is learning a generative conversational model from large scale social conversation data.Most methods in this line are constructed within the statistical machine translation (SMT) framework, where a sequence-to-sequence (Seq2Seq) model is learned to "translate" an input utterance into a response.However, general purpose dialog is intrinsically different from machine translation.In machine translation, since every sentence and its translation are semantically equivalent, there exists a 1-to-1 relationship between them.However, in general purpose dialog, a general response (e.g., "I don't know") could correspond to a large variety of input utterances.For example, in the chit-chat corpus used in this study (as shown in Figure 1), the top three most frequently appeared responses are "Must support!Cheer!", "Support!It's good.", and "My friends and I are shocked!",where the response "Must support!Cheer!" is used for 1216 different input utterances.Previous Seq2Seq models, which treat all the utteranceresponse pairs uniformly and employ a single model to learn the relationship between them, will inevitably favor such general responses with high frequency.Although these responses are safe for replying different utterances, they are boring and trivial since they carry little information, and may quickly lead to an end of the conversation.
There have been a few efforts attempting to address this issue in literature.Li et al. (2016a) proposed to use the Maximum Mutual Information (MMI) as the objective to penalize general responses.It could be viewed as a post-processing approach which did not solve the generation of trivial responses fundamentally.Xing et al. (2017) pre-defined a set of topics from an external corpus to guide the generation of the Seq2Seq model.However, it is difficult to ensure that the topics learned from the external corpus are consistent with that in the conversation corpus, leading to the introduction of additional noises.Zhou et al. (2017) introduced latent responding factors to model multiple responding mechanisms.However, these latent factors are usually difficult in interpretation and it is hard to decide the number of the latent factors.
In our work, we propose a novel controlled response generation mechanism to handle different utterance-response relationships in terms of specificity.The key idea is inspired by our observation on everyday conversation between humans.In human-human conversation, people often actively control the specificity of responses depending on their own response purpose (which might be affected by a variety of underlying factors like their current mood, knowledge state and so on).For example, they may provide some interesting and specific responses if they like the conversation, or some general responses if they want to end it.They may provide very detailed responses if they are familiar with the topic, or just "I don't know" otherwise.Therefore, we propose to simulate the way people actively control the specificity of the response.
We employ a Seq2Seq framework and further introduce an explicit specificity control variable to represent the response purpose of the agent.Meanwhile, we assume that each word, beyond the semantic representation which relates to its meaning, also has another representation which relates to the usage preference under different response purpose.We name this representation as the usage representation of words.The specificity control variable then interacts with the usage representation of words through a Gaussian Kernel layer, and guides the Seq2Seq model to generate responses at different specificity levels.We refer to our model as Specificity Controlled Seq2Seq model (SC-Seq2Seq).Note that unlike the work by (Xing et al., 2017), we do not rely on any external corpus to learn our model.All the model parameters are learned on the same conversation corpus in an end-to-end way.
We employ distant supervision to train our SC-Seq2Seq model since the specificity control variable is unknown in the raw data.We describe two ways to acquire distant labels for the specificity control variable, namely Normalized Inverse Response Frequency (NIRF) and Normalized Inverse Word Frequency (NIWF).By using normalized values, we restrict the specificity control variable to be within a pre-defined continuous value range with each end has very clear meaning on the specificity.This is significantly different from the discrete latent factors in (Zhou et al., 2017) which are difficult in interpretation.
We conduct an empirical study on a large public dataset, and compare our model with several state-of-the-art response generation methods.Empirical results show that our model can generate either general or specific responses, and significantly outperform existing methods under both automatic and human evaluations.

Related Work
In this section, we briefly review the related work on conversational models and response specificity.

Conversational Models
Automatic conversation has attracted increasing attention over the past few years.At the very beginning, people started the research using handcrafted rules and templates (Walker et al., 2001;Williams et al., 2013;Henderson et al., 2014).These approaches required little data for training but huge manual effort to build the model, which is very time-consuming.For now, conversational models fall into two major categories: retrieval-based and generation-based.Retrievalbased conversational models search the most suitable response from candidate responses using different schemas (Kearns, 2000;Wang et al., 2013;Yan et al., 2016).These methods rely on preexisting responses, thus are difficult to be exten-ded to open domains (Zhou et al., 2017).With the large amount of conversation data available on the Internet, generation-based conversational models developed within a SMT framework (Ritter et al., 2011;Cho et al., 2014;Bahdanau et al., 2015) show promising results.Shang et al. (2015) generated replies for short-text conversation by encoder-decoder-based neural network with local and global attentions.Serban et al. (2016) built an end-to-end dialogue system using generative hierarchical neural network.Gu et al. (2016) introduced copynet to simulate the repeating behavior of humans in conversation.Similarly, our model is also based on the encoder-decoder framework.

Response Specificity
Some recent studies began to focus on generating more specific or informative responses in conversation.It is also called a diversity problem since if each response is more specific, it would be more diverse between responses of different utterances.As an early work, Li et al. (2016a) used Maximum Mutual Information (MMI) as the objective to penalize general responses.Later, Li et al. (2017) proposed a data distillation method, which trains a series of generative models at different levels of specificity and uses a reinforcement learning model to choose the model best suited for decoding depending on the conversation context.These methods circumvented the general response issue by using either a post-processing approach or a data selection approach.
Besides, Li et al. (2016b) tried to build a personalized conversation engine by adding extra personal information.Xing et al. (2017) incorporated the topic information from an external corpus into the Seq2Seq framework to guide the generation.However, external dataset may not be always available or consistent with the conversation dataset in topics.Zhou et al. (2017) introduced latent responding factors to the Seq2Seq model to avoid generating safe responses.However, these latent factors are usually difficult in interpretation and hard to decide the number.
Moreover, Mou et al. (2016) proposed a content-introducing approach to generate a response based on a predicted keyword.Yao et al. (2016) attempted to improve the specificity with the reinforcement learning framework by using the averaged IDF score of the words in the response as a reward.Shen et al. (2017) presented a con-ditional variational framework for generating specific responses based on specific attributes.Unlike these existing methods, we introduce an explicit specificity control variable into a Seq2Seq model to handle different utterance-response relationships in terms of specificity.

Specificity Controlled Seq2Seq Model
In this section, we present the Specificity Controlled Seq2Seq model (SC-Seq2Seq), a novel Seq2Seq model designed for actively controlling the generated responses in terms of specificity.

Model Overview
The basic idea of a generative conversational model is to learn the mapping from an input utterance to its response, typically using an encoderdecoder framework.Formally, given an input utterance sequence X = (x 1 , x 2 , . . ., x T ) and a target response sequence Y = (y 1 , y 2 , . . ., y T ), a neural Seq2Seq model is employed to learn p(Y|X) based on the training corpus D = {(X, Y)|Y is the response of X}.By maximizing the likelihood of all the utterance-response pairs with a single mapping mechanism, the learned Seq2Seq model will inevitably favor those general responses that can correspond to a large variety of input utterances.
To address this issue, we assume that there are different mapping mechanisms between utteranceresponse pairs with respect to their specificity relation.Rather than involving some latent factors, we propose to introduce an explicit variable s into a Seq2Seq model to handle different utteranceresponse mappings in terms of specificity.By doing so, we hope that (1) s would have explicit meaning on specificity, and (2) s could not only interpret but also actively control the generation of the response Y given the input utterance X.The goal of our model becomes to learn p(Y|X, s) over the corpus D, where we acquire distant labels for s from the same corpus for learning.The overall architecture of SC-Seq2Seq is depicted in Figure 2, and we will detail our model as follows.

Encoder
The encoder is to map the input utterance X into a compact vector that can capture its essential topics.Specifically, we use a bi-directional GRU (Cho et al., 2014) as the utterance encoder, and each word x i is firstly represented by its semantic representation e i mapped by semantic embedding  matrix E as the input of the encoder.Then, the encoder represents the utterance X as a series of hidden vectors {h t } T t=1 modeling the sequence from both forward and backward directions.Finally, we use the final backward hidden state as the initial hidden state of the decoder.

Decoder
The decoder is to generate a response Y given the hidden representations of the input utterance X under some specificity level denoted by the control variable s.Specifically, at step t, we define the probability of generating any target word y t by a "mixture" of probabilities: where p M (y t ) denotes the semantic-based generation probability, p S (y t ) denotes the specificitybased generation probability, β and γ are the coefficients.Specifically, p M (y t ) is defined the same as that in traditional Seq2Seq model (Sutskever et al., 2014): where w is a one-hot indicator vector of the word w and e t−1 is the semantic representation of the t − 1-th generated word in decoder.W h M , W e M and b M are parameters.h yt is the t-th hidden state in the decoder which is computed by: where f is a GRU unit and c t is the context vector to allow the decoder to pay different attention to different parts of input at different steps (Bahdanau et al., 2015).
p S (y t ) denotes the generation probability of the target word given the specificity control variable s.Here we introduce a Gaussian Kernel layer to define this probability.Specifically, we assume that each word, beyond its semantic representation e, also has a usage representation u mapped by usage embedding matrix U.The usage representation of a word denotes its usage preference under different specificity.The specificity control variable s then interacts with the usage representations through the Gaussian Kernel layer to produce the specificity-based generation probability p S (y t ): ) where σ 2 is the variance, and Ψ S (•) maps the word usage representation into a real value with the specificity control variable s as the mean of the Gaussian distribution.W U and b U are parameters to be learned.Note here in general we can use any realvalue function to define Ψ S (U, w).In this work, we use the sigmoid function σ(•) for Ψ S (U, w) since we want to define s within the range [0,1] so that each end has very clear meaning on the specificity, i.e., 0 denotes the most general response while 1 denotes the most specific response.In the next section, we will also keep this property when we define the distant label for the control variable.

Distant Supervision
We train our SC-Seq2Seq model by maximizing the log likelihood of generating responses over the training set D: L = (X,Y)∈D log P (Y|X, s; θ). ( where θ denotes all the model parameters.Note here since s is an explicit control variable in our model, we need the triples (X, Y, s) for training.However, s is not directly available in the raw conversation corpus, thus we acquire distant labels for s to learn our model.We introduce two ways of distant supervision on the specificity control variable s, namely Normalized Inverse Response Frequency (NIRF) and Normalized Inverse Word Frequency (NIWF).

Normalized Inverse Response Frequency
Normalized Inverse Response Frequency (NIRF) is based on the assumption that a response is more general if it corresponds to more input utterances in the corpus.Therefore, we use the inverse frequency of a response in a conversation corpus to indicate its specificity level.Specifically, we first build the response collection R by extracting all the responses from D. For a response Y ∈ R, let f Y denote its corpus frequency in R, we compute its Inverse Response Frequency (IRF) as: where |R| denotes the size of the response collection R. Next, we use the min-max normalization method (Jain et al., 2005) to obtain the NIRF value.Namely, .
(7) where max(IRF R ) and min(IRF R ) denotes the maximal and minimum IRF value in R respectively.The NIRF value is then used as the distant label of s in training.Note here by using normalized values, we aim to constrain the specificity control variable s to be within the pre-defined continuous value range [0,1].

Normalized Inverse Word Frequency
Normalized Inverse Word Frequency (NIWF) is based on the assumption that the specificity level of a response depends on the collection of words it contains, and the sentence is more specific if it contains more specific words.Hence, we can use the inverse corpus frequency of the words to indicate the specificity level of a response.Specifically, for a word y in the response Y, we first obtain its Inverse Word Frequency (IWF) by: where f y denotes the number of responses in R containing the word y.Since a response usually contains a collection of words, there would be multiple ways to define the response-level IWF value, e.g., sum, average, minimum or maximum of the IWF values of all the words.In our work, we find that the best performance can be achieved by using the maximum of the IWF of all the words in Y to represent the response-level IWF by This is reasonable since a response is specific as long as it contains some specific words.We do not require all the words in a response to be specific, thus sum, average, and minimum would not be appropriate operators for computing the responselevel IWF.Again, we use min-max normalization to obtain the NIWF value for the response Y.

Specificity Controlled Response Generation
Given a new input utterance, we can employ the learned SC-Seq2Seq model to generate responses at different specificity levels by varying the control variable s.In this way, we can simulate human conversations where one can actively control the response specificity depending on his/her own mind.When we apply our model to a chatbot, there might be different ways to use the control variable for conversation in practice.If we want the agent to always generate informative responses, we can set s to 1 or some values close to 1.If we want the agent to be more dynamic, we can sample s within the range [0,1] to enrich the styles in the response.We may further employ some reinforcement learning technique to learn to adjust the control variable depending on users' feedbacks.This would make the agent even more vivid, and we leave this as our future work.

Experiment
In this section, we conduct experiments to verify the effectiveness of our proposed model.

Dataset Description
We conduct our experiments on the public Short Text Conversation (STC) dataset1 released in NTCIR-13.STC maintains a large repository of post-comment pairs from the Sina Weibo which is one of the popular Chinese social sites.

Baselines Methods
We compare our proposed SC-Seq2Seq model against several state-of-the-art baselines: (1) Seq2Seq-att: the standard Seq2Seq model with the attention mechanism (Bahdanau et al., 2015); (2) MMI-bidi: the Seq2Seq model using Maximum Mutual Information (MMI) as the objective function to reorder the generated responses (Li et al., 2016a); (3) MARM: the Seq2Seq model with a probabilistic framework to model the latent responding mechanisms (Zhou et al., 2017); (4) Seq2Seq+IDF: an extension of Seq2Seq-att by optimizing specificity under the reinforcement learning framework, where the reward is calculated as the sentence level IDF score of the generated response (Yao et al., 2016).We refer to our model trained using NIRF and NIWF as SC-Seq2Seq NIRF and SC-Seq2Seq NIWF respectively.

Implementation Details
As suggested in (Shang et al., 2015), we construct two separate vocabularies for utterances and responses by using 40,000 most frequent words on each side in the training data, covering 97.7% words in utterances and 96.1% words in responses respectively.All the remaining words are replaced by a special token <UNK> symbol.We implemented our model in Tensorflow3 .We tuned the hyper-parameters via the development set.Specifically, we use one layer of bi-directional GRU for encoder and another uni-directional GRU for decoder, with the GRU hidden unit size set as 300 in both the encoder and decoder.The dimension of semantic word embeddings in both utterances and responses is 300, while the dimension of usage word embeddings in responses is 50.We apply the Adam algorithm (Kingma and Ba, 2015) for optimization, where the parameters of Adam are set as in (Kingma and Ba, 2015).The variance σ2 of the Gaussian Kernel layer is set as 1, and all other trainable parameters are randomly initialized by uniform distribution within [-0.08,0.08].The mini-batch size for the update is set as 128.We clip the gradient when its norm exceeds 5.
Our model is trained on a Tesla K80 GPU card, and we run the training for up to 12 epochs, which takes approximately five days.We select the model that achieves the lowest perplexity on the development dataset, and we report results on the test dataset.

Evaluation Methodologies
For evaluation, we follow the existing work and employ both automatic and human evaluations: (1) distinct-1 & distinct-2 (Li et al., 2016a): we count numbers of distinct unigrams and bigrams in the generated responses, and divide the numbers by total number of generated unigrams and bigrams.Distinct metrics (both the numbers and the ratios) can be used to evaluate the specificity/diversity of the responses.( 2) BLEU (Papineni et al., 2002): BLEU has been proved strongly correlated with human evaluations.BLEU-n measures the average n-gram precision on a set of reference sentences.(3) Average & Extrema (Serban et al., 2017): Average and Extrema projects the generated response and the ground truth response into two separate vectors by taking the mean over the word embeddings or taking the extremum of each dimension respectively, and then computes the cosine similarity between them.(4) Human evaluation: Three labelers with rich Weibo experience were recruited to conduct evaluation.Responses from different models are randomly mixed for labeling.Labelers refer to 300 random sampled test utterances and score the quality of the responses with the following criteria: 1) +2: the response is not only semantically relevant and grammatical, but also informat- ive and interesting; 2) +1: the response is grammatical and can be used as a response to the utterance, but is too trivial (e.g., "I don't know"); 3) +0: the response is semantically irrelevant or ungrammatical (e.g., grammatical errors or UNK).Agreements to measure inter-rater consistency among three labelers are calculated with the Fleiss' kappa (Fleiss and Cohen, 1973).

Evaluation Results
Model Analysis: We first analyze our models trained with different distant supervision information.For each model, given a test utterance, we vary the control variable s by setting it to five different values (i.e., 0, 0.2, 0.5, 0.8, 1) to check whether the learned model can actually achieve different specificity levels.As shown in Table 2, we find that: (1) The SC-Seq2Seq model trained with NIRF cannot work well.The test performances are almost the same with different s value.This is surprising since the NIRF definition seems to be directly corresponding to the specificity of a response.By conducting further analysis, we find that even though the conversation dataset is large, it is still limited and a general response could appear very few times in this corpus.In other words, the inverse frequency of a response is very weakly correlated with its response spe-cificity.
(2) The SC-Seq2Seq model trained with NIWF can achieve our purpose.By varying the control variable s from 0 to 1, the generated responses turn from general to specific as measured by the distinct metrics.The results indicate that the max inverse word frequency in a response is a good distant label for the response specificity.
(3) When we compare the generated responses against ground truth data, we find the SC-Seq2Seq NIWF model with the control variable s set to 0.5 can achieve the best performances.The results indicate that there are diverse responses in real data in terms of specificity, and it is necessary to take a balanced setting if we want to fit the ground truth.
Baseline Comparison: The performance comparisons between our model and the baselines are shown in Table 3.We have the following observations: (1) By using MMI as the objective, MMI-bidi can improve the specificity (in terms of distinct ratios) over the traditional Seq2Seq-att model.(2) MARM can achieve the best distinct ratios among the baseline methods, but the worst in terms of the distinct numbers.The results indicate that MARM tends to generate specific but very short responses.Meanwhile, its low BLEU scores also show that the responses generated by MARM deviate from the ground truth significantly.(3) By using the IDF information as the reward to train All the improvements over the baseline models are statistically significant (p-value < 0.01).These results demonstrate the effectiveness as well as the flexibility of our controlled generation model.Table 4 shows the human evaluation results.We can observe that: (1) SC-Seq2Seq NIWF,s=1 generates the most informative responses and interesting (labeled as "+2") and the least general responses than all the baseline models.Meanwhile, SC-Seq2Seq NIWF,s=0 generates the most general responses (labeled as "+1"); (2) MARM generates the most bad responses (labeled as "+0"), which indicates the drawbacks of the unknown latent responding mechanisms; (3) The kappa values of our models are all larger than 0.4, considered as "moderate agreement" regarding quality of responses.The largest kappa value is achieved by SC-Seq2Seq NIWF,s=0 , which seems reasonable since it is easy to reach an agreement on general responses.Sign tests demonstrate the improvements of SC-Seq2Seq NIWF,s=1 to the baseline models are statistically significant (p-value < 0.01).All the human judgement results again demonstrate the effectiveness of our controlled generation mechanism.

Case Study
To better understand how different models perform, we conduct some case studies.We randomly sample three utterances from the test dataset, and show the responses generated by different models.As shown in Table 5, we can find that: (1) The responses generated by the four baselines are often quite general and short, which may quickly lead to an end of the conversation.(2) SC-Seq2Seq NIWF with large control variable values (i.e., s > 0.5) can generate very long and specific responses.In these responses, we can find many informative words.For example, in case 2 with s as 1 and 0.8, we can find words like "眼妆(eye make-up)", "气 质(temperament)" and "雪亮(bright)" which are quite specific and strongly related to the conversation topic of "beauty".(3) When we decrease the control variable value, the generated responses become more and more general and shorter from our SC-Seq2Seq NIWF model.

Analysis on Usage Representations
We also conduct some analysis to understand the usage representations of words introduced in our model.We randomly sample 500 words from our SC-Seq2Seq NIWF and apply t-SNE (Maaten and Hinton, 2008) to visualize both usage and semantic embeddings.As shown in Figure 3, we can see that the two distributions are significantly different.In the usage space, words like "脂 肪 肝(fatty liver)" and "久 坐(outsit)" lie closely which are both specific words, and both are far from the general words like "胖(fat)".On the contrary, in the semantic space, "脂 肪 肝(fatty liver)" is close to "胖(fat)" since they are semantically related, and both are far from the word "久坐(outsit)".Furthermore, given some sampled target words, we also show the top-5 similar words based on cosine similarity under both representations in Table 6.Again, we can see that the nearest neighbors of a same word are quite different under two representations.Neighbors based on semantic representations are semantically related, while neighbors based on usage representations are not so related but with similar specificity levels.

Conclusion
We propose a novel controlled response generation mechanism to handle different utteranceresponse relationships in terms of specificity.We introduce an explicit specificity control variable into the Seq2Seq model, which interacts with the usage representation of words to generate responses at different specificity levels.Empirical results showed that our model can generate either general or specific responses, and significantly outperform state-of-the-art generation methods.

Figure 1 :
Figure 1: Rank-frequency distribution of the responses in the chit-chat corpus, with x and y axes being lg(rank order) and lg(frequency) respectively.

Figure 2 :
Figure 2: The overall architecture of SC-Seq2Seq model.

Figure 3
Figure 3: t-SNE embeddings of usage and semantic vectors.

Table 2 :
Model analysis of our SC-Seq2Seq under the automatic evaluation.

Table 3 :
Comparisons between our SC-Seq2Seq and the baselines under the automatic evaluation.

Table 4 :
Results on the human evaluation.

Table 5 :
Examples of response generation from the STC test data.s = 1, 0.8, 0.5, 0.2, 0 are the outputs of our SC-Seq2Seq NIWF with different s values.