S2SPMN: A Simple and Effective Framework for Response Generation with Relevant Information

How to generate relevant and informative responses is one of the core topics in response generation area. Following the task formulation of machine translation, previous works mainly consider response generation task as a mapping from a source sentence to a target sentence. To realize this mapping, existing works tend to design intuitive but complex models. However, the relevant information existed in large dialogue corpus is mainly overlooked. In this paper, we propose Sequence to Sequence with Prototype Memory Network (S2SPMN) to exploit the relevant information provided by the large dialogue corpus to enhance response generation. Specifically, we devise two simple approaches in S2SPMN to select the relevant information (named prototypes) from the dialogue corpus. These prototypes are then saved into prototype memory network (PMN). Furthermore, a hierarchical attention mechanism is devised to extract the semantic information from the PMN to assist the response generation process. Empirical studies reveal the advantage of our model over several classical and strong baselines.


Introduction
Dialogue systems, or say, chatbots are usually considered as the future of human-computer interaction and extensive works have been done in this area (Wen et al., 2016;Qiu et al., 2017;Wen et al., 2017;Kreyssig et al., 2018).
As one of the main approaches for dialogue system design, response generation has attracted more and more attention from research community. Neural networks based models like Seq2Seq architecture (Vinyals and Le, 2015;Shang et al., 2015) are proven to be effective to generate valid responses for a dialogue system. However, as revealed in many previous works (Li et al., 2016a; * *Chenliang Li is the Corresponding Author Wu et al., 2018), "safe reply" is still an open problem and lots of efforts are made to generate more informative responses (Li et al., 2016a;Mou et al., 2016;Li et al., 2016b;Qiu et al., 2017;He et al., 2017;. Note that in this paper when we say response generation, we focus on single turn chit-chat for that other tasks like multi-turn (Zhang et al., 2018) or goal-oriented generation could be partly considered as the extensions of single-turn generation.
Though existing works mentioned above are helpful in some ways, they all follow the task formulation proposed by (Ritter et al., 2011), which considers response generation (RG) task as a mapping from a source sentence to a target sentence like machine translation (MT). This task formulation ignores the natural difference between MT and RG: MT deals with sentence pairs of the same meanings while RG needs to realize the meaning transformation from a source post to the target response. In this sense, the meaning transformation is more difficult than machine translation. Hence, many researchers have designed more and more complex models. However, given a target post, the relevant information covered by the dialogue corpus is usually overlooked. It is intuitive that the responses for a similar post would provide more contextual information to guide the response generation. To this end, we are interested in exploiting the relevant responses in the training set as soft prototypes to assist the response generation.
Specifically, in this paper, we propose Sequence to Sequence with Prototype Memory Network (named S2SPMN). We introduce two Prototype Memory Networks (PMNs) to store the relevant responses extracted from the dialogue corpus: static PMN and dynamic PMN. Tested on a widely used benchmark dataset, the proposed S2SPMN produces more informative responses than the standard and strong baselines. To the best of our knowledge, it is the first work leveraging prototype information in dialogue corpus in response generation area. The contributions of this paper could be summarized as follows: (1) We propose S2SPMN, a simple yet effective response generation model which could leverage relevant information in dialogue corpus to assist response generation.
(2) Empirical studies indicate the superiority of proposed S2SPMN over other methods.

Problem Definition
where Y i is the response for a post X i , we aim to train a model with Γ such that the model can generate an accurate and informative response for a new post X . Here, we propose to exploit the relevant information provided by Γ. Let T = (r 1 , r 2 , ..., r m ) refers to the prototype memory network constructed for post X , where r i is the i-th relevant response (named prototype) extracted from dialogue dataset Γ. The goal is to derive the model to generate the response Y : p(Y |X ) = p(Y |T , X ).
In following sections, we firstly introduce the generation framework with hierarchical attention mechanism assuming PMN is constructed. Then we will introduce two kinds of PMNs: static PMN and dynamic PMN.

Sequence-to-Sequence with Prototype
Memory Network S2SPMN is built with a Seq2Seq encoder-decoder framework (Sutskever et al., 2014) with the attention mechanism (Bahdanau et al., 2014). We use LSTM (Hochreiter and Schmidhuber, 1997) to materialize both encoder and decoder. The hidden state at t-th encoding step is generated from previous hidden state h t−1 and current input x t as follows: For decoder, at i-th timestep, s i is the decoder's hidden state and p i is the probability distribution of candidate words .
where M LP () is a one-layer perception, o i is the hierarchical attention over entire prototype memory network which will be formalized in following sections. c i is the summarization for the post regarding to the hidden state s i−1 : where v 1 is the attention parameter.

Prototype Memory Network
Given a post X , a set of responses are selected from training set as prototypes and are then saved into the Prototype Memory Network(PMN). We propose two kinds of Prototype Memory Networks.
Static PMN: For static PMN(SPMN), we randomly select m responses before training starts and the entire PMN remains unchanged during the training process. That is, we use the same prototypes for all the post-response pairs.
Dynamic PMN: In dynamic PMN(DPMN), prototypes are selected by retrieving the most relevant posts. We calculate the cosine similarity with TF-IDF weighting scheme between the given post and all the posts in training set. We consider top-m posts and put the associated responses into DPMN. This means that the prototypes are characteristic for each post-response pair.
In both SPMN and DPMN, m is a predefined hyper-parameter controlling the size of the PMN. Each prototype is represented with the concatenation of word embeddings. We perform zero padding for both SPMN and DPMN with a pseudo word 1 , making the length for the representation of each prototype be the same. Here we denote the prototype memory network as PMN = {r 1 , r 2 , ..., r m }, in which r m is the representation of m-th prototype and m is the size of the PMN. And r m = {w m,1 , w m,2 , ..., w m,l } where w m,i is the embedding of i-th word, and l is the maximum allowable length for a prototype.
For both SPMN and DPMN, we select responses rather than posts although sometimes they have similar vocabularies and syntactic structure. We believe that using responses as prototypes could help with the meaning transformation from post to response. In DPMN, all the retrieved prototypes could be considered as responses to the target post. It is intuitive that the generated response would have similar representation to these prototypes.

Hierarchical Attention Mechanism
We use a two-stage hierarchical attention mechanism to extract useful information in PMN and integrate it into the decoding process. The first stage is a sentence level attention over entire PMN to generate the abstractive prototyper i at each timestep: where v 2 is the attention parameter. The second stage is a word level attention o i over the generatedr i = {ŵ 1 ,ŵ 2 , ...,ŵ l } and is calculated as follows: where v 3 is the attention parameter.

Experiment Setup
We use a subset of STC dataset (Shang et al., 2015) crawled from Weibo, the largest social media in China. The vocabulary size is set to be 8, 000 for computational efficiency and words out of vocabulary are replaced by the symbol "unk".
1 The embedding of the pseudo word is a zero vector.
We remove sentences longer than 25 words or containing more than 2 unk symbols. After preprocessing step, we have 315, 980 post-response pairs in training set, 3, 510 pairs in validation set and 300 in test set.
In our model, we use one-layer LSTM and the hidden size is set to be 600 in both encoder and decoder. For all the words used in our model, the embedding size is 300. Mini-batch learning is used and batch size is set as 64. We use simple SGD for optimization and the initial learning rate is set to be 0.2.

Evaluation Metrics
We use two automatic evaluation metrics including Perplexity and Distinct. Human evaluation is also conducted as the only gold standard for response generation is human judgement.
Perplexity: Following (Vinyals and Le, 2015) and , we use perplexity as one of our automatic evaluation metrics. Perplexity could measure the holistic condition of model learning. A lower perplexity score indicates better generalization performance. Perplexity on both validation set (PPL-V) and test set (PPL-T) are presented in table 2.
Distinct-1, Distinct-2: Distinct-1 and distinct-2 calculate the ratios of distinct unigrams and bigrams in the generated responses respectively (Li et al., 2016a;Wu et al., 2018). The higher score suggests that the generated response is more diverse and informative. Here, we report the distinct-1 and distinct-2 scores on entire test set.
Human Annatation: We further recruit human annotators to judge the quality of the generated answers for all the qa-pairs in test set. Responses generated by all the methods are pooled and randomly shuffled for each annotator. A score between 0 and 2 is assigned to each generated answer based on the following criteria: +2: the answer is natural and relevant to the question.
+0: the answer is irrelevant and unclear in meaning (e.g. too many grammatical errors to understand).

Results Comparation
We use a standard baseline and a strong baseline for comparison. S2SA: The standard Seq2Seq model with an attention mechanism (Vinyals and Le, 2015).
TAS2S: One of the existing state-of-the-art neural models based on Seq2Seq architecture. The topical words relevant to the post are considered via an attention mechanism when decoding .
As for our models, we use SPMN to denote the generating method with static prototype memory networks and DPMN with dynamic prototype memory networks. The numbers following model names are the size of PMN.
Automatic Evaluation: Table 1 shows the automatic evaluation results. We see that both SPMN and DPMN obtain huge improvements over the two baselines in terms of PPL-V and PPL-T. Also, we observe that SPMN1000 outperforms SPMN500 in all the four automatic metrics. Note that each post has the same prototypes provided by SPMN. This is reasonable that the relevant response is more likely to be covered by storing more prototypes in SPMN. As for the DPMN, we can see that DPMN achieves the best performance with only 100 prototypes in terms of PPL-T, compared with the other 4 methods. This suggests that using a retrieval mechanism to incorporate the relevant responses brings more useful information for better response generation. Note that S2SA outperforms the others in terms of distinct-1 and distinct-2. Further human evaluation indicates that many responses generated by S2SA are irrelevant and meaningless, which could inevitably increase the distinct scores.
Human Annotation: Table 2 shows human annotation results. It is clear that our models (SPMN500, SPMN1000, DPMN100) generate much more informative and valid responses and much less meaningless or "safe" responses than baseline models (S2SA, TAS2S). Specifically, SPMN500, SPMN1000 and DPMN100 all  Table 2: Human Annotation outperform S2SA and TAS2S by producing more informative and valid responses. Also, we can find that DPMN still outperforms SPMN500 and SPMN1000 with only 100 relevant responses, which is consistent with the observation made in automatic evaluation (in terms of PPL-V and PPL-T). Table 3 shows several cases generated by different models. Note that the size of training set and vocabulary used in our experiments are relatively small compared to millions of qa-pairs used in other works Wu et al., 2018), so it's reasonable that bad cases sometimes occur in results of baselines. However, our models, no matter the static one or the dynamic one, could generate amazing responses which are not only grammatical and informative, but also have some emotional expressions like the use of punctuation and repetition.

Natural language generation
How to generate grammatical and interesting sentences in different situations is one of the core topics in natural language processing area. Extensive works are proposed to generate poems , abstracts (Wang and Ling, 2016), arguments (Hua and Wang, 2018), stories (Peng et al., 2018) and so on. Although existing approaches are useful in some ways, it's still difficult to generate natural sentences from scratch and integrating retrieved results has recently become a new fashion in this area. Hua and Wang (2018) proposed an encoder-decoder style neural network-based argument generation model enriched with externally retrieved evidence from Wikipedia.  devised a Retrieve-Rerank-Rewrite model for abstractive summarization which uses retrieved results as soft template to assist the decoding process.

Response generation
Hand-craft rules, retrieval and generation are three main solutions for conversational AI and generation is the most interesting one in current research community. Li et al. (2016a;2016b; proposed a series of works in solving the "safe reply" problem using different approaches like redefining the objective function or leveraging GAN.  considered topic coherence issue by incorporating topical words. Dynamically restricting the target vocabulary is also an interesting idea and Wu et al. (2018) proposed to filter irrelevant words while achieving better computational efficiency . He et al. (2017) introduced copy mechanism to simulate people's behaviors in real conversations and the proposed model could copy useful words from source sentences.  indicated that emotion is quite important in real dialogues thus an emotional chatting machine was devised to generate emotional responses.  proposed a neural knowledge diffusion (NKD) model to introduce knowledge into dialogue generation.

Conclusion and Future Work
In this paper, we propose S2SPMN, a simple yet effective response generation model by exploiting relevant information contained in large dialogue dataset. Empirical studies indicate that simply selecting responses from training set as prototypes and integrating them into the generation process could dramatically improve the quality of generated responses. Moreover, our model is very flex-ible and could be adapted to any other Seq2Seq based generation methods. Most importantly, we claim the intrinsic difference between RG and MT and propose a new way to define response generation.
As the first work trying to help with the meaning transformation between source and target, we have obtained the encouraging progress. However, we know that there are still many directions to enrich the proposed framework. In future work, we would like to devise more sophisticated solutions to bridge the semantic gap in RG and explore linguistic patterns in conversations like what has been done in discourse analysis  .