Towards Implicit Content-Introducing for Generative Short-Text Conversation Systems

The study on human-computer conversation systems is a hot research topic nowadays. One of the prevailing methods to build the system is using the generative Sequence-to-Sequence (Seq2Seq) model through neural networks. However, the standard Seq2Seq model is prone to generate trivial responses. In this paper, we aim to generate a more meaningful and informative reply when answering a given question. We propose an implicit content-introducing method which incorporates additional information into the Seq2Seq model in a flexible way. Specifically, we fuse the general decoding and the auxiliary cue word information through our proposed hierarchical gated fusion unit. Experiments on real-life data demonstrate that our model consistently outperforms a set of competitive baselines in terms of BLEU scores and human evaluation.


Introduction
To establish a conversation system with adequate artificial intelligence is a long-cherished goal for researchers and practitioners. In particular, automatic conversation systems in open domains are attracting increasing attention due to its wide applications, such as virtual assistants and chatbots. In open domains, researchers mainly focus on data-driven approaches, since the diversity and uncertainty make it impossible to prepare the interaction logic and domain knowledge. Basically, there are two mainstream ways to build an opendomain conversation system: 1) to search preestablished database for candidate responses by * Corresponding author: ruiyan@pku.edu.cn query retrieval (Isbell et al., 2000;Wang et al., 2013;, and 2) to generate a new, tailored utterance given the userissued query (Shang et al., 2015;Vinyals and Le, 2015;Serban et al., 2016;. In these studies, generation-based conversation systems have shown impressive potential. Especially, the Sequence-to-Sequence (Se-q2Seq) model (Sutskever et al., 2014) based on neural networks has been extensively used in practice; the idea is to encode a query as a vector and to decode the vector into a reply. Inspired by , we mainly focus on the generative short-text conversation without context information.
Despite this, the performance of Seq2Seq generation-based conversation systems is far from satisfactory because its generation process is not controllable; it responses to a query according to the pattern learned from the training corpus. As a result, the system is likely to generate an unexpected reply even with little semantics, e.g, "I don't know" and "Okay" due to the high frequency of these patterns in training data (Li et al., 2016a;. To address this issue, Li et al. (2016a) proposed to increase diversity in the Se-q2Seq model so that more informative utterances have a chance to stand out.  provided a content-introducing approach that generates a reply based on a predicted word. The word is usually enlightening and drives the generated response to be more meaningful. However, this method is to some extent rigid; it requires the predicted word to explicitly occur in the generated utterance. As shown in Table 1, sometimes, it is better to generate a semantic related sentence based on the cue word rather than including it in the reply directly.
As for such content-introducing method, there are two aspects that need to be taken into consid-  eration. 1) How to add the additional cue words during the generation process? One of the prevailing methods is modifying the neural cell with various gating mechanisms (Wen et al., 2015a,b;Xu et al., 2016). However, we need careful operation to ensure the neuron works as expected. 2) How to display the cue words in replies? As mentioned above, the explicit content-introducing approach in  does not fit well with all situations.
In this paper, we present an implicit contentintroducing method for generative conversation systems, which incorporates cue words using our proposed hierarchical gated fusion unit (HGFU) in a flexible way. Our main contributions are as follows: • We propose the cue word GRU, another neural cell, to deal with the auxiliary information. Compared with other gating methods, our cue word GRU is more flexible.
• We focus on the implicit content-introducing method during generation: the information of the cue word will be fused into the generation process but not necessarily occur explicitly. In this way, we change the "hard" content-introducing method into a new "soft" schema.
The rest of paper is organized as follows. We start by introducing the technical background. In Section 3, we describe our proposed method. In Section 4, we illustrate the experimental setup and evaluations against a variety of baselines. Section 5 briefly reviews related work. Finally, we conclude our paper in Section 6.
2 Technical Background

Seq2Seq Model and Attention Mechanism
Seq2Seq model was first introduced in statistical machine translation; the idea is to encode a source sentence as a vector by a recurrent neural network (RNN) and to decode the vector to a target sentence by another RNN. Now, the conversational generation is treated as a monolingual translation task (Ritter et al., 2011;Shang et al., 2015). Given a query Q = (x 1 , ..., x n ), the encoder represents it as a context vector C and then the decoder generates a response R = (y 1 , ..., y m ) word by word by maximizing the generation probability of R conditioned on Q. The objective function of Seq2Seq can be written as: To be specific, the encoder RNN calculates the context vector by: where h t is the hidden state of encoder RNN at time t and f is a non-linear transformation which can be a long-short term memory unit (L-STM) (Hochreiter and Schmidhuber, 1997) or a gated recurrent unit (GRU) . In this work, we implement f using GRU. The decoder RNN generates each reply word conditioned on the context vector C. The probability distribution p t of candidate words at every time step t is calculated as: (3) where s t is the hidden state of decoder RNN at time t and y t−1 is the generated word in the reply at time t − 1.
Attention mechanisms  have been proved effective to improve the generation quality. In Seq2Seq with attention, each y i corresponds to a context vector C i ; it is weighted average of all hidden states of the encoder. Formally, C i is defined as C i = T j=1 α ij h j , where α ij is given by: where η is usually implemented as a multi-layer perceptron (MLP) with tanh as an activation function. Pre-process Trained model User's query Cue word prediction Implicit content introducing Figure 1: The architecture of our system. Based on the constructed corpus, we train our implicit content-introducing conversation system. Given a user-issued query, we first predict the cue word. Then, we incorporate the cue word into decoding process to generate a meaningful response.

Pointwise Mutual Information
Pointwise mutual information (PMI) (Church and Hanks, 1990) is a measure of association ratio based on the information theoretic concept of mutual information. Given a pair of outcomes x and y belonging to discrete random variables X and Y, the PMI quantifies the discrepancy between the probability of their coincidence based on their joint distribution and their individual distributions. Mathematically: This quantity is zero if x and y are independent, positive if they are positively correlated, and negative if they are negatively correlated.

Implicit Content-Introducing Conversation System
Figure 1 provides an overview of our system architecture. We crawl conversational data from social media which are publicly available. After filtering and cleaning procedures, we establish the conversational parallel dataset, which consists of a large number of aligned query − reply pairs. Based on the entire set, we first predict the cue word for the given query in Subsection 3.1. Next, we propose the new implicit content-introducing process, which explores when to incorporate the predicted cue word in Subsection 3.2 and how to apply such information in Subsection 3.3.

Cue Word Prediction
In computational linguistics, PMI has been used for finding collocations and associations between words. As mentioned in , it is an appropriate statistic for cue words prediction, which is also adopted in this paper to predict a cue word C w for the given query. Formally, given a query word w q and a reply word w r , the PMI is computed as: Then, we choose the cue word C w with highest PMI score against the query words w q1 , ..., w qn during the prediction, i.e., C w = argmax wr PMI(w q1 , ..., w qn , w r ), where The approximation is based on the independence assumptions of both the prior distribution p(w qi ) and posterior distributions p(w qi |w r ). Even the two assumptions may not be true, we use them in a pragmatic way so that the word-level P-MI is additive for a whole utterance. PMI penalizes a common word by dividing its prior probability; hence, it prefers a word which is most "mutually informative" with the query.

Information Fusion Patterns
To implant the specific information in conversation system, we consider two types of information fusion patterns, namely 1) Local information initialization 2) Global information inception.
Local information initialization. In the local pattern, we fuse the cue word C w as the auxiliary Cue Word GRU Fusion Unit Figure 3: The structure of a HGFU. The bottom of two GRUs deal with corresponding input source, i.e., the last generated word y t−1 and the cue word C w . After that, fusion unit combines the output of two GRUs to compute current hidden state h t . .
information only in the beginning of decoding. We describe this kind of pattern by the blue arrowhead in Figure 2. Recurrent neural networks(RNNs) such as gated recurrent units (GRUs) have the ability to keep the information from the beginning to the end to some extent. Therefore, the cue word added on the first step of the neural networks can still influence the generation of the later steps. Global information inception. However, we observe that, although the network is capable of deciding what to keep in the cell state to affect the later generation, the influence of the added information in the beginning of decoding is becoming weaker and weaker over time. Therefore, to provide the model a broader and more flexible space for learning, we propose a global information inception pattern, which fuses the cue word C w as the auxiliary information at every step of decoding. This process is presented by both the blue arrowhead and the green arrowheads in Figure 2.

Hierarchical Gated Fusion Unit
In this subsection, we propose our Hierarchical Gated Fusion Unit (HGFU), which incorporates cue words into the generation process and relaxes the constraint from the "hard" content-introducing method into a new "soft" schema. Figure 3 provides an overview of the structure of a HGFU. As seen, the framework consists of three components: the standard GRU, the cue word GRU, and the fusion unit. Among them, standard GRU and cue word GRU take the last generated word y t−1 and cue word C w respectively as the decoder GRU's input; the fusion unit combines the hidden states of both GRUs to predict the next word y t . In the following, we will illustrate these components in detail.

Standard GRU
We adopt the standard gated recurrent unit (GRU) with the attention mechanism at the decoder part. Let h t−1 be the last hidden state, y t−1 be the embedding of the last generated word, and C t be the current attention-based context. The current hidden state of the general decoding, h y , is defined as: and U 's ∈ R dim×dim are weight matrices; b's ∈ R dim are bias terms; E denotes the word embedding dimensionality and dim denotes the number of hidden state units. This general decoding process is presented by the "Standard GRU" in Figure 3.

Cue word GRU
To generate more meaningful and informative replies, we introduce cue words as the additional information during generation. Naturally, the key point lies in how to incorporate such information. One of the prevailing methods is modifying the neural cell by various gating mechanisms. However, these approaches are designed specially for a specific scenario, and not effective as expected when they are employed to other tasks. To tackle this issue, we propose the cue word GRU, another independent neural cell, to deal with the auxiliary information. Since this neural cell can be replaced easily by other units, it greatly improves the flexibility and reusability.
Given the last hidden state h t−1 , the additional cue word C w and the current attention-based context C t , the new hidden state of the auxiliary de-coding h w is computed by following equations: where W 's and U 's are weights and b's are bias terms like those in the standard GRU. Note that the standard GRU does not share parameter matrixes with the cue word GRU. The "Cue word GRU" in Figure 3 describes the auxiliary decoding process.

Fusion unit
To combine both the general decoding information and the auxiliary decoding information, we apply the fusion unit (Arevalo et al., 2017) integrating the hidden states of both standard GRU, i.e., h y , and the cue word GRU, i.e., h w , to compute the current hidden state h t . The equations are as follows: with θ the parameters to be learned. From the equations above we can see that, the gate neuron k controls the contribution of the information calculated from h y and h w to the overall output of the unit.

Model Training
When training on the aligned corpus, we randomly sample a noun in the reply as the cue word. The objective function was the cross entropy error between the generated word distribution p t and the actual word distribution y t in the training corpus.

Experiments
In this section, we compare our method with thestate-of-art response generation models based on a huge conversation resource. The objectives of our experiments are to 1) evaluate the effectiveness of our proposed HGFU model, and 2) explore how cue words affect the process of reply generation. Figure 4: Heat map and the k gate openness. Bottom: The correlation between the generated reply words and the cue word. Top: The openness of k gate in fusion unit.

Experimental setup
We evaluated our model on a massive Chinese dataset of human conversation crawled from the Baidu Tieba 1 forum. There are 500,000 query − reply pairs for training, 2,000 for validation, and another unseen 27,871 samples for testing. In total, we kept about 63,000 distinct words.
In our experiments, the encoder, the standard decoder and the cue word decoder have 1,000 hidden units; the word embedding dimensionality is 610 which were initialized randomly and learned during training. We applied AdaDelta with a minibatch size of 80 for optimization. These values were mostly chosen empirically. In order to prevent overfitting, early stopping was implemented using a held-out validation set.

Comparison Methods
In this paper, we conduct extensive experiments to compare our proposed method against several representative baselines. All the methods actually are implemented in two ways to utilize the cue word, which are local information initialization and global information inception.
rGRU: Through a specially designed Recall gate (Xu et al., 2016) 我拍照也都是巨丑的！My photos are also ugly! --Suitable Table 2: An example query, corresponding cue word in bold and its candidate replies with human annotation. The query states that people laughed at the author's photo, it is unsuitable to ask the ownership of this photo in Reply1. Generally, Reply2 and Reply3 apply to this scenario, but they do not reflect semantic relevance with the cue word. Reply4 talks about the respondent's situation and related to "Photogenic", thus it is a suitable response.
logue act (DA) features during the generation process. SLGD: We implemented the Stochastic Language Generation in Dialogue (SLGD) method (Wen et al., 2015a), which added additional features in each gate of the neural cell.
FGRU: To explore more fusion strategies, intuitively, we fused the cue word and hidden states by vector concatenation during the decoding process.
Note that rGRU and SCGRU incorporate additional information by gating mechanisms, while SLGD and FGRU fuse the information into each gate of the neural cell directly.

Experiment Evaluation
Objective metrics. To evaluate the performance of different methods for the conversation generation task, we leverage BLEU (Papineni et al., 2002) as the automatic evaluation metric, which is originally designed for machine translation and evaluates the output by using n-gram matching between the output and the reference. Here, we use BLEU-1, BLEU-2 and BLEU-3 in our experiments.
Subjective metrics. Since automatic metrics may not consistently agree with human perception (Stent et al., 2005), human testing is essential to assess subjective quality. Hence, we randomly sampled 150 queries in the test set, then we invited five annotators to offer a judgment. For fairness, all of our human evaluation was conducted in a random, blind fashion, i.e., replies obtained from the five evaluated models are pooled and randomly permuted for each annotator. Three levels are assigned to a reply with scores from 0 to 2: 0 =  Unsuitable reply, 2 = Suitable reply, and 1 = Neutral reply. To make the annotation task operable, the suitability of the generated reply is judged not only based on Grammar and Fluency, Logic Consistency and Semantic Relevance following (Shang et al., 2015), but also Implicit Relevance, i.e., the generated reply should be semantically relevant to the predicted cue word, no matter the cue word explicitly appears in the reply or not. If any of the first three criteria is contradicted, the reply should be labeled as "Unsuitable". Only the replies conforming to all requirements are labeled as "Suitable". Table 2 shows an example of the annotation results of a query and its replies. The first reply is labeled as "Unsuitable" because of the logic consistency. Reply2 and Reply3 are not semantically related to the cue word, and is therefore annotated as "Neutral".

Overall Performance
The overall results against all baseline methods are listed in Table 4. Our proposed HGFU model in global schema obviously shows better performance than the baseline methods; it obtains the Chinese Sentence

English Tranlation Query 写的真心棒！(夸 夸 夸奖 奖 奖)
What a nice written! (Appreciation) Reply 谢谢夸奖！么么哒！ Thanks for your appreciation! Love you! Query 还是无法淡定。(内 内 内心 心 心) Still cannot calm down. (Heart) Reply 内心是崩溃的吧。 Your heart must be broken. Query 我先去哭一会。(纸 纸 纸巾 巾 巾) I am going to cry for a while. (Tissue) Reply 递纸巾！ Offer you a tissue! Query 当初你们不是说过他是诺维斯基吗？(说 说 说过 过 过) Didn't you say that he was N owitzki † ? (Say) Reply 说过吗？好像没有说过啊！？ Did I say it? I don't seem to say it!? Table 3: The explicit introducing-content cases of our HGFU model. The predicted cue word in bold explicitly occurs in the generated reply. N owitzki † is a NBA basketball player.
highest BLEU scores as well as the highest human score.
In terms of automatic evaluations, the globalbased methods perform much better than a set of local-based methods, which demonstrates the effectiveness of global information inception. As mentioned above, the global schema provides the model a broader and more flexible space for learning, which is benefit for information fusion. When it comes to human scores (For the sake of convenience, we only conducted human evaluation in global schema), there are similar conclusions to BLEU results.
From Table 4, we can see that the performance of rGRU is not as good as the other systems, while SCGRU outperforms the others in the local pattern and shows comparative performance in the global schema. These two methods both augment the standard neural network with specially designed gate to control the cue word, but the results vary greatly. It is the limitation of gating mechanisms that is lacking in adaptiveness. Besides, SLGD adding cue word term in each gate of the neural cell has the similar result as FGRU method, which concatenates cue word with hidden state. Basically, our proposed HGFU has a significant improvement against the baseline systems. The most probable credits come from the cue word GRU: we apply the extra GRU unit to control the auxiliary information instead of fusion in the standard GRU, which is more flexible.
Till now, we have elaborated the overall performance of all methods. Next we will come to a closer look at some representative cases of our HGFU model for further analysis and discussions.

Analysis and Case Studies
Given a query and the cue word, our HGFU model generates a meaningful and informative response. In Table 3, the predicted cue word occurs in the generated response and we treat this kind of generation as the explicit introducing-content. However, we do not strictly restrict tothis. As shown in Table 5,our HGFUmodel also generates the replies without containing the cue word, while the responsesare still somehow related to the cue word and the query. This reflects our expectation: the information of the cue word will be fused into the generation process but not necessarily occur explicitly. It provesthe characteristics of our proposed new "soft" schema, whichare more flexible, extensible, and controllable.
We further analyze these explicit cases using a heat map as shown in Figure 4. We use various shades of blue to present the extent of correlation between the cue word and the generated reply. The darker the blue is, the higher correlation they have. For the added information in the reply (Here is exactly the cue word in darkblue), its position and occurrence times are not fixed, which are autonomously controlled by our model. Besides, the rectangular pulse is also a significant presentation of this correlation, which indicates how the k gate in fusion unit balance the influence of h y and h w . When in the high level of the rectangular pulse, k "opens" the switch of h w to generate the current word; when in the low level, the fusion unit mainly takes h y for generation. We observe that the switch corresponds with the heat map: the generated word is more correlated with the cue word when the switch is open.
This photo of T aemin † was also taken as a desktop for a long while. (Screenshot) Reply 锁屏吗？ As the lockscreen? Query 混脸熟求勾搭！(小 小 小新 新 新) Make acquaintance and seek chances for further relations! (Freshman) Reply 同新人！求认识。 I am also the new! Nice to meet you. 5 Related work

Conversation Systems
Automatic human-computer conversation has attractedincreasing attention over the past few years. At the very beginning, people start the research using hand-crafted rules and templates (Walker et al., 2001;Misu and Kawahara, 2007;Williams et al., 2013). These approaches require no data or little data for trainingbuthuge manual effort to build the model, which is very timeconsuming. For now, buildinga conversation systemmainly falls into two categories: retrievalbased and generation-based. As information retrieval techniques are developing fast, Leuski et al. (2009) build systems to select the most suitable response from the query-reply pairs using a statistical language model in cross-lingual information retrieval.  propose a retrieval-based conversation system with the deep learning-to-respond schema through a deep neural network framework driven by web data.
Recently, generation-based conversation systems have shownimpressive potential. Shang et al. (2015) generate replies for short-text conversation by Seq2Seq-basedneural networks with local and global attentions.

Content Introducing
In In open domains, Xing et al. (2016) incorporate topic information into Seq2Seq framework to generate informative and interesting responses. To provide informative clues for content introducing, Li et al. (2016b) detect entities from previous utterances and search for more related entities in a large knowledge graph. A very recent study similar to ours is , where the predicted word explicitly occurs in the generated utterance. Unlike the existing work, we explore an implicit content-introducing method for neural conversation systems, which utilizes the additional cue word in a "soft" manner to generate a more meaningful response given a user-issued query.

Conclusion
In this paper, we explore an implicit contentintroducing method for generative short-text conversation system. Given a user-issued query, our proposed HGFU incorporates an additional cue word in a "soft" manner to generate a more meaningful response. The HGFU model consists of three components: the standard GRU, the cue word GRU and the fusion unit. The standard GRU operates a general decoding process, and the cue word GRU imitates this process but treats the predicted cue word as the current input. As for the fusion unit, it combines both the hidden states of the standard GRU and the cue word GRU to generate the current output word. The experimental results demonstrate the effectiveness of our approach.