Neural Response Generation with Meta-words

We present open domain dialogue generation with meta-words. A meta-word is a structured record that describes attributes of a response, and thus allows us to explicitly model the one-to-many relationship within open domain dialogues and perform response generation in an explainable and controllable manner. To incorporate meta-words into generation, we propose a novel goal-tracking memory network that formalizes meta-word expression as a goal in response generation and manages the generation process to achieve the goal with a state memory panel and a state controller. Experimental results from both automatic evaluation and human judgment on two large-scale data sets indicate that our model can significantly outperform state-of-the-art generation models in terms of response relevance, response diversity, and accuracy of meta-word expression.


Introduction
Human-machine conversation is a fundamental problem in NLP. Traditional research focuses on building task-oriented dialog systems (Young et al., 2013) to achieve specific user goals such as restaurant reservation through limited turns of dialogues within specific domains. Recently, building a chatbot for open domain conversation (Vinyals and Le, 2015) has attracted increasing attention, not only owing to the availability of large amount of human-human conversation data on internet, but also because of the success of such systems in real products such as the social bot XiaoIce (Shum et al., 2018) from Microsoft.
A common approach to implementing a chatbot is to learn a response generation model within an encoder-decoder framework (Vinyals and Le, Table 1: An example of response generation with metawords. The underlined word means it is copied from the message, and the word in bold means it corresponds to high specificity. 2015; Shang et al., 2015). Although the architecture can naturally model the correspondence between a message and a response, and is easy to extend to handle conversation history (Serban et al., 2016;Xing et al., 2018) and various constraints , it is notorious for generating safe responses such as "I don't know" and "me too" in practice. A plausible reason for the "safe response" issue is that there exists one-to-many relationship between messages and responses. One message could correspond to many valid responses and vice versa (Zhang et al., 2018a). The vanilla encoder-decoder architecture is prone to memorize high-frequency patterns in data, and thus tends to generate similar and trivial responses for different messages. A typical method for modeling the relationship between messages and responses is to introduce latent variables into the encoder-decoder framework (Serban et al., 2017;Park et al., 2018). It is, however, difficult to explain what relationship a latent variable represents, nor one can control responses to generate by manipulating the latent variable. Although a recent study  replaces continuous latent variables with discrete ones, it still needs a lot of post human effort to explain the meaning of the variables.
In this work, we aim to model the one-to-many relationship in open domain dialogues in an explainable and controllable way. Instead of using latent variables, we consider explicitly representing the relationship between a message and a response with meta-words 1 . A meta-word is a structured record that characterizes the response to generate. The record consists of a group of variables with each an attribute of the response. Each variable is in a form of (key, type, value) where "key" defines the attribute, "value" specifies the attribute, and "type" ∈ {r, c} indicates whether the variable is real-valued (r) or categorical (c). Given a message, a meta-word corresponds to one kind of relationship between the message and a response, and by manipulating the meta-word (e.g., values of variables or combination of variables), one can control responses in a broad way. Table  1 gives an example of response generation with various meta-words, where "Act", "Len", "Copy", "Utts", and "Spe" are variables of a meta-word and refer to dialogue act, response length (including punctuation marks), if copy from the message, if made up of multiple utterances, and specificity level (Zhang et al., 2018a) respectively 2 . Advantages of response generation with meta-words are three-folds: (1) the generation model is explainable as the meta-words inform the model, developers, and even end users what responses they will have before the responses are generated; (2) the generation process is controllable. The metaword system acts as an interface that allows developers to customize responses by tailoring the set of attributes; (3) the generation approach is general. By taking dialogue acts , personas , emotions , and specificity (Zhang et al., 2018a) as attributes, our approach can address the problems in the existing literature in a unified form; and (4) generation-based open domain dialogue systems now become scalable, since the model supports feature engineering on meta-words.
The challenge of response generation with meta-words lies in how to simultaneously ensure relevance of a response to the message and fidelity of the response to the meta-word. To tackle the challenge, we propose equipping the vanilla sequence-to-sequence architecture with a novel goal tracking memory network (GTMN) and crafting a new loss item for learning GTMN. GTMN sets meta-word expression as a goal of generation and dynamically monitors expression of each variable in the meta-word during the decoding process. Specifically, GTMN consists of a state memory panel and a state controller where the former records status of meta-word expression and the latter manages information exchange between the state memory panel and the decoder. In decoding, the state controller updates the state memory panel according to the generated sequence, and reads out difference vectors that represent the residual of the meta-word. The next word from the decoder is predicted based on attention on the message representations, attention on the difference vectors, and the word predicted in the last step. In learning, besides the negative log likelihood, we further propose minimizing a state update loss that can directly supervise the learning of the memory network under the ground truth. We also propose a meta-word prediction method to make the proposed approach complete in practice.
We test the proposed model on two large-scale open domain conversation datasets built from Twitter and Reddit, and compare the model with several state-of-the-art generation models in terms of response relevance, response diversity, accuracy of one-to-many modeling, accuracy of metaword expression, and human judgment. Evaluation results indicate that our model can significantly outperform the baseline models over most of the metrics on both datasets.
Our contributions in this paper are three-folds: (1) proposal of explicitly modeling one-to-many relationship and explicitly controlling response generation in open domain dialogues with multiple variables (a.k.a., meta-word); (2) proposal of a goal tracking memory network that naturally allows a meta-word to guide response generation; and (3) empirical verification of the effectiveness of the proposed model on two large-scale datasets.

Related Work
Neural response generation models are built upon the encoder-decoder framework (Sutskever et al., 2014). Starting from the basic sequence-tosequence with attention architecture (Vinyals and Le, 2015;Shang et al., 2015), extensions under the framework have been made to combat the "safe response" problem (Mou et al., 2016;Tao et al., 2018); to model the hierarchy of conversation history (Serban et al., 2016(Serban et al., , 2017Xing et al., 2018); to generate responses with specific personas or emotions ; and to speed up response decoding . In this work, we also aim to tackle the "safe response" problem, but in an explainable, controllable, and general way. Rather than learning with a different objective (e.g., ), generation from latent variables (e.g., ), or introducing extra content (e.g., ), we explicitly describe relationship between message-response pairs by defining metawords and express the meta-words in responses through a goal tracking memory network. Our method allows developers to manipulate the generation process by playing with the meta-words and provides a general solution to response generation with specific attributes such as dialogue acts.
Recently, controlling specific aspects in text generation is drawing increasing attention (Hu et al., 2017;Logeswaran et al., 2018). In the context of dialogue generation, Wang et al. (2017) propose steering response style and topic with human provided topic hints and fine-tuning on small scenting data; Zhang et al. (2018a) propose learning to control specificity of responses; and very recently, See et al. (2019) investigate how controllable attributes of responses affect human engagement with methods of conditional training and weighted decoding. Our work is different in that (1) rather than playing with a single variable like specificity or topics, our model simultaneously controls multiple variables and can take controlling with specificity or topics as special cases; and (2) we manage attribute expression in response generation with a principled approach rather than simple heuristics like in (See et al., 2019), and thus, our model can achieve better accuracy in terms of attribute expression in generated responses.

Problem Formalization
the j-th variable and m i,j .k, m i,j .t, and m i,j .v the key, the type, and the value of the variable respectively. Our goal is to estimate a generation probability P (Y |X, M ) from D, and thus given a new message X with a pre-defined meta-word M , one can generate responses for X according to P (Y |X, M ). In this work, we assume that M is given as input for response generation. Later, we will describe how to obtain M with X.

Response Generation with Meta-Words
In this section, we present our model for response generation with meta-words. We start from an overview of the model, and then dive into details of the goal tracking memory enhanced decoding. Figure 1 illustrates the architecture of our goal tracking memory enhanced sequence-tosequence model (GTMES2S). The model equips the encoder-decoder structure with a goal tracking memory network that comprises a state memory panel and a state controller. Before response decoding, the encoder represents an input message as a hidden sequence through a bi-directional recurrent neural network with gated recurrent units (biGRU) (Chung et al., 2014), and the goal tracking memory network is initialized by a meta-word. Then, during response decoding, the state memory panel tracks expression of the meta-word and gets updated by the state controller. The state controller manages the process of decoding at each step by reading out the status of meta-word expression from the state memory panel and informing the decoder of the difference between the status and the target of meta-word expression. Based on the message representation, the information provided by the state controller, and the generated word sequence, the decoder predicts the next word of the response. In the following section, we will elaborate the goal tracking memory enhanced decoding, which is the key to having a response that is relevant to the message and at the same time accurately reflects the meta-word.

Goal Tracking Memory Network
The goal tracking memory network (GTMN) dynamically controls response generation according to the given meta-word via cooperation of the state memory panel and the state controller. It informs the decoder at the first time to what extend the meta-word has been expressed. For local attributes such as response length 3 , the dynamic control 3 Local attributes refer to the attributes whose values are location sensitive during response generation. For example,  strategy is more reasonable than static strategies such as feeding the embedding of attributes to the decoder like in conditional training in (See et al., 2019). This is because if the goal is to generate a response with 5 words and 2 words have been decoded, then the decoder needs to know that there are 3 words left rather than always memorizing that 5 words should be generated.

State Memory Panel
Suppose that the given meta-word M consists of l variables, then the state memory panel M is made up of l memory cells {M i } l i=1 where ∀i ∈ {1, . . . , l}, M i is in a form of (key, goal, value) which are denoted as M i .k, M i .g, and M i .v respectively. We define Rep(·) as a representation getting function which can be formulated as where m i is the i-th variable of M , σ(·) is a sigmoid function, and B(·) returns the bag-of-words representation for a piece of text. M i is then initialized as: (2) M i .k ∈ R d stores the key of m i , and M i .g ∈ R d stores the goal for expression of m i in generation. Thus, the two items are frozen in decoding. M i .v ∈ R d refers to the gray part of the progress bar in Figure 1, and represents the progress of expression of m i in decoding. Hence, it is updated by the state controller after each step of decoding. length of the remaining sequence varies after each step of decoding. In contrary, some attributes, such as dialogue acts, are global attributes, as they are reflected by the entire response.

State Controller
As illustrated by Figure 1, the state controller stays between the encoder and the decoder, and manages the interaction between the state memory panel and the decoder. Let s t be the hidden state of the decoder at step t. The state controller first updates M i .v t−1 to M i .v t based on s t with a state update operation. It then obtains the difference between M i .g and M i .v t from the state memory panel via a difference reading operation, and feeds the difference to the decoder to predict the t-th word of the response.
State Update Operation. The operation includes SUB and ADD as two sub-operations. Intuitively, when the status of expression surpasses the goal, then the state controller should execute the SUB operation (stands for "subtract") to trim the status representation; while when the status of expression is inadequate, then the state controller should use the ADD operation to enhance the status representation. Technically, rather than comparing M i .v t−1 with M i .g and adopting operations accordingly, we propose a soft way to update the state memory panel with SUB and ADD, since (1) it is difficult to identify over-expression or sub-expression by comparing two distributed representations; and (2) the hard way will break differentiablility of the model. Specifically, we define g t ∈ R d×l as a gate to control the use of SUB or ADD where g t and Difference Reading Operation. For each variable in the meta-word M , the operation represents the difference between the status of expression and the goal of expression as a vector, and then applies an attention mechanism to the vectors to indicate the decoder the importance of variables in generation of the next word. Formally, suppose that d t i ∈ R 2d is the difference vector for m i ∈ M at step t, then d t i is defined as With (d t 1 , . . . , d t l ) as a difference memory, the difference reading operation then takes s t as a query vector and calculates attention over the memory as where (a t 1 , . . . , a t l ) are attention weights, and U ∈ R d×d is a parameter.

Response Decoding
In decoding, the hidden state s t is calculated by GRU(s t−1 , [e(y t−1 ) ⊕ C t ]), where e(y t−1 ) ∈ R d is the embedding of the word predicted at step t − 1, and C t is a context vector obtained from attention over the hidden states of the input message X given by the biGRU based encoder. Let H X = (h X,1 , . . . , h X,Tx ) be the hidden states of X, then C t is calculated via where U d , W s , W h , and b d are parameters, and s t−1 is the hidden state of the decoder at step t−1.
With the hidden state s t and the distance vector o t returned by the state controller, the probability distribution for predicting the t-th word of the response is given by where y t is the t-th word of the response with e(y t ) its embedding, and W p and b p are parameters.

Learning Method
To perform online response generation with metawords, we need to (1) estimate parameters of GTMES2S by minimizing a loss function; and (2) learn a model to predict meta-words for online messages.

Loss for Model Learning
The first loss item is the negative log likelihood (NLL) of D, which is formulated as where Θ is the set of parameters of GTMES2S. By minimizing NLL, the supervision signals in D may not sufficiently flow to GTMN, as GTMN is nested within response decoding. Thus, besides NLL, we propose a state update loss that directly supervises the learning of GTMN with D.
The idea is to minimize the distance between the ground truth status of meta-word expression and the status stored in the state memory panel. Suppose that y 1:t is the segment of response Y generated until step t, then ∀m i ∈ M , we consider two cases: (1) ∃F i (·) that F i (y 1:t ) maps y 1:t to the space of m i .v. As an example, response length belongs to this case with F i (y 1:t ) = t; (2) it is hard to define an F i (·) that can map y 1:t to the space of m i .v. For instance, dialogue acts belong to this case since it is often difficult to judge the dialogue act from part of a response. For case (1), we define the state update loss as where T is the length of Y and · refers to L 2 norm. For case (2), the loss is defined as The full state update loss L SU (Θ) for D is then given by where C 1 and C 2 represent sets of variables belonging to case (1) and case (2) respectively, and I(·) is an indicator function. The loss function for learning of GTMES2S is finally defined by where λ acts as a trade-off between the two items.

Meta-word Prediction
We assume that values of meta-words are given beforehand. In training, the values can be extracted from ground truth. In test, however, since only a message is available, we propose sampling values of a meta-word for the message from probability distributions estimated from The sampling approach not only provides meta-words to GTMNES2S, but also keeps meta-words diverse for similar messages. Formally, let h p X be the last hidden state of a message X processed by a biGRU, then ∀m i ∈ M , we assume that m i .v obeys a multinomial distribution with the probability p i parameterized as softmax(W mul In distribution estimation, we assume that variables in a meta-word are independent, and jointly maximize the log likelihood of {(M i |X i )} N i=1 and the entropy of the distributions as regularization.

Experiments
We test GTMNES2S on two large-scale datasets.

Datasets
We mine 10 million message-response pairs from Twitter FireHose, covering 2-month period from June 2016 to July 2016, and sample 10 million pairs from the full Reddit data 4 . As preprocessing, we remove duplicate pairs, pairs with a message or a response having more than 30 words, and messages that correspond to more than 20 responses to prevent them from dominating learning. After that, there are 4, 759, 823 pairs left for Twitter and 4, 246, 789 pairs left for Reddit. On average, each message contains 10.78 words in the Twitter data and 12.96 words in the Reddit data. The average lengths of responses in the Twitter data and the Reddit data are 11.03 and 12.75 respectively. From the pairs after pre-processing, we randomly sample 10k pairs as a validation set and 10k pairs as a test set for each data, and make sure that there is no overlap between the two sets. After excluding pairs in the validation sets and the test sets, the left pairs are used for model training. The test sets are built for calculating automatic metrics. Besides, we randomly sample 1000 distinct messages from each of the two test sets and recruit human annotators to judge the quality of responses generated for these messages. For both the Twitter data and the Reddit data, top 30, 000 most frequent words in messages and responses in the training sets are kept as message vocabularies and response vocabularies. In the Twitter data, the message vocabulary and the response vocabulary cover 99.17% and 98.67% words appearing in messages and responses respectively. The two ratios are 99.52% and 98.8% respectively in the Reddit data. Other words are marked as "UNK".

Meta-word Construction
As a showcase of the framework of GTMNES2S, we consider the following variables as a metaword: (1) Response Length (RL): number of words and punctuation marks in a response. We restrict the range of the variable in {1, . . . , 25} (i.e., responses longer than 25 are normalized as 25), and treat it as a categorical variable. (2) Dialog Act (DA): we employ the 42 dialogue acts based on the DAMSL annotation scheme (Core and Allen, 1997). The dialogue act of a given response is obtained by the state-of-the-art dialogue act classifier in  learned from the Switchboard (SW) 1 Release 2 Corpus (Godfrey and Holliman, 1997). DA is a categorical variable. (3) Multiple Utterances (MU): if a response is made up of multiple utterances. We split a response as utterances according to ".", "?" and "!", and remove utterances that are less than 3 words. The variable is "true" if there are more than 1 utterance left, otherwise it is "false". (4) Copy Ratio (CR): inspired by COPY-NET (Gu et al., 2016) which indicates that humans may repeat entity names or even long phrases in conversation, we incorporate a "copy mechanism" into our model by using copy ratio as a soft implementation of COPY-NET. We compute the ratio of unigrams shared by a message and its response (divided by the length of the response) with stop words and top 1000 most frequent words in training excluded. CR is a real-valued variable. (5) Specificity (S): following SC-Seq2Seq (Zhang et al., 2018b), we calculate normalized inverse word frequency as a specificity variable. The variable is real-valued. Among the five variables, RL, CR, and S correspond to the state update loss given by Equation (11), and others correspond to Equation (12).

Baselines
We compare GTMNES2S with the following baseline models: (1) MMI-bidi: the sequenceto-sequence model with response re-ranking in  learned by a maximum mutual information objective; (2) SC-Seq2Seq: the specificity controlled Seq2Seq model in (Zhang et al., 2018b); (3) kg-CVAE: the knowledgeguided conditional variational autoencoders in ; and (4) CT: the conditional training method in (See et al., 2019) that feeds the embedding of pre-defined response attributes to the decoder of a sequence-to-sequence model. Among the baselines, CT exploits the same attributes as GTMNES2S, SC-Seq2Seq utilizes specificity, and kg-CVAE leverages dialogue acts. All models are implemented with the recommended parameter configurations in the existing papers, where for kg-CVAE, we use the code shared at https://github.com/ snakeztc/NeuralDialog-CVAE, and for other models without officially published code, we code with TensorFlow. Besides the baselines, we also compare GTMNES2E learned from the full loss given by Equation (14) with a variant learned only from the NLL loss, in order to check the effect of the proposed state update loss. We denote the variant as GTMNES2S w/o SU.

Evaluation Metrics
We conduct both automatic evaluation and human evaluation. In terms of automatic ways, we evaluate models from four aspects: relevance, diversity, accuracy of one-to-many modeling, and accuracy of meta-word expression. For relevance, besides BLEU (Papineni et al., 2002), we follow (Serban et al., 2017) and employ Embedding Average (Average), Embedding Extrema (Extrema), Embedding Greedy (Greedy) as metrics. To evaluate diversity, we follow  and use Distinct-1 (Dist1) and Distinct-2 (Dist2) as metrics which are calculated as the ratios of distinct unigrams and bigrams in the generated responses. For accuracy of one-to-many modeling, we utilize A-bow precision (A-prec), A-bow recall (A-rec), E-bow precision (E-prec), and E-bow recall (Erec) proposed in  as metrics. For accuracy of meta-word expression, we measure accuracy for categorical variables and square deviation for real-valued variables. Metrics of relevance, diversity, and accuracy of meta-word expression are calculated on the 10k test data based on top 1 responses from beam search. To measure the accuracy of meta-word expression for a generated response, we extract values of the metaword of the response with the methods described in Section 6.2, and compare these values with the oracle ones sampled from distributions. Metrics of accuracy of one-to-many modeling require a test message to have multiple reference responses. Thus, we filter the test sets by picking out messages that have at least 2 responses, and form two subsets with 166 messages for Twitter and 135 messages for Reddit respectively. On average, each message corresponds to 2.8 responses in the Twitter data and 2.92 responses in the Reddit data. For each message, 10 responses from a model are used for evaluation. In kg-CVAE, we follow  and sample 10 times from the latent variable; in SC-Seq2Seq, we vary the specificity in {0.1, 0.2, . . . , 1}; and in both CT and GTMNES2S, we sample 10 times from the distributions. Top 1 response from beam search under each sampling or specificity setting are collected as the set for evaluation.
In terms of human evaluation, we recruit 3 native speakers to label top 1 responses of beam search from different models. Responses from all models for all the 1000 test messages in both data are pooled, randomly shuffled, and presented to each of the annotators. The annotators judge the quality of the responses according to the following criteria: +2: the response is not only relevant and natural, but also informative and interesting; +1: the response can be used as a reply, but might not be informative enough (e.g.,"Yes, I see" etc.); 0: the response makes no sense, is irrelevant, or is grammatically broken. Each response receives 3 labels. Agreements among the annotators are measured by Fleiss' kappa (Fleiss and Cohen, 1973).

Implementation Details
In test, we fix the specificity variable as 0.5 in SC-Seq2Seq, since in (Zhang et al., 2018a), the authors conclude that the model achieves the best overall performance under the setting. For kg-  Table 2: Results on relevance, diversity, and accuracy of one-to-many modeling. Numbers in bold mean that improvement over the best baseline is statistically significant (t-test, p-value < 0.01).  Table 3: Results on accuracy of meta-word expression. Numbers in bold mean that improvement over the best baseline is statistically significant (t-test, p-value < 0.01).
CVAE, we follow  and predict a dialogue act for a message with an MLP. GTMNES2S and CT leverage the same set of attributes. Thus, for fair comparison, we let them exploit the same sampled values in generation. In GTMNES2S, the size of hidden units of the encoder and the decoder, and the size of the vectors in memory cells (i.e., d) are 512. Word embedding is randomly initialized with a size of 512. We adopt the Adadelta algorithm (Zeiler, 2012) in optimization with a batch size 200. Gradients are clipped when their norms exceed 5. We stop training when the perplexity of a model on the validation data does not drop in two consecutive epochs. Beam sizes are 200 in MMI-bidi (i.e., the size used in ) and 5 in other models. Table 2 and Table 3 report evaluation results on automatic metrics. On most of the metrics, GTMNES2S outperforms all baseline methods, and the improvements are significant in a statistical sense (t-test, p-value < 0.01). The results demonstrate that with meta-words, our model can represent the relationship between messages and responses in a more effective and more accurate way, and thus can generate more diverse responses without sacrifice on relevance. Despite leveraging the same attributes for response generation, GTMNES2S achieves better accuracy than CT on both one-to-many modeling and meta-word expression, indicating the advantages of the dynamic control strategy over the static control strategy, as we have analyzed at the beginning of Section 4.2. Without the state update loss, there is significant performance drop for GTMNES2S. The results verified the effect of the proposed loss in learning. Table 4 summarizes human evaluation results. Compared with the baseline methods and the variant, the full GTMNES2S model can generate much more excellent responses (labeled as "2") and much fewer inferior responses (labeled as "0"). Kappa values of all models exceed 0.6, indicating substantial agreement over all annotators. The results further demonstrate the value of the proposed model for real human-machine conversation. kg-CVAE gives more informative responses, and also more bad responses than MMIbidi and SC-Seq2Seq. Together with the contradiction on diversity and relevance in Table 2, the results indicate that latent variable is a doublebladed sword: the randomness may bring interesting content to responses and may also make responses out of control. On the other hand, there   are no random variables in our model, and thus, it can enjoy a well-trained language model.

Discussions
In this section, we examine effect of different attributes by adding them one by one to the generation model. Besides, we also illustrate how GTMNES2S tracks attribute expression in response generation with test examples.
Contribution of attributes. Table 5 shows perplexity (PPL) of GTMNES2S with different sets of attributes on the validation data. We can see that the more attributes are involved in learning, the lower PPL we can get. By leveraging all the 5 attributes, we can reduce almost 50% PPL from the vanilla encoder-decoder model (i.e., the one without any attributes). The results not only indicate the contribution of different attributes to model fitting, but also inspire us the potential of the proposed framework, since it allows further improvement with more well designed attributes involved.
Case Study. Figure 2 illustrates how our model controls attributes of responses with the goal tracking mechanism, where distance between the value of a memory cell (i.e., M i .v t ) during gener-Message: mm so should i just pull the ring out than ? kg-CVAE: where is the ring ? MMI-bidi: you don't want to that SC-Seq2Seq: you should not do such things MU=False, DA=Statement-non-opinion, RL=8, CR=0.24, S=0.5 GTMNES2S:

GTMNES2S
w  ation and the goal of the memory cell (i.e., M i .g) is visualized via heat maps. In the first example, the full model gradually reduces the distance between the value and the goal of copy ratio expression with the generation process moving on. As a result, it just copies "pull the ring out" from the message, which makes the response informative and coherent. On the other hand, without the state update loss, GTMNES2S w/o SU makes a mistake by copying "ring" twice, and the distance between the value and the goal is out of control. In the second example, we visualize the expression of MU, a categorical attribute. Compared with realvalued attributes, categorical attributes are easier to express. Therefore, both the full model and GTMNES2S w/o SU successfully generate a response with multiple utterances, although the distance between the value and the goal of MU expression in GTMNES2S w/o SU is still in a mess.

Conclusions
We present a goal-tracking memory enhanced sequence-to-sequence model for open domain response generation with meta-words which explicitly define characteristics of responses. Evaluation results on two datasets indicate that our model significantly outperforms several state-of-the-art generative architectures in terms of both response quality and accuracy of meta-word expression.