Augmenting Neural Response Generation with Context-Aware Topical Attention

Sequence-to-Sequence (Seq2Seq) models have witnessed a notable success in generating natural conversational exchanges. Notwithstanding the syntactically well-formed responses generated by these neural network models, they are prone to be acontextual, short and generic. In this work, we introduce a Topical Hierarchical Recurrent Encoder Decoder (THRED), a novel, fully data-driven, multi-turn response generation system intended to produce contextual and topic-aware responses. Our model is built upon the basic Seq2Seq model by augmenting it with a hierarchical joint attention mechanism that incorporates topical concepts and previous interactions into the response generation. To train our model, we provide a clean and high-quality conversational dataset mined from Reddit comments. We evaluate THRED on two novel automated metrics, dubbed Semantic Similarity and Response Echo Index, as well as with human evaluation. Our experiments demonstrate that the proposed model is able to generate more diverse and contextually relevant responses compared to the strong baselines.


Introduction
With the recent success of deep neural networks in natural language processing tasks such as machine translation (Sutskever et al., 2014) and language modeling (Mikolov et al., 2010), there has been growing research interest in building datadriven dialogue systems. Fortunately, innovation in deep learning architectures and the availability of large public datasets have produced fertile ground for the data-driven approaches to become feasible and quite promising. In particular, the Sequence-to-Sequence (Seq2Seq) neural network model (Sutskever et al., 2014) has witnessed substantial breakthroughs in improving the perfor-mance of conversational agents. Such a model succeeds in learning the backbone of the conversation but lacks any aptitude for producing contextsensitive and diverse conversations. Instead, generated responses are dull, short and carry little information (Li et al., 2016a). Instinctively, humans tend to adapt conversations to their interlocutor not only by looking at the last utterance but also by considering information and concepts covered in the conversation history (Danescu-Niculescu-Mizil and Lee, 2011). Such adaptation increase the smoothness and engagement of the generated responses. We speculate that incorporating conversation history and topic information with our novel model and method will improve generated conversational responses. In this work, we introduce a novel, fully data-driven, multi-turn response generation system intended to produce context-aware and diverse responses. Our model builds upon the basic Seq2Seq model by combining conversational data and external knowledge information trained through a hierarchical joint attention neural model. We find that our method leads to both diverse and contextual responses compared to the literature strong baselines. We also introduce two novel quantitative metrics for dialogue model development, dubbed Semantic Similarity and Response Echo Index. While the former measures the capability of the model to be consistent with the context and to maintain the topic of the conversation, the latter assesses how much our approach is able to generate unique and plausible responses which are measurably distant from the input dataset. Used together, they provide a means to reduce burden of human evaluation and allow rapid testing of dialogue models. We show that such metrics correlate well with human judgment, making a step towards a good automatic evaluation procedure.
The key contributions of this work are: • We devise a fully data-driven neural conversational model that leverages conversation history and topic information in the response generation process through a hierarchical joint attention mechanism; making the dialogue more diverse and engaging.
• We introduce two novel automated metrics: Semantic Similarity and Response Echo Index and we show that they correlate well with human judgment.
• We collect, parse and clean a conversational dataset from Reddit comments 1 .

Related Work
Neural generative models have been improved through several techniques.  built upon the Seq2Seq work by introducing a Hierarchical Recurrent Encoder-Decoder neural network (HRED) that accounts for the conversation history. (Li et al., 2016b) used deep reinforcement learning to generate highly-rewarded responses by considering three dialogue properties: ease of answering, informativeness and coherence. (Zhang et al., 2018) addressed the challenge of personalizing the chatbot by modeling human-like behaviour. They presented a persona-based model that aims to handle the speaker consistency by integrating a speaker profile vector representation into the the Seq2Seq model. (Xing et al., 2017) used a similar idea but added an extra probability value in the decoder to bias the overall distribution towards leveraging topic words in the generated responses. Their architecture does not focus on capturing conversation history. All of these improvements are motivated by the scarcity of diversity and informativeness of the responses. Our work follows on from these works with the additional aim of generating context-aware responses by using a hierarchical joint attention model. An important line of research that we also address in this work is automatically evaluating the quality of dialogue responses. In dialogue systems, automated metrics tend to be borrowed from other NLP tasks such as BLEU (Papineni et al., 2002) from machine translation and ROUGE (Lin, 2004) from text summarization. Yet, such metrics fail, mainly because they are focusing on the wordlevel overlap between the machine-generated an-swer and the human-generated answer, which can be inconsistent with what humans deem a plausible and interesting response. (Liu et al., 2016) have showed that these metrics correlate very weakly with human evaluation. Indeed, wordoverlapping metrics achieve best results when the space of responses is small and lexically overlapping which is not the case for dialogue systems responses. Significant works have looked into this challenge. Examples include ADEM , an evaluation model that learns to score responses from an annotated dataset of human responses scores. (Venkatesh et al., 2018) proposed a number of metrics based on user experience, coherence, and topical diversity and have showed that these metrics can be used as a proxy for human evaluation. However, engagement and coherence metrics are estimated via recruiting evaluators. In this work, we propose directly calculable approximations of human evaluation grounded in conversational theories of accommodation and affordance (Danescu-Niculescu-Mizil and Lee, 2011).

Topical Hierarchical Recurrent Encoder Decoder
Topical Hierarchical Recurrent Encoder Decoder (THRED) can be viewed as a hybrid model that conditions the response generation on conversation history captured from previous utterances and on topic words acquired from a Latent Dirichlet Allocation (LDA) model (Blei et al., 2003). The proposed approach extends the standard Seq2Seq model by leveraging topic words in the process of response generation and accounting for conversation history. Figure 1 illustrates our model. We detail below the components of our model.

Message Encoder
Let a sequence of N utterances within a dialogue D = {U 1 , ..., U N }. Every utterance U i = {w i,1 , ..., w i,L i } contains a random variable L i of sequence of words where w i,k represents the word embedding vector at position k in the utterance U i . The message encoder sequentially accepts the embedding of each word in the input message U i and updates its hidden state at every time step t by a bidirectional GRU-RNN (Cho et al., 2014) according to: where h i,t−1 represents the previous hidden state. Figure 1: THRED model architecture in which we jointly model two specifications that presumably make the task of response generation successful: context-awareness (modeled by Context Attention) and diversity (modeled by Topic Attention).

Message Attention
Different parts of the conversation history have distinct levels of importance that may influence the response generation process. The message attention in THRED operates by putting more focus on the salient input words with regard to the output. It computes, at step t, a weight value α i,j,t for every encoder hidden state h i,j and linearly combines them to form a vector m i,t according to Bahdanau attention mechanism (Bahdanau et al., 2015). Formally, m i,t is calculated as: where α i,j,t is computed as: where s t−1 represents the hidden state of the decoder (further details are provided later), c i,t delineates the hidden state of the context-level encoder (computed in Equation (3)) , η is a multi-layer perceptron having tanh as activation function. Unlike the Bahdanau attention mechanism, the attentional vector m i,t is based on both the hidden states of the decoder and the hidden states of the contextlevel encoder. We are motivated by the fact that c i,t may carry important information that could be missing in s t−1 . In summary, the attentional vector m i,t is an order-sensitive information of all the words in the sentence, attending to more important words in the input messages.

Context-Level Encoder
The context-level encoder takes as input each utterance representation (m 1,t , . . . , m N,t ) and calculates the sequence of recurrent hidden states as shown in Equation (3): where c i−1,t delineates the previous hidden state of the context-level encoder and N represents the number of utterances in the conversation history. The resulted c i,t vector summarizes all past information that have been processed up to position i.

Context-Topic Joint Attention
Context Attention: On top of the context-level encoder, a context attention is added to attend to important utterances in the conversation history. Precisely, the context attention assigns weights (γ 1,t , ..., γ N,t ) to (c 1,t , ..., c N,t ) and forms a vector r t as where: Topic Attention: In order to infuse the response with information relevant to the input messages, we enhance the model with topic information. We assign a topic T to the conversation context using a pre-trained LDA model (Hoffman et al., 2010). LDA is a probabilistic topic model that appoints multiple topics for the dialogue history. The LDA parameters were estimated using the collapsed Gibbs sampling algorithm (Zhao et al., 2011). We provide further details on how we train this model in the supplementary material. In our case, the conversation history is a short document, so we believe that the most probable topic will be sufficient to model the dialogue. After acquiring topic words for the entire history, we pick the n highest probable words under T (we choose n = 100 in our experiments). The topic words {t 1 , · · · , t n } are then linearly combined to form a fixed-length vector k. The weight values are calculated as the following: where i ∈ {1, · · · , n}, c N,t is the last hidden state of the context-level encoder, and s t−1 is the t − 1 th hidden state in the decoder. The topic attention uses additionally the last hidden state of the context-level encoder c N,t in order to diminish the repercussion of impertinent topic words and feature the relevant ones to the message. Unlike (Xing et al., 2017), our model employs the final context-level encoder hidden state c N,t in order to account for conversation history in the generated response. In summary, the topic words are summarized as a topic vector k representing prior knowledge for response generation. The key idea of this approach is to affect the generation process by avoiding the need to learn the same conversational pattern for each utterance but instead enriching the responses with topics and words related to the subject of the message even if the words were never used before.

Decoder
The decoder is responsible for predicting the response utterance U m+1 given the previous utterances and the topic words. Following (Xing et al., 2017), we biased the generation probability towards generating the topic words in the response. In particular, we added an extra probability to the standard generation probability, enforcing the model to account for the topical tokens. Consequently, the generation probability is defined as the following: where K and V represent respectively topic vocabulary and response vocabulary; p V and p K are defined as follows: , σ is a tanh and M is calculated as follows:

Datasets
One of the main weaknesses of dialogue systems is caused by the paucity of high-quality conversational dataset. The well-known OpenSubtitles dataset (Tiedemann, 2012) lacks speaker annotations, thus making it more difficult to train conversation systems which demand high quality speaker and conversation level tags. Therefore, the assumption of treating consecutive utterances as turn exchanges uttered by two persons (Vinyals and Le, 2015) could not be viable. To enable the study of high-quality and large-scale dataset for dialogue modeling, we have collected a corpus of 35M conversations drawn from the Reddit data 2 , where each dialogue is composed of three turn exchanges. The Reddit dataset is composed of posts and comments, where each comment is annotated with rich meta data (i.e., author, number of replies, user's comment karma, etc.) 3 . To harvest the dataset, we curated 95 English subreddits out of roughly 1.1M public subreddits 4 . Our choice was based on the top-ranked subreddits that discuss topics such as news, education, business, politics and sports. We processed Reddit for a 12 month-period ranging from December 2016 until December 2017. For each post, we retrieved all comments and we recursively followed the chain of replies of each comment to recover the entire conversation. Reddit dataset is often semantically well-structured and is not filled with spelling errors thanks to moderator's efforts. Therefore, we do not perform any spelling correction procedure. Due to resource limitations, we randomly sampled 6M dialogues as training data, 700K dialogues as development data, and 40K dialogues as test data.
For OpenSubtitles, we trained the models on the same size of data as for Reddit.

Experiments
In this section, we focus on the task of evaluating the next utterance given the conversation history. We compare THRED against three opensource baselines, namely Standard Seq2Seq with attention mechanism (Bahdanau et al., 2015), HRED (Serban et al., 2016), and Topic-Aware (TA) Seq2Seq (Xing et al., 2017). As done in (Li et al., 2016b), for Standard Seq2Seq and TA-Seq2Seq, we concatenate the dialogue history to account for context in a multi-turn conversation. All experiments are conducted on two datasets (i.e., Reddit and OpenSubtitles). We report results on OpenSubtitles in the supplementary material.

Quantitative Evaluation
In the following subsections, we introduce two metrics that can impartially evaluate THRED and compare against the different baselines. These metrics were tested on 5000 dialogues randomly sampled from the test dataset. It is worth mentioning that we present word perplexity (PPL) on the test data in Table 4 (along with diversity metric). However, we do not believe that it represents a good measure for assessing the quality of responses . This is because perplexity captures how likely the responses are under a generation probability distribution, and does not measure the degree of diversity and engagingness in the responses.

Semantic Similarity
A good dialogue system should be capable of sustaining a coherent conversation with a human by staying on topic and by following a train of thoughts (Venkatesh et al., 2018). Semantic Similarity (SS) metric estimates the correspondence between the utterances in the context and the generated response. The intuition behind this metric is that plausible responses should be consistent with the context and should maintain the topic of the conversation. Our response generator THRED along with the baselines generate an utterance based on the two previous utterances in the dialogue (i.e., Utt1 and Utt2). We compute the cosine distance between the embedding vectors of the test utterances (Utt.1 and Utt.2) and the generated responses from the different models (i.e., THRED, TA-Seq2Seq, HRED and Seq2Seq). Therefore, a low score denotes a high coherence. More precisely, for each triple in the test dataset, we test two scenarios: (1) we compute the SS of each generated response with respect to the most recent utterance in the conversation (Utt.2) and (2) we compute the SS of each generated response with respect to the second most recent utterance (Utt.1). To render the semantic representation of an utterance, we leverage Universal Sentence Encoder (Cer et al., 2018) wherein a sentence is projected to a fixed dimensional embedding vector.
However, dull and generic responses such as "i'm not sure" tend to be semantically close to many utterances, hindering the effectiveness of the metric. To cope with this negative effect, we manually compiled a set of 60 dull responses and computed the SS score by multiplying the cosine distance with the following penalty factor (akin to length penalty in (Wu et al., 2016)): where L indicates the length of the response after dropping stop words and punctuation and L stands for the length of non-dull part of the response after dropping stop words. The intuition here is that the longer utterances, with nearly the same similarity, communicate the intention unequivocally since it takes more words to convey the same meaning. The penalized Semantic Similarity score is therefore defined as: SS(utt i,j , resp i ) = P × (1 − cos( utt i,j , resp i )) where i represents the index for the dialogue in the test dataset and j denotes the index of the utterance in the conversation history. The results conducted on Reddit dataset are shown in Table 2. We can observe that THRED is able to generate responses which follow the topic and semantics of the input utterances. In particular, the responses generated by THRED tend to be closer to the context of the conversation (Utt.1 and Utt.2) compared to the responses generated from the baslines. To ensure the statistical significance of THRED, we conducted Student's t-test over the average values of SS metric. THRED outperforms all baselines (p < 0.001) especially when the comparison is made against the most recent utterance (Utt.2). On the other hand, THRED is level with compared models in semantic distance with respect to the second most recent utterance (Utt.1). This makes CONTEXT GENERATED RESPONSES (Reddit) sanctions are an act of war → why do you think that ?
THRED: because it's really a theory that supports terrorism . and this has an effect on the idea of a regime that isn't the same as a government (Excellent, Good, Excellent, Good, Excellent) HRED: because the war is n't a war .
it 's a war . (Good, Poor, Poor, Poor, Poor) Seq2Seq: because it 's an unpopular opinion , and that 's why it 's a bad thing to say . (Good, Poor, Excellent, Good, Good) TA-Seq2Seq: because it's a war . (Good, Poor, Excellent, Poor, Good) Table 1: One cherry-picked dialogue out of 150 conversations along with the generated responses from all models. Human judgments are provided in the brackets. The blue arrow specifies a dialogue exchange and the highlighted words in red represent the topic words acquired from the pre-trained LDA model.  sense because in a multi-turn dialogue, speakers are more likely to address the last utterance spoken by the interlocutor, which is why THRED tends to favour the most recent utterance over an older one. Additionally, the roughly similar distances for both utterances in Standard Seq2Seq and TA-Seq2Seq exhibit that by concatenating context as single input, these models cannot distinguish between early turns and late turns. Similarly, the results achieved on OpenSubtitles dataset (See Figure 4 in the supplementary material) illustrate that THRED succeeds in staying on topic and in accounting for contextual information.

Reliability Assurance
In order to ensure that the SS measurement is stable and void of random error, we investigate whether the SS metric is able to yield the same previous results regardless of a specific test dataset. Following (Papineni et al., 2002), the test dataset is randomly partitioned to 5 disjoint subsets (i.e., each one consists of 1000 test dialogues). Then, we compute standard deviation of SS over each dataset. The results, showcased in Table 3, indicate low standard deviation on the subdatasets, denoting that the SS metric is a consistent and re-  liable measure to compare different dialogue models.

Response Echo Index
The goal of the Response Echo Index (REI) metric is to detect overfitting to the training dataset. More specifically, we want to measure the extent to which the responses generated by our model repeat the utterances appearing in the training data. Our approach is close to sampling and finding the nearest neighbour in image generative models (Theis et al., 2016). We randomly sampled 10% of the training data of both OpenSubtitles and Reddit. The nearest neighbour is determined via Jaccard similarity function. Each utterance is represented by lemmatized bag-of-words where stop words and punctuation marks are omitted. In effect, REI is defined as: wheret is the normalized form of text t, T 0.1 denotes the sampled training data, and J represents Jaccard function. REI is expected to be low since the generated responses should be distant from the nearest neighbor. According to the results, presented in Figure 2, the REI scores of the responses generated from THRED are the lowest compared to the rest of the models. Such observation leads us to the conclusion that THRED is able to generate unique responses which appear to be drawn from the input distribution, while being measurably far from the input dataset. This strength in THRED is attributed to the topic attention and incorporating topic words in response generation. Due to the same reason, standard Seq2Seq and HRED fall short.

Degree of Diversity & Perplexity
To account further for diversity in generated responses, following (Li et al., 2016a), we calculated distinct-1 and distinct-2 by counting unique unigrams and bigrams, normalized by the number of generated words. The results, given in Table 4, on Reddit indicate that THRED yields content rich and diverse responses, mainly ascribed to incorporating new topic words into response generation. Further, in perplexity, THRED performs slightly better.

Human Evaluation
Besides the quantitative measures, 4-scale and side-by-side human evaluation were carried out. Five human raters were recruited for the purpose of evaluating the quality of the responses. They were fluent, native English speakers and wellinstructed for the judgment task to ensure quality rating. We showed every judge 300 conversations (150 dialogues from Reddit and 150 dialogues from OpenSubtitles) and two generated responses for each dialogue: one generated by THRED model and the other one generated by one of our baselines. The source models were unknown to the evaluators. The responses were ordered in a random way to avoid biasing the judges. Additionally, Fleiss' Kappa score is used to gauge the reliability of the agreement between human evaluators (Shao et al., 2017). An example of generated responses from the Reddit dataset are provided in Table 1 For the 4-scale human evaluation, judges were asked to judge the responses from Bad (0)    In order to better visualize the density of the points, we added stochastic noise generated by Gaussian distribution N (0, 0.1) to the human ratings (i.e., horizontal axis) at the cost of lowering correlation, as done in .
consensus degree rated 32.9% and 36.9% of the THRED responses in OpenSubtitles and Reddit respectively as Excellent, which is greatly larger than all baselines (up to 11.6% and 22.7% respectively). Apart from the 4-scale rating, we conducted the evaluations side-by-side to measure the gain in THRED over the strong baselines. Specific comparison instructions are included in the supplementary material. The results, illustrated in Table 5, suggest that THRED is substantially superior to all baselines in producing informative and plausible responses from human's perspective. The high Kappa scores imply that a major agreement prevails among the lablers. In particular, THRED beats the strong baselines in 52% of the test data in Reddit (the percentage is achieved by averaging the win ratio). However, for the rest of the cases, THRED is equally good with the baselines in 25% in Reddit (calculated similarly based on Table 5). Hence, the ratio of cases where THRED is better than or equal with the baselines in terms of quality is 77% in Reddit.

Automated metric vs. Human evaluation
We also carried out an analysis on the correlation between the human evaluator ratings and our  Table 5: Side-by-side human evaluation along with 4-scale human evaluation of dialogue utterance prediction on Reddit dataset (mean preferences ±90% confidence intervals).
quantitative scores. The Semantic Similarity metric, which requires no pre-training, reaches a Pearson correlation of -0.341 with respect to the most recent utterance (Utt.2) on Reddit. A negative correlation is anticipated here since the higher human ratings correspond to the lower semantic distance. This compares with values of 0.351 for Automatic User Ratings (Venkatesh et al., 2018) and 0.436 for ADEM  from recent models which required large amounts of training data and computation. The correlations are visualized as scatter plots in Figure 3. In addition, we assessed ADEM on our test datasets using the pre-trained weights 5 , provided by the authors. ADEM achieves low correlation with human judgment (ρ = 0.014 on Reddit and ρ = 0.034 on OpenSubtitles) presumably since the quality of its predicted scores highly depends on the corpus on which the model is trained.

Comparing Datasets
Finally, we investigate the impact of training datasets on the quality of the responses generated by THRED and all baselines. Table 6 has results which support that our cleaner, well-parsed Reddit dataset generates significantly improved responses over our metrics of interest. In particular, we contrast the two datasets in terms of human judgment and the automated metrics among all the models. Regarding human assessment, we took the mean evaluation rating (MER) per response in the test data to draw the comparison between the datasets. As demonstrated in Table 6 (see more details in Figure 6 in the Appendix), the human evaluators scored generated responses from the Reddit dataset higher than utterances generated from the OpenSubtitles dataset, which is true not only 5 https://github.com/mike-n-7/ADEM  in THRED, but in all models. Consequently, the training data plays a crucial role in generating high-quality responses. Morever, in OpenSubtitles, the assumption of spotting a conversation, as stated in Section 4, tends to include extraneous utterances in the dialogue, impeding the response generation process. While such presumption may seem valid in dealing with two-turn dialogues, it can aggravate the quality of conversations in multi-turn dialogues.

Conclusion
In this work, we introduce the Topical Hierarchical Recurrent Encoder Decoder (THRED) model for generating topically consistent responses in multiturn open conversations. We demonstrate that THRED significantly outperforms current stateof-the-art systems on quantitative metrics and human judgment. Additionally, we evaluate our new model and existing models with two new metrics which prove to be good measures for automatically evaluating the quality of the responses. Finally, we present a parsed and cleaned dataset based on conversations from Reddit which improves generated responses. We expect more advanced work to be done in the area of chit-chat dialogue to improve the models, training data, and means of evaluation.

A Supplementary Material
A.1 Experimental Setup The model parameters are learned by optimizing the log-likelihood of the utterances via Adam optimizer with a learning rate of 0.0002; we followed (Luong et al., 2015) for decaying the learning rate. The dropout rate is set to 0.2 for both the encoder and the decoder to avoid overfitting. For all the baselines, we experimented hidden state units with the size of 1024. For our model, we tested with encoder and decoder hidden state units of size 800, the same for the context encoder. During inference, we experimented with the standard beam search with the beam width 5 and the length normalization α = 1 (Wu et al., 2016). We noticed that applying the length normalization resulted in a more diverse and longer sentences but at the expense of the semantic coherence of the response in some cases. Training LDA model: We trained two LDA models 6 : one trained on OpenSubtitles and the other one trained on Reddit. Both of them were trained on 1M dialogues. We set the number of topics to 150, α to 1 150 and γ to 0.01. We filtered out stop words and universal words. We also discarded the 1000 words with the highest frequency from the topic words.

A.2 Human Evaluation Procedure
For the 4-scale human evaluation, judges were asked to judge the responses from Bad (0) to Excellent (3). Excellent (score 3): The response is very appropriate, on topic, fluent, interesting and shows understanding of the context. Good (score 2): The response is coherent with the context but it is not diverse and informative. It may imply the answer. Poor (score 1): The response is interpretable and grammatically correct but completely off-topic. Bad (score 0): The response is grammatically broken and it does not provide an answer. Regarding the side-by-side evaluation, humans were asked to favor response 1 over response 2 if: (1) response 1 is relevant, logically consistent to the context, fluent and on topic; or (2) Both responses 1 and 2 are relevant, consistent and fluent but response 1 is more informative than response 2. If judges cannot tell which one is better, they can rate the responses as Equally good or Equally Bad.  Table 2). From left to right, the labels in horizontal axis are THRED, HRED, Seq2Seq and TA-Seq2Seq. THRED surpasses all baselines in similarity with Utt.2, and works mildly better in similarity with Utt.1. The median for every box plot is also reported between brackets. The order of the values is the same as the order of the methods.
CONTEXT GENERATED RESPONSES (OpenSubtitles) the giant observatory in hawaii are like time machines capturing the faint light that has travelled towards us through all of cosmic history → the deeper they look into space the farther back they see in time THRED: as the stars move into space they see the rising sun and the stars rise above the clouds (Excellent, Excellent, Good, Excellent, Poor) HRED: this is the first time we have ever seen (Poor, Poor, Bad, Good, Bad) Seq2Seq:the stars and stars of the moon are the stars and stars of the stars and stars and stars and stars and stars and stars (Poor, Bad, Bad, Bad, Poor) TA-Seq2Seq: they have been searching for a long time for a long time (Good, Poor, Good, Good, Good) (OpenSubtitles) and closer to home in baltimore prominent psychiatrist and the author of a newly released book on insanity disappeared mysteriously last night after a lecture at the university → former mental patient james cole is wanted for questioning regarding dr cole 's disappearance THRED: authorities warn that cole has a history of violence and advise anyone to notify authorities (Excellent, Excellent, Excellent, Excellent, Excellent) HRED: they have been unable to identify him (Good, Poor, Poor, Good, Poor) Seq2Seq:cole cole is cole is cole(Poor, Bad, Bad, Bad, Bad) TA-Seq2Seq: authorities warn that cole has been kidnapped (Poor, Good, Good, Good, Good) Table 7: One cherry-picked dialogues out of 150 conversations along with the generated responses from all models. Human judgments are provided in the brackets. The blue arrow specifies a dialogue exchange and the highlighted words in red represent the topic words acquired from the pre-trained LDA model.   Table 4). The numbers in the bracket indicate the gain of distinct-1 and distinct-2 over the second best method (i.e., TA-Seq2Seq).  Table 9: Side-by-side human evaluation along with 4-scale human evaluation of dialogue utterance prediction on OpenSubtitles dataset (mean preferences ±90% confidence intervals). Results on Reddit dataset are reported in Table 5.  Table 6 in which only mean and standard deviation are reported per metric.