Dialog Generation Using Multi-Turn Reasoning Neural Networks

In this paper, we propose a generalizable dialog generation approach that adapts multi-turn reasoning, one recent advancement in the field of document comprehension, to generate responses (“answers”) by taking current conversation session context as a “document” and current query as a “question”. The major idea is to represent a conversation session into memories upon which attention-based memory reading mechanism can be performed multiple times, so that (1) user’s query is properly extended by contextual clues and (2) optimal responses are step-by-step generated. Considering that the speakers of one conversation are not limited to be one, we separate the single memory used for document comprehension into different groups for speaker-specific topic and opinion embedding. Namely, we utilize the queries’ memory, the responses’ memory, and their unified memory, following the time sequence of the conversation session. Experiments on Japanese 10-sentence (5-round) conversation modeling show impressive results on how multi-turn reasoning can produce more diverse and acceptable responses than state-of-the-art single-turn and non-reasoning baselines.


Introduction
Dialogue systems such as chatbots are a thriving topic that is attracting increasing attentions from researchers Li et al., 2015;Wen et al., 2016). Recent achievements, such as deep neural networks for text generating, user profiling (Li et al., 2014), and natural language understanding, have accelerated the progresses of this field, which was historically approached by conventional rule-based and/or statistical response ranking strategies.
Response ranking models retrieve the most suitable response(s) from a fixed set of (question, answer) pairs given a dialogue context and current query from a user (Banchs and Li, 2012;Lowe et al., 2015). Learning-to-rank approaches were applied to compute the similarity scores of between (query, context) and indexed candidate (question, answer) pairs to return the optimal "answer" to the user. These ranking-based retrieval strategies have been well-applied as an important approach to dialogue systems, yet the set of scripted responses are limited and are short at generalization. On the other hand, statistical machine translation (SMT) systems have been applied to dialogue systems (Ritter et al., 2011), taking user's query as a source language sentence and the chatbot's response as a target language sentence. Labeled data for learning-to-ranking training will not be necessary anymore and all we need is the largescale (question, answer) pairs.
The sequence-to-sequence model proposed in (Sutskever et al., 2014) applied end-to-end training of neural networks to text generation. This model, further enhanced by an attention mechanism , was generic and allowed its application to numerous sequenceto-sequence learning tasks such as neural machine translation (NMT) , image captioning (Donahue et al., 2015;Mao et al., 2015), speech recognition (Chan et al., 2015) and constituency parsing . The simplicity of these models makes them attractive, since "translation" and "alignment" are learned jointly on the fly.
Specially,  applied the sequence-to-sequence model to conversational modeling and achieved impressive results on various datasets. Their model was trained to predict a response given the previous sentence (s). Shang et al. (2015) combined local and global attentions and reported better results than retrieval based systems.  explored three different end-to-end approaches for the problem of predicting the response given a query attached with a single message context.
Multi-turn conversation modeling is considered to be more difficult than machine translation, since there are many more acceptable responses for a given (context, query) input and these often rely on external knowledge and/or contextual reasoning. Dialogue systems trained with a maximum likelihood estimation (MLE) objective function, as most SMT utilizes, often learn to reply generic sentences as "I don't know" or "sounds good", which have a high incidence in the "answer" part of (question, answer) style training datasets. There have been various attempts at diversifying the responses (Li et al., 2016a;Yu et al., 2016;Li et al., 2017) but the lack of variations in the responses remains as an essential challenge. We wonder that if this stress can be relieved by modeling the prior context in a rather fine-grained way.
In document comprehension fields, multi-turn reasoning (also called multi-hop reasoning) has delivered impressive results by assimilating various pieces of information to produce an unified answer (Hill et al., 2015;Dhingra et al., 2016). Through multi-turn reading the document's memory using attention models, current question can be extended with much richer knowledge. This makes it easier to figure out the correct answer from that document. Different documents need to be read different times to yield out the correct answer for the input question. Specially, Shen et al. (2016) use a dynamic number of turns by introducing a termination gate to control the number of iterations of reading and reasoning.
Motivated by the reasoning network for document comprehension (Shen et al., 2016), we propose multi-turn reasoning neural networks that generate the proper response (or, "answer") by attention-based reasoning from current conversation session ("document") and current query (identical to "question" in document comprehension) from the user. In particular, our networks utilize conversation context and explicitly separate speakers' interventions into sentence-level and conversation-level memories. Our first model uses plain single-turn attention to integrate all the memories, and the second approach integrates multi-turn reasoning. The formulation of our pro-posed approach is designed in a generalized way, allowing for inclusion of additional information such as external knowledge bases (Yih and Ma, 2016;Ghazvininejad et al., 2017;Han et al., 2015) or emotional memories (Zhou et al., 2017). Moreover, our approach for two-speaker scenario can be easily extended to group chatting by a further speaker-specific memory splitting.
We evaluate the performances of our methods by comparing three configurations trained on a Japanese twitter conversation session dataset. Each conversation session contains 10 sentences which are 5-round between two real-world speakers. The results provide evidences that multi-turn reasoning neural networks can help improving the consistency and diversity of multi-turn conversation modeling.
This paper is structured as follows: Section 2 gives a general description of multi-turn conversation modeling; Section 3 describes background neural language modeling, text generation, and attention mechanisms; Section 4.1 first introduces a model with multiple attention modules and then explains how the multi-turn reasoning mechanism can be further integrated into the previous models; Sections 5, 6 and 7 describe the experimental settings and results using automatic evaluation metrics, detailed human-evaluation based analysis, and conclusions, respectively.

Multi-turn Conversation Modeling
Consider a dataset D consisting of a list of conversations between two speakers. Each conversation d ∈ D is an ordered sequence of sentences s i , where i ∈ [1, T d ] and T d is the number of sentences in d, produced by two speakers alternately. In this way, for s i , s j ∈ d, both sentences are from the same speaker if and only if i ≡ j (mod 2). Note that, our definition includes the case that one speaker continuously expresses his/her message through several sentences. We simply concatenate these sentences into one to ensure that the conversation is modeled with alternate speakers.
A multi-turn conversation model is trained to search parameters that maximize the likelihood of every sentence s i ∈ d where i ≥ 2, supposing that the beginning sentence s 1 is always given as a precondition: where Here, s <i are sentences s j ∈ d and j < i. The probability of each sentence, p(s i |s <i ), is frequently estimated by a conditional language model. Note that, traditional single-turn conversation models or NMT models are a special case of this model by simply setting T d to be 2. That is, the generation of the next sentence is sessioninsensitive and is only determined by the former single sentence. Another aspect of understanding this contextual conversation model is that, the number of reference contextual sentences s <i is not limited to be one. Suppose there are already 9 sentences known in one conversation session and we want to generate the 10-th sentence, then from p(s 1 ) to p(s 9 ) are all preconditions and we will only need to focus on estimating p(s 10 |s <10 ).
We adapt sequence-to-sequence neural models (Sutskever et al., 2014) for multi-turn conversation modeling. They are separated into an encoder part and a decoder part. The encoder part applies a RNN on the input sequence(s) s <i to yield prior information. The decoder part estimates the probability of the generated output sequence s i by employing the last hidden state of the encoder as the initial hidden state of the decoder. Sutskever et al. (2014) applied this technique to NMT and impressive experimental results were reported thereafter.
Using Equation 2, we are modeling two chatbots talking with each other, since all the s 2,...,T d are modeled step-by-step on the fly. However, we can add constraints to determine whose responses to be generated, either one speaker or both of them. That is, when i takes odd integers of 1, 3, 5 and so on, we are modeling the first speaker. Even integers of i indicates a generation of responses for the second speaker.

Language Modeling and Text Generation
Language models (LM) are trained to compute the probability of a sequence of tokens (words or characters or other linguistic units) being a linguistical sentence. Frequently, the probability of a sentence s with T s tokens is computed by the production of the probabilities of each token y j ∈ s given its contexts y <j and y >j : When generating a sequence based on a LM, we can generate one word at a time based on the previously predicted words. In this situation, only the previously predicted words are known and the probability of current sequence is approximated by dropping the posterior context. That is, p(y j |y <j , y >j ) ≈ p(y j |y <j ). (4) We construct a sequence generation LM using sequence-to-sequence neural network f . The neural network intercalates linear combinations and non-linear activate functions to estimate the probability of mass function. Then, in the encoder part of f , the contextual information is represented by a fixed-size hidden vector h j : where θ f represents f 's trainable parameters.
To embed the previous word sequence into a fixed-size vector, recurrent neural networks (RNN) such as long short term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRU)  are widely used. These networks repeat a recurrent operation on each input word: where θ g represents the trainable parameters of a RNN function g, and h j is the hidden state of the RNN at time j.

Conditional Language Modeling
The hidden state h of a RNN can accumulate information from previous words (y j ∈ s and j < T s ) or previous sentences (s i ∈ d and i < T d ) which ensures the encoding and decoding processes in sequence-to-sequence models. Since the contextual sentences are known already, the encoder can represent them in both forward ( h j ) and backward ( h j ) directions. The results from both recursions can be combined by a concatenated operation. This is referred to as bidirectional RNN shorted as BiRNN (Schuster and Paliwal, 1997).
For each sentence s i ∈ d (i < T d ), we annotate the combination of the final states of each RNN direction as a memory vector m i = A projection of annotation m i can be used as the decoder's initial state t 0 such as t 0 = tanh(W s m T d ) and T d < T d . W s here is a weight matrix that projects m T d into a vector that shares a same dimension with t 0 . In  for NMT, h 1 , backward encoding of a single source sentence, was used to initialize t 0 = tanh(W s h 1 ).

Attention Mechanism
Summarizing all contextual information into one single fixed-length vector becomes weaker to guide the generation of the target sentence as the contextual information grows longer. To tackle this problem, an attention mechanism  was applied to NMT for learning to align and translate jointly. In this attention-based model, the conditional probability in Equation 4 is defined as: where is a RNN hidden state in the decoder part for time i h i is a context vector, a weighted combination of the annotation set memory (h 1 , ..., h Ts ) produced by encoding a source sentence s with length T s . The weight α where e (j) i is an alignment model and is implemented by a feed-forward network a: This method was applied to single-turn conversation modeling . We use this model, with attention over each immediately previous sentence s i−1 ∈ d for generating s i , as a baseline for our experiments. We annotate this model as SIMPLE subsequently in this paper.

Multi-turn Reasoning Network with Multiple Type Memories
The attention mechanism described in Equations 8 and 9 is performed in a single-turn feed-forward fashion. However, for complex context and complex queries, human readers often revisit the given context in order to perform deeper inference after one turn reading. This real-world reading phenomenon motivated the multi-turn reasoning networks for document comprehension (Shen et al., 2016). Considering dialog generation scenario with given rich context, we intuitively think if the attentions can be performed multi-turns so that the conversation session is better understood and the simple query, which frequently omits unknown number of context-sensitive words, can be extended for a better generation of the response. The domain adaptation from document comprehension to dialog generation is feasible by taking the rich context of the speakers as a "document", current user's query as a "question" and the chatbot's response as an "answer". However, there are still several major challenges for this domain adaptation. First, a document is frequently written by a single author with one (hidden) personality, one writing style, and one distribution of the engagement rates of the topics appearing in that document. These are not the case for conversation scenario in which at least two speakers are involved with different (hidden) personalities, personalized speaking styles, and diverse engagement rate distributions of the topics in that conversation session. Second, for document comprehension, the output is frequently a single named entity (Shen et al., 2016) and thus a single softmax function can satisfy this one-shot ranking problem. However, we will need a RNN decoder utilizing context vectors for generating the target response sentence being a sequence of tokens (words or characters) instead of one single named entity.
We tackle the first challenge by separating the context into multiple type memories upon which attention models are performed. For the second difference, we replace the simple softmax output layer by a GRU decoder employing reasoningattention context vectors.

Separation of contextual information
The SIMPLE model can use multiple turns of context to infer the response by concatenating them during decoding, using a separator symbol such as EOS for end-of-sentence.  separated the query message and the previous two context messages when conditioning the response. The previous context messages were concatenated and treated as a single message.
In our proposed models, we use more than three turns for the context. We separate the last message (the query) from the previous turns to produce a set of annotations h, one per character 1 in the sentence. While encoding the contextual information, we separate the m i from each speaker into two sets. The motivation is to capture individual characteristics such as personalized topical information and speaking style (Li et al., 2016b). We refer to the set of annotations from the same speaker as one memory. That is, the sentences for which the probabilities are being predicted as M r (response memory, specially corresponds to the chatbot's side) and the question set as M q (query memory, specially corresponds to the user's side). We further apply a RNN on top of m i to produce one more set of vectors M c (context memory): in which, where T c is the number of turns (sentences) in the conversation. The initial state m (c) 0 is a trainable parameter.
We apply an attention mechanism on each of the memories M q , M r , M c and M h (of current query) separately. Refer to Figure 1 for an intuitive illustration of these memories. Following (Shen et al., 2016), we choose projected cosine similarity function as the attention module. The attention score a q j,i on memory m q i ∈ M q for a RNN hidden state t j in the decoder part is computed as follows: where W q 1 and W q 2 are weight vectors associated with m q i and t j , respectively. Consequently, the attention vector on the query sequences is given by: 1 Most Japanese characters, such as Kanji, have independent semantic meanings other than English letters Similarly, the attention scores and attention vectors on the other three memories can be derived by replacing q with r, c, and h in Equations 14, 15.
We then concatenate these resulting attention vectors into a final context vector c M j , which is consequently applied to Equations 8 and 9. Since the dimension of the updated context vector c M j is four times larger, its weight matrix C will need to be enlarged with a same column dimension with the dimension of c M j so that Cc M j still aligns with the dimension of the hidden layer vector t j . More details of the GRU function style definition of t j using c j can be found in (Bahdanau et al., 2015). We refer to this model that integrates multiple types of memories through separated attention mechanisms as MULTI.
Note that, by separately embedding conversation context into multiple type memories following the number of speakers, we can easily extend this two speaker scenario into group chatting in which tens or hundreds of speakers can be engaged in. The only extension is to further separate M q by speakers. Consequently, the context vector can be concatenated using the attention vectors by read- ing all the memories. The theoretical benefit is that the chatbot can softly keep track of each individual speaker's topics and then make a decision of how to response to that speaker. Another extension will be using a reinforcement learning policy to determine when to let the chatbot to give a response to which speaker in current group chatting. Generally, the number and type of memories can be enlarged in a reasonable way, such as by introducing external knowledge (Yih and Ma, 2016;Ghazvininejad et al., 2017;Han et al., 2015) or performing sentiment analysis to the "fact memories" to yield emotional memories (Zhou et al., 2017). A detailed description and experimental testifying is out of the scope of this paper.

Reasoning Neural Dialogue System
As illustrated in Figure 2, we apply a multiturn reasoning mechanism, following Shen et al. (2016), to the multiple-type annotation memories. This reasoning mechanism replaces the single-turn attention mechanism. We adapt the idea of using a termination state during the inference to dynamically determine how many turns to reason. The termination module can decide whether to continue to infer the next turn (of re-reading the four types of memories) after digesting intermediate topical and speaker-specific information, or to terminate the whole inference process when it concludes that existing information is sufficient to generate the next word in a response. Generally, the idea is to construct a reasoning attention vector that works as a context vector during generating the next word. This idea is included in the "Reasoning box" in Figure 2. Specially, y j−1 stands for a former word generated by the hidden stateŝ j−1 in the GRU decoder. E y is the embedding matrix. We use a full-connection layer to map fromŝ j−1 to the initial reasoning hidden state h R 1 , since h R m should be with the same length alike each memory vector in M q,r,c,h andŝ j−1 's dimension is smaller than that. Thus, (1) outside the "reasoning box", we use a GRU decoder to yieldŝ j so that a next word y j can be generated, and (2) inside the "reasoning box", we read the memories to yield the "optimal" contextual vector. The "reasoning box" takes the memories M q,r,c,h andŝ j−1 as inputs and finally outputs c R j,m . The number of reasoning turns for yielding the "reasoning attention vectors" (c R j which is further indexed by reasoning steps of 1, 2 in Figure 2) during the decoding inference is dynamically parameterized by both the contextual memories and current query, and is generally related to the complexities of the conversation context and current query.
The training steps are performed as per the general framework as described in Equations 8 and 9. For each reasoning hidden state h R m , the termination probability o m is estimated by f tg (h R m ; θ tg ), which is where θ tg = {w t , b t }, w t is a weight vector, b t is a bias, and σ is the sigmoid logistic function. Then, different hidden states h R m are first weighted by their termination probabilities o m and then summed to produce a reasoning-attention context vector c R j (using the equations as described previously in Section 3.2), which is consequently used to construct the next reasoning step's h R 2 = RNN(h R 1 , c R j,1 ). The final c R j,m (m ≥1 is the final reasoning step) will be used in Equations 8 and 9 in a way alike former attention vectors. During our experiments, we instead used a sum of from o 2 × c R j,1 to o m+1 × c R j,m as the final c R j,m for next word generation.
During generating each word in the response, our network performs a response action r m at the m-th step, which implies that the termination gate variables o 1:m = (o 1 = 0, o 2 = A: i'm bored B: boooring A: isn't it? (ˆ_ˆ) B: everybody is asleep A: really?
(ˆ_ˆ) my friends are still awake! :P B: lucky! xD feels lonely with everyone asleep A: almost everyone is awake (ˆvˆ) B: whyyy?! my friends go early to bed. Or is it me that's late? xD A: it's us who are late! xD B: true, very true xD A stochastic policy π((o m , r m )|h R m , t j ; θ) with parameters θ to get a distribution of termination actions, to continue reading the conversation context (i.e., M q,r,c,h ) or to stop, and of response actions r m for predicting the next word if the model decides to stop at current step. In our experiments, we set a maximum step parameter T max to be 5 for heuristically avoiding too many reasoning times. We follow (Shen et al., 2016) to compute the expected reward and its gradient for one instance. We refer to this model with multi-turn reasoning attentions as REASON.

Experiments
In our experiments, we used a dataset consisting of Japanese twitter conversations. Each conversation contains 10 sentences from two real-world alternating speakers. Given the origin of the dataset, it is quite noisy, containing misspelled words, slang and kaomoji (multi-character sequences of facial emoticons) among meaningful words and characters. Preliminary experiments by using a wordbased approach resulted in the vocabulary size being too big and with too many word breaking errors, we instead used a character-based approach. Figure 3 shows a sample 10-sentence conversation in which original Japanese sentences were translated into English and similar spelling patterns were kept in a sense (such as boooring for boring and whyyy for why).
We kept the conversations in which all sentences were no more than 20 characters. This filtering strategy resulted in a dataset of 254K conversations from which 100 (1K sentences) where taken out for testing and another 100 for validat-  ing and hyper-parameter tuning. The training set contains 6,214 unique characters, which are used as our vocabulary with the addition of two special symbols, an UNK (out-of-vocabulary unknown word) and an EOS (end-of-sentence). Table 1 shows major statistics of the dataset. The training minimizes negative log-likelihood (NLL) per character on the nine sentences s 2,...,10 of each conversation. One configuration in MULTI and REASON is that, we respectively use the reference contexts (instead of former automatically generated sentences) to generate current sentence. That is, when generating s i , we use the golden contextual sentences of from s 1 to s i−1 . These three systems were respectively trained 3 epochs (10,000 iterations) on an AdaDelta (Zeiler, 2012) optimizer. Character embedding matrix was shared by both the encoder and the decoder parts. All the hidden layers, in the encoding/decoding parts and the attention models, were of size 200 and the character embeddings were of size 100. The recurrent units that we used were GRU. The gradients were clipped at the maximum gradient norm of 1. The reasoning module's maximum steps T max was set to be 5. The data was iterated on mini-batches of less than 1,500 symbols each.
We initialized the recurrent weight matrices in GRUs as random orthogonal matrices. Unless specially mentioned, all the elements of the 0-indexed vectors and all bias vectors were initialized to be zero. Any other weight matrices were initialized by sampling from the Gaussian distribution of mean 0 and variance 0.01. Figure 4 shows the progression of the NLLs per  character during training. The validation costs begun converging in the third epoch for the three models. The plot roughly shows lower cost for more complex models.  obtained better correlation with human evaluation when using BLEU-2 rather than BLEU-4. We thus report both of these scores for automatic evaluation and comparison. The character-level BLEU-4 and BLEU-2 scores for the trained models are reported in Table 2. The REASON model achieved consistently better BLEU-2 and BLEU-4 scores in the three datasets. MULTI performed slightly better than SIMPLE on the validation set yet that performance is less stable than REASON. Figure 4 also reflects that, (1) the final training costs of SIMPLE and MULTI are quite close with each other at iteration 10,000; (2) there is a big margin of between the final training cost of REASON and that of SIMPLE or MULTI; and (3) the validation costs exactly follows an order of SIMPLE > MULTI > REASON. Figure 5 illustrates an English translation of a conversation and the responses suggested by each of the described models. This conversation is extracted from the test set. The three responses are different from the reference response, but the one from REASON looks the most consistent with the given context. The response from MULTI is contradicting the context of speaker B as he/she said Not at all in a former sentence.

Analysis
As it has been shown in ) that BLEU doesn't correlate well with human judgments, we asked three human evaluators to respectively examine 900 responses from each of the models given their reference contexts. The evaluators were asked to judge (1) whether one response is acceptable and (2) whether one response is better than the other two responses. A summary of this evaluation is displayed in  Table 3: Human evaluation of the responses generated by the three models. † Percentage over the conversations that had at least one response accepted. ‡ From the cases where any of both compared models was acceptable. > = "better than". that were considered acceptable by at least two of the human evaluators while the best-of-three columns refers to the percentage of times that each model's response was considered by at least two evaluators to be better than the other two's, from the contexts that had at least one acceptable response. The last two columns make one-to-one comparisons. In 18% of the contexts, none of the models produced an acceptable response. This human evaluation shows that complexer models are more likely to produce acceptable responses. The MULTI and REASON models are only different in the attention mechanism of multiturn reasoning. The reasoning module performed better than single-turn attention 58% of the times. Table 4 contains the character-level distinct-n (Li et al., 2016a) metrics for n-grams where 1 ≤ n ≤ 5. This metric measures the number of distinct n-grams divided by the total number of ngrams in the generated responses. The displayed results are computed on the concatenation of all the responses to the test-set contexts. The Reference column was computed on the reference responses and represents the optimal human-like ratio.
SIMPLE performed the best at uni-gram diver-  sity. For n-grams n ≥ 2, REASON produced the most diverse outputs. While the results for REASON were consistently better than the other two models, the results for MULTI were not always better than SIMPLE. This indicates MULTI does not always benefit from the augmented context without the multi-turn reasoning attentions.

Conclusions
We have presented a novel approach to multi-turn conversation modeling. Our approach uses multiple explicitly separated memories to represent rich conversational contexts. We also presented multiturn reasoning attentions to integrate various annotation memories. We run experiments on three different models with and without the introduced approaches and measured their performances using automatic metrics and human evaluation. Experimental results verified that the increased contexts are able to help producing more acceptable and diverse responses. Driven by the depth of the reasoning attention, the diversities of the responses are significantly improved. We argue that the reasoning attention mechanism helps integrating the multiple pieces of information as it can combine them in a more complex way than a simple weighted sum. We further observed that as the accuracy of the conversation model improves, the diversity of the generated responses increases.
The proposed approach of multi-turn reasoning over multiple memory attention networks is presented in a general framework that allows the inclusion of memories of multiple resources and types. Applying to group chatting with more than two speakers and reasoning over emotion embeddings or knowledge vectors included from an external knowledge base/graph are taken as our future directions.