A Neural Network Approach to Context-Sensitive Generation of Conversational Responses

We present a novel response generation system that can be trained end to end on large quantities of unstructured Twitter conversations. A neural network architecture is used to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances. Our dynamic-context generative models show consistent gains over both context-sensitive and non-context-sensitive Machine Translation and Information Retrieval baselines.


Introduction
Until recently, the goal of training open-domain conversational systems that emulate human conversation has seemed elusive.However, the vast quantities of conversational exchanges now available on social media websites such as Twitter and Reddit raise the prospect of building data-driven models that can begin to communicate conversationally.The work of Ritter et al. (2011), for example, demonstrates that a response generation system can be constructed from Twitter conversations using statistical machine translation techniques, where a status post by a Twitter user is "translated" into a plausible looking response.However, an approach such as that presented in Ritter et al. (2011) does not address the challenge of generating responses that are sensitive to the context of the conversation.Broadly speaking, context may be linguistic or involve grounding in the physical or virtual world, but we here focus on linguistic context.The ability to take into account previous utterances is key to building dialog systems that can keep conversations active and engaging.Figure 1 illustrates a typical Twitter dialog where the contextual information is crucial: the phrase "good luck" is plainly motivated by the reference to "your game" in the first utterance.In the MT model, such contextual sensitivity is difficult to capture; moreover, naive injection of context information would entail unmanageable growth of the phrase table at the cost of increased sparsity, and skew towards rarely-seen context pairs.In most statistical approaches to machine translation, phrase pairs do not share statistical weights regardless of their intrinsic semantic commonality.
We propose to address the challenge of contextsensitive response generation by using continuous representations or embeddings of words and phrases

Related Work
Our work naturally lies in the path opened by Ritter et al. (2011), but we generalize their approach by exploiting information from a larger context.Ritter et al. and our work represent a radical paradigm shift from other work in dialog.More traditional dialog systems typically tease apart dialog management (Young, 2002) from response generation (Stent and Bangalore, 2014), while our holistic approach can be considered a first attempt to accomplish both tasks jointly.While there are previous uses of machine learning for response generation (Walker et al., 2003), dialog state tracking (Young et al., 2010), and user modeling (Georgila et al., 2006), many components of typical dialog systems remain hand-coded: in particular, the labels and attributes defining dialog states.In contrast, the dialog state in our neural network model is completely latent and directly optimized towards end-to-end performance.In this sense, we believe the framework of this paper is a significant milestone towards more data-driven and less hand-coded dialog processing.
Continuous representations of words and phrases estimated by neural network models have been applied on a variety of tasks ranging from Information Retrieval (IR) (Huang et al., 2013;Shen et al., 2014), Online Recommendation (Gao et al., 2014b), Machine Translation (MT) (Auli et al., 2013;Cho et al., 2014;Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014), and Language Modeling (LM) (Bengio et al., 2003;Collobert and Weston, 2008).Gao et al. (2014a) successfully use an embedding model to refine the estimation of rare phrase-translation probabilities, which is traditionally affected by sparsity problems.Robustness to sparsity is a crucial property of our method, as it allows us to capture context information while avoiding unmanageable growth of model parameters.
Our work extends the Recurrent Neural Network Language Model (RLM) of (Mikolov et al., 2010), which uses continuous representations to estimate a probability function over natural language sentences.We propose a set of conditional RLMs where contextual information (i.e., past utterances) is encoded in a continuous context vector to help generate the response.Our models differ from most previous work in the way the context vector is constructed.For example, Mikolov and Zweig (2012) and Auli et al. (2013) use a pre-trained topic model.In our models, the context vector is learned along with the conditional RLM that generates the response.Additionally, the learned context encodings do not exclusively capture contentful words.Indeed, even "stop words" can carry discriminative power in this task; for example, all words in the utterance "how are you?" are commonly characterized as stop words, yet this is a contentful dialog utterance.

Recurrent Language Model
We give a brief overview of the Recurrent Language Model (RLM) (Mikolov et al., 2010) architecture that our models extend.A RLM is a generative model of sentences, i.e., given sentence s = s 1 , . . ., s T , it estimates: p(s t |s 1 , . . ., s t−1 ). (1) The model architecture is parameterized by three weight matrices, Θ RNN = W in , W out , W hh : an input matrix W in , a recurrent matrix W hh and an output matrix W out , which are usually initialized randomly.
The rows of the input matrix W in ∈ R V ×K contain the K-dimensional embeddings for each word in the language vocabulary of size V .Let us denote by s t both the vocabulary token and its one-hot representation, i.e., a zero vector of dimensionality V with a 1 corresponding to the index of the s t token.The embedding for s t is then obtained by s t W in .The recurrent matrix W hh ∈ R K×K keeps a history of the subsequence that has already been processed.The output matrix W out ∈ R K×V projects the hidden state h t into the output layer o t , which has an entry for each word in the vocabulary V .This value is used to generate a probability distribution for the next word in the sequence.Specifically, the forward pass proceeds with the following recurrence, for t = 1, . . ., T : where σ is a non-linear function applied elementwise, in our case the logistic sigmoid.The recurrence is seeded by setting h 0 = 0, the zero vector.The probability distribution over the next word given the previous history is obtained by applying the softmax activation function: The RLM is trained to minimize the negative loglikelihood of the training sentence s: The recurrence is unrolled backwards in time using the back-propagation through time (BPTT) algorithm (Rumelhart et al., 1988), and gradients are accumulated over multiple time-steps.

Context-Sensitive Models
We distinguish three linguistic entities in a conversation between two users A and B: the context1 c, the message m and response r.The context c represents a sequence of past dialog exchanges of any length; then B emits a message m to which A reacts by formulating its response r (see Figure 1).We use three context-based generation models to estimate a generation model of the response r, r = r 1 , . . ., r T , conditioned on past information c and m: p(r|c, m) = T t=1 p(r t |r 1 , . . ., r t−1 , c, m). (5) These three models differ in the manner in which they compose the context-message pair (c, m).

Tripled Language Model
In our first model, dubbed RLMT, we straightforwardly concatenate each utterance c, m, r into a single sentence s and train the RLM to minimize L(s).Given c and m, we compute the probability of the response as follows: we perform the forward propagation over the known utterances c and m to obtain a hidden state encoding useful information about previous utterances.Subsequently, we compute the likelihood of the response from that hidden state.An issue with this simple approach is that the concatenated sentence s will be very long on average, especially if the context comprises multiple utterances.Modelling such long-range dependencies with an RLM is difficult and is still considered an open problem (Pascanu et al., 2013).We will consider RLMT as an additional context-sensitive baseline for the models we present next.

Dynamic-Context Generative Model I
The above limitation of RLMT can be addressed by strengthening the context bias.In our second model (DCGM-I), the context and the message are encoded into a fixed-length vector representation the is used by the RLM to decode the response.This is illustrated in Figure 3 (left).First, we consider c and m as a single sentence and compute a single bag-of-words representation b cm ∈ R V .Then, b cm is provided as input to a multilayered non-linear forward architecture that produces a fixed-length representation that is used to bias the recurrent state of the decoder RLM.At training time, both the context encoder and the RLM decoder are learned so as to minimize the negative log-probability of the generated response.
The parameters of the model are , where {W f } L =1 are the weights for the L layers of the feed-forward context networks.The fixed-length context vector k L is obtained by forward propagation of the network: The rows of W 1 f contain the embeddings of the vo-cabulary.2These are different from those employed in the RLM and play a crucial role in promoting the specialization of the context encoder to a distinct task.The hidden layer of the decoder RLM takes the following form: This model conditions on the previous utterances via biasing the hidden layer state on the context representation k L .Note that the context representation does not change through time.This is useful because: (a) it forces the context encoder to produce a representation general enough to be useful for generating all words in the response and (b) it helps the RLM decoder to remember context information when generating long responses.

Dynamic-Context Generative Model II
Because DCGM-I does not distinguish between c and m, that model has the propensity to underestimate the strong dependency that holds between m and r.
Our third model (DCGM-II) addresses this issue by concatenating the two linear mappings of the bag-ofwords representations b c and b m in the input layer of the feed-forward network representing c and m (see Figure 3 right).Concatenating continuous representations prior to deep architectures is a common strategy to obtain order-sensitive representations (Bengio et al., 2003;Devlin et al., 2014).
The forward equations for the context encoder are: where [x, y] denotes the concatenation of x and y vectors.In DCGM-II, the bias on the recurrent hidden state and the probability distribution over the next token are computed as described in Eq. 7.
5 Experimental Setting

Dataset Construction
For computational efficiency and to alleviate the burden of human evaluators, we restrict the context sequence c to a single sentence.Hence, our dataset is composed of "triples" τ ≡ (c τ , m τ , r τ ) consisting of three sentences.We mined 127M context-messageresponse triples from the Twitter FireHose, covering the 3-month period June 2012 through August 2012.
Only those triples where context and response were generated by the same user were extracted.To minimize noise, we selected triples that contained at least one frequent bigram that appeared more than 3 times in the corpus.This produced a corpus of 29M Twitter triples.Additionally, we hired crowdsourced raters to evaluate approximately 33K candidate triples.Judgments on a 5-point scale were obtained from 3 raters apiece.This yielded a set of 4232 triples with a mean score of 4 or better that was then randomly binned into a tuning set of 2118 triples and a test set of 2114 triples3 .The mean length of responses in these sets was approximately 11.5 tokens, after cleanup (e.g., stripping of emoticons), including punctuation.

Automatic Evaluation
We evaluate all systems using BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005), and supplement these results with more targeted human pairwise comparisons in Section 6.3.A major challenge in using these automated metrics for response generation is that the set of reasonable responses in our task is potentially vast and extremely diverse.The dataset construction method just described yields only a single reference for each status.Accordingly, we extend the set of references using an IR approach to mine potential responses, after which we have human judges rate their appropriateness.As we see in Section 6.3, it turns out that by optimizing systems towards BLEU using mined multi-references, BLEU rankings align well with human judgments.This lays groundwork for interesting future correlation studies.

Multi-reference extraction
We use the following algorithm to better cover the space of reasonable responses.Given a test triple τ ≡ (c τ , m τ , r τ ), our goal is to mine other responses {r τ } that fit the context and message pair (c τ , m τ ).To this end, we first select a set of 15 candidate triples {τ } using an IR system.The IR system is calibrated in order to select candidate triples τ for which both the message m τ and the response r τ are similar to the original message m τ and response r τ .Formally, the score of a candidate triple is: where d is the bag-of-words BM25 similarity function (Robertson et al., 1995), α controls the impact of the similarity between the responses and is a smoothing factor that avoids zero scores for candidate responses that do not share any words with the reference response.We found that this simple formula provided references that were both diverse and plausible.Given a set of candidate triples {τ }, human evaluators are asked to rate the quality of the response within the new triples {(c τ , m τ , r τ )}.After human evaluation, we retain the references for which the score is 4 or better on a 5 point scale, resulting in 3.58 references per example on average (Table 1).The average lengths for the responses in the multi-reference tuning and test sets are 8.75 and 8.13 tokens respectively.

Feature Sets
The response generation systems evaluated in this paper are parameterized as log-linear models in a framework typical of statistical machine translation (Och and Ney, 2004).These log-linear models comprise the following feature sets: MT MT features are derived from a large response generation system built along the lines of Ritter et al. (2011) (Ritter et al., 2011).We also included MT decoder features specifically motivated by the response generation task: Jaccard distance between source and target phrase, Fisher's exact probability, and a score relating the lengths of source and target phrases.
IR We also use an IR feature built from an index of triples, whose implementation roughly matches the IR status approach described in Ritter et al. ( 2011): For a test triple τ , we choose r τ as the candidate response iff τ = arg max τ d(m τ , m τ ).
CMM Neither MT nor IR traditionally take into account contextual information.Therefore, we take into consideration context and message matches (CMM), i.e., exact matches between c, m and r.We define 8 features as the [1-4]-gram matches between c and the candidate reply r and the [1-4]-gram matches between m and the candidate reply r.These exact matches help capture and promote contextual information in the replies.
RLMT, DCGM-I, DCGM-II We consider the RLM trained on the concatenated triples, denoted as RLMT (Section 4.1), to be a context-sensitive RLM baseline.Each neural network model contributes an additional feature corresponding to the likelihood of the candidate response given context and message.

Model Training
The proposed models are trained on a 4M subset of the triple data.The vocabulary consists of the most frequent V = 50K words.In order to speed up training, we use the Noise-Contrastive Estimation (NCE) loss, which avoids repeated summations over V by approximating the probability of the target word (Gutmann and Hyvärinen, 2010).Parameter optimization is done using Adagrad (Duchi et al., 2011) with a mini-batch size of 100 and a learning rate α = 0.1, which we found to work well on held-out data.In order to stabilize learning, we clip the gradients to a fixed range [−10, 10], as suggested in Mikolov et al. (2010).All the parameters of the neural models are sampled from a normal distribution N (0, 0.01) while the recurrent weight W hh is initialized as a random orthogonal matrix and scaled by 0.01.To prevent over-fitting, we evaluate performance on a held-out set during training and stop when the objective increases.The size of the RLM hidden layer is set to K = 512, where the context encoder is a 512, 256, 512 multilayer network.The bottleneck in the middle compresses context information that leads to similar responses and thus achieves better generalization.The last layer embeds the context vector into the hidden space of the decoder RLM.

Rescoring Setup
We evaluate the proposed models by rescoring the n-best candidate responses obtained using the MT phrase-based decoder and the IR system.In contrast to MT, the candidate responses provided by IR have been created by humans and are less affected by fluency issues.The different n-best lists will provide a comprehensive testbed for our experiments.First, we augment the n-best list of the tuning set with the scores of the model of interest.Then, we run an iteration of MERT (Och, 2003) to estimate the log-linear weights of the new features.At test time, we rescore the test n-best list with the new weights.

Lower and Upper Bounds
Table 2 shows the expected upper and lower bounds for this task as suggested by BLEU scores for human responses and a random response baseline.The RAN-DOM system comprises responses randomly extracted from the triples corpus.HUMAN is computed by choosing one reference amongst the multi-reference set for each context-status pair. 4Although the scores  indicates the number of features of the models.The log-linear weights are estimated by running one iteration of MERT.We mark by (±%) the relative improvements with respect to the reference system (£).
are lower than those usually reported in SMT tasks, the ranking of the three systems is unambiguous.

BLEU and METEOR
The results of automatic evaluation using BLEU and METEOR are presented in Table 3, where some broad patterns emerge.First, both metrics indicate that a phrase-based MT decoder outperforms a purely IR approach.Second, adding CMM features to the baseline systems helps.Third, the neural network models contribute measurably to improvement: RLMT and DCGM models outperform baselines, and DCGM models provide more consistent gains than RLMT.
MT vs. IR BLEU and METEOR scores indicate that the phrase-based MT decoder outperforms a purely IR approach, despite the fact that IR proposes fluent human generated responses.This may be because the IR model only loosely captures important patterns between message and response: It ranks candidate responses solely by the similarity of their message with the message of the test triple ( §5.3).As a result, the top ranked response is likely to drift from the purpose of the original conversation.The MT approach, by contrast, more directly models statistical patterns between message and response.
CMM MT+CMM, totaling 17 features (9 from MT + 8 CMM), improves 0.38 BLEU points, a 9.5% relative improvement, over the baseline MT model.IR+CMM, with 10 features (IR + word penalty + 8 CMM), benefits even more, attaining 1.8 BLEU points and 1.5 METEOR points over the IR base-RANDOM system so as to make BLEU scores comparable.line.Figure 4 (a) and (b) plots the magnitude of the learned CMM feature weights for MT+CMM and IR+CMM.CMM features help in both these hypothesis spaces and especially on the IR n-best list.Figure 4 (b) supports the hypothesis formulated in the previous paragraph: Since IR solely captures intermessage similarities, the matches between message and response are important, while context matches help in providing additional gains.The phrase-based statistical patterns captured by the MT system do a good job in explaining away 1-gram and 2-gram message matches (Figure 4 (a)) and the performance gain mainly comes from context matches.On the other hand, we observe that 4-gram matches may be important in selecting appropriate responses.Inspection of the tuning set reveals instances where responses contain long subsequences of their corresponding messages, e.g., m = "good night best friend, I love you", r = "I love you too, good night best friend".Although infrequent, such higher-order n-gram matches, when they occur, may provide a more robust signal of the quality of the response than 1-and 2-gram matches, given the highly conversational nature of our dataset.
RLMT and DCGM Both RLMT and DCGM models outperform their respective MT and IR baselines.Both models also exhibit similar performance and show improvements over the MT+CMM models, albeit using a lower dimensional feature space.We believe that their similar performance is due to the limited diversity of MT n-best list together with gains in fluency stemming from the strong language model provided by the RLM.In the case of IR models, on the other hand, there is more headroom for improvement and fluency is already guaranteed.Any   gains must come from context and message matches.Hence, RLMT underperforms with respect to both DCGM and IR+CMM.The DCGM models appear to have better capacity to retain contextual information and thus achieve similar performance to IR+CMM despite their lack of exact n-gram match information.
In the present experimental setting, no striking performance difference can be observed between the two versions of the DCGM architecture.If multiple sequences were used as context, we expect that the DCGM-II model would likely benefit more owing to the separate encoding of message and context.DCGM+CMM We also investigated whether mixing exact CMM n-gram overlap with semantic information encoded by the DCGM models can bring additional gains.DCGM-{I-II}+CMM systems each totaling 10 features show increases of up to 0.48 BLEU points over MT+CMM and up to 0.88 BLEU over the model based on Ritter et al. (2011).ME-TEOR improvements similarly align with BLEU improvements both for MT and IR lists.We take this as evidence that CMM exact matches and DCGM semantic matches interact positively, a finding that comports with Gao et al. (2014a), who show that semantic relationships mined through phrase embeddings correlate positively with classic co-occurrencebased estimations.Analysis of CMM feature weights in Figure 4 (c) and (d) suggests that 1-gram matches are explained away by the DCGM model, but that higher order matches are important.It appears that DCGM models might be improved by preserving word-order information in context and message encodings.

Human Evaluation
Human evaluation was conducted using crowdsourced annotators.Annotators were asked to compare the quality of system output responses pairwise ("Which is better?") in relation to the context and message strings in the 2114 item test set.Identical strings were held out, so that the annotators only saw those outputs that differed.Paired responses were presented in random order to the annotators, and each pair of responses was judged by 5 annotators.
Table 4 summarizes the results of human evaluation, giving the difference in mean scores (pairwise preference margin) between systems and 95% confidence intervals generated using Welch's t-test.Identical strings not shown to raters are incorporated with an automatically assigned score of 0.5.The pattern in these results is clear and consistent: context-sensitive systems (+CMM) outperform non-context-sensitive systems, with preference gains as high as approximately 5.3% in the case of DCGM-II+CMM versus IR, and about 3.1% in the case of DCGM-II+CMM versus MT.Similarly, context-sensitive DCGM systems outperform non-DCGM context-sensitive systems by 1.5% (MT) and 2.3% (IR).These results are consistent with the automated BLEU rankings and confirm that our best performing DCGM models outperform both raw baseline and the context-sensitive baseline using CMM features.

Discussion
Table 5 provides examples of responses generated on the tuning corpus by the MT-based DCGM-II+CMM system, our best system in terms of both BLEU and human evaluation.Responses from this system are on average shorter (8.95 tokens) than the original human  responses in the tuning set (11.5 tokens).Overall, the outputs tend to be generic or commonplace, but are often reasonably plausible in the context as in examples 1-3, especially where context and message contain common conversational elements.Example 2 illustrates the impact of context-sensitivity: the word "book" in the response is not found in the message.Nonetheless, longer generated responses are apt to degrade both syntactically and in terms of content.We notice that longer responses are likely to present information that conflicts either internally within the response itself, or is at odds with the context, as in examples 4-5.This is not unsurprising, since our model lacks mechanisms both for reflecting agent intent in the response and for maintaining consistency with respect to sentiment polarity.Longer context and message components may also result in responses that wander off-topic or lapse into incoherence as in 6-8, especially when relatively low frequency unigrams ("bass", "threat") are echoed in the response.
In general, we expect that larger datasets and incorporation of more extensive contexts into the model will help yield more coherent results in these cases.Consistent representation of agent intent is outside the scope of this work, but will likely remain a significant challenge.

Conclusion
We have formulated a neural network architecture for data-driven response generation trained from social media conversations, in which generation of responses is conditioned on past dialog utterances that provide contextual information.We have proposed a novel multi-reference extraction technique allowing for robust automated evaluation using standard SMT metrics such as BLEU and METEOR.Our context-sensitive models consistently outperform both context-independent and context-sensitive baselines by up to 11% relative improvement in BLEU in the MT setting and 24% in the IR setting, albeit using a minimal number of features.As our models are completely data-driven and self-contained, they hold the potential to improve fluency and contextual relevance in other types of dialog systems.
Our work suggests several directions for future research.We anticipate that there is much room for improvement if we employ more complex neural network models that take into account word order within the message and context utterances.Direct generation from neural network models is an interesting and potentially promising next step.Future progress in this area will also greatly benefit from thorough study of automated evaluation metrics.

Figure 1 :
Figure 1: Example of three consecutive utterances occurring between two Twitter users A and B.

Figure 2 :
Figure 2: Compact representation of an RLM (left) and unrolled representation for two time steps (right).

Figure 3 :
Figure 3: Compact representations of DCGM-I (left) and DCGM-II (right).The decoder RLM receives a bias from the context encoder.In DCGM-I, we encode the bag-ofwords representation of both c and m in a single vector b cm .In DCGM-II, we concatenate the representations b c and b m on the first layer to preserve order information.

Figure 4 :
Figure 4: Comparison of the weights of learned CMM features for MT+CMM and IR+CMM systems (a) et (b) and DCGM-II+CMM on MT and IR (c) and (d).

Table 1 :
Number of triples, average, minimum and maximum number of references for tuning and test corpora.

Table 3 :
Context-sensitive ranking results on both MT (left) and IR (right) n-best lists, n = 1000.The subscript feat.

Table 4 :
Pairwise human evaluation scores between System A and B. The first (second) set of results refer to the MT (IR) hypothesis list.The asterisk means agreement between human preference and BLEU rankings.

Table 5 :
Sample responses produced by the MT-based DCGM-II+CMM system.