How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models

Generative conversational systems are attracting increasing attention in natural language processing (NLP). Recently, researchers have noticed the importance of context information in dialog processing, and built various models to utilize context. However, there is no systematic comparison to analyze how to use context effectively. In this paper, we conduct an empirical study to compare various models and investigate the effect of context information in dialog systems. We also propose a variant that explicitly weights context vectors by context-query relevance, outperforming the other baselines.


Introduction
Recently, human-computer conversation is attracting increasing attention due to its promising potentials and alluring commercial values. Researchers have proposed both retrieval methods (Ji et al., 2014; and generative methods (Ritter et al., 2011;Shang et al., 2015) for automatic conversational systems. With the success of deep learning techniques, neural networks have demonstrated powerful capability of learning human dialog patterns; given a user-issued utterance as an input query q, the network can generate a reply r, which is usually accomplished in a sequence-to-sequence (Seq2Seq) manner (Shang et al., 2015).
In the literature, there are two typical research setups for dialog systems: single-turn and multiturn. Single-turn conversation is, perhaps, the simplest setting where the model only takes q into consideration when generating r (Shang et al., * Corresponding author. 2015; Mou et al., 2016). However, most realworld dialogs comprise multiple turns. Previous utterances (referred to as context in this paper) could also provide useful information about the dialog status and are the key to coherent multi-turn conversation.
Existing studies have realized the importance of context, and proposed several context-aware conversational systems. For example,  directly concatenate context utterances and the current query; others use hierarchical models, first capturing the meaning of individual utterances and then integrating them as discourses . There could be several ways of combining context and the current query, e.g., pooling or concatenation . Unfortunately, previous literature lacks a systematic comparison of the above methods.
In this paper, we conduct an empirical study on context modeling in Seq2Seq-like conversational systems. We emphasize the following research questions: • RQ1. How can we make better use of context information? Our study shows that hierarchical models are generally better than nonhierarchical ones. We also propose a variant of context integration that explicitly weights a context vector by its relevance measure, outperforming simple vector pooling or concatenation. • RQ2. What is the effect of context on neural dialog systems? We find context information is useful to neural conversational models. It yields longer, more informative and diversified replies. To sum up, the contributions of this paper are two-fold: (1) We conduct a systematic study on context modeling in neural conversational models.
(2) We further propose an explicitly con-text weighting approach, outperforming the other baselines.

Non-Hierarchical Model
To model a few utterances before the current query, several studies directly concatenate these sentences together and use a single model to capture the meaning of context and the query . They are referred to as non-hierarchical models in our paper. Such method is also used in other NLP tasks, e.g., document-level sentiment analysis (Xu et al., 2016) and machine comprehension (Wang and Jiang, 2017).
Following the classic encode-decoder framework, we use a Seq2Seq network, which transforms the query and context into a fixed-length vector v enc by a recurrent neural network (RNN) during encoding; then, in the decoding phase, it generates a reply r with another RNN in a wordby-word fashion. (See Figure 1a.) In our study, we adopt RNNs with gated recurrent units (Cho et al., 2014, GRUs), which alleviates the long propagation problem of vanilla RNNs. When decoding, we apply beam search with a size of 5.

Hierarchical Model
A more complicated approach to context modeling is to build hierarchical models with a two-step strategy: an utterance-level model captures the meaning of each individual sentences, and then an inter-utterance model integrates context and query information ( Figure 1b).
Researchers have tried different ways of combining information during inter-utterance modeling; this paper evaluates several prevailing methods.
Sum pooling. Sum pooling (denoted as Sum) integrates information over a candidate set by summing the values in each dimension ( Figure  2a). Given context vectors v c 1 , · · · , v cn and the query vector v q , the encoded vector v enc is Sum pooling is used in , where bag-of-words (BoW) features of context  . . , c n } and the current query q with (a) non-hierarchical or (b) hierarchical models. and the query are simply added. In our experiments, sum pooling operates on the features extracted by sentence-level RNNs of context and query utterances, as modern neural networks preserve more information than BoW features.
Concatenation. Concatenation (Concat) is yet another method used in . This strategy concatenates every utterance-level vectors v c i and v q as a long vector, i.e., v enc = Compared with sum pooling, vector concatenation can distinguish different roles of the context and query, as this operation keeps input separately. One potential shortcoming, however, is that concatenation only works with fixed-length context. Sequential integration. Yao et al. (2015) and Serban et al. (2015) propose hierarchical dialog systems, where an inter-utterance RNN is built upon utterance-level RNNs' features (last hidden state). Training is accomplished by end-to-end gradient propagation, and the process is illustrated in Figure 2c.
Using an RNN to integrate context and query vectors in a sequential manner enables complicated information interaction. Based on the RNN's hidden states, Sum and Concat could also be applied to obtain the encoded vector v enc .  However, we found their performance is worse than only using the last hidden state (denoted as Seq). One plausible reason might be that the inter-sentence RNN is not long and that RNN can preserve these information well. Therefore, this variant is adopted in our experiments, as shown in Figure 2c.

Explicitly Weighting by Context-Query Relevance
In conversation, context utterances may vary in content and semantics: context utterances that are relevant to the query may be useful, while irrelevant ones may bring more about noise. Following this intuition, we propose a variant that explicitly weights the context vector by an attention score of context-query relevance. First, we compute the similarity between the context and query by the cosine measure where e c i = w∈c i e w and e q = w ∈q e w that is, the sentence embedding is the sum of word embeddings.
Following the spirit of attention mechanisms , we would like to normalize these similarities by a softmax function and obtain attention probabilities: where s q is computed in the same manner as s c i and is always 1, which is the cosine of two same vectors. The intuition is that, if the context is less relevant, we should mainly focus on the query itself, but if the context is relevant, we should focus more evenly across context and the query. In other words, our explicitly weighting approach could be viewed as heuristic attention. Akin to Subsection 2.2, we aggregate the weighted context and query vectors by pooling and concatenation, resulting in the following two variants.
• WSeq (sum), where weighted vectors are summed together • WSeq (concat), where weighted vectors are concatenated v enc = [α c 0 h c 0 ; . . . ; α cn h cn ; α q h q ] (7) Notice that the explicitly weighting approach can also be applied to sentence embeddings (without inter-sentence RNN). We denote the variants by WSum and WConcat, respectively; details are not repeated. They are included for comparison in Section 3.2.

Setup
We conducted all experiments on a Chinese dataset crawled from an online free chatting platform, Baidu Tieba. 1 To facilitate the research of context's effect, we established a multi-turn conversational corpus following  and Serban et al. (2015). A data sample contains three utterances, being a triple last context, query, reply . In total, we had  (2015) and : embeddings 620d and hidden states 1000d; we used AdaDelta for optimization.

Results and Analysis
We evaluated model performance by BLEU scores. As this paper compares various models, it is unaffordable for us to hire workers to manually annotate their satisfaction. BLEU scores, albeit imperfect for open-domain dialog systems, exhibits more or less correlation with human satisfaction (Liu et al., 2016;Tao et al., 2017). We present in Table 1 the overall performance of the models introduced in Section 2, and answer our research questions as follows.
RQ1: How can we make better use of context information?
We first observe that context-aware methods generally outperform the context-insensitive one. This implies context is indeed useful in opendomain, chit-chat-style dialog systems. The results are consistent with previous studies Serban et al., 2015).
Among context-aware neural conversational models, we have the following findings.
• Hierarchical structures outperform the nonhierarchical one.
Comparing the non-hierarchical and hierarchical structures, we find it obvious that (most) hierarchical models outperform the non-hierarchical one by a large margin. The results show that, dialog systems are differ-ent from other NLP applications, e.g., comprehension (Wang and Jiang, 2017), where non-hierarchical recurrent neural networks are adopted to better integrate information across different sentences. A plausible explanation, as indicated by Meng et al. (2017), is that conversational sentences are not necessarily uttered by a same speaker, and literature shows consistent evidence of the effectiveness of hierarchical RNNs in dialog systems.
• Keeping the roles of different utterances separately is important.
As mentioned in Section 2, the concatenation operation (Concat) distinguishes the roles of different utterances, while sum pooling Sum aggregates information in a homogeneous way. We see that the former outperforms the latter in both sentence-embedding and inter-sentence RNN levels, showing that sum pooling is not suitable for treating dialog context. Our conjecture is that sum pooling buries illuminating query information under less important context. Hence, keeping them separately will generally help.
• The context-query relevance score benefits conversational systems.
Our explicitly weighting approach computes an attention probability by context-query relevance. In all variants (Sum, Concat, and Seq), explicitly weighting improves the performance by a large margin (except BLEU-1 for Seq). The results indicate that contextquery relevance is useful, as it emphasizes  relevant context utterances as well as weakens irrelevant contexts.

RQ2:
What is the effect of context on neural dialog systems?
We are now curious about how context information affects neural conversational systems. In Table 2, we present three auxiliary metrics, i.e., sentence length, entropy, and diversity. The former two are used in Serban et al. (2016) and Mou et al. (2016), whereas the latter one is used in Zhang and Hurley (2008).
As shown, content-aware conversational models tend to generate longer, more meaningful and diverse replies compared with content-insensitive models, given that they also improve BLEU scores. 2 This shows an interesting phenomenon of neural sequence generation: an encoder-decoder framework needs sufficient source information for meaningful generation of the target; it simply does not fall into meaningful content from less meaningful input. A similar phenomenon is also reported in our previous work (Mou et al., 2016); we show that, a same network will generate more meaningful sentences if it starts from a given (meaningful) keyword. These results also partially explain why a seq2seq neural network tends to generate short and universally relevant replies in open-domain conversation, despite its success in machine translation, abstractive summarization, etc.

Conclusion
In this work, we analyzed the effect of context in generative conversational models. We conducted a systematic comparison among existing meth-ods and our newly proposed variant that explicitly weights context vectors by context-query relevance.
We show that hierarchical RNNs generally outperform non-hierarchical ones, and that explicitly weighting context information can emphasize the relevant context utterances and weaken less relevant ones.
Our experiments also reveal an interesting phenomenon: with context information, neural networks tend to generate longer, more meaningful and diverse replies, which sheds light on neural sequence generation.