NEXUS Network: Connecting the Preceding and the Following in Dialogue Generation

Sequence-to-Sequence (seq2seq) models have become overwhelmingly popular in building end-to-end trainable dialogue systems. Though highly efficient in learning the backbone of human-computer communications, they suffer from the problem of strongly favoring short generic responses. In this paper, we argue that a good response should smoothly connect both the preceding dialogue history and the following conversations. We strengthen this connection through mutual information maximization. To sidestep the non-differentiability of discrete natural language tokens, we introduce an auxiliary continuous code space and map such code space to a learnable prior distribution for generation purpose. Experiments on two dialogue datasets validate the effectiveness of our model, where the generated responses are closely related to the dialogue context and lead to more interactive conversations.


Introduction
With the availability of massive online conversational data, there has been a surge of interest in building open-domain chatbots with data-driven approaches. Recently, the neural network based sequence-to-sequence (seq2seq) framework (Sutskever et al., 2014;Cho et al., 2014) has been widely adopted. In such a model, the encoder, which is typically a recurrent neural network (RNN), maps the source tokens into a fixed-sized continuous vector, based on which the decoder estimates the probabilities on the target side word by word. The whole model can be efficiently trained by maximum likelihood (MLE) and has demonstrated state-of-the-art performance in various domains. However, this architecture is not * Indicates equal contribution. X. Shen focuses on algorithm and H. Su is responsible for experiments.  suitable for modeling dialogues. Recent research has found that while the seq2seq model generates syntactically well-formed responses, they are prone to being off-context, short, and generic. (e.g., "I dont know" or "I am not sure") (Li et al., 2016a;. The reason lies in the one-to-many alignments in human conversations, where one dialogue context is open to multiple potential responses. When optimizing with the MLE objective, the model tends to have a strong bias towards safe responses as they can be literally paired with arbitrary dialogue context without semantical or grammatical contradictions. These safe responses break the dialogue flow without bringing any useful information and people will easily lose interest in continuing the conversation. In this paper, we propose NEXUS Network which aims at producing more on-topic responses to maintain an interactive conversation flow. Our assumption is that a good response should serve as a "nexus": connecting and being informative to both the preceding dialogue context and the follow-up conversations. For example, in Figure  1, the response from B 1 is a smooth connection, where the first half indicates the preceding context is a "Do you know" question and the second half informs that the follow-up would be an introduction about Star Wars. We establish this connection by maximizing the mutual information (MMI) of the current utterance with both the past and future contexts. In this way, generic responses can be largely discouraged as they contain no valuable information and thus have only weak correlations with the surrounding context. To enable efficient training, two challenges exist.
The first challenge comes from the discrete nature of language tokens, hindering efficient gradient descent. One strategy is to estimate the gradient by methods like Gumbel-Softmax (Maddison et al., 2017;Jang et al., 2017) or REINFORCE algorithm (Williams, 1992, which has been applied in many NLP tasks (He et al., 2016;Shetty et al., 2017;Gu et al., 2018;Paulus et al., 2018), but the trade-off between bias and variance of the estimated gradient is hard to reconcile. The resulting model usually strongly relies on sensitive hyper-parameter tuning, careful pre-train and taskspecific tricks. Li et al. (2016a); Wang et al. (2017) avoid this non-differentiability problem by learning a separate backward model to rerank candidate responses in the testing phase while still adhering to the MLE objective for training. However, the candidate set normally suffers from low diversity and a huge sample size is needed for good performance (Li et al., 2016b).
The second challenge relates to the unknown future context in the testing phase. In our framework, both the history and future context need to be explicitly observed in order to compute the mutual information. When applying it to generating tasks where only the history context is given, there is no way to explicitly take into account the future information. Therefore, reranking-based models do not apply here. (Li et al., 2016c) addresses future information by policy learning, but the model suffers from high variance due to the enormous sequential search space. Serban et al. (2017); ; Shen et al. (2017) adopt the variational inference strategy to reduce the training variance by optimizing over latent continuous variables. However, they all stick to the original MLE objective and no connection with the surrounding context is considered.
In this work, we address both challenges by introducing an auxiliary continuous code space which is learned from the whole dialogue flow. At each time step, instead of directly optimizing discrete utterances, the current, past and future utterances are all trained to maximize the mutual information with this code space. Furthermore, a learnable prior distribution is simultaneously optimized to predict the corresponding code space, enabling efficient sampling in the testing phase without getting access to the ground-truth future con-versation. Extensive experiments have been conducted to validate the superiority of our framework. The generated responses clearly demonstrate better performance with respect to both coherence and diversity.
2 Model Structure

Motivation
Let u i be the ith utterance within a dialogue flow. The dialogue history H i−1 contains all the preceding context u 1 , u 2 , . . . , u i−1 and F i+1 denotes the future conversations u i+1 , . . . , u T . The objective of our model is to find the decoding probability p θ (u i |H i−1 , F i+1 ) that maximizes the mutual information I(H i−1 , u i ) and I(u i , F i+1 ). Formally, the objective is: (1) λ 1 and λ 2 adjusts the relative weight. Mutual information is defined over p θ (u i |H i−1 , F i+1 ) and the empirical distribution p(H i−1 , F i+1 ). Now we assume the future context F i+1 is known to us when training the decoding probability, we will address the unknown future problem later.
Directly optimizing with this objective is unfortunately infeasible because the exact computation of mutual information is intractable, and backpropagating through sampled discrete sequences is notoriously difficult to train. The discontinuity prevents the direct application of the reparameterization trick (Kingma and Welling, 2014). Lowvariance relaxations like Gumbel-Softmax (Jang et al., 2017), semantic hashing (Kaiser et al., 2018) or vector quantization (van den Oord et al., 2017) lead to biased gradient estimations, which are accumulated as the sequence becomes longer. The Monte-Carlo-Simulation is unbiased but suffers from high variances. Designing a reasonable control variate for variance reduction is an extremely tricky task (Mnih and Gregor, 2014;Tucker et al., 2017). For this sake, we propose replacing u i with a continuous code space c learned from the whole dialogue flow.

Continuous Code Space
We define the continuous code space c to follow the Gaussian probability distribution with a diagonal covariance matrix conditioning on the whole dialogue: The dialogue history H i−1 is encoded into vector H i−1 by a forward hierarchical GRU model E f as in . The future conversation, including the current utterance, is encoded intoF i by a backward hierarchical GRU E b .H i−1 and F i are concatenated and a multi-layer perceptron is built on top of them to estimate the Gaussian mean and covariance parameters. The code space is trained to infer the encoded historyH i−1 and futureF i+1 . The full optimizing objective is: (3) whereH i−1 andF i+1 are also assumed to be Gaussian distributed given c with mean and covariance estimated from multi-layer perceptrons. We infer the encoded vectors instead of the original sequences for three reasons. Firstly, inferring dense vectors is parallelizable and computationally much cheaper than autoregressive decoding, especially when the context sequences could be unlimitedly long. Secondly, sequence vectors can capture more holistic semantic-level similarity than individual tokens. Lastly, It can also help alleviate the posterior collapsing issue (Bowman et al., 2016) when training variational inference models on text (Chen et al., 2017;Shen et al., 2018), which we will use later. It can be shown that the above objective maximizes a lower bound of λ 1 I(H i−1 , c) + λ 2 I(c, F i+1 ), given the conditional probability p φ (c|H i−1 , F i ). The proof is a direct extension of the derivation in , followed by the Data Processing Inequality (Beaudry and Renner, 2012) that the encoding function can only reduce the mutual information. As the sampling process contains only Gaussian continuous variables, the above objective can be trained through the reparameterization trick (Kingma and Welling, 2014), which is a low-variance, unbiased gradient estimator (Burda et al., 2015). After training, samples from p φ (c|H i−1 , F i ) hold high mutual information with both the history and future context. The next step is then transferring the continuous code space to reasonable discrete natural language utterances.

Decoding from Continuous Space
Our decoder transfers the code space c into the ground-truth utterance u i by defining the probability distribution p(u i |H i−1 , c), which is imple-mented as a GRU decoder going through u i word by word to estimate the output probability. The encoded historyH i−1 and code space c are concatenated as an extra input at each time step. The loss function for the decoder is then: (4) which can be proved to be the lower bound of the conditional mutual information I(u i , c|H i−1 ). By maximizing the conditional mutual information, c i is trained to maintain as much information about the target sequence u i as possible.
Combining Eq. 3 and 4, our model until now can be viewed as optimizing a lower bound of the following objective: Compared with the original motivation in Eq. 1, we sidestep the non-differentiability problem by replacing u i with a continuous code space c, then forcing u i to contain the same information as maintained in c by additionally maximizing the mutual information between them.
Nonetheless, Eq. 5 and Eq. 1 might lead to different optimums as mutual information does not satisfy the transitive law. In the extreme case, different dimensions of c could individually maintain information about history, current and future conversations and the conversations themselves do not share any dependency relation. To avoid this issue, we restrict the dimension of c to be smaller than that of the encoded vectors. In this case, optimizing Eq. 5 will favor utterances having stronger correlations with the surrounding context to achieve a higher total mutual information.

Learnable Prior Distribution for Unknown Future
The last problem is the sampling mechanism of c in Eq. 2, which conditions on the ground-truth future conversation. In the testing phase, when we have no access to it, we cannot perform the decoding process as in Eq. 4. To allow for decoding with only the history context, we need to learn an appropriate prior distribution p θ (c|H i−1 ) for c. In the ideal case, we would like (6) However, p φ (c|H i−1 ) is intractable as it integrates over all possible future conversations. We apply variational inference on c to maximize the variational lower bound (Jordan et al., 1999): prior I|H i−1 )) (7) It can be reformulated as maximizing: by minimizing the KL divergence between them. It also functions as a regularizer to prevent overfitting when learning p φ (c|H i−1 , F i ). In the testing phase, we can sample c from the learned prior distribution p θ (c|H i−1 ), then generate a response based on it.

Summary
To sum up, the total objective function of our model is: Weighting can be added to individual loss functions for better performance, but we find it enough to maintain equal weights and avoid extra hyperparameters. All the parameters are simultaneously updated by gradient descent except for the encoders E f and E b , which only accept gradients from L(d) since otherwise the model can easily learn to encode no information for a lower reconstruction loss in L(c) and L(p). An overview of our training procedure is depicted in Fig. 2.
3 Relationship to Existing Methods MMI decoding MMI decoder was proposed by (Li et al., 2016a) and further extended in (Wang et al., 2017). The basic idea is the same as our model by maximizing the mutual information with the dialogue context. However, the MMI principle is applied only at the testing phase rather than the training phase. As a result, it can only be used to evaluate the quality of a generation by estimating its mutual information with the context. To apply it in a generative task, we have to first sample some candidate responses with the seq2seq model, then rerank them by accounting for the MMI score. Our model differs from it in that we directly estimate the decoding probability thus no post-sampling rerank is needed. Moreover, we further include the future context to strengthen the connection role of the current utterances.

Conditional Variational Autoencoder
The idea of learning an appropriate prior distribution in Eq. 7 is essentially a conditional variational autoencoder (Sohn et al., 2015) where the accumulated posterior distribution is trained to stay close to a prior distribution. It has also been applied in dialogue generation (Serban et al., 2017;. However, all the above methods stick to the MLE objective function and do not optimize with respect to the mutual information. As we will show in the experiment, they fail to learn the correlation between the utterance and its surrounding context. The generation diversity of these models comes more from the sampling randomness of the prior distribution rather than from the correct understanding of context correlation. Moreover, they suffer from the posterior collapsing problem (Bowman et al., 2016) and require special tricks like KL-annealing, BOW loss or word drop-out (Shen et al., 2018). Our model does not have such problems.
Deep Reinforcement Learning Dialogue Generation (Li et al., 2016c) first considered future success in dialogue generation and applied deep reinforcement learning to encourage more interactive conversations. However, the reward functions are intuitively hand-crafted. The relative weight for each reward needs to be carefully tuned and the training stage is unstable due to the huge search space. In contrast, our model maximizes the mutual information in the continuous space and trains the prior distribution through the reparamaterization trick. As a result, our model can be more easily trained with a lower variance. Throughout our experiment, the training process of NEXUS network is rather stable and much less data-hungry. The MMI objective of our model is theoretically more sound and no manually-defined rules need to be specified.

Dataset and Training Details
We run experiments on the DailyDialog (Li et al., 2017b) and Twitter corpus (Ritter et al., 2011). DailyDialog contains 13118 daily conversations under ten different topics. This dataset is crawled from various websites for English learner to practice English in daily life, which is high-quality, less noisy but relatively smaller. In contrast, the Twitter corpus is significantly larger but contains more noise. We obtain the dataset as used in Serban et al. (2017) and filter out tweets that have already been deleted, resulting in about 750,000 multi-turn dialogues. The contents have more informal, colloquial expressions which makes the generation task harder. These two datasets are randomly separated into training/validation/test sets with the ratio of 10:1:1.
In order to keep our model comparable with the state-of-the-art, we keep most parameter values the same as in (Serban et al., 2017). We build our vocabulary dictionary based on the most frequent 20,000 words for both corpus and map other words to a UNK token. The dimensionality of the code space c is 100. We use a learning rate of 0.001 for DailyDialog and 0.0002 for Twitter corpus. The batch size is fixed to 128. The word vector dimension is 300 and is initialized with the public Word2Vec (Mikolov et al., 2013) embeddings trained on the Google News Corpus. The probability estimators for the Gaussian distributions are implemented as 3-layer perceptrons with the hyperbolic tangent activation function. As mentioned above, when training NEXUS models, we block the gradient from L(c) and L(p) with respect to E f and E b to encourage more meaningful encodings. The UNK token is prevented from being generated in the test phase. We implemented all the models with the open-sourced Python library Pytorch (Paszke et al., 2017) and optimized using the Adam optimizer (Kingma and Ba, 2015).

Compared Models
We conduct extensive experiments to compare our model against several representative baselines.
Seq2Seq: Following the same implementation as in (Vinyals and Le, 2015), the seq2seq model serves as a baseline. We try both greedy decoding  Table 2 and beam search (Graves, 2012) with beam size set to 5 when testing. MMI: We implemented the bidirectional-MMI decoder as in Li et al. (2016a), which showed better performance over the anti-LM model. The hyperparameter λ is set to 0.5 as suggested. 200 candidates per context are sampled for re-ranking.
VHRED: The VHRED model is essentially a conditional variational autoencoder with hierarchical encoders (Serban et al., 2017;. To alleviate the posterior collapsing problem, we apply the KL-annealing trick and early stop with the step set as 12,000 for the DailyDialog and 75,000 for the Twitter corpus. RL: Deep reinforcement learning chatbot as in (Li et al., 2016c). We use all the three reward functions mentioned in the paper and keep the relative weights the same as in the original paper. Policy network is initialized with the above-mentioned MMI model.
NEXUS-H: NEXUS network maximizing mutual information only with the history (λ 2 = 0). NEXUS-F: NEXUS network maximizing mutual information only with the future (λ 1 = 0). NEXUS: NEXUS network maximizing mutual information with both the history and future.
NEXUS-H and NEXUS-F are implemented to help us better analyze the effects of different components in our model. The hyperparameters λ 1 and λ 2 in NEXUS are set to be 0.5 and 1 respectively as we find history vector is consistently easier to be reconstructed than the future vector (A.6).

Metric-based Performance
Embedding Score We conducted three embedding-based evaluations (average, greedy and extrema) , which map responses into vector space and compute the cosine similarity (Rus and Lintean, 2012). The embedding-based metrics can to a large extent capture the semantic-level similarity between generated responses and ground truth. We represent words using Word2Vec embeddings trained on the Google News Corpus. We also measure the uncertainty of the score by assuming each data point is independently Gaussian distributed. The standard deviation yields the 95% confidence interval (Barany et al., 2007). Table 1 reports the embedding scores on both datasets. NEXUS network significantly outperforms the best baseline model in most cases. Notably, NEXUS can absorb the advantages from both NEXUS-H and NEXUS-F. The history and future information seem to help the model from different perspectives. Taking into account both of them does not create a conflict and the combination leads to an overall improvement. RL performs rather poorly on this metric, which is understandable as it does not target the ground-truth responses during training (Li et al., 2016c).
BLEU Score BLEU is a popular metric that measures the geometric mean of the modified ngram precision with a length penalty (Papineni et al., 2002). Table 2 reports the BLEU 1-3 scores. Compared with embedding-based metrics, the BLEU score quantifies the word-overlap between generated responses and the ground-truth. One challenge of evaluating dialogue generation by BLEU score is the difficulty of accessing multiple references for the one-to-many alignment relation. Following Sordoni et al. (2015); Zhao et al.

Model
DailyDialog Twitter BLEU-1 BLEU-2 BLEU-3 BLEU-1 BLEU-2 BLEU-3  (Lin and Och, 2004). p-value interval is computed base on the altered bootstrap resampling algorithm (Riezler and Maxwell, 2005) (2017); Shen et al. (2018), for each context, 10 more candidate references are acquired by using information retrieval methods (see Appendix A.4 for more details). All candidates are then passed to human annotators to filter unsuitable ones, resulting in 6.74 and 5.13 references for DailyDialog and Twitter dataset respectively. The human annotation is costly, so we evaluate it on 1000 sampled test cases for each dataset. As the BLEU score is not the simple mean of individual sentence scores, we compute the 95% significance interval by bootstrap resampling (Koehn, 2004;Riezler and Maxwell, 2005). As can be seen, NEXUS network achieves best or near-best performances with only greedy decoders. NEXUS-H generally outperforms NEXUS-F as the connection with future context is not explicitly addressed by the BLEU score metric. MMI and VHRED bring minor improvements over the seq2seq model. Even when evaluated on multiple references, RL still performs worse than most models.
Connecting the preceding We define two metrics to evaluate the model's capability of "connecting the preceding context": AdverSuc and Neg-PMI. AdverSuc measures the coherence of generated responses with the provided context by learning an adversarial discriminator (Li et al., 2017a) on the same corpus to distinguish coherent responses from randomly sampled ones. We encode the context and response separately with two different LSTM neural networks and output a binary signal indicating coherent or not 1 . The Adver-Suc value is reported as the success rate that the model fools the classifier into believing its false generations (p(generated = coherent) > 0.5).
Neg-PMI measures the negative pointwise mutual information value − log p(c|r)/p(c) between the generated response r and the dialogue context c. p(c|r) is estimated by training a separate backward seq2seq model. As p(c) is a constant, we ignore it and only report the value of − log p(c|r). A good model should achieve a higher Adver-Suc and a lower Neg-PMI. The results are listed in Table 3. We can see there is still a big gap between ground-truth and synthesized responses. As expected, NEXUS-H leads to the most significant improvement. MMI model also performs remarkably well, but it requires post-reranking thus the sampling process is much slower. VHRED and NEXUS-F do not help much here, sometimes even slightly degrade the performance. We also tried removing the history context when computing the posterior distribution in VHRED, the resulting model has similar performance among all metrics, which suggests VHRED itself cannot actually learn the correlation pattern with the preceding context. Surprisingly, though RL explicitly set the coherence score as a reward function, its performance is far from satisfying. We assume RL requires much more data to learn the appropriate policy than other models and the training process suffers from a higher variance. The result is thus hard to be guaranteed.
Connecting the following We measure the model's capability of "connecting the following context" from two perspectives: number of the simulated turns and diversity of generated re-  sponses. We apply all models to generate multiple turns until a generic response is reached. The set of generic responses is manually examined to include all utterances providing only passive dull replies 2 . The number of generated turns can reflect the time that a model can maintain an interactive conversation. The results are reflected in the #Turns column in Table 3. As in (Li et al., 2016a), we measure the diversity by the percentage of distinct unigrams (Distinct-1) and bigrams (Distinct-2) in all generated responses. Intuitively a higher score on these three metrics implies a more interactive generation system that can better connect the future context. Again, NEXUS network dominates most fields. NEXUS-F brings more impact than NEXUS-H as it explicitly encourages more interactive turns. Most seq2seq models fail to provide an informative response in the first turn. The MMI-decoder does not change much, possibly because the sampling space is not large enough, a more diverse sampling mechanism (Vijayakumar et al., 2018) might help. NEXUS network can effectively continue the conversation for 2.8 turns for DailyDialog and 2.5 turns for Twitter, which is closest to the ground truth (4.8 and 4.0 turns respectively). It also achieves the best diversity score in both datasets. It is worth mentioning that NEXUS-H also improves over baselines, though not as significantly as NEXUS-F, so NEXUS is not a trade-off but more like an enhanced version from NEXUS-H and NEXUS-F.
In summary, NEXUS network clearly generates higher-quality responses in both coherence and diversity, even in a rather small dataset like Daily-Dialog. NEXUS-H contributes more to the coher- 2 We use a simple rule matching method (see Appendix A.5). We manually inspect it on a validation subset and find the accuracy is more than 90%. Similar methods are adopted in (Li et al., 2016c). ence and NEXUS-F more to the diversity.

Human Evaluation
We also employed crowdsourced judges to provide evaluations for a random sample of 500 items in the DailyDialog test dataset. Participants are asked to assign a binary score to each contextresponse pair from three perspectives: whether the response coincides with its preceding context (Pri), whether the response is interesting enough for people to continue (Post) and whether the response itself is a fluent natural sentence (Flu). Each sample gets one point if judged as yes and zero otherwise. Each pair is judged by three participants and the score supported by most people is adopted. We also evaluated the inter-annotator consistency by Fleiss'k score (Fleiss, 1971) and obtained k scores of 0.452 for Pri, 0.459 for Post (moderate agreement) and 0.621 for Flu (substantial agreement), which implies most contextresponse pairs reach a consensus on the evaluation task. We compute the average human score for each model. Unlike metric-based scores, the human evaluation is conducted only on the DailyDialog corpus as it contains less noise and can be more fairly evaluated by human judges. Table 3 shows the result in the last three columns. As can be seen, the pri and post human scores are highly correlated with the automatic evaluation metric "coherence" and "#turns", verifying the validity of these two metrics. As for fluency, there is no significant difference among most models. As we also manually examined, fluency is not a major problem and all models produce mostly well-formed sentences. Overall, NEXUS network does produce responses that are more acceptable to human judges.  VHRED, RL and NEXUS model. We see NEXUS network does generate more interactive outputs than the other three. Though reranked by the bidirectional language model, the MMI decoder still produces quite a few generic responses. VHRED's utterances are more diverse, but it only cares about answering to the immediate query and makes no efforts to bring about further topics. Moreover, it also generates more inappropriate responses than the others. RL provides diverse responses but sometimes not fluent or coherent enough. We do observe that NEXUS sometimes generate over-complex questions which are not very natural, as in the second example. But in most cases, it outperforms the others.

Conclusion
In this paper, we propose "NEXUS Network" to enable more interactive human-computer conversations. The main goal of our model is to strengthen the "nexus" role of the current utterance, connecting both the preceding and the following dialogue context. We compare our model with MMI, reinforcement learning and CVAEbased models. Experiments show that NEXUS network consistently produces higher-quality re-sponses. The model is easier to train, requires no special tricks and demonstrates remarkable generalization capability even in a very small dataset.
Our model can be considered as combining the objective of MMI and CVAE and is compatible with current improving techniques. For example, mutual information can be maximized under a tighter bound using Donsker-Varadhan or f-divergence representation (Donsker and Varadhan, 1983;Nowozin et al., 2016;Belghazi et al., 2018). Extending the code space distribution to more than Gaussian by importance weighted autoencoder (Burda et al., 2015), inverse autoregressive flow (Kingma et al., 2016) or VamPrior (Tomczak and Welling, 2018) should also help with the performance.