Response-Anticipated Memory for On-Demand Knowledge Integration in Response Generation

Neural conversation models are known to generate appropriate but non-informative responses in general. A scenario where informativeness can be significantly enhanced is Conversing by Reading (CbR), where conversations take place with respect to a given external document. In previous work, the external document is utilized by (1) creating a context-aware document memory that integrates information from the document and the conversational context, and then (2) generating responses referring to the memory. In this paper, we propose to create the document memory with some anticipated responses in mind. This is achieved using a teacher-student framework. The teacher is given the external document, the context, and the ground-truth response, and learns how to build a response-aware document memory from three sources of information. The student learns to construct a response-anticipated document memory from the first two sources, and teacher’s insight on memory creation. Empirical results show that our model outperforms the previous state-of-the-art for the CbR task.


Introduction
Neural conversation models have achieved promising performance in response generation. However, it is widely observed that the generated responses lack sufficient content and information (Li et al., 2016a). One way to address this issue is to integrate various external information into conversation models. Examples of external information include document topics (Xing et al., 2017), commonsense knowledge graphs (Zhou et al., 2018), and domain-specific knowledge bases (Yang et al., 2019). Conversing by reading (CbR) (Qin et al.,  2019) is a recently proposed scenario where external information can be ingested to conversations. In CbR, conversations take place with reference to a document. The key problem in CbR is to learn how to integrate information from the external document into response generation on demand.
To exploit knowledge from documents for conversations, a conventional way is to extend the sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014) with Memory Networks (Sukhbaatar et al., 2015), which store knowledge representations accessible to their decoder (Ghazvininejad et al., 2018;Parthasarathi and Pineau, 2018). Dinan et al. (2018) propose to encode the dialogue context as well as a set of retrieved knowledge by Transformer (Vaswani et al., 2017) to construct the memory. However, these methods only use sentence-level representations of the documents in the memory, which cannot pinpoint accurate tokenlevel document information.
To discover token-level document information, researchers borrow models from other generation tasks, which are adept at extracting segments of sentences for given questions. Moghe et al. (2018) explore the pointer generator network (See et al., 2017) for abstractive summarization and the bidirectional attention flow model (Seo et al., 2017), which is a QA model to predict a span of the document to be contained in the response. Qin et al. (2019) follow the stochastic answer network (SAN) (Liu et al., 2018) in machine reading comprehension (MRC), integrating both context and document information to form the context-aware document memory. This approach obtains the stateof-the-art performance on the CbR task.
However, we should notice the difference between existing generation tasks and CbR. For summarization, QA, and MRC, they require models to extract exact answers from documents, where documents cover all requisite knowledge. Meanwhile, CbR expects to output a general utterance relevant to both context and document. As the example in Fig. 1, the document refers to actor, films, fans, wealthy and the context mentions disease. Document and context discuss the same person but have no topic overlap; thus we cannot pinpoint document information from the context. If we use SAN as in Qin et al. (2019), SAN can hardly acquire helpful information from context-document interaction. To ingest useful knowledge for response generation, we argue that processing documents should consider not only the interaction between context and document but also the target response. As in the example, the document should attend more on fans, wealthy by considering the response.
In this work, we propose a method to construct a response-anticipated memory to contain document information that is potentially more important in generating responses. Particularly, we construct a teacher-student framework based on Qin et al. (2019). The teacher model accesses the groundtruth response, context, and document. It learns to construct a weight matrix that contains information about the importance of tokens in the document to the response. The student model learns to mimic the weight matrix constructed by the teacher without access to the response. That is, the teacher learns to build a response-aware memory, while the student learns to build a response-anticipated memory. During inference on testing data, the student will be applied. Our experiments show our model exceeds all competing methods.

Related Work
Most neural conversation models in open domain chit-chat scenarios are based on the Seq2Seq model (Sutskever et al., 2014;Shang et al., 2015). A critical issue of these models is the safe response problem, i.e., generated responses often lack enough content and information. To address this issue, previous work encourages response diversity and informativeness by introducing new training objectives (Li et al., 2016b;, refining beam search strategies (Li et al., 2016a;Vijayakumar et al., 2018;, exploiting information from conversational contexts (Serban et al., , 2017, or incorporating with retrieval-based conversation systems (Song et al., 2018;Wu et al., 2019b;Tian et al., 2019).
Some researchers augment information in generating responses by external resources. Zhou et al. (2018) utilize the commonsense knowledge graph by their designed graph attention. Agarwal et al. (2018) propose a knowledge encoder to encode query-entity pairs from the knowledge base. Wu et al. (2019a) enrich response generation with knowledge triplets. These work all uses knowledge information in structured formats.
External unstructured text information has also been investigated to improve conversation models. Some researchers directly build "document memory" by using distributed representations of the knowledge sentences into conversation models (Ghazvininejad et al., 2018;Parthasarathi and Pineau, 2018). Dinan et al. (2018) make use of the Transformer (Vaswani et al., 2017) to encode the knowledge sentences as well as the dialogue context.  design a knowledge selector to construct the document memory on selective knowledge information. As stated in the introduction, some other researchers borrow models from other generation tasks, including abstractive summarization models (Moghe et al., 2018), QA models (Moghe et al., 2018) and MRC models (Meng et al., 2020;Qin et al., 2019). Especially, Qin et al. (2019) get the state-of-the-art performance. However, they all construct the document memory relying on connections between context and document without consideration of the response. If context or document contains a lot of noise tokens irrelevant to the response, which is indeed the case in CbR, the constructed memory may be misled by these noise information (as the case in Fig. 1). Therefore, we propose to involve the consideration of responses in the memory construction, which can benefit generating a more desired response.

Methodology
In this section, we will first give an overall description of the proposed teacher-student architecture for CbR, then briefly describe the base model. The detailed teacher model and student model are presented in Sec 3.3 and 3.4. Lastly, we summarize the training updates of the two models in Sec 3.5.

Model Architecture
The CbR task provides a conversation context X and a document D as inputs, requiring the model to generate a response R to X by referring to D.
In the rest of the paper, we use |X|, |D|, and |R| to denote the number of tokens in X, D, and R respectively. To pinpoint accurate document information for response generation, we design a teacher-student framework to construct document memory as follows: • The teacher model learns a response-aware document memory M used in our base conversation model. Specifically, we construct a response-aware weight matrix G ∈ R |D|×|D| , which considers the correlation between context-aware document representations and response representations, and then impose G on the memory matrix M. The teacher model is optimized to reconstruct the response with the use of response-aware memory M. • The student model learns to construct a responseanticipated weight matrix to estimate G used in the teacher model but without access to the response. It is a feed-forward neural network with document and context as its input.
The teacher model and the student model are jointly optimized with training data, while only the student model is applied to testing data.

Base Model
Following Qin et al. (2019), we use SAN (Liu et al., 2018) as our base model, which mainly consists of three components: • Input encoder: We use two bi-directional LSTM encoders to extract token-level representations of the document D and the context X.
• Memory construction: We build the document memory M ∈ R |D|×k (k is the hidden size of the memory) which will be used in the decoder. A cross-attention layer is first applied to the outputs of the two encoders to integrate information from the context to the document. Then, we obtain a set of context-aware document representation Since each d i corresponds to a document token, we treat it as the contextaware token representation of the i-th token. Next, a self-attention layer is employed to ingest salient information of the context-aware document representations: (1) where the softmax conducts the normalization over each row of the matrix.
• Output decoder: We use an attentional recurrent decoder to generate response tokens by attending to the memory M. The initial hidden state is set as the summation of token-level context representations. For each decoding step t, we get a hidden state h t : where [; ] indicates concatenation, and the crossattention layer here integrates information from the memory to the recurrent outputs. e t−1 is the wordembedding at step t − 1. Finally, we generate a token y t by a softmax on h t .
Our model modifies the memory construction by refining its self-attention layer so that the memory represents more accurate and on-demand knowledge that helps generating the response.

Teacher Model
To ingest accurate memory information for response generation under the aforementioned base model, our teacher model builds a response-aware weight matrix G ∈ R |D|×|D| given the contextaware document representation D and the response R, then refines the document memory M with G. Elements in G's indicate the importance of tokens or token pairs in the document, with consideration of the response information. First, we describe how to modify the memory matrix M when G is given. The original memory M is constructed by a self-attention operation as Eq. 1. To facilitate response awareness, we update the attention weight matrix A by element-wise multiplying G, and then get the refined memory M as In the following, we describe two methods to construct the response-aware weight matrix G: (1) Encoder Cross Attention

Self Attention
But, he's struggling with diseases now.

Context
Jackie was a renowned actor and starred many films, so he had many fans. He's generous and wealthy.  Figure 2: The architecture of our model. Blocks and lines in gray color compose the base model. Blue and gray parts compose the teacher model, while purple parts compose the student model. All components work for training, while only the student model and the decoder works for inference. In the response-aware/anticipated weight matrix, darker grids indicate higher weights. ( : matrix multiplication; : element-wise matrix multiplication.)

Document
We measure the response-aware token importance (RTI) considering the ground-truth response to construct G.
(2) We measure the response-aware pairwise importance (RPI) of each token pair (i, j), which can be directly assigned to the element G ij in G. For both methods, matrix elements can be either continuous or binary.

Response-Aware Token Importance (RTI)
We denote the response-aware token importance of document tokens as β ∈ R |D| , and measure it by response R and context-aware token representation D. To obtain β, we first apply an encoder to obtain the token-level representations of the response as [r 1 , . . . , r |R| ] and use its last hidden state r |R| as the sentence-level response representation. The response-aware token importance of token i is defined as the similarity between its context-aware token representation d i and the response representation r |R| . Next, we adjust each attention distribution (i.e., each column of A) with each of its attention weight multiplied by the token importance β i . Therefore, the resulting G can be obtained as: where 1 ∈ R |D| represents an identity vector with all elements as 1. By plugging the above G in Eq. 5, we can construct a memory matrix with plagiarized signals from the response. In this way, the self-attention distributions can adjust to emphasize important tokens, and their corresponding context-aware document token representations be-come more important in the memory matrix.
Recall that the document contains a large amount of noise information in CbR. Thus the attention distributions may become long-tailed due to the existence of many redundant document tokens. Hence, we can further construct a binary weighting vector based on β. We keep the weight of each element as 1 with the probability of β i calculated in Eq. 5. If the weight of a token turns to 0, this token is deactivated in calculating the attention distributions. However, the binary weight sampled from the Bernoulli distribution is not differentiable. To enable back-propagation of our model, we apply the Gumbel-Softmax (Jang et al., 2016) to approximate the Bernoulli distribution in the training phase, and sample the binary value from the Bernoulli distribution in the prediction phase as: where g(β) is defined as: Prediction.
The objective function of the teacher model is to maximize the log-likelihood of responses generated by the response-aware memory constructed with β: where f t denotes operations in Eq. 5 and its preorder operations. θ t consists of all parameters in the layers of f t . φ denotes parameters in Eq. 1 to Eq. 3. Both φ and θ t are learning parameters for J t .
Response-Aware Pairwise Importance (RPI) Instead of using token importance, we can construct G by the pairwise importance of token pairs. After obtaining the token representations [r 1 , . . . , r |R| ] from the response encoder similarly as in RTI, we can calculate the similarity of each d i towards all r j 's, denoted as n i ∈ R |R| . Each element in G can be associated with a weight B ij defined as the inner-product between n i and n j . Thus, we can treat B as the response-aware pairwise importance, and directly set each element in G as B ij : (9) Compared with response-aware token importance in which the designed G has identical column values, response-aware pairwise importance allows different values of different index (i, j)'s in G (but (i, j) and (j, i) have the same value since G is symmetric). Thus, the space of G is larger.
Notice that, the aforementioned binary processing with each β i can also be applied on each B ij here and the resulting G is binary. By using a binary G in our model, the memory construction can be considered as passing through a Graph Attention Network (GAT) (Veličković et al., 2018), which also constructs a graph and updates its representations relying on the information from itself and neighbors on the graph. However, our neighborhood matrix (i.e. G in our model) is not predefined as in GAT but dependant on the inputs d i 's and r j 's, which involve parameters to be estimated.
The objective of the teacher model for RPI can be modified from Eq. 8 by replacing β with B obtained in Eq. 9.

Student Model
The student model learns to construct a responseanticipated weight matrix to estimate the weight matrix G in the teacher model without access to the ground-truth R. If we employ RTI, the estimated target of the student model is β in Eq. 5. For RPI, the estimated target is B in Eq. 9.
Given D and X as inputs, we apply a bilinear attention layer to obtain a hidden representation matrix H. We apply a two-layer multi-layer per-ceptron (MLP) with ReLU activation to estimate β; we combine two attention outputs by W a to estimate B in the RPI: The objective function of the student model is to maximize the log-likelihood of generating responses based on the estimatedβ orB, and diminish the gap of the weighting vector or matrix between the student model and the teacher model by a mean square loss. Taking the RTI strategy as an example, we optimize the following objective: where f s denotes the operation in Eq. 11 and its preorder operations. θ s consists of the layer parameters in f s . λ balances the two loss terms. For RPI, we replace to optimize with B andB.

Model Training
We first train the teacher model until it converges, and then train the student model with the use of β or B from the converged teacher model. Next, we repeat the above processes iteratively. In the training of the teacher model, we fix parameters in θ s (except parameters shared with θ t ) and train the model subject to J t ; for the student model, we fix φ and θ t (except parameters shared with θ s ) and train the model subject to J s . For inference, only the student model will be used to infer the responseanticipated weight matrix and the decoder applies it for generating the output response. As stated in RPI, it has better model capacity by allowing a larger space of G with the use of the weight matrix B instead of the token importance vector β in RTI. In terms of optimization, we need to estimate more parameters by using RPI, which requires higher training difficulty.

Dataset
We use the dataset for the CbR task released by Qin et al. (2019). The dataset contains crawled articles and discussions about these articles from Reddit. The articles act as the documents, while the discussions serve as conversational contexts and responses. In total, we have 2.3M/13k/1.5k samples for training/testing/validation.

Implementation Details
For all methods, we set word embedding dimension to 300 with the pre-trained GloVe (Pennington et al., 2014). Following Qin et al. (2019), our vocabulary contains top 30k frequent tokens. We use bi-LSTMs with the hidden dimensions of 512 and the dropout rate of 0.4 in our encoders. We optimize models by Adam with an initial learning rate of 0.0005 and the batch size of 32. All conversation contexts/responses/documents are truncated to have the maximum length of 30/30/500. For training, we set λ as 1 in the loss of student models after tuning. For inference, we apply a top-k random sampling decoding (Edunov et al., 2018) with k=20. The validation set is for early stopping. Aforementioned implementation details can be found in our codes 1 .

Competing Methods
1. Seq2Seq (Sutskever et al., 2014). The standard Seq2Seq model that leverages only the conversational context for response generation. 2. MemNet (Ghazvininejad et al., 2018). A knowledge-grounded conversation model that uses a memory network to store knowledge facts. 3. GLKS . It applies a global knowledge selector in encoding and a local selector on every decoding step. 4. Conversation with Machine Reading (CMR) (Qin et al., 2019). The state-of-the-art model on the CbR task, which is also our base model (Sec 3.2). Here, we use the full model of CMR (called CMR+w in (Qin et al., 2019)), since the full model outperforms other CMR's variants on most metrics. We further apply the copy mechanism (See et al., 2017) to this base model (CMR+Copy). 5. Four variants of our proposed models: RAM T denotes our Response-Anticipated Memory-based model with RTI, and RAM T+Copy denotes its 1 https://github.com/tianzhiliang/RAM4CbR copy version. RAM P and RAM P+Copy denote our model with RPI and its copy variant .

Evaluation Metrics
Following all metrics in Qin et al. (2019), we evaluate all methods by both automatic and human evaluations. For automatic evaluations, we evaluate the responses in three aspects: 1. Appropriateness. We use three metrics to evaluate the overall quality of a response: BLEU-4 (Papineni et al., 2002), Meteor (Banerjee and Lavie, 2005), and NIST (Doddington, 2002). NIST is a variant of BLEU that measures n-gram precision weighted by the informativeness of n-grams. 2. Grounding. We measure the relevance between documents and generated responses to reveal the effectiveness of responses exploiting the document information. We define #overlap as the number of non-stopword tokens in both the document D and the generated responseR but not in contexts X. We calculate the precision P and recall R as where S denotes the stopword list. F1 is the harmonic mean of precision P and recall R. We further propose to measure the effectiveness of exploiting the document information considering the ground-truth. In this way, we evaluate how many ground-truth information models can exploit from the document. We define #overlap GT as the number of non-stopword tokens in the document D, the generated responseR and the ground-truth R but not in contexts X. The precision and recall   are as following, where F 1 GT is the harmonic mean of precision P GT and recall R GT For human evaluations, we hire five annotators from a commercial annotation company to evaluate 200 randomly selected test samples, and results from different models are shuffled. The annotators evaluate on a 5-point scale in three aspects: overall quality (H-Appr), relevance with documents (H-Ground), and informativeness (H-Info).

Experimental Results and Analysis
In this part, we first show the performance of all methods in Sec 5.1. Then, we validate the effectiveness of response anticipation on CbR in Sec 5.2 by comparing the top similar tokens with the response using their representations in the memory. We also compare more variants of our model  in Sec 5.3, including the token importance versus pairwise importance, and each method with continuous weights versus their variants with binary weights. At last, we conduct a case study in Sec 5.4.

Overall Performance
Results of all models on automatic and human evaluations are shown in Table 1 and Table 3. Mem-Net outperforms Seq2Seq on most metrics, which validates that it is important to utilize document information in CbR. However, MemNet only slightly improves on Grounding. Both GLKS and CMR outperform MemNet on most metrics, indicating that it matters how to construct the document memory used in conversation models for CbR. Compared with CMR, CMR+Copy is more competitive on Grounding but weaker on other metrics. Our proposed models outperform other competing methods on all metrics, including automatic and human evaluations. For models without the copy mechanism, RAM T performs the best. For models with copy, RAM T+Copy and RAM P+Copy excel CMR+Copy on most metrics. Overall, our proposed strategy works well on both the model with and without copy mechanism. We will compare RAM T and RAM P in details in Sec 5.3.

Effectiveness of Response Anticipation
In this section, we investigate whether anticipating response contributes to building a better document memory. We first calculate the semantic similarity between each document token and the response us- i remember my comment on my post. and i am not sure why but my point is that he has the best score that will always get a good CMR+Copy they are , but not the same as the first one .
he also played the same game, is there title to be a team RAM T i think we have num teams playing the premier league team. in my opinion he was not a good player, but the united kingdom was in the europa i love him the next time i play for num years, so that is probably the only option i understand. RAM T +Copy they are the best player in the world. he also played the second one, but that doesn't mean it was num years ago. ing their Glove embeddings, and select top K document tokens. Next, we accumulate the attention weights of each token in all attention distributions in the self-attention weights A in Eq. 1, i.e. summation over each column of A. Then we select the top K tokens according to their accumulated attention weights. Here, we set K = 10, 20. We apply metrics in Liu et al. (2016) to calculate the similarity of two token sets extracted above, including maximal tokens-tokens embedding similarity (Emb-M) and bag-of-word embedding similarity (Emb-B). A higher similarity score indicates more response information anticipated by the model. Table 4 shows the results of our two models RAM T and RAM P as well as CMR (We use the original self-attention matrix A for the above calculation for CMR). Results demonstrate that our model is able to output more response-anticipated self-attention distributions, which benefits generating a response close to the ground truth.

Analysis on Different Model Variants
Token importance vs Pairwise importance. We compare our model variants with different strategies to construct the responseaware/anticipated weight matrix , i.e. RAM T (Eq. 5) and RAM P (Eq. 9). We not only compare their overall performance by the teacher-student framework (Eq. 8 & 12) but also the teacher model only (Eq. 12). The first four rows in Table 2 shows the results. We have an interesting finding that RAM P underperforms RAM T in the full teacher-student framework, but outperforms RAM T on the mode with teacher model only on most metrics. This result is actually consistent with our discussion in Sec 3.5 that RAM P has a higher capacity to carry more information in G, thus its teacher model yields better performance. However, for the student model, RAM P is more difficult to converge to a good local optimum due to more parameters to be estimated, resulting in that its overall performance may not exceed that of RAM T.

Continuous weight vs Binary weight.
We also compare the model variants with continuous weight (Eq. 5) and binary weight (Eq. 6). The last two rows in Table 2 give the results of the variants of RAM T and RAM P with a binary G. We can see that both RAM T and RAM P with a binary weight matrix performs better on Appropriateness, which means a sparse G on the attention matrix can help select more concise information to construct the memory. Nevertheless, models with a continuous weight matrix can generate more informative responses owing to their ability to access broader and more information from the document. Table 3 shows two test samples with generated responses of all models. For Case 1, Seq2Seq and MemNet cannot generate responses relevant to either the document or context. CMR catches the topic "sports", while GLKS and CMR+Copy use "first person" and "first one" to reflect "only two" mentioned in the document. The response of RAM T contains information related to both document ("num teams" and "premier league") and context ("europa"). RAM T+Copy is also highly relevant to the document and the context, and copies "player" from the document. For Case 2, the first four methods have little relation to the document or the context. CMR+Copy mentions "played". Our models mention "played" and "num years". By examining the cases, our method shows promising improvements over existing methods. However, generation on the CbR task is very challenging and there is still a huge space to improve.

Case Study
We plot the accumulated attention weights of RAM T and CMR as in Sec 5.2 of the document tokens on Case 1. Fig. 4 shows that RAM T's attention highlights "num" and "premier", and thus it generates the above words in its response.

Conclusion
Focusing on the CbR task, we propose a novel response-anticipated document memory to exploit and memorize the document information that is important in response generation. We construct the response-anticipated memory by a teacher-student framework. The teacher accesses the response and learns a response-aware weight matrix; the student learns to estimate the weight matrix in the teacher model and construct the response-anticipated document memory. We verify our model on both automatic and human evaluations and experimental results show our model obtains the state-of-the-art performance on the CbR task.