A Compare Aggregate Transformer for Understanding Document-grounded Dialogue

Unstructured documents serving as external knowledge of the dialogues help to generate more informative responses. Previous research focused on knowledge selection (KS) in the document with dialogue. However, dialogue history that is not related to the current dialogue may introduce noise in the KS processing. In this paper, we propose a Compare Aggregate Transformer (CAT) to jointly denoise the dialogue context and aggregate the document information for response generation. We designed two different comparison mechanisms to reduce noise (before and during decoding). In addition, we propose two metrics for evaluating document utilization efficiency based on word overlap. Experimental results on the CMU_DoG dataset show that the proposed CAT model outperforms the state-of-the-art approach and strong baselines.


Introduction
Dialogue system (DS) attracts great attention from industry and academia because of its wide application prospects. Sequence-to-sequence models (Seq2Seq) (Sutskever et al., 2014;Serban et al., 2016) are verified to be an effective framework for the DS task. However, one problem of Seq2Seq models is that they tended to generate generic responses that provids deficient information Li et al. (2016); Ghazvininejad et al. (2018). Previous researchers proposed different methods to alleviate this issue. One way is to focus on models' ability to extract information from conversations. Li et al. (2016) introduced Maximum Mutual Information (MMI) as the objective function for generating diverse response. Serban et al. (2017) proposed a latent variable model to capture posterior information of golden response. Zhao et al. (2017) used conditional variational autoencoders to learn discourselevel diversity for neural dialogue models. The Document: Movie Name: The Shape of Water. Year: 2017. Director: Guillermo del Toro. Genre: Fantasy, Drama.Cast: Sally Hawkins as Elisa Esposito, a mute cleaner who works at a secret government laboratory. ... Critical Response: one of del Toro's most stunningly successful works ... Dialogue: S1: I thought The Shape of Water was one of Del Toro's best works. What about you? S2: Yes, his style really extended the story. S1: I agree. He has a way with fantasy elements that really helped this story be truly beautiful. It has a very high rating on rotten tomatoes, too. S2: Sally Hawkins acting was phenomenally expressive. Didn't feel her character was mentally handicapped. S1: The characterization of her as such was definitely off the mark.
The Document-grounded Dialogue (DGD) (Zhou et al., 2018b;Zhao et al., 2019;Li et al., 2019) is a new way to use external knowledge. It establishes a conversation mode in which relevant information can be obtained from the given document. One example of DGD is presented in Figure  1. Two interlocutors talk about the given document and freely reference the text segment during the conversation.
To address this task, two main challenges need to be considered in a DGD model: 1) Determining which of the historical conversations are related to the current conversation, 2) Using current conversation and the related conversation history to select proper document information and to gener-ate an informative response. Previous work Arora et al. (2019); Zhao et al. (2019); Qin et al. (2019); Tian et al. (2020);  generally focused on selecting knowledge with all the conversations. However, the relationship between historical conversations and the current conversation has not been studied enough. For example, in Figure 1, the italics utterance from user1, "Yes, his style really extended the story.", is related to dialogue history. While the black fold utterance from user1, "Sally Hawkins acting was phenomenally expressive. Didn't feel her character was mentally handicapped.", has no direct relationship with the historical utterances. when employing this sentence as the last utterance, the dialogue history is not conducive to generate a response.
In this paper, we propose a novel Transformerbased (Vaswani et al., 2017) model for understanding the dialogues and generate informative responses in the DGD, named Compare Aggregate Transformer (CAT). Previous research (Sankar et al., 2019) has shown that the last utterance is the most important guidance for the response generation in the multi-turn setting. Hence we divide the dialogue into the last utterance and the dialogue history, then measure the effectiveness of the dialogue history. If the last utterance and the dialogue history are related, we need to consider all the conversations to filter the document information. Otherwise, the existence of dialogue history is equal to the introduction of noise, and its impact should be eliminated conditionally. For this purpose, on one side, the CAT filters the document information with the last utterance; on the other side, the CAT uses the last utterance to guide the dialogue history and employs the guiding result to filter the given document. We judge the importance of the dialogue history by comparing the two parts, then aggregate the filtered document information to generate the response. Experimental results show that our model can generate more relevant and informative responses than competitive baselines. When the dialogue history is less relevant to the last utterance, our model is verified to be even more effective. The main contributions of this paper are: (1) We propose a compare aggregate method to determine the relationship between the historical dialogues and the last utterance. Experiments showed that our model outperformed strong baselines on the CMU DoG dataset.
(2) We propose two new metrics to evaluate the document knowledge utilization in the DGD. They are both based on N-gram overlap among generated response, the dialogue, and the document.

Related Work
The DGD maintains a dialogue pattern where external knowledge can be obtained from the given document. Most recently, some DGD datasets Models trying to address the DGD task can be classified into two categories based on their encoding process with dialogues: one is parallel modeling and the other is incremental modeling. For the first category, Moghe et al. (2018) used a generation-based model that learns to copy information from the background knowledge and a span prediction model that predicts the appropriate response span in the background knowledge.  claimed the first to unify knowledge triples and long texts as a graph. Then employed a reinforce learning process in the flexible multihop knowledge graph reasoning process. To improve the process of using background knowledge, (Zhang et al., 2019) firstly adopted the encoder state of the utterance history context as a query to select the most relevant knowledge, then employed a modified version of BiDAF (Seo et al., 2017) to point out the most relevant token positions of the background sequence.  used a decoding switcher to predict the probabilities of executing the reference decoding or generation decoding. Some other researchers (Zhao et al., 2019;Arora et al., 2019;Qin et al., 2019; also followed this parallel encoding method. For the second category, Kim et al. (2020) proposed a sequential latent knowledge selection model for Knowledge-Grounded Dialogue. Li et al. (2019) designed an incremental transformer to encode multi-turn utterances along with knowledge in the related document. Meanwhile, a two-way deliberation decoder (Xia et al., 2017) was used for response generation. However, the relationship between the dialogue history and the last utterance is not well studied. In this paper, we propose a compare aggregate method to investigate this problem. It should be pointed out that when the target response changes the topic, the task is to detect whether the topic is ended and to  Figure 2: The architecture of the CAT model. "utter" is short for utterance. "doc" is short for document.
initiate a new topic (Akasaki and Kaji, 2019). We do not study the conversation initiation problem in this paper, although we may take it as future work.

Problem Statement
The inputs of the CAT model are the given docu- The task is to generate the response R = (R 1 , R 2 , ..., R r ) with r tokens with probability: where R <i = (R 1 , R 2 , ..., R i−1 ), Θ is the model's parameters.

Encoder
The structure of the CAT model is shown in Figure  2. The hidden dimension of the CAT model is h. We use the Transformer structure (Vaswani et al., 2017). The self-attention is calculated as follow: where Q, K, and V are the query, the key, and the value, respectively; d k is the dimension of Q and K. The encoder and the decoder stack N (N = 3 in our work) identical layers of multihead attention (MAtt): where The encoder of CAT consists of two branches as figure 2 (a). The left branch learns the information selected by dialogue history H, the right part learns the information chosen by the last utterance L. After self-attention process, we get H s = MAtt(H, H, H) and L s = MAtt(L, L, L). Then we employ L s to guide the H. H 1 = MAtt(L s , H, H), where H 1 is the hidden state at the first layer. Then we adopt H 1 to select knowledge from the document D, D 1 = FF(MAtt(H 1 , D, D)). FF is the feed-forward process. In the second layer, D 1 is the input, After N layers, we obtain the information D n selected by H. In the right branch, we use L s to filter the D. D n is the information selected by L.

Comparison Aggregate
As demonstrated by (Sankar et al., 2019), the last utterance played an fundamental role in response generation. We need to preserve the document information filtered by L, and determine how much information selected by H is needed. We propose 2 different compare aggregate methods: one is concatenation before decoding and the other is attended comparison in the decoder.

Concatenation
We use average pooling to H s and L s to get their vector representations H sa and L sa ∈ R h * 1 , respectively. The concatenation method calculates relevance score α to determine the importance of D n as follow: where is the concatenation of X and Y in sentence dimension. * is the element-wise multiplication. Note that the D n is guided by H, the concatenation method performs a second level comparison with H and L and then transfers the topic-aware D f inal to the two-pass Deliberation Decoder (DD) (Xia et al., 2017). The structure of the DD is shown in Figure 2 (b). The first-pass takes L and D f inal as inputs and learns to generate a contextual coherently response R 1 . The second-pass takes R 1 and the document D as inputs and learns to inject document knowledge. The DD aggregates document, conversation, and topic information to generate the final response R 2 . Loss is from both the first and the second layers: where M is the total training example; R 1 i and R 2 i are the i-th word generated by the first and second decoder layer, respectively.

Attended Comparison
We employ an Enhanced Decoder (Zheng and Zhou, 2019) to perform the attended comparing. The structure of our Enhanced Decoder is illustrated in Figure 2 (c). It accepts D n , D n and the response R as inputs, applying a different way to compare and aggregate. The merge attention computes weight across all inputs: where W P is learnable parameters. The dimension of P is 3. P R , P D and P D are the Softmax results of P. V merge and L are used for next utterance attention as shown in Figure 2 (c). The output of the Enhanced Decoder is connected to the second layer of DD and we define this new structure as Enhanced Deliberation Decoder (EDD). The loss is the same as Eq. (7).

Dataset
We evaluate our model with the CMU DoG (Zhou et al., 2018b) dataset. There are 4112 dialogs based on 120 documents in the dataset. One document contains 4 sections, such as movie introduction and scenes. A related section is given for every several consequent utterances. However, the conversations are not constrained to the given section. In our setting, we use the full document (with 4 section) as external knowledge. The average length of documents is around 800 words. We concatenate consequent utterances of the same person as one utterance. When training, we remove the first two or three rounds of greeting sentences. Each sample contains one document, two or more historical utterances, one last utterance, and one golden response. When testing, we use two different versions of the test set. The first follows the process of training data, we name it Reduced version. The second is constructed by comparing the original document section of the conversation based, we preserve the examples that the dialogue history and the last utterance are based on different document sections. For example, dialogue history is based on section 2, the last utterance and response are based on section 3. We name it Sampled version and it is used for testing our models' comprehending ability of the topic transfer in conversations. The data statistics are shown in Table 1. Please refer to Zhou et al. (2018b) for more details. It is worth noting that the sampled version does not represent the proportion of all conversation topic transfers, but it demonstrates this problem better than the Reduced version. We also test our method on the Holl-E Moghe et al. (2018) dataset. Since the processing of the dataset and the experimental conclusions obtained are similar to CMU DoG, we did not present in this article.

Baselines
We evaluated several competitive baselines.  -k). For the second one, we use the same encoder RNN with shared parameters to learn the representation of the document and the utterance, then concatenate the final hidden state of them as the input of the context RNN. It is denoted by VHRED(c). For the third one, we use word-level dot-attention (Luong et al., 2015) to get the document-aware utterance representation and use it as the input of context RNN. It is termed as VHRED(a).

Transformer-based models
T-DD/T-EDD: They both use the Transformer as the encoder. The inputs are the concatenation of dialogues and the document. These two models parallel encode the dialogue without detecting topic transfer. The T-DD uses a Deliberation Decoder (DD) as the decoder. The T-EDD uses an Enhanced Deliberation Decoder (EDD) as the decoder. ITDD (Li et al., 2019): It uses Incremental Transformer Encoder (ITE) and two-pass Deliberation Decoder (DD). Incremental Transformer uses multi-head attention to incorporate document sections and context into each utterance's encoding process. ITDD incrementally models dialogues without detecting topic transitions.

Evaluation Metrics
Automatic Evaluation: We employ perplexity (PPL) (Bengio et al., 2000), BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). The PPL of the gold response is measured, lower perplexity indicates better performance. BLEU measures the n-gram overlap between a generated response and a gold response. Since there is only one reference for each response, BLEU scores are extremely low. ROUGE measures the n-gram overlap based on the recall rate. Since the conversations are constrained by the background material, ROUGE is reliable.
We also introduce two metrics to automatically evaluate the Knowledge Utilization (KU), they are both based on N -grams overlaps. We define one document, conversations and generated response in Test set as (D,C,R). The N -grams set of each (D,C,R) are termed as G N d , G N c and G N r , respectively. The number of overlapped N -grams of G N d and G N r is recorded as G N dr . Tuples which are in G N dr but not in G N c is named G N dr−c . Then KU = len(G N dr−c )/len(G N dr ) reflects how many N -grams in the document are used in the generated replies, len(G) is the tuple number in G. The larger the KU is, the more N -grams of the document is utilized. Since low-frequency tuples may be more representative of text features, we define the reciprocal of the frequency of each tuple k in G as R G k , which represents the importance of a tuple. Then the Quality of Knowledge Utilization (QKU) is calculated as: If R Gr k is more important in response and R G d k is less important in document, the QKU will become even larger. So the smaller QKU means the higher quality of the used document knowledge.
Human Evaluation: We randomly sampled 100 conversations from the Sampled test set and obtained 800 responses from eight models. We have 5 graduate students as judges. They score each response with access to previous dialogues and the document. We use three metrics: Fluency, Coherence, and Informativeness. Fluency measures whether the response is a human-like utterance. Coherence measures if the response is coherent with the dialogue context. Informativeness measures if the response contains relevant and correct information from the document. They are scored from 1 to

Experimental Setup
We use OpenNMT-py (Klein et al., 2017) as the code framework. For all models, the pre-trained 300 dimension word embedding (Mikolov et al., 2013) is shared by dialogue, document, and generated responses, the dimension of the hidden size is 300. For the RNN-based models, 3-layer bidirectional GRU and 3-layer GRU are applied for encoder and decoder, respectively. For the Transformer-based models, the layers of both encoder and decoder are set to 3, the number of heads in multi-head attention is 8 and the filter size is 2048. We use Adam (α = 0.001, β 1 = 0.9, β 2 = 0.999, and = 10 −8 ) (Kingma and Ba, 2015) for optimization. The beam size is set to 5 in the decoder. We truncate the words of the document to 800 and the dialogue utterance to 40. All models are trained on a TITAN X (Pascal) GPU. The average training time per epoch is around 40 minutes for the Transformer-based models and around 20 minutes for the RNN-based models. Table 2 shows the automatic evaluations for all models on the Reduced (Sampled) dataset. The dialogue history is 2 rounds. We only present ROUGE-L as ROUGE-1/2 show the same trend as ROUGE-L. Through experiments, we can see that the change range of KU-2 (8.0-13.7) is less than KU-3 (23.1-31.7) on the Reduced data, indicating that the KU-3 can better reflect the amount of knowledge used than KU-2.

Experimental Results study
In the RNN-based models, the VHRED(-k) gets the worst PPL/BLEU/ROUGE, which reveals the importance of injecting document knowl-edge in the DGD task. We did not calculate the KU/QKU of the VHRED(-k) since the model did not use document knowledge. The VHRED(a) gets better PPL/BLEU/ROUGE/KU/QKU than the VHRED(c) model, which means the smaller granular extraction of document information benefits more in generating responses.
Among the Transformer-based models, The ITDD model gets better PPL/BLEU/ROUGE-L/KU/QKU than the T-DD model, which means the incremental encoding method is stronger than parallel encoding. The CAT-EDD and the CAT-DD models achieve better performance than the T-DD and the T-EDD models, respectively. It indicates that our Compare-Aggregate method is helpful to understand the dialogue. The CAT-EDD model outperforms the ITDD model on all metrics, which indicates that our CAT module automatically learns the topic transfer between conversation history and the last utterance as we expected. The CAT-EDD does not perform as good as the CAT-DD, which shows that it is necessary to set up an independent mechanism to learn topic transfer, rather than automatic learning by attentions in the decoder.
Comparing with the RNN-based models, the Transformer-based models get better performance on PPL/BLEU/ROUGE. It proves that the latter is better in the ability of convergence to the ground truth. The VHRED(c) and the VHRED(a) get better KU and worse QKU than the Transformer-based models. It means that the latent variable models increase the diversity of replies and use more document tuples, but their ability to extract unique tuples is not as good as the Transformer-based ones. Table 3 shows the manual evaluations for all models on the Reduced(Sampled) dataset. The CAT-DD model gets the highest scores on Fluency/Coherence/Informativeness. When experimenting with the Sampled test set, we can see that the advantages of our models become greater than   the results of the Reduced version in both automatic and manual evaluations. Our model shows more advantages in datasets with more topic transfer. Table 4 illustrates the ablation study of the CAT-DD model. w/o-left means the left branch is removed and the model degenerates to T-DD which takes the last utterance and document as inputs. We can see that all the automatic evaluation indexes significantly reduce, indicating the dialogue history can not be simply ignored. w/o-(5,6) is a model without Eq. (5) and (6), which is equivalent to simply connect the outputs of the left and the right encoder branches. The results showed that the ability of the model to distinguish the conversation topic transfer is weakened. w/o-(G) is a model removing the utterattention in the left branch, which means we do not use L to guide the H, the structure of left branch changes to the right branch and the input is H. The performance is declining, which indicates that the guiding process is useful. The significant tests (twotailed student t-test) on PPL/BLEU/KU-2/QKU-2 reveal the effectiveness of each component.

History Round Study
We use the CAT-DD model and the Sampled test set to study the influence of the historical dialogue rounds. For example, setting dialogue history to 0 means we use only the last utterance, the CAT-DD becomes the w/o-left model in the ablation study. Setting dialogue history to N means we use N rounds of dialogue history for the input of the left branch. We set the conversation history to 0/1/2/3/4 to test the response of VHRED(a)/ITDD/CAT-DD models. Figure 3 shows the trend of BLEU/KU-3/QKU-3. The top figure shows the BLEU trend, the CAT-DD reaches the maximum when the rounds are 2. The continuous increase of rounds does not significantly improve the generation effect. In the middle picture, with the increase of historical dialogue from 0 to 2, the VHRED(a) and the CAT-DD have a visible improvement on the KU-3, which shows that the information contained in the historical dialogue can be identified and affect the extraction of document information. The ITDD model is not as sensitive as the others on the KU-3, indicating that the incremental encoding structure pays more attention to the information of the last utterance. The bottom figure shows the trend of the QKU-3. When the history dialogue increases, the ITDD model keeps stable and the VHRED(a) and the CAT-DD models have a declining trend, which again indicates that the VHRED(a) and the CAT-DD are more sensitive to the historical dialogue. Figure 4 shows the average sigmoid(W α α) value in the CAT-DD model over the Reduced/Sampled test set and the Validation set. A higher value means a stronger correlation between the last utterance and the historical dialogue. We can see that  Figure 4: The rating of dialogue history in the CAT-DD model with Reduced and Sampled test set. The abscissa represents the dialogue rounds and the ordinate represents the correlation score in the model.

Document:
... sally hawkins as elisa esposito, a mute cleaner who works at a secret government laboratory. michael shannon as colonel richard strickland ... rating rotten tomatoes: 92% The shape of water is a 2017 american fantasy film ... it stars sally hawkins, michael shannon, richard jenkins, Doug jones, michael stuhlbarg, and octavia spencer ... Dialogue history: S1: I wonder if it's a government creation or something captured from the wild. i would assume the wild. S2: It was captured for governmental experiments. The last Utterance: S1: Is it a big name cast? Groud truth: S2: Sally hawkins played the role of the mute cleaner, michael shannon played the role of colonel richard strickland.  on the Reduced test set and the Validation set, the relevance score is higher than that of the Sampled data, which proves that the last utterance and the historical dialogue are more irrelevant in the latter. Our model captures this change and performs better on the Sampled data than the Reduced data. When the historical rounds increase from 1 to 2, the relevance score reduces obviously for all data sets, which means the increase of dialogue history introduces more unrelated information. When the historical conversations increases from 2 to 6, all data have no significant change, indicating that increasing the dialogue rounds does not improve the recognition ability of the model to the topic change.

Case Study
In Figure 5, we randomly select an example in the Sampled test set for a case study. The document, the dialogue history, the last utterance, and the ground truth are presented. We can observe that the last utterance is irrelevant to the dialogue history. The generated responses of different models are listed below. The VHRED(a) and CAT-DD(w/o-(G)) models misunderstand the dialogue and use the wrong document knowledge. The TDD gives a generic reply. The ITDD model answers correctly but without enough document information. The CAT-DD(w/o-(5,6)) model gives a response that was influenced by the irrelevant historical dialogue which we want to eliminate. Only the CAT-DD model generates a reasonable reply and uses the correct document knowledge, which means it correctly understands the dialogues.

Conclusion
We propose the Compare Aggregate method to understand Document-grounded Dialogue (DGD). The dialogue is divided into the last utterance and the dialogue history. The relationship between the two parts is analyzed to denoise the dialogue context and aggregate the document information for response generation. Experiments show that our model outperforms previous work in both automatic and manual evaluations. Our model can better understand the dialogue context and select proper document information for response generation. We also propose Knowledge Utilization (KU) and Quality of Knowledge Utilization (QKU), which are used to measure the quantity and quality of the imported external knowledge, respectively. In the future, we will further study the topic transition problem and the knowledge injecting problem in the DGD.