Conversational Response Re-ranking Based on Event Causality and Role Factored Tensor Event Embedding

We propose a novel method for selecting coherent and diverse responses for a given dialogue context. The proposed method re-ranks response candidates generated from conversational models by using event causality relations between events in a dialogue history and response candidates (e.g., “be stressed out” precedes “relieve stress”). We use distributed event representation based on the Role Factored Tensor Model for a robust matching of event causality relations due to limited event causality knowledge of the system. Experimental results showed that the proposed method improved coherency and dialogue continuity of system responses.


Introduction
While a variety of dialogue models such as the neural conversational model (NCM) (Vinyals and Le, 2015) have been researched widely, such dialogue models often generate simple and dull responses due to the limitation of their ability to take dialogue context into account. It is very difficult for these models to generate coherent responses to a dialogue history. We tackle this problem with a new architecture by incorporating event causality relations between response candidates and a dialogue history. Typical event causality relations are cause-effect relations between two events, such as "be stressed out" precedes "relieve stress." In this paper, event causality relations are defined that an effect event is likely to happen after a corresponding cause event happens (Shibata and Kurohashi, 2011;Shibata et al., 2014). Event causality relations have been used in why-question answering systems to focus on causalities between questions and answers (Oh et al., 2013(Oh et al., , 2016(Oh et al., , 2017. It is also reported that a conversational model using event causality relations can generate diverse and coherent responses (Fujita et al., 2011). However, the relation between dialogue continuity and the coherency of system responses is still an underlying problem.
In this paper, we propose a novel method to select an appropriate response from response candidates generated by NCMs. We define a score for re-ranking to select a response that has an event causality relation to a dialogue history. Re-ranking effectively improves response reliability in language generation tasks such as why-question answering and dialogue systems (Oh et al., 2013;Jansen et al., 2014;Bogdanova and Foster, 2016;Ohmura and Eskenazi, 2018). We used event causality pairs extracted from a large-scale corpus (Shibata and Kurohashi, 2011;Shibata et al., 2014). We also use distributed event representation based on the Role Factored Tensor Model (RFTM) (Weber et al., 2018) to realize a robust matching of event causality relations, even if these causalities are not included in the extracted event causality pairs. In human and automatic evaluations, the proposed method outperformed conventional methods in selecting coherent and diverse responses.
2 Response Re-ranking Using Event Causality Relations Figure 1 shows an overview of the proposed method. The process consists of four parts. First, N -best response candidates are generated from an NCM given a dialogue history (Figure 1 1 ⃝; Section 2.1). Then, events (predicate-argument structures) are extracted by an event parser from both the dialogue history and the response candidates (Figure 1 2 ⃝). We used Kurohashi Nagao Parser (KNP) 1 (Kawahara and Kurohashi, 2006;Sasano and Kurohashi, 2011) as the event parser. Next, the extracted events are converted to dis- Figure 1: Neural conversational model+re-ranking using event causality; a response that has an event causality relation ("be exhausted" → "relax") to the dialogue history is selected by the re-ranking. ⃝; Section 2.2, 2.4). We describe these components in more detail below.

Neural Conversational Model (NCM)
NCM learns a mapping between input and output word sequences by using recurrent neural networks (RNNs). NCMs can generate N -best response candidates by using beam search or sampling (Macherey et al., 2016).

Event Causality Pairs
The proposed method uses event causality pairs. Events in a pair, which have cause-effect relations, are extracted from a large-scale corpus on the basis of co-occurring statistics and case frames (Shibata and Kurohashi, 2011;Shibata et al., 2014). 420,000 entries are extracted from 1.6 billion texts: each entry consists of information denoted in Table 1. "predicate 1" and "argument 1" are components of a cause event, and "predicate 2" and "argument 2" are components of an effect event. Each event consists of a predicate and arguments. The predicate is required, and the argument is optional. We used arguments that have the following roles: nominative, accusative, dative, instrumental, and locative cases. lif t is the mutual information score between two events, which indicates the strength of the causality relation. Using lif t, we propose a score for re-ranking as, p is the posterior probability of the response candidate provided by NCM. λ is a hyper parameter to decide the weight of event causality relations. lif t(e h , e r ) is the lif t score between an event e h in the dialogue history, and an event e r in the response candidate, which is equal to 2 if the pair does not appear in the extracted event causality pair pool. Note that lif t(e h , e r ) is logscaled because it has a wide range of values (10 < lif t(e h , e r ) < 10, 000). In the case where more than one event causality relations are recognized between the dialogue history and the response candidate, the score of the candidate is determined by the relation with the highest lif t(e h , e r ). We call this model "Re-ranking."

Distributed Event Representation Based on Role Factored Tensor Model (RFTM)
It is difficult to determine event causality relations by using only the pairs observed in an actual corpus. Therefore, we introduce a distributed event representation to improve the robustness of matching events in a dialogue with those in the event causality pair pool. Any events are embedded into fixed length vectors to calculate their similarities. We define an event with a single predicate or a pair of a predicate and arguments. Argument a of an event is embedded into vector as v a by using Skip-gram (Mikolov et al., 2013c,a,b). Predicate p of an event is embedded into vector as v p by using predicate embedding which is based on case-unit Skip-gram. Figure 2 shows the model architecture of predicate embedding. The model learns predicate vector representations which are good at predicting its arguments. To get an event embedding for the pair of v p and v a , we propose to use RFTM, which was proposed by Weber et al. (2018). The RFTM embeds a predicate and its arguments into vector e as, (2) The relation of a predicate and its arguments is computed using a 3D tensor T and matrices W a . If the event has no arguments, e is substituted by v p . The RFTM is trained to predict an event sequence; thus it can represent the meaning of the event in a particular context. Figure 3 illustrates the process of matching events on the basis of distributed event representation. Given an event pair from a response candidate and a dialogue history, the proposed method finds an event causality pair that has the highest cosine similarity from the pool. lif t score, strength of the event causality relation, is extended as,

Event Causality Relation Matching Based on Distributed Event Representation
.
( 3) e h is an event in the dialogue history, e r is an event in the response candidate. e c and e e are respectively a cause and an effect event of an event Figure 3: Event causality relation matching; the lif t of the event causality relation in which "be exhausted" precedes "relax," is calculated from the lif t of the most similar event causality relation where "be stressed out" precedes "relieve stress." (1) with lif t emb (e h , e r ), the score using distributed event representation is defined as, We call this model "Re-ranking (emb)."

Experiments
We conducted automatic and human evaluations to compare responses with and without the reranking. We evaluated our proposed re-ranking method on a conventional Encoder-Decoder with Attention (EncDec) model (Bahdanau et al., 2015;Luong et al., 2015) and a Hierarchical Recurrent Encoder-Decoder (HRED) model (Sordoni et al., 2015;. While HRED tries to generate more coherent responses to dialogue context than a simple Encoder-Decoder, the diversity of responses is small due to context constraints. We used the Japanese data from a Wikipedia dump for training Skip-gram and predicate word embeddings of RFTM, and the Maichichi newspaper dataset 2017 2 for training RFTM. We col-

Model Settings
The hidden unit size of Skip-gram (Mikolov et al., 2013c,a,b), predicate embedding, and RFTM (Weber et al., 2018) was 100. We used gated recurrent units (GRUs) Chung et al., 2014) whose number of layers was 2 and hidden unit size was 256, for the encoder and decoder of the NCMs. The batch size was 100, the dropout probability was 0.1, and the teacher forcing rate was 1.0. We used Adam (Kingma and Ba, 2015) as the optimizer. The gradient clipping was 50, the learning rate for the encoder and the context RNN of HRED was 1e −4 , and the learning rate for the decoder was 5e −4 . The loss function was inverse token frequency (ITF) loss (Nakamura et al., 2019). We used sentencepiece (Kudo and Richardson, 2018) as the tokenizer, and the vocabulary size was 32,000. These settings were the same in all models. Repetitive suppression (Nakamura et al., 2019) and length normalization (Macherey et al., 2016) were used at the decoding step. Finally, λ of Eq.

Diversity of Beam Search
We investigated internal diversity of N -best response candidates generated from each dialogue model. It is expected that the higher diversity is, the more effective re-ranking is. Hence, we evaluated diversity on the test data by dist-1, 2 (Li et al., 2016). Beam width was set to 20; it is same in the following experiments.
The result is shown in Table 2: Ave.dists are averages of dist computed internal N -best response candidates. The diversity of EncDec is higher than that of HRED. Table 3 shows the results of our evaluation using automatic metrics. We compared the results by referring to the ratio of responses different from the without re-ranking method ("re-ranked"), bilingual evaluation understudy (BLEU) (Papineni et al., 2002), NIST (Doddington, 2002), and vector extrema (Gabriel et al., 2014) ("extrema") score. NIST is based on BLEU, but heavily weights less frequent N-grams to focus on content words. Vector extrema computes cosine similarity between sentence vectors of a reference and a generated response from a model. Each sentence vector e s is computed by taking extrema of Skip-gram word vectors e w in each dimension d as,

Comparison in Automatic Metrics
max w∈s e wd if e wd > | min w ′ ∈s e w ′ d | min w∈s e wd otherwise .
(5) e sd and e wd are the dth dimensions of e s and e w respectively. Additionally, we evaluated dist (Li et al., 2016), Pointwise Mutual Information (PMI) (Newman et al., 2010), and average response length ("length"). Dist and PMI are used to evaluate diversity and coherency respectively. PMI between a response and a dialogue history is defined as,   specific NCM, a range of dialogue history used for re-ranking, and re-ranking method. Methods with "1-best" used neither re-ranking and event embedding. Those with "Re-ranking" used re-ranking but did not use event embedding. Those with "Reranking (emb)" used both the re-ranking and the proposed event embedding method. Re-ranking lowered scores of the similarity to reference: BLEU, NIST, and extrema, because normal NCM models were trained to generate similar responses to the references, generated top 1 response before re-ranking should have the highest scores in those similarity metrics. Dist-2 and PMI were improved by re-ranking. This indicates that words in re-ranked responses are diverse and coherent to dialogue histories. However, ratios of re-ranked responses were around 10%; hence, the effect of re-ranking was limited. By introducing the proposed event embedding method, the ratios of re-ranked responses improved drastically (Reranking vs. Re-ranking (emb)). Moreover, the reranking models with event embedding have highest dist-1, dist-2, and PMI. As the HRED models had higher BLEU, NIST, and PMI values than those of EncDec models in all re-ranking methods, we conducted a human evaluation by comparing HRED model-based systems.

Human Evaluation
It is difficult to evaluate system performances only with automatic metrics . Hence, we compared a baseline model and our models in a human evaluation to confirm coherency and dialogue continuity of responses selected by our proposed methods. We compared baseline HRED model with our proposed models, re-ranked without embedding and with embedding using the last  five histories. To reduce evaluators' workload, we used test data whose the number of user utterances is less than three, and removed dialogues which need external knowledge to evaluate. We used crowdsourcing for the human evaluation. Ten crowd-workers compared responses selected by two of three models in the following two subjective criteria. The first one is "which words in a response are more related to a dialogue history" (word coherency), which indicates system response coherency to dialogue histories. The second criterion is "which response is easier to respond to" (dialogue continuity), which indicates how much dialogue continuity system responses have. We were inspired to make these criteria by those of the Alexa Prize (Ram et al., 2018). The results are shown in Table 4, 5, and 6. Word coherency was improved by our model without embedding, but lowered by the model with embedding. This is because workers acknowledged causality relations included in the event causality pair pool, but did not acknowledge generalized causalities with event embedding. However, dialogue continuity was improved by the proposed re-ranking model with embedding, it is probably because the proposed model reduced the number of dull responses. We need to investigate the better threshold in the event embedding to balance out the coherency and the continuity as the future work.

Discussion
We analyzed an adequacy of re-ranking using event causality relations. Here are system response examples of our proposed method. "()" indicates original Japanese sentences, "[]" indicates event causality relations used for re-ranking, and "<>" indicates responses before re-ranking. All examples are translated from Japanese to English. In above examples, appropriate event causality relations are used to select logical, coherent, and diverse responses. However, we found that these cases are not majority. Our method used inadequate event causality relations even if coherent responses were selected as a result ("Conversation 3"). Responses selected by our method are sometimes unnatural and incoherent than those before re-ranking as "Conversation 4", "Conversation 5", or "Conversation 6". Considering the result of human evaluation and above examples, we hypothesized that our method have two problems in slecting appropriate event causality relations. The first problem is that the event embedding over-generalized events ("Conversation 4"). The causality in Conversation 4 ("drink alcohol" precedes "can drink alcohol") is obtained by generalizing a causality that "enter restaurant" precedes "order beer", which is included in the event causality pair pool. It is necessary to prevent over-generalization by improving the embedding architecture. The second problem is that our method focuses on only word coherency, not response naturalness ("Conversation 5" and "Conversation 6"). To solve the problem, our method has to maintain response naturalness while improving coherency of word choices.

Conclusion
We proposed a selection of response candidates generated from a neural conversational model (NCM) utilizing event causality relations. The method had a robust matching of event causality relations attributed to distributed event representation. Experimental results showed that the proposed method selects a coherent and diverse response. The proposed method can be applied to any languages that have a semantic parser, because it uses predicate-argument structure based event expressions. However, unnatural responses were sometimes selected due to inadequate event causality relations. Future work will focus on solving the problem by preventing overgeneralization of events, and maintaining response naturalness.