Reasoning with Heterogeneous Knowledge for Commonsense Machine Comprehension

Reasoning with commonsense knowledge is critical for natural language understanding. Traditional methods for commonsense machine comprehension mostly only focus on one specific kind of knowledge, neglecting the fact that commonsense reasoning requires simultaneously considering different kinds of commonsense knowledge. In this paper, we propose a multi-knowledge reasoning method, which can exploit heterogeneous knowledge for commonsense machine comprehension. Specifically, we first mine different kinds of knowledge (including event narrative knowledge, entity semantic knowledge and sentiment coherent knowledge) and encode them as inference rules with costs. Then we propose a multi-knowledge reasoning model, which selects inference rules for a specific reasoning context using attention mechanism, and reasons by summarizing all valid inference rules. Experiments on RocStories show that our method outperforms traditional models significantly.


Introduction
Commonsense knowledge is fundamental in artificial intelligence, and has long been a key component in natural language understanding and human-like reasoning. For example, to understand the relation between sentences "Mary walked to a restaurant" and "She ordered some foods", we need commonsense knowledge such as "Mary is a girl", "restaurant sells food", etc. The task of understanding natural language with commonsense knowledge is usually referred as commonsense machine comprehension, which has been a hot topic in recent years (Richardson et al., 2013;. Recently, RocStories (Mostafazadeh et al., 2016a), a commonsense machine comprehension task, has attached many researchers' attention due to its significant difference from previous machine comprehension tasks. RocStories focuses on reasoning with implicit commonsense knowledge, rather than matching with explicit information in given contexts. In this task, a system requires choosing a sentence, namely hypothesis, to complete a given commonsense story, called as premise document. Table 1 shows two examples. RocStories proposes a challenging benchmark task for evaluating commonsensebased language understanding. As investigated by Mostafazadeh et al.(2016a), this dataset does not have any boundary cases and thus results in 100% human performance.
Commonsense machine comprehension, however, is an natural ability for human but could be very challenging for computers. In general, any world knowledge whatsoever in the reader's mind can affect the choice of an interpretation (Dahlgren et al., 1989). That is, a person can learn any heterogeneous commonsense knowledge and make inference of given information based on all knowledge in his mind. For example, to choose the right hypothesis for the first premise document in Table 1, we needs the event narrative knowledge that "X does a thorough job" will lead to "commends X", rather than "fire X". Besides, people can further confirm their judgement based on the sentimental coherence between "finish super early" and "job well done". Furthermore, in the second example, even both hypothesises are consistent with the premise document in both event and sentimental facets, we can still infer the right answer easily using the commonsense knowledge that "puppy" is a dog, meanwhile "kitten" is a cat. In recent years, many methods have been proposed for commonsense machine comprehension. However, these methods mostly either focus on matching explicit information in given texts (Weston et al., 2014;Wang and Jiang, 2016a,b;Zhao et al., 2017), or paid attention to one specific kind of commonsense knowledge, such as event temporal relation (Chambers and Jurafsky, 2008;Modi and Titov, 2014;Pichotta and Mooney, 2016b;Hu et al., 2017) and event causality (Do et al., 2011;Radinsky et al., 2012;Hashimoto et al., 2015;Gui et al., 2016). As discussed above, it is obvious that commonsense machine comprehension problem is far from settled by considering only explicit or a single kind of commonsense knowledge. To achieve humanlike comprehension and reasoning, there exist two main challenges: 1) How to mine and represent different kinds of implicit knowledge that commonsense machine comprehension needs. For example, to complete the first example in Table 1, we need a system equipped with the event narrative knowledge that "commends X" can be inferred from "X does a thorough job", as well as the sentiment coherent knowledge that "insubordination" and "finish super early" are sentimental incoherent.
2) How to reason with various kinds of commonsense knowledge. As shown above, knowledge that reasoning process needs varies for different contexts. For human-like commonsense machine comprehension, a system should take various kinds of knowledge into consideration, decide what knowledge will be utilized in a specific reasoning contexts, and make the final decision by taking all utilized knowledge into consideration.
To address the above problems, this paper proposes a new commonsense reasoning approach, which can mine and exploit heterogeneous knowledge for commonsense machine comprehension. Specifically, we first mine different kinds of knowledge from raw text and relevant knowl-edge base, including event narrative knowledge, entity semantic knowledge and sentiment coherent knowledge. These heterogeneous knowledge are encoded into a uniform representation -inference rules between elements under different kinds of relations, with an inference cost for each rule. Then we design a rule selection model using attention mechanism, modeling which inference rules will be applied in a specific reasoning context. Finally, we propose a multi-knowledge reasoning model, which measures the reasoning distance from a premise document to a hypothesis as the expected cost sum of all inference rules applied in the reasoning process.
By modeling and exploiting heterogeneous knowledge during commonsense reasoning, our method can achieve more accurate and more robust performance than traditional methods. Furthermore, our method is a general framework, which can be extended to incorporate new knowledge easily. Experiments show that our method achieves a 13.7% accuracy improvement on the standard RocStories dataset, a significant improvement over previous work.

Commonsense Knowledge Acquisition for Machine Comprehension
As described above, various knowledge can be exploited for machine comprehension. In this section, we describe how to mine different knowledge from different sources. Specifically, we mine three types of commonly used commonsense knowledge, including: 1)Event narrative knowledge, which captures temporal and causal relations between events; 2)Entity semantic knowledge, which captures semantic relations between entities; 3)Sentiment coherent knowledge, which captures sentimental coherence between elements.
In this paper, we represent commonsense knowledge as a set of inference rules given in the form of X f − → Y : s, which means that element Y can be inferred from element X under relation f , with an inference cost s. An element can stand  for either event, entity or sentiment, and this paper represents elements using lemmatized nouns, verbs and adjectives. The lexical element representation can also be easily extended to structural representation, like the one in (Chambers and Jurafsky, 2008), if needed. However, in auxiliary experiments we found that using structural elements results in severe sparseness and noises which in turn will hurt the reasoning performance. Therefore, we think an individual work is needed to solve it. Table 2 demonstrates several examples of inference rules. In following, we describe how to mine different types of inference rules.

Mining Event Narrative Knowledge
Event narrative knowledge captures structured temporal and casual knowledge about stereotypical event sequences, which is fundamental for commonsense machine comprehension. For example, we can infer "X ordered some foods" from "X walked to a restaurant" using event narrative knowledge. Previous work (Chambers and Jurafsky, 2008;Rudinger et al., 2015) proves that event narrative knowledge can be mined from raw texts unsupervisedly. So we propose two models to encode this knowledge using inference rules. The first one is based on ordered PMI, which is also proposed by Rudinger et al. (2015). Given two element e 1 and e 2 , this model calculates the cost of inference rule e 1 narrative − −−−−− → e 2 as: Here C(e 1 , e 2 ) is the order sensitive count that element e 1 occurs before element e 2 in different sentences of the same document. The second model is a variant of the skip-gram model (Mikolov et al., 2013). The goal of this model is to find element representations which can accurately predict relevant elements in sentences afterwards. Formally, given n asymmetric pairs of elements (e 1 1 , e 1 2 ), (e 2 1 , e 2 2 ), ...., (e n 1 , e n 2 ) identified from training data, the objective of our model is to maximize the average log proba-bility 1 n n i=1 logP (e i 2 |e i 1 ). And the probability P (e 2 |e 1 ) is defined using the softmax function: where v e and v e are "antecedent" and "consequent" vector representation of element e, respectively. We use the negative inner product −v e 2 T v e 1 as the cost of inference rule

Mining Entity Semantic Knowledge
Entities, often serving as event participants or environment variables, are important components of commonsense stories. Intuitively, an entity in hypothesis is reasonable if we can identify semantic relations between it and some parts of premise document. For example, if a premise document contains "Starbucks", then "coffeehouse" and "latte" will be reasonable entities in hypothesis since "Starbucks" is a possible coreference of "coffeehouse" and it is semantically related to "latte". Specifically, we identify mainly two kinds of semantic relations between entities for commonsense machine comprehension: 1) Coreference relation, which indicates that two elements refer to the same entity in environment. In stories, besides to pronouns, an entity is often referred using its hypernyms, e.g, the second example in Table 1 uses "dog" to refer to "puppy". Motivated by this observation, we mine coreference knowledge between elements using Wordnet (Kilgarriff and Fellbaum, 2000): X coref − −− → Y is an inference rule with cost 0 if X and Y are lemmas in the same Wordnet synset, or with hyponymy relation in Wordnet. Otherwise, the cost of inference rules between this element-pair under this relation will be 1.
2) Associative relation, which captures the semantic relatedness between two entities, i.e., "starbucks" → "latte", "restaurant" → "food", etc. This paper mines associative relations between entities from Wikipedia 1 , using the method proposed by Milne and Witten(2008). Specifically, given two entities e 1 and e 2 , we compute the semantic distance dist(e 1 , e 2 ) between them as: where E 1 and E 2 are the sets of all entities that link to these two entities in Wikipedia respectively, and W is the entire Wikipedia. We set the cost of inference rule e 1 associative − −−−−−− → e 2 as dist(e 1 , e 2 ).

Mining Sentiment Coherent Knowledge
Sentiment is one of the central and pervasive aspects of human experience (Ortony et al., 1990). It plays an important role in commonsense stories, i.e., a reasonable hypothesis should be sentimental coherent with its premise document. In this paper, we mine sentiment coherence rules using Sen-tiWordnet (Baccianella et al., 2010), in which each synset of Wordnet is assigned with three sentiment scores: positivity, negativity and objectivity. Concretely, to identify sentimental coherence rule between two element e 1 and e 2 , we first compute the positivity, negativity and objectivity scores of every element by averaging the scores of all synsets it's in, then we identify an element to be subjective if its objectivity score is smaller than a threshold, and the distance between its positivity and negativity score is greater than a threshold. Finally, for an inference rule e 1 senti − −− → e 2 , we set its cost to 1 if e 1 and e 2 are both subjective and have opposite sentimental polarity, to -1 if they are both subjective and their sentimental polarity are the same, and to 0 for other cases. For example, we will mine inference rules "good senti − −− → happy : −1", "perfect senti − −− → sad : 1" and "young senti − −− → happy : 0".

Metric Learning to Calibrate Cost Measurement
So far, we have extracted many inference rules under different relations. However, because we extract them from different sources and estimate their costs using different measurements, the cost metrics of these rules may not be consistent with each other. To exploit different types of inference rules in a unified framework, we here propose a metric learning based method to calibrate their costs.
Given an input distance function, a metric learning method constructs a new distance function which is "better" than the original one with supervision regarding an ideal distance (Kulis, 2012). To calibrate inference rule cost, we add a nonlinear layer to the original cost s r of inference rule r under relation f : Here c r is the metric-unified inference cost of inference rule r, w f and b f are calibration parame-ters for inference rules of relation f . We use sigmoid function in order to normalize costs into 0 to 1. Calibration parameters will be trained along with other parameters in our model. See Section 3.4 for detail.

Dealing with Negation
One important linguistic phenomenon needs to specifically consider is negation. Here we discuss how to solve negation in our model. We use ¬X to represent an element X modified by a negation word (the existence of negation is detected using dependency relations). Under event narrative relation and sentiment coherent relation, the existence of negation will reverse the conclusion. So we add three additional negation Here s is the calibrated cost of the original inference rule. For entity semantic relations, we just ignore the negation since it will not affect the inference under these relations.

Machine Comprehension via Commonsense Reasoning
This section describes how to leverage acquired knowledge for commonsense machine comprehension. We first define how to infer from a premise document to a hypothesis using inference rules. Then we model how to choose inference rules for a specific reasoning context. Finally, we describe how to measure the reasoning distance from a premise document to a hypothesis by summarizing the costs of all possible inferences.

Inference from Premise Document to Hypothesis
Given a premise document D = {d 1 , d 2 , ..., d m } containing m elements, a hypothesis H = {h 1 , h 2 , ..., h n } containing n elements, a valid inference R from D to H is a set of inference rules that all elements in H can be inferred from one element in D using one and only one rule in R. This definition means that all elements in H should be covered by consequents of inference rules in R, as well as all antecedents of inference rules in R should come from D. Figure 1 shows some inference examples, where (a), (b) and (d) are valid inferences, but (c) is not a valid inference because its rules can not cover all elements in hypothesis. By the definition, the size of R and the size of H are equal. So we use r i to denote the inference rule in R that applied to derive element h i in H, i.e., R = {r 1 , r 2 , ..., r n }.
Based on the above definition, we can naturally define the cost of an inference R as the cost sum of all inference rules in R. In Figure 1, the cost for inference (a) is 0.0 + 0.1 + 0.1 = 0.2, and for inference (d) is 0.0 + 0.8 = 0.8.

Modeling Inference Probability using Attention Mechanism
Obviously, there exist multiple valid inferences for a premise document and a hypothesis. For example, in Figure 1, both (a) and (b) are valid inferences for the same premise document and hypothesis. To identify whether a hypothesis is reasonable, we need to consider all possible inferences. However, in human reasoning process, not all inference rules have the same possibility to be applied, because the more reasonable inference will be proposed more likely. In Figure 1, inference (a) should have a higher probability than inference (b) because it is more reasonable to infer "foods" from "a restaurant" with associative relation, rather than from "walked to" with narrative relation. Besides, the possibility of proposing an inference should not depend on its cost, e.g., inference (d) should have high possibility to be proposed despite its high cost, because we often infer event "sleep" from another event using inference rules under narrative relation. As examples mentioned above, the "cost" measures the "correctness" of an inference rule. A rule with low cost is more likely to be "reasonable", and a rule with high cost is more likely to be a contradiction with commonsense. On the other hand, the "possibility" should measure how likely a rule will be applied in a given context, which does not depend on the "cost" but on the nature of the rule and the given context. Motivated by above observations, we endow each inference a probability P (R|D, H), indicating the possibility that R is chosen to infer hypothesis H from premise document D. For simplicity, we assume that each element in hypothesis is independently inferred using individual inference rule, then P (R|D, H) can be written as: Equation (7) clearly shows how an inference rule is selected given the premise document D and the element h i in hypothesis. It depends on which element d j in D will be selected and which relation f will be used to infer h i from d j . We then refactor the probability P (r i , d j |D, h i ) to be: Here f (r) is the relation type of inference rule r, and g(h, d, f ; D) is defined as: Here F denotes all relation types of inference rules, s(e 1 , e 2 ) is a matching function between two elements e 1 and e 2 , measuring by cosine similarity based on GoogleNews word2vec (Mikolov et al., 2013). And a(e, f ) is an attention function measuring how likely an element e will be involved with rules under relation f : where v f ∈ R K , W f ∈ R K×F and b f ∈ R K are attention parameters of relation f , and e ∈ R F is the feature vector of element e. Here K is the size of attention hidden layer and F is the dimension of feature vector. We consider three types of features, as shown in Table 3. Using attention mechanism, our method models the possibility that an inference rule is applied during the inference from a premise document to a hypothesis by considering the relatedness between elements and knowledge category, as well as the relatedness between two elements, which make it able to select the most reasonable inference rules to derive each part of the hypothesis.

Reasoning Distance Between Premise Document and Hypothesis
Given a premise document, this section shows how to measure whether a hypothesis is coherent using above inference model. Given all valid inferences from D to H and the probability P (R|D, H) of selecting inference R to infer H from D, we measure the reasoning distance L(D → H) as the expected cost sum of all valid inferences: Then using Equation (6) and Equation (7), we can further rewrite the equation into: Equation (15) shows that in our framework, the final cost of inferring the element h i in the hypothesis is the expected cost of all valid inference rules which can derive h i from one element in the premise document.

Model Learning
Following Huang et al. (2013), our model measures the posterior probability of choosing hypothesis H as the answer of premise document D through a softmax function: Here H D is all candidate hypothesises for D, and γ is a positive smoothing factor. We train our model by maximizing the likelihood of choosing right hypothesis H + for D: where θ is the parameter set of our model, including calibration parameters in Section 2.4 and attention parameters in Section 3.2. L(θ) is differentiable so we can estimate θ using any gradientbased optimization algorithm.

Experimental Settings
Data Preparation. We evaluated our approach on the Test Set Spring 2016 of RocStories, which consists of 1871 commonsense stories, with each story has two candidate story endings. Because stories in the training set of RocStories do not contain wrong hypothesis, and our model has a compact size of parameters, we estimated the parameters of our model using the Validation Set Spring 2016 of RocStories with 1871 commonsense stories. We mined event narrative knowledge from the Training Set Spring 2016 of RocStories, which consists of 45502 commonsense stories. We performed lemmatisation, part of speech annotation, named entity tagging, and dependency parsing using Stanford CoreNLP toolkits . We used the Jan. 30, 2010 English version of Wikipedia and processed it according to the method described by Hu et al. (2008).
Model Training. We used normalized initialization (Glorot and Bengio, 2010) to initialize attention parameters in our model. For calibration parameters, we initialized all w f to 1 and b f to 0. The model parameters were trained using minibatch stochastic gradient descent algorithm. As for hyper-parameters, we set the batch size as 32, the learning rate as 1, the dimension of attention hidden layer K as 32, and the smoothing factor γ as 0.5.
Baselines. We compared our approach with following three baselines: 1) Narrative Event Chain (Chambers and Jurafsky, 2008), which scores hypothesis using PMI scores between events. We used a simplified version of the original model by using only verbs as event, ignoring the dependency relation between verbs and their participants. We found such a simplified version achieved better performance than its original one whose performance was reported in (Mostafazadeh et al., 2016a).
2) Deep Structured Semantic Model (DSS-M) (Huang et al., 2013), which achieved the best performance on RocStories as reported by Mostafazadeh et al.(2016a). This model measures the reasoning score between a premise document D and a hypothesis H by calculating the cosine similarity between the overall vector representations of D and H, and do not consider any other task-relevant knowledge.
3) Recurrent Neural Network(RNN) Model proposed by Pichotta and Mooney(2015), which transforms all events and their arguments into a sequence and predict next events and arguments using a Long Short-Term Memory network. We used the average generating probability of all elements in H as the reasoning score, and choose the hypothesis with largest reasoning score as the system answer.   1) Our model outperforms all baselines significantly. Compared with baselines, the accuracy improvement on test set is at least 13.7%. This demonstrates the effectiveness of our model by mining and exploiting heteregenous knowledge.

Overall Performance
2) The event narrative knowledge only is insufficient for commonsense machine comprehension. Compared with Narrative Event Chain Model, our model achieves a 16.3% accuracy improvement by considering richer commonsense knowledge, rather than only narrative event knowledge.
3) It is necessary to distinguish different kinds of commonsense relations for machine comprehension and commonsense reasoning. Compared with DSSM and RNN, which model all relations between two elements using a single semantic similarity score, our model achieves significant accuracy improvements by modeling, distinguishing and selecting different types of commonsense relations between different kinds of elements.

Effects of Different Knowledge
To investigate the effect of different kinds of knowledge in our model, we conducted two groups of experiments.
The first group of experiments was conducted using only one kind of knowledge at a time in our model. Table 5 shows the results. We can see that using a single kind of knowledge is insufficient for commonsense machine comprehension: all single-knowledge settings cannot achieve competitive performance to the all-knowledge setting.

System
Accuray Event Narrative Knowledge 60.98% Entity Semantics Knowledge 57.14% Sentiment Coherent Knowledge 61.30% Our Model(All Knowledge) 67.02%  Table 6: Comparison of the performance by removing one single type of knowledge. Table 6 shows the results. We can find that removing any kind of knowledge will reduce the accuracy. This verified that all kinds of knowledge containing unique complementary information, which cannot be covered by other types of knowledge.

Effect of Inference Probability
This section investigates the effect of inference rule selection probability, and whether our attention mechanism can effectively model the possibility of inference rule selection. We compared our method with following two heuristic settings: 1) Minimum Cost Mechanism, which measures the reasoning distance by only selecting the inference rule with minimum cost for each hypothesis element.
2) Average Cost Mechanism, which measures the reasoning distance by setting equal probabilities to all inference rules that can infer a hypothesis element from a premise document element.

System
Accuracy Minimum Cost Mechanism 54.84% Average Cost Mechanism 63.01% Our Model(Attention Mechanism) 67.02% Table 7: Comparison of the performance using different inference rule selection mechanism. Table 7 show the results. We can see that: 1) the minimum cost mechanism cannot achieve competitive performance, we believe this is because the selection of rules should not depend on the cost of them, and considering all valid inferences is critical for reasoning; 2) our attention mechanism can effectively model the inference rule selection possibility. Compared with the average cost mechanism, our method achieved a 6.36% accuracy improvement. This also verified the necessity of an effective inference rule probability model.

Effect of Negation Rules
This section investigates the effect of special handling of negation mentioned in Section 2.5. To investigate the necessity of negation rules proposed in our model, we conducted experiments by removing all negation rules from original system, and investigate the change of accuracy.

System
Accuracy Our Model 67.02% -w/o Negation Rules 63.12% Table 8: Comparison of the performance by removing negation rules. Table 8 show the results. We can see that removing negation rules will significantly drop the system performance, which confirm the effectiveness of our proposed negation rules.

Related Work
Endowing computers with the ability of understanding commonsense story has long a goal of natural language processing. There exist two big challenges: 1)Matching explicit information in the given context; 2)Incorporating implicit commonsense knowledge into human-like reasoning process. Previous machine comprehension tasks (Richardson et al., 2013;Rajpurkar et al., 2016) mainly focus on the first challenge, leading their solutions focusing on semantic matching between texts (Weston et al., 2014;Kumar et al., 2015;Narasimhan and Barzilay, 2015;Smith et al., 2015;Sukhbaatar et al., 2015;Hill et al., 2015;Wang et al., 2015Cui et al., 2016;Trischler et al., 2016a,b;Kadlec et al., 2016;Kobayashi et al., 2016;Wang and Jiang, 2016b), but ignore the second issues. One notable task is SNLI (Bowman et al., 2015), which considers entailment between two sentences. This task, however, only provides shallow context and thus needs a few kinds of implicit knowledge (Rocktäschel et al., 2015;Wang and Jiang, 2016a;Angeli et al., 2016;Parikh et al., 2016;Henderson and Popa, 2016;Zhao et al., 2017).
Realizing that story understanding needs commonsense knowledge, many researches have been proposed to learn structural event knowledge. Chambers and Jurafsky (2008) first proposed an unsupervised approach to learn partially ordered sets of events from raw text. Many expansions have been introduced later, including unsupervisedly learning narrative schemas and scripts (Chambers and Jurafsky, 2009;Regneri et al., 2011), event schemas and frames (Chambers and Jurafsky, 2011;Balasubramanian et al., 2013;Sha et al., 2016;Huang et al., 2016;Mostafazadeh et al., 2016b), and some generative models to learn latent structures of event knowledge (Cheung et al., 2013;Chambers, 2013;Bamman et al., 2014;Nguyen et al., 2015). Another direction for learning event-centred knowledge is causality identification (Do et al., 2011;Radinsky et al., 2012;Berant et al., 2014;Hashimoto et al., 2015;Gui et al., 2016), which tried to identify the causality relation in text.
For reasoning over these knowledge, Jans et al. (2012) extend introduced skip-grams for collecting statistics. Further improvements include incorporating more information and more complicated models (Radinsky and Horvitz, 2013;Modi and Titov, 2014;Ahrendt and Demberg, 2016). Recent researches tried to solve event prediction problem by transforming it into an language modeling paradigm (Pichotta and Mooney, 2014Rudinger et al., 2015;Hu et al., 2017).
The principal difference between previous work and our method is that we not only take various kinds of implicit commonsense knowledge into consideration, but also provide a highly extensible framework to exploit these kinds of knowledge for commonsense machine comprehension. We also notice the recent progress in RocStories (Mostafazadeh et al., 2017). Rather than inferring a possible ending generated from document, recent systems solve this task by discriminatively comparing two candidates. This enables very strong stylistic features being added explicitly (Schwartz et al., 2017;Bugert et al., 2017) or implicitly (Schenk and Chiarcos, 2017), which can select hypothesis without any consideration of given document. Also, some augmentation strategies are introduced to produce more training data (Roemmele and Gordon, 2017;Mihaylov and Frank, 2017;Bugert et al., 2017). These methods are dataset-sensitive and are not the main concentration of our paper.

Conclusions and Future Work
This paper proposes a commonsense machine comprehension method, which performs effective commonsense reasoning by taking heterogenous knowledge into consideration. Specifically, we mine commonsense knowledge from heterogeneous knowledge sources and simultaneously exploit them by proposing a highly extensible multiknowledge reasoning framework. Experiment results shown that our method surpasses baselines by a large margin.
Currently, there are little labeled training instances for commonsense machine comprehension, for future work we want to address this issue by developing semi-supervised or unsupervised approaches.