KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning

Commonsense reasoning aims to empower machines with the human ability to make presumptions about ordinary situations in our daily life. In this paper, we propose a textual inference framework for answering commonsense questions, which effectively utilizes external, structured commonsense knowledge graphs to perform explainable inferences. The framework first grounds a question-answer pair from the semantic space to the knowledge-based symbolic space as a schema graph, a related sub-graph of external knowledge graphs. It represents schema graphs with a novel knowledge-aware graph network module named KagNet, and finally scores answers with graph representations. Our model is based on graph convolutional networks and LSTMs, with a hierarchical path-based attention mechanism. The intermediate attention scores make it transparent and interpretable, which thus produce trustworthy inferences. Using ConceptNet as the only external resource for Bert-based models, we achieved state-of-the-art performance on the CommonsenseQA, a large-scale dataset for commonsense reasoning.


Introduction
Human beings are rational and a major component of rationality is the ability to reason.Reasoning is the process of combining facts and beliefs to make new decisions (Johnson-Laird, 1980), as well as the ability to manipulate knowledge to draw inferences (Hudson and Manning, 2018).Commonsense reasoning utilizes the basic knowledge that reflects our natural understanding of the world and human behaviors, which is common to all humans.Empowering machines with the ability to perform commonsense reasoning has been seen as the bottleneck of artificial general intelligence (Davis and Marcus, 2015).Recently, there have been a few emerging large-scale datasets for testing machine commonsense with various focuses (Zellers et al., 2018;Sap et al., 2019b;Zellers et al., 2019).In a typical dataset, CommonsenseQA (Talmor et al., 2019), given a question like "Where do adults use glue sticks?", with the answer choices being {classroom(), office (), desk drawer ()}, a commonsense reasoner is expected to differentiate the correct choice from other "distractive" candidates.False choices are usually highly related to the question context, but just less possible in realworld scenarios, making the task even more challenging.This paper aims to tackle the research question of how we can teach machines to make such commonsense inferences, particularly in the question-answering setting.
It has been shown that simply fine-tuning large, pre-trained language models such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) can be a very strong baseline method.However, there still exists a large gap between performance of said baselines and human performance.Reasoning with neural models is also lacking in transparency and interpretability.There is no clear way as to how they manage to answer commonsense questions, thus making their inferences dubious.
Merely relying on pre-training large language models on corpora cannot provide well-defined or reusable structures for explainable commonsense reasoning.We argue that it would be more beneficial to propose reasoners that can exploit commonsense knowledge bases (Speer et al., 2017;Tandon et al., 2017;Sap et al., 2019a).Knowledgeaware models can explicitly incorporate external knowledge as relational inductive biases (Battaglia et al., 2018) to enhance their reasoning capacity, as well as to increase the transparency of model behaviors for more interpretable results.Furthermore, a knowledge-centric approach is extensible through commonsense knowledge acquisition techniques (Li et al., 2016;Xu et al., 2018).
We propose a knowledge-aware reasoning framework for learning to answer commonsense questions, which has two major steps: schema graph grounding ( §3) and graph modeling for inference ( §4).As shown in Fig. 1, for each pair of question and answer candidate, we retrieve a graph from external knowledge graphs (e.g.ConceptNet) in order to capture the relevant knowledge for determining the plausibility of a given answer choice.The graphs are named "schema graphs" inspired by the schema theory proposed by Gestalt psychologists (Axelrod, 1973).The grounded schema graphs are usually much more complicated and noisier, unlike the ideal case shown in the figure .Therefore, we propose a knowledge-aware graph network module to further effectively model schema graphs.Our model KA GNE T is a combination of graph convolutional networks (Kipf and Welling, 2017) and LSTMs, with a hierarchical path-based attention mechanism, which forms a GCN-LSTM-HPA architecture for path-based relational graph representation.Experiments show that our framework achieved a new state-of-the-art performance 2 on the CommonsenseQA dataset.Our model also works better then other methods with limited supervision, and provides human-

Overview
In this section, we first formalize the commonsense question answering problem in a knowledge-aware setting, and then introduce the overall workflow of our framework.

Problem statement
Given a commonsense-required natural language question q and a set of N candidate answers {a i }, the task is to choose one answer from the set.From a knowledge-aware perspective, we additionally assume that the question q and choices {a i } can be grounded as a schema graph (denoted as g) extracted from a large external knowledge graph G, which is helpful for measuring the plausibility of answer candidates.The knowledge graph G = (V, E) can be defined as a fixed set of concepts V , and typed edges E describing semantic relations between concepts.Therefore, our goal is to effectively ground and model schema graphs to improve the reasoning process.

Reasoning Workflow
As shown in Fig. 2, our framework accepts a pair of question and answer (QA-pair) denoted as q and a.It first recognizes the mentioned concepts within them respectively from the concept set V of the knowledge graph.We then algorithmically construct the schema graph g by finding paths between pairs of mentioned concepts ( §3).
The grounded schema graph is further encoded with our proposed knowledge-aware graph network module ( §4).We first use a model-agnostic language encoder, which can either be trainable or a fixed feature extractor, to represent the QA-pair as a statement vector.The statement vector serves as an additional input to a GCN-LSTM-HPA architecture for path-based attentive graph modeling to obtain a graph vector.The graph vector is finally fed into a simple multi-layer perceptron to score this QA-pair into a scalar ranging from 0 to 1, representing the plausibility of the inference.The answer candidate with the maximum plausibility score to the same question becomes the final choice of our framework.

Schema Graph Grounding
The grounding stage is three-fold: recognizing concepts mentioned in text ( §3.1), constructing schema graphs by retrieving paths in the knowledge graph ( §3.2), and pruning noisy paths ( §3.3).

Concept Recognition
We match tokens in questions and answers to sets of mentioned concepts (C q and C a respectively) from the knowledge graph G (for this paper we chose to use ConceptNet due to its generality).
A naive approach to mentioned concept recognition is to exactly match n-grams in sentences with the surface tokens of concepts in V .For example, in the question "Sitting too close to watch tv can cause what sort of pain?", the exact matching result C q would be {sitting, close, watch tv, watch, tv, sort, pain, etc.}.We are aware of the fact that such retrieved mentioned concepts are not always perfect (e.g."sort" is not a semantically related concept, "close" is a polysemous concept).How to efficiently retrieve contextually-related knowledge from noisy knowledge resources is still an open research question by itself (Weissenborn et al., 2017;Khashabi et al., 2017), and thus most prior works choose to stop here (Zhong et al., 2018;Wang et al., 2019b).We enhance this straightforward approach with some rules, such as soft matching with lemmatization and filtering of stop words, and further deal with noise by pruning paths ( §3.3) and reducing their importance with attention mechanisms ( §4.3).

Schema Graph Construction
ConceptNet.Before diving into the construction of schema graphs, we would like to briefly introduce our target knowledge graph ConceptNet.ConceptNet can be seen as a large set of triples of the form (h, r, t), like (ice, HasProperty, cold), where h and t represent head and tail con-cepts in the concept set V and r is a certain relation type from the pre-defined set R. We delete and merge the original 42 relation types into 17 types, in order to increase the density of the knowledge graph3 for grounding and modeling.
Sub-graph Matching via Path Finding.We define a schema graph as a sub-graph g of the whole knowledge graph G, which represents the related knowledge for reasoning a given questionanswer pair with minimal additional concepts and edges.One may want to find a minimal spanning sub-graph covering all the question and answer concepts, which is actually the NP-complete "Steiner tree problem" in graphs (Garey and Johnson, 1977).Due to the incompleteness and tremendous size of ConceptNet, we find that it is impractical to retrieve a comprehensive but helpful set of knowledge facts this way.Therefore, we propose a straightforward yet effective graph construction algorithm via path finding among mentioned concepts (C q ∪ C a ).
Specifically, for each question concept c i ∈ C q and answer concept c j ∈ C a , we can efficiently find paths between them that are shorter than k concepts4 .Then, we add edges, if any, between the concept pairs within C q or C a .

Path Pruning via KG Embedding
To prune irrelevant paths from potentially noisy schema graphs, we first utilize knowledge graph embedding (KGE) techniques, like TransE (Wang et al., 2014), to pre-train concept embeddings V and relation type embeddings R, which are also used as initialization for KA GNE T ( §4).In order to measure the quality of a path, we decompose it into a set of triples, the confidence of which can be directly measured by the scoring function of the KGE method (i.e. the confidence of triple classification).Thus, we score a path with the multiplication product of the scores of each triple in the path, and then we empirically set a threshold for pruning ( §5.3).

Knowledge-Aware Graph Network
The core component of our reasoning framework is the knowledge-aware graph network module KA GNE T. The KA GNE T first encodes plain structures of schema graphs with graph convolutional networks ( §4.1) to accommodate pre-trained con-

GCNs Encoding Unlabeled Schema Graphs
Statement Vector s < l a t e x i t s h a 1 _ b a s e 6 4 = " l I < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 O M P A + 8 6 t R P 1 y S u 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " t e q K / C Z X u H N 0 c 6 L 8 + 5 8 z F s L T j 5 z C H / g f P 4 A S 7 e Q q w = = < / l a t e x i t > ↵ (i,j,k) < l a t e x i t s h a 1 _ b a s e 6 4 = " x t P E 3 J N D u X x 2 1 O q z T C U 6 w s 7

Path-level Attention
ConceptPair-level Attention. (i,j) < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 S a l ) < l a t e x i t s h a 1 _ b a s e 6 4 = " t e q K / C Modeling Relational Paths between < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 j x 4 y / 5   cept embeddings in their particular context within schema graphs.It then utilizes LSTMs to encode the paths between C q and C a , capturing multihop relational information ( §4.2).Finally, we apply a hierarchical path-based attention mechanism ( §4.3) to complete the GCN-LSTM-HPA architecture, which models relational schema graphs with respect to the paths between question and answer concepts.

Graph Convolutional Networks
Graph convolutional networks (GCNs) encode graph-structured data by updating node vectors via pooling features of their adjacent nodes (Kipf and Welling, 2017).Our intuition for applying GCNs to schema graphs is to 1) contextually refine the concept vectors and 2) capture structural patterns of schema graphs for generalization.
Although we have obtained concept vectors by pre-training ( §3.3), the representations of concepts still need to be further accommodated to their specific schema graphs context.Think of polysemous concepts such as "close" ( §3.1), which can either be a verb concept like in "close the door" or an adjective concept meaning "a short distance apart".Using GCNs to update the concept vector with their neighbors is thus helpful for disambiguation and contextualized concept embedding.Also, the pattern of schema graph structures provides potentially valuable information for reasoning.For instance, shorter and denser connections between question and answer concepts could mean higher plausibility under specific contexts.
As many works show (Marcheggiani and Titov, 2017;Zhang et al., 2018), relational GCNs (Schlichtkrull et al., 2018) usually overparameterize the model and cannot effectively utilize multi-hop relational information.We thus apply GCNs on the plain version (unlabeled, nondirectional) of schema graphs, ignoring relation types on the edges.Specifically, the vector for concept c i ∈ V g in the schema graph g is initialized by their pre-trained embeddings at first (h Then, we update them at the (l + 1)th layer by pooling features of their neighboring nodes (N i ) and their own at the l-th layer with an non-linear activation function σ:

Relational Path Encoding
In order to capture the relational information in schema graphs, we propose an LSTM-based path encoder on top of the outputs of GCNs.Recall that our graph representation has a special purpose: "to measure the plausibility of a candidate answer to a given question".Thus, we propose to represent graphs with respect to the paths between question concepts C q and answer concepts C a .We denote the k-th path between i-th question concept c (q) i ∈ C q and j-th answer concept c (a) j ∈ C a as P i,j [k], which is a sequence of triples: Note that the relations are represented with trainable relation vectors (initialized with pretrained relation embeddings), and concept vectors are the GCNs' outputs (h (l) ).Thus, each triple can be represented by the concatenation of the three corresponding vectors.We employ LSTM networks to encode these paths as sequences of triple vectors, taking the concatenation of the first and the last hidden states: The above R i,j can be viewed as the latent relation between the question concept c (q) i and the answer concept c (a) j , for which we aggregate the representations of all the paths between them in the schema graph.Now we can finalize the vector representation of a schema graph g by aggregating all vectors in the matrix R using mean pooling: , where [• ; •] means concatenation of two vectors.The statement vector s in the above equation is obtained from a certain language encoder, which can either be a trainable sequence encoder like LSTM or features extracted from pre-trained universal language encoders like GPT/BERT).To encode a question-answer pair with universal language encoders, we simply create a sentence combining the question and the answer with a special token ("question+ [sep] + answer"), and then use the vector of '[cls]' as suggested by prior works (Talmor et al., 2019)..We concatenate R i,j with an additional vector T i,j before doing average pooling.The T i,j is inspired from the Relation Network (Santoro et al., 2017), which also encodes the latent relational information yet from the context in the statement s instead of the schema graph g.Simply put, we want to combine the relational representations of a pair of question/answer concepts from both the schema graph side (symbolic space) and the language side (semantic space).
Finally, the plausibility score of the answer candidate a to the question q can be computed as score(q, a) = sigmoid(MLP(g)).

Hierarchical Attention Mechanism
A natural argument against the above GCN-LSTM-mean architecture is that mean pooling over the path vectors does not always make sense, since some paths are more important than others for reasoning.Also, it is usually not the case that all pairs of question and answer concepts equally contribute to the reasoning.Therefore, we propose a hierarchical path-based attention mechanism to selectively aggregate important path vectors and then more important question-answer concept pairs.This core idea is similar to the work of Yang et al. (2016), which proposes a document encoder that has two levels of attention mechanisms applied at the word-and sentence-level.In our case, we have path-level and concept-pair-level attention for learning to contextually model graph representations.We learn a parameter matrix W 1 for path-level attention scores, and the importance of the path Afterwards, we similarly obtain the attention over concept-pairs.
The whole GCN-LSTM-HPA architecture is illustrated in Figure 3.To sum up, we claim that the KA GNE T is a graph neural network module with the GCN-LSTM-HPA architecture that models relational graphs for relational reasoning under the context of both knowledge symbolic space and language semantic space.

Experiments
We introduce our setups of the CommonsenseQA dataset (Talmor et al., 2019), present the baseline methods, and finally analyze experimental results.

Dataset and Experiment Setup
The CommonsenseQA dataset consists of 12,102 (v1.11) natural language questions in total that require human commonsense reasoning ability to answer, where each question has five candidate answers (hard mode).The authors also release an easy version of the dataset by picking two random terms/phrases for sanity check.
CommonsenseQA is directly gathered from real human annotators and covers a broad range of types of commonsense, including spatial, social, causal, physical, temporal, etc.To the best of our knowledge, CommonsenseQA may be the most suitable choice for us to evaluate supervised learning models for question answering.
For the comparisons with the reported results in the CommonsenseQA's paper and leaderboard, we use the official split (9,741/1,221/1,140) named (OFtrain/OFdev/OFtest).Note that the performance on OFtest can only be tested by submitting predictions to the organizers.To efficiently test other baseline methods and ablation studies, we choose to use randomly selected 1,241 examples from the training data as our in-house data, forming an (8,500/1,221/1,241) split denoted as (IHtrain/IHdev/IHtest).All experiments are using the random-split setting as the authors suggested, and three or more random states are tested on development sets to pick the best-performing one.

Compared Methods
We consider two different kinds of baseline methods as follows: • Knowledge-agnostic Methods.These methods either use no external resources or only use unstructured textual corpora as additional information, including gathering textual snippets from search engine or large pre-trained language models like BERT-LARGE.QABILINEAR, QACOM-PARE, ESIM are three supervised learning models for natural language inference that can be equipped with different word embeddings including GloVe and ELMO.BIDAF++ utilizes Google web snippets as context and is further augmented with a self-attention layer while using ELMO as input features.GPT/BERT-LARGE are fine-tuning methods with an additional linear layer for classification as the authors suggested.They both add a special token '[sep]' to the input and use the hidden state of the '[cls]' as the input to the linear layer.More details about them can be found in the dataset paper (Talmor et al., 2019).
• Knowledge-aware Methods.We also adopt some recently proposed methods of incorporating knowledge graphs for question answering.KV-MEM (Mihaylov and Frank, 2018) is a method that incorporates retrieved triples from ConceptNet at the word-level, which uses a key-valued memory module to improve the representation of each token individually by learning an attentive aggregation of related triple vectors.CBPT (Zhong et al., 2018) is a plug-in method of assembling the predictions of any models with a straightforward method of utilizing pre-trained concept embeddings from ConceptNet.TEXTGRAPH-CAT (Wang et al., 2019c) concatenates the graphbased and text-based representations of the statement and then feed it into a classifier.We create sentence template for generating sentences and then feed retrieved triples as additional text inputs as a baseline method TRIPLESTRING.Rajani et al. (2019) propose to collect human explanations for commonsense reasoning from annotators as additional knowledge (COS-E), and then train a language model based on such human annotations for improving the model performance.

Implementation Details of KagNet
Our best (tested on OFdev) settings of KA GNE T have two GCN layers (100 dim, 50dim respectively), and one bidirectional LSTMs (128dim) .We pre-train KGE using TransE (100 dimension) initialized with GloVe embeddings.The statement encoder in use is BERT-LARGE, which works as a pre-trained sentence encoder to obtain fixed features for each pair of question and answer candidate.The paths are pruned with path-score threshold set to 0.15, keeping 67.21% of the original Human Performance -88.9 Table 2: Comparison with official benchmark baseline methods using the official split on the leaderboard.
paths.We did not conduct pruning on concept pairs with less than three paths.For very few pairs with none path, R(i,j) will be a randomly sampled vector.We learn our KA GNE T models with Adam optimizers (Kingma and Ba, 2015).In our experiments, we found that the recall of ConceptNet on commonsense questions and answers is very high (over 98% of QA-pairs have more than one grounded concepts).

Performance Comparisons and Analysis
Comparison with standard baselines.
As shown in Table 2, we first use the official split to compare our model with the baseline methods reported on the paper and leaderboard.BERT and GPT-based pre-training methods are much higher than other baseline methods, demonstrating the ability of language models to store commonsense knowledge in an implicit way.This presumption is also investigated by Trinh and Le (2019) and Wang et al. (2019).Our proposed framework achieves an absolute increment of 2.2% in accuracy on the test data, a state-of-the-art performance.
We conduct the experiments with our in-house splits to investigate whether our KA GNE T can also work well on other universal language encoders (GPT and BERT-BASE), particularly with different fractions of the dataset (say 10%, 50%, 100% of the training data).Table 1 shows that our KA GNE T-based methods using fixed pre-trained language encoders outperform fine-tuning themselves in all settings.Furthermore, we find that the improvements in a small data situation (10%) is relatively limited, and we believe an important future research direction is thus few-shot learning Table 3: Comparisons with knowledge-aware baseline methods using the in-house split (both easy and hard mode) on top of BLSTM as the sentence encoder.
for commonsense reasoning.
Comparison with knowledge-aware baselines.
To compare our model with other adopted baseline methods that also incorporate ConceptNet, we set up a bidirectional LSTM networks-based model for our in-house dataset.Then, we add baseline methods and KA GNE T onto the BLSTMs to compare their abilities to utilize external knowledge5 .Table 3 shows the comparisons under both easy mode and hard mode, and our methods outperform all knowledge-aware baseline methods by a large margin in terms of accuracy.Note that we compare our model and the CoS-E in Table 2.Although CoS-E also achieves better result than only fine-tuning BERT by training with human-generated explanations, we argue that our proposed KagNet does not utilize any additional human efforts to provide more supervision.
Ablation study on model components.
To better understand the effectiveness of each component of our method, we have done ablation study as shown in Table 4.We find that replacing our GCN-LSTM-HPA architecture with traditional relational GCNs, which uses separate weight matrices for different relation types, results in worse performance, due to its overparameterization.The attention mechanisms matters almost equally in two levels, and pruning also effectively filters noisy paths.Error analysis.
In the failed cases, there are three kinds of hard problems that KA GNE T is still not good at.
• negative reasoning: the grounding stage is not sensitive to the negation words, and thus can choose exactly opposite answers.questions with more than one highly plausible answers, the commonsense reasoner should benefit from explicitly investigating the difference between different answer candidates, while KA GNE T training method is not capable of doing so.• subjective reasoning: Many answers actually depend on the "personality" of the reasoner.For instance, "Traveling from new place to new place is likely to be what?" The dataset gives the answer as "exhilarating" instead of "exhausting", which we think is more like a personalized subjective inference instead of common sense.

Case Study on Interpretibility
Our framework enjoys the merit of being more transparent, and thus provides more interpretable inference process.We can understand our model behaviors by analyzing the hierarchical attention scores on the question-answer concept pairs and path between them.Figure 4 shows an example how we can analyze our KA GNE T framework through both pairlevel and path-level attention scores.We first select the concept-pairs with highest attention scores and then look at the (one or two) top-ranked paths for each selected pair.We find that paths located in this way are highly related to the inference process and also shows that noisy concepts like 'fountain' will be diminished while modeling.

Model Transferability.
We study the transferability of a model that is trained on CommonsenseQA (CSQA) by directly testing it with another task while fixing its parameters.Recall that we have obtained a BERT-LARGE model and a KA GNE T model trained on CSQA.Now we denoted them as CSQA-BL and CSQA-KN to suggest that they are not trainable anymore.
In order to investigate their transferability, we separately test them on SWAG (Zellers et al., 2018) WhatJdoJyouJfill withJink toJwrite on an A4 paper?J A: fountainJpen ✔ (KagNet); B: printer (BERT); C: squid D: pencilJcase (GPT); E:Jnewspaper CSQA-BL has an accuracy of 56.53%, while our fixed CSQA-KN model achieves 59.01%.Similarly, we also test both models on the WSC-QA, which is converted from the WSC pronoun resolution to a multi-choice QA task.The CSQA-BL achieves an accuracy of 51.23%, while our model CSQA-KN scores 53.51%.These two comparisons further support our assumption that KA GNE T, as a knowledge-centric model, is more extensible in commonsense reasoning.As we expect for a good knowledge-aware frameworks to behave, our KA GNE T indeed enjoys better transferablity than only fine-tuning large language encoders like BERT.

Recent methods on the leaderboard.
We argue that the KA GNE T utilizes the ConceptNet as the only external resource and other methods are improving their performance in orthogonal directions: 1) we find that most of the other recent submissions (as of Aug. 2019) with public information on the leaderboard utilize larger additional textual corpora (e.g.top 10 matched sentences in full Wikipedia via information retrieval tools), and fine-tuning on larger pre-trained encoders, such as XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019).2) there are also models using multi-task learning to transfer knowledge from other reading comprehension datasets, such as RACE (Lai et al., 2017) and OpenBookQA (Mihaylov et al., 2018).
An interesting fact is that the best performance on the OFtest set is still achieved the original fine-tuned RoBERTa model, which is pre-trained with copora much larger than BERT.All other RoBERTa-extended methods have negative improvements.We also use statement vectors from RoBERTa as the input vectors for KA GNE T, and find that the performance on OFdev marginally improves from 77.47% to 77.56%.Based on our above-mentioned failed cases in error analysis, we believe fine-tuning RoBERTa has achieved the limit due to the annotator biases of the dataset and the lack of comparative reasoning strategies.

Related Work
Commonsense knowledge and reasoning.There is a recent surge of novel large-scale datasets for testing machine commonsense with various focuses, such as situation prediction (SWAG) (Zellers et al., 2018), social behavior understanding (Sap et al., 2019a,b), visual scene comprehension (Zellers et al., 2019), and general commonsense reasoning (Talmor et al., 2019), which encourages the study of supervised learning methods for commonsense reasoning.Trinh and Le (2018) find that large language models show promising results in WSC resolution task (Levesque, 2011), but this approach can hardly be applied in a more general question answering setting and also not provide explicit knowledge used in inference.A unique merit of our KA GNE T method is that it provides grounded explicit knowledge triples and paths with scores, such that users can better understand and put trust in the behaviors and inferences of the model.
Injecting external knowledge for NLU.Our work also lies in the general context of using external knowledge to encode sentences or answer questions.Yang and Mitchell (2017) are the among first ones to propose to encode sentences by keeping retrieving related entities from knowledge bases and then merging their embeddings into LSTM networks computations, to achieve a better performance on entity/event extraction tasks.Weissenborn et al. (2017), Mihaylov andFrank (2018), andAnnervaz et al. (2018) follow this line of works to incorporate the embeddings of related knowledge triples at the word-level and improve the performance of natural language understanding tasks.In contrast to our work, they do not explicitly impose graph-structured knowledge into models , but limit its potential within transforming word embeddings to concept embeddings.Some other recent attempts (Zhong et al., 2018;Wang et al., 2019c) to use ConceptNet graph embeddings are adopted and compared in our experiments ( §5).Rajani et al. (2019) propose to manually collect more human explanations for correct answers as additional supervision for auxiliary training.KA GNE T-based framework focuses on injecting external knowledge as an explicit graph structure, and enjoys the relational reasoning capacity over the graphs.Relational reasoning.KA GNE T can be seen as a knowledge-augmented Relation Network module (RN) (Santoro et al., 2017), which is proposed for the visual question answering task requiring relational reasoning (i.e.questions about the relations between multiple 3D-objects in an image).We view the concepts in the questions and answers as objects and effectively utilize external knowledge graphs to model their relations from both semantic and symbolic spaces ( §4.2), while prior methods mainly work on the semantic one.

Conclusion
We propose a knowledge-aware framework for learning to answer commonsense questions.The framework first constructs schema graphs to represent relevant commonsense knowledge, and then model the graphs with our KA GNE T module.The module is based on a GCN-LSTM-HPA architecture, which effectively represent graphs for relational reasoning purpose in a transparent, interpretable way, yielding a new state-of-the-art results on a large-scale general dataset for testing machine commonsense.Future directions include better question parsing methods to deal with negation and comparative question answering, as well as incorporating knowledge to visual reasoning.

1Figure 1 :
Figure 1: An example of using external commonsense knowledge ( symbolic space) for inference in natural language commonsense questions (semantic space ).

Figure 2 :
Figure 2: The overall workflow of the proposed framework with knowledge-aware graph network module.
8 Y g k m i t m s i I y w w s T Y o r I S v M U v L 5 N W t e K d V 6 q 3 F + X a d V 5 H A Y 7 h B M 7 A g 0 u o w Q 0 0 o A k E F D z D K 7 w 5 T 8 6 L 8 + 5 8 z E d X n H z n C P 7 A + f w B P J y S V Q = = < / l a t e x i t > C a < l a t e x i t s h a 1 _ b a s e 6 4 = " d Q Q H t 0 / 0 P B O F o 0 b E C + a w h y k I r o Q = " > A A A B 9 H i c b V D L S s N A F L 2 p r 1 p f V Z d u g k V w V Z I q 6 L L Y j c s K 9 g F t K D f T S T t 0 M o k z k 0 I J / Q 4 3 L h R x 6 8 e 4 8 2 + c t F l o 6 4 G B w z n 3 c s 8 c P + Z M a c f 5 t g o b m 1 v b O 8 X d 0 t 7 + w e

R
< l a t e x i t s h a 1 _ b a s e 6 4 = " C r L s u W M D M F W j z + i a U w 0 9 l j 4 O V G 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 + t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j m 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l 5 l 2 / X H G r 7 h x k l X g 5 q U r z c t K 3 c 3 j K M I J n M I 5 e H A F d b i F B r S A A c I z v M K b 8 + i 8 O O / O x 6 K 1 4 O Q z x / A H z u c P q I W M y A = = < / l a t e x i t > T < l a t e x i t s h a 1 _ b a s e 6 4 = " K c C k Q 8 D r 2 D P V F N e c f O X j V 2 4 o J 5 Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j w 4 r G F f k E b y m Y 7 a d d u N m F 3 I 5 T Q X + D F g y J e / U n e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u Y F i e D a u O 6 3 s 7 G 5 t b 2 z W 9 g r 7 h 8 c H h 2 X T k 7 b O k 4 V w x a L R a y 6 A d U o u M S W 4 U Z g N 1 F I o 0 B g J 5 j c z / 3 O E y r N Y 9 k 0 0 w T 9 i I 4 k D z m j x k q N 5 q B U d i v u A m S d e D k p Q 4 7 6 o P T V H 8 Y s j V A a J q j W P c 9 N j J 9 R Z T g T O C v 2 U 4 0 J Z R M 6 w p 6 l k k a o / W x x 6 I x c W m V I w l j Z k o Y s 1 N 8 T G Y 2 0 n k a B 7 Y y o G e t V b y 7 + 5 / V S E 9 7 5 G Z d J a l C y 5 a I w F c T E Z P 4 1 G X K F z I i p J Z Q p b m 8 l b E w V Z c Z m U 7 Q h e K s v r 5 N 2 t e J d V 6 q N m 3 L N z e M o w D l c w B V 4 c A s 1 e I A 6 t I A B w j O 8 w p v z 6 L w 4 7 8 7 H s n X D y W f O 4 A + c z x + r j Y z K < / l a t e x i t > W 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " L J R h c N B T U K L I m / q d f 0 c G H F U w / U U = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i R V 0 G X B j c s K 9 g F N K Z P p T T t 0 M g k z E 6 G E / o Y b F 4 q 4 9 W f c + T d O 2 i y 0 9 c D A 4 Z x 7 u W d O k A i u j e t + O 6 W N z a 3 t n f J u Z W / / 4 P C o e n z S 0 X G q G L Z Z L G L V C 6 h G w S W 2 D T c C e 4 l C G g U C u 8 H 0 L v e 7 T 6 g 0 j + W j m S U 4 i O h Y 8 p A z a q z a t 6 4 + G 6 1 n S L O s p w B u d w C R 7 c Q B P u o Q V t Y J D A M 7 z C m 5 M 6 L 8 6 7 8 7 E c L T n F z i n 8 g f P 5 A / P O k Z E = < / l a t e x i t > W 2< l a t e x i t s h a 1 _ b a s e 6 4 = " l z k h 1 6 e y o / L R n d r m 7 y y J 5 f 1 A 0 5 3 v w 3 p u 0 e t P X B w O O 9 G W b m B T F n 2 r j u t 5 N b W 9 / Y 3 M p v F 3 Z 2 9 / Y P i o d H L R 0 l i t A m i X i k O g H W l D N J m 4 Y Z T j u x o l g E n L a D 8 f X M b z 9 R p V k k 7 8 w k p r 7 A Q 8 l C R r C x 0 j 3 p s 4 e 0 / H g + 7 R d L b s W d A 6 0 S L y M l y N D o F 7 9 6 g 4 g k g k p D O N a 6 6 7 m x 8 d b q A B T S A g 4 B l e 4 c 1 R z o v z 7 n w s W n N O N n M M f + B 8 / g B S 5 5 A J < / l a t e x i t > and g < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 0

Figure 3 :
Figure 3: Illustration of the GCN-LSTM-HPA architecture for the proposed KA GNE T module.

Figure 4 :
Figure 4: An example of interpreting model behaviors by hierarchical attention scores.

Table 1 :
Comparisons with large pre-trained language model fine-tuning with different amount of training data.
• comparative reasoning strategy: For the

Table 4 :
Ablation study on the KA GNE T framework.