Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering

Non-extractive commonsense QA remains a challenging AI task, as it requires systems to reason about, synthesize, and gather disparate pieces of information, in order to generate responses to queries. Recent approaches on such tasks show increased performance, only when models are either pre-trained with additional information or when domain-specific heuristics are used, without any special consideration regarding the knowledge resource type. In this paper, we perform a survey of recent commonsense QA methods and we provide a systematic analysis of popular knowledge resources and knowledge-integration methods, across benchmarks from multiple commonsense datasets. Our results and analysis show that attention-based injection seems to be a preferable choice for knowledge integration and that the degree of domain overlap, between knowledge bases and datasets, plays a crucial role in determining model success.


Introduction
With the recent success of large pre-trained language models (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019;Liu et al., 2019), model performance has reached or surpassed human-level capability on many previous question-answering (QA) benchmarks (Hermann et al., 2015;Rajpurkar et al., 2016;Lai et al., 2017). However, these benchmarks do not directly challenge model reasoning capability, as they require only marginal use of external knowledge to select the correct answer, i.e., all the evidence required to solve questions in these benchmarks is explicit in the context lexical space. Efforts have been made towards building more challenging datasets that, by design, require models to synthesize external commonsense * Work was done during an internship at Bosch Research. knowledge and leverage more sophisticated reasoning mechanisms (Zhang et al., 2018;Ostermann et al., 2018), showing that the previous stateof-the-art models often struggle to solve these newer tasks reliably. As a result, commonsense has received a lot of attention in other areas as well, such as natural language inference (Zellers et al., 2018b(Zellers et al., , 2019 and visual question answering (Zellers et al., 2018a). Despite the importance of commonsense knowledge, however, previous work on QA methods takes a coarse-grained view of commonsense, without considering the subtle differences across the various knowledge types and resources. Such differences have been discussed at length in AI by philosophers, computational linguists, cognitive psychologists (see for instance (Davis, 2014)): at the high level, we can identify declarative commonsense, whose scope encompassess factual knowledge, e.g., 'the sky is blue', 'Paris is in France'; taxonomic knowledge, e.g., 'football players are athletes', 'cats are mammals'; relational knowledge, e.g., 'the nose is part of the skull', 'handwriting requires a hand and a writing instrument'; procedural commonsense, which includes prescriptive knowledge, e.g., 'one needs an oven before baking cakes', 'the electricity should be off while the switch is being repaired' (Hobbs et al., 1987); sentiment knowledge, e.g., 'rushing to the hospital makes people worried', 'being in vacation makes people relaxed'; and metaphorical knowledge (e.g., 'time flies', 'raining cats and dogs'). We believe that it is important to identifiy the most appropriate commonsense knowledge type required for specific tasks, in order to get better downstream performance. Once the knowledge type is identified, we can then select the appropriate knowledge-base(s), and the suitable neural integration mechanisms (e.g., attention-based injection, pre-training, or auxiliary training objectives). Accordingly, in this work we conduct a comparison study of different knowledge bases and knowledge integration methods, and we evaluate model performance on two multiple-choice QA datasets that explicitly require commonsense reasoning. In particular, we used ConceptNet (Speer et al., 2016) and the recently-introduced ATOMIC  knowledge resources, integrating them with the Option Comparison Network model (OCN; Ran et al. (2019)), a recent stateof-the-art model for multiple choice QA tasks. We evalutate our models on the DREAM (Sun et al., 2019) and CommonsenseQA (Talmor et al., 2019) datasets. An example from DREAM that requires commonsense is shown in Table 1, and an example from CommonsenseQA is shown in Table 2. Our experimental results and analysis suggest that attention-based injection is preferable for knowledge integration and that the degree of domain overlap, between knowledge-base and dataset, is vital to model success. 1 Dialogue: M: I hear you drive a long way to work every day. W: Oh, yes. it's about sixty miles. but it doesn't seem that far, the road is not bad, and there's not much traffic. Question: How does the woman feel about driving to work? Answer choices: A. She doesn't mind it as the road conditions are good.* B. She is unhappy to drive such a long way everyday. C. She is tired of driving in heavy traffic.

Related Work
It has been recognized that many recent QA tasks require external knowledge or commonsense to solve, and numerous efforts have been made in injecting commonsense in neural models. Bauer et al. (2018) introduced a pipeline for extracting grounded multi-hop commonsense relation paths from ConceptNet and proposed to inject commonsense knowledge into neural models' intermediate representations, using attention. Similarly, Mihaylov and Frank (2018) also proposed to extract relevant knowledge triples from ConceptNet and use Key-Value Retrieval (Miller et al., 2016) to gather information from knowledge to enhance the neural representation. Zhong et al. (2018) proposed to pre-train a scoring function using knowledge triples from ConceptNet, to model the direct and indirect relation between concepts. This scoring function was then fused with QA models to make the final prediction. Pan et al. (2019a) introduced an entity discovery and linking system to identify the most salient entities in the question and answer-options. Wikipedia abstracts of these entities are then extracted and appended to the reference documents to provide additional information. Weissenborn et al. (2018) proposed a strategy of dynamically refining word embeddings by reading input text as well as external knowledge, such as ConceptNet and Wikipedia abstracts. More recently, Lin et al. (2019) proposed to extract subgraphs from ConceptNet and embed the knowledge using Graph Convolutional Networks (Kipf and Welling, 2016). Then the knowledge representation is integrated with word representation through an LSTM layer and hierarchical attention mechnism. Lv et al. (2019) introduced graphbased reasoning modules that takes both Concept-Net knowledge triples and Wikipedia text as inputs to refine word representations from a pretrained language model and make predictions.
Commonsense knowledge integration has also received a lot of attention on many other tasks. Tandon et al. (2018) proposed to use commonsense knowledge as hard/soft constraints to bias the neural model's prediction on a procedural text comprehension task. Ma et al. (2018) proposed to used embedded affective commonsense knowledge inside LSTM cell to control the information flow in each gate for sentiment analysis task. Li and Srikumar (2019) presented a framework to convert declarative knowlegde into first-order logic that enhance neural networks' training and prediction. Peters et al. (2019) and Levine et al. (2019) both tried to injecting knowlegde into language models by pretraining on knowledge bases.
Previous works only focus on using external knowledge sources to improve model performance on certain tasks, disregarding the type of commonsense knowledge and how the domain of the knowledge resource affects results on downstream tasks. In this paper, we examine the roles of knowledge-base domain and specific integration mechanisms on model performance.

Approach Overview
In this section, we describe the model architecture used in our experiments. Next, we introduce two popular knowledge resources, we define our knowledge-extraction method, then we illustrate various neural knowledge-integration mechanisms.

Model architecture
The BERT model (Devlin et al., 2019) has been applied to numerous QA tasks and has achieved very promising performance, particularly on the DREAM and CommonsenseQA datasets. When utilizing BERT on multiple-choice QA tasks, the standard approach is to concatenate the dialogue context and the question with each answer-option, in order to generate a list of tokens which is then fed into BERT encoder; a linear layer is added on top, in order to predict the answer. One aspect of this strategy is that each answeroption is encoded independently: from a cognitive perspective, this aspect contradicts how humans typically solve multiple-choice QA tasks, namely by weighing each option to find correlations within them, in addition to correlations with respect to the question. To address this issue, Ran et al. (2019) introduced the Option Comparison Network (OCN) that explicitly models pairwise answer-option interactions, making OCN bettersuited for multiple-choice QA task structures. We re-implemented OCN while keeping BERT as its upstream encoder. 2 Specifically, given a dialogue D, a question Q, and an answer-option O k , we concatenate them and encode with BERT to get hidden representation T enc ∈ R n×d : Where d is the size of BERT's hidden representation and n is the total number of words. Next, 2 Because the newly-released XLNet has out-performed BERT on various tasks, we considered using XLNet as the OCN's encoder. However, from our initial experiments, XL-Net is very unstable, in that it easily provides degenerate solutions-a problem noted by Devlin et al. (2019) for small datasets. We found BERT to be more stable in our study. the dialogue encoding D enc ∈ R n d ×d , question encoding Q enc ∈ R nq×d , and answer-option encoding O k,enc ∈ R no×d are separated from T enc . Here, option-encoding consists both of question and option, i.e. Q enc ⊆ O k,enc and n d + n o = n, as suggested by Ran et al. (2019). Given a set of options O k (k = 1, 2, ...), these options are compared, pairwise, using standard tri-linear attention (Seo et al., 2016): Where, W 1 , W 2 , W 3 ∈ R d are trainable weights and u ∈ R x×d , v ∈ R y×d are input matrices; x and y here are generic placeholder for input lengths; matrix multiplication and elementwise multiplication are denoted by (·) and (•), respectively. Next, we gather information from all other options, to form a new option representation O k,new ∈ R no×d . Formally, given option O k,enc and another option O l,enc ∈ R n l ×d , O k,new is computed as follows: Where, W c ∈ R (d+2d(|O|−1))×d , |O| denotes total number of options and n l denotes the number of words in the compared option. Then, a gating mechanism is used to fuse the option-wise correlation information O k,new with the current optionencoding O k,enc . Gating values are computed as: Here, W g ∈ R 3d×d and V a ∈ R d×1 . Co-attention (Xiong et al., 2016) is applied to re-read the dialogue, given the fused option-correlation features: Here, W p ∈ R 3d×d . Finally, self-attention (Wang et al., 2017) is used to compute final option repre- Unlike the vanilla BERT model, which takes the first token to predict the answer, max-pooling is applied on the sequence dimension of O f ∈ R no×d , in order to generate the final prediction.

Knowledge bases
The first knowledge-base we consider for our experiments is ConceptNet (Speer et al., 2016). Con-ceptNet contains over 21 million edges and 8 million nodes (1.5 million nodes in the partition for the English vocabulary), generating triples of the form (C1, r, C2): the natural-language concepts C1 and C2 are associated by commonsense relation r, e.g., (dinner, AtLocation, restaurant). Thanks to its coverage, ConceptNet is one of the most popular semantic networks for commonsense. ATOMIC ) is a new knowledge-base that focuses on procedural knowledge. Triples are of the form (Event, r, {Effect|Persona|Mental-state}), where head and tail are short sentences or verb phrases and r represents an if-then relation type. An example would be: (X compliments Y, xIntent, X wants to be nice). Since both DREAM and CommonsenseQA datasets are open-domain and require general commonsense, we think these knowledge-bases are most appropriate for our investigation.

Knowledge elicitation
ConceptNet. For the DREAM dataset, we find ConceptNet relations that connect dialogues and questions to the answer-options. The intuition is that these relation paths would provide explicit evidence that would help the model find the answer. Formally, given a dialogue D, a question Q, and an answer-option O, we find all ConceptNet relations (C1, r, C2), such that C1 ∈ (D + Q) and C2 ∈ O, or vice versa. This rule works well for single-word concepts. However, a large number of concepts in ConceptNet are actually phrases, and finding exactly matching phrases in D/Q/O is much harder. To fully utilize phrase-based Con-ceptNet relations, we relaxed the exact-match constraint to the following: Here, S represents D/Q/O, depending on which sequence we try to match the concept C to. Additionally, when the part-of-speech (POS) tag for a concept is available, we make sure it matches the POS tag of the corresponding word in D/Q/O. For CommonsenseQA, we use the same procedure to find ConceptNet relations for each answeroption, except that only Q is present and used. Table 3 shows the extracted ConceptNet triples for the CommonsenseQA example in Table 2. It is worth noting that we are able to extract the original ConceptNet sub-graph that was used to create the question, along with some extra triples. Although not perfect, the bold ConceptNet triple does provide some clue that could help the model resolve the correct answer.   ATOMIC. We observe that many questions in DREAM inquire about agent's opinion and feeling. Superficially, this particular question type seems well-suited for ATOMIC, whose focus is on folk psychology and related general implications; we could frame our goal as evaluating whether ATOMIC can provide relevant knowledge to help answer these questions. However, one challenge to this strategy is that heads and tails of knowledge triples in ATOMIC are short sentences or verb phrases, while rare words and person-references are reduced to blanks and PersonX/PersonY, respectively. This calls for a new matching procedure, different from the ConceptNet extraction strategy, for eliciting ATOMIC-specific relations: we rely on the recently-published COMET model (Bosselut et al., 2019) to generate new ATOMIC relations, with intermediate phrasal resolutions. In particular, we first segmented all dialogues, questions, and answer-options into sentences. We further segment long sentences into sub-sentences, using commas as seperators. Because only verb-phrases satisfy the definition of an "event" in ATOMIC (i.e., relations are only invoked by verbs), we remove all sentences/subsentences that do not contain any verb. Next, we use a pre-trained COMET model (Bosselut et al., 2019) to generate all possible ATOMIC relations, for all candidate sentences/sub-sentences and we use greedy-decoding to take the 1-best sequences. Table 4 shows the sample ATOMIC relations, generated using the DREAM example in Table 1. It is interesting to note that the reaction for the woman agent (second utterance) is identified as happy, since she said that 'the road is not bad.' If we compare the identified attributes for answer-options, the one from correct answer seems to be sematically closer than the other two.

Knowledge injection
Given previously extracted/generated knowledge triples, we need to integrate them with the OCN model. Inspired by Bauer et al. (2018), we propose to use attention-based injection. For Concept-Net knowledge triples, we first convert conceptrelation tokens into regular tokens, in order to generate a pseudo-sentence. For example, "(book, At-Location, library)" would be converted to "book at location library." Next, we use the BERT embedding layer to generate an embedding of this pseudo-sentence, with C denoting a ConceptNet relation: If we let H C ∈ R 1×2l be the concatenation of the final hidden states and l be the number of hidden units in the LSTM layer, then m ConceptNet relations would yield the commonsense knowledge matrix H M ∈ R m×2l . We adopt the attention mechanism used in QAnet (Yu et al., 2018) to model the interaction between H M and the BERT encoding output T enc (from Equation 1): Specifically, H M is first projected into the same dimension as T enc , using W proj ∈ R 2l×d . Then, the similarty matrix S ∈ R n×m is computed using tri-linear attention, as in Equation 2. We then use S to compute text-to-knowledge attention A m ∈ R n×d and knowledge-to-text attention A t ∈ R n×d . Finally, the knowledge-aware textual representation T out ∈ R n×d is computed, where W a ∈ R 4d×d . T out is fed to subsequent layers (in place of T enc ), in order to generate the prediction. The model structure with knowledge-injection is summarized in Figure 1. For ATOMIC knowledge triples, the injection method is slightly different. Because heads of these knowledge triples are sentences/utterances and the tails contain attributes of the persons (i.e., subject and object of the sentence), it is not possible to directly inject the knowledge triples, asis. We replace the heads of the ATOMIC knowledge triples with the corresponding speaker for dialogues and leave as blank for the answeroptions. Next, we convert the special relation tokens into regular tokens, e.g., "xIntent"⇒"intent" and "oEffect"⇒ "others effect", to make pseudosentences. As a result, an ATOMIC relation "(the road is not bad, xReact, happy)" would be converted to "(W, react, happy)." Moreover, as the ATOMIC knowledge triples are associated with dialogues and answer-options, independently, we inject option relations into O enc ∈ R no×d and dialogue relations into D enc , respectively, using the injection method described above.

Knowledge pre-training
Pre-training large-capacity models (e.g., BERT, GPT (Radford et al., 2019), XLNet (Yang et al., 2019)) on large corpora, then fine-tuning on more domain-specific information, has led to performance improvements on various tasks. Inspired by this, our goal in this section is to observe the effect of pre-training BERT on commonsense knowledge and refining the model on task-specific content from our DREAM and CommonsenseQA corpora. Essentially, we would like to test if pretraining on our external knowledge resources can help the model acquire commonsense. For the ConceptNet pre-training procedure, pre-training BERT on pseudo-sentences formulated from Con-ceptNet knowledge triples does not provide much gain on performance. Instead, we trained BERT on the Open Mind Common Sense (OMCS) corpus (Singh et al., 2002), the original corpus that was used to create ConceptNet. We extracted about 930K English sentences from OMCS and randomly masked out 15% of the tokens; we then fine-tuned BERT, using a masked language model objective. Then we load this fine-tuned model into OCN and trained on DREAM and CommonsenseQA tasks. As for pre-training on ATOMIC, we again use COMET to convert ATOMIC knowledge triples into sentences; we created special tokens for 9 types of relations as well as blanks. Next, we randomly masked out 15% of the tokens, only masking out tail-tokens. We use the same OMCS pre-training procedure.

Models
Dev

Datasets
We choose to evaluate our hypotheses using the DREAM and CommonsenseQA datasets, because some / all questions require commonsense reasoning and because there remains a large gap between state-of-the-art models and human performance. DREAM is a dialogue-based multiple-choice QA dataset, introduced by Sun et al. (2019). It was collected from English-as-a-foreign-language examinations, designed by human experts. The dataset contains 10,197 questions for 6,444 dialogues in total, and each question is associated with 3 answer-options. The authors point out that 34% of questions require commonsense knowledge to answer, which includes social implication, speaker's intention, or general world knowledge.
CommonsenseQA is a multiple-choice QA dataset that specifically measure commonsense reasoning (Talmor et al., 2019). This dataset is constructed based on ConceptNet (Speer et al., 2016). Specifically, a source concept is first extracted from ConceptNet, along with 3 target concepts that are connected to the source concept, i.e., a sub-graph. Crowd-workers are then asked to generate questions, using the source concept, such that only one of the target concepts can correctly answer the question. Additionally, 2 more distractor concepts are selected by crowd-workers so that each question is associated with 5 answeroptions. In total, the dataset contains 12,247 questions. For CommonsenseQA, we evaluate models on the development-set only, since test-set answers are not publicly available.

Training details
For ease of comparison, we borrow hyperparameter settings from Pan et al. (2019b); we used the BERT Whole-Word Masking Uncased model (Devlin et al., 2018) for all experiments. For DREAM experiments, we used a max sequencelength of 512, batch-size of 24, learning rate of 1e −5 , and we trained the model for 16 epochs. For CommonsenseQA, we used a max sequence length of 60, batch-size of 32, learning rate of 1e −5 , and trained for 8 epochs. For pre-training on OMCS, we used max sequence length of 35, batch-size of 32, learning rate of 3e −5 , and trained for 3 epochs. For pre-training on ATOMIC, the max sequence length is changed to 45, other hyperparameters remain the same, and we only use the ATOMIC training set. When using OCN on CommonsenseQA, since there is no dialogue, we compute co-attention with Q enc , in place of D enc , in order to keep the model structure consistent.

Results
DREAM results are shown in Table 5, and CommonsenseQA results are shown in Table  6. For all of our experiments, we run 3 trials with different random seeds and we report average scores in the tables.
Evaluated on DREAM, our OCN model got a significant performance boost (+3.0%), compared to BERTlarge from previous work. We think the reasons are that OCN is better-suited for the task and that we used BERT Whole-Word Masking Uncased model. OCN with ConceptNet knowledge-injection achieves slightly better results on the development-set, while ATOMIC knowledge-injection helps achieve a small improvement on the test-set. However, we recognize that these improvements are very limited; to our surprise, OCN pre-trained on OMCS or ATOMIC got significantly lower performance.
As for results on CommonsenseQA, Concept-Net knowledge-injection provides a significant performance boost (+2.8%), compared to the OCN baseline, suggesting that explicit links from question to answer-options help the model find the correct answer. Pre-training on OMCS also provides a small performance boost to the OCN baseline. Since both ConceptNet knowledge-injection and OMCS pre-training are helpful, we combine both approaches with OCN and we are able to achieve further improvement (+4.9%). Finally, similar to the results on DREAM, OCN pre-trained on ATOMIC yields a siginificant performance drop.

Error Analysis
To better understand when a model performs better or worse with knowledge-integration, we analyzed model predictions. DREAM dataset provides annotations for about 1000 questions: 500 questions in the development-set and 500 in the testset. Specifically, questions are manually classified into 5 categories: Matching, Summary, Logic inference, Commonsense inference, and Arithmetic inference; and each question can be classified under multiple categories. We refer readers to Sun et al. (2019) for additional category information. We extracted model predictions for these annotated questions in test-set and grouped them by types. The accuracies for each questiongroup are shown in Table 7. Note that we omitted 2 categories that have less than 10 questions. For the ConceptNet and the ATOMIC knowledgeinjection models, we can see that they did better on questions that involve commonsense (last 3 columns in the table), and the performance on other types are about the same or slightly worse, compared to baseline OCN. As for models pretrained on OMCS corpus or ATOMIC knowledgebase, we already saw that these model performances drop, compared to the baseline. When we look at the performance difference in each question type, it is clear that some categories account for the performance drop more than others. For example, for both the OMCS pre-trained model and the ATOMIC pre-trained model, performance drops significantly for Matching questions, in particular. On the other hand, for questions that require both commonsense inference and summarization, both models' performances only dropped 18.2(-9.2) 64.0(-11.9) 51.6(-9.1) 42.9(-28.5) 70.0(+0.0)  slightly or did not change. Based on these results, we infer that commonsense knowledge-injection with attention is making an impact on models' weight distributions. The model is able to do better on questions that require commonsense but is losing performance on other types, suggesting a direction for future research in developing more robust (e.g., conditional) injection methods. Moreover, pre-training on knowledge-bases seems to have a larger impact on models' weight distributions, resulting in inferior performance. This weight distribution shift also favors of commonsense, as we see that commonsense types are not affected as much as other types. We also conducted similar analysis for CommonsenseQA. Since all questions in CommonsenseQA require commonsense reasoning, we classify questions based on the ConceptNet relation between the question concept and correct answer concept. The intuition is that the model needs to capture this relation in order to answer the question. The accuracies for each question type are shown in Table 8. Note that we have omitted question types that have less than 25 questions. We can see that with ConceptNet relation-injection, all question types got performance boosts, for both OCN model and OCN pre-trained on OMCS, suggesting that knowledge is indeed helpful for the task.
In the case of OCN pre-trained on ATOMIC, although the overall performance is much lower than OCN baseline, it is interesting to see that performance for the "Causes" type is not significantly affected. Moreover, performance for "CausesDesire" and "Desires" types actually got much better. As noted by , "Causes" in ConceptNet is similar to "Effects" and "Reac-tions" in ATOMIC; and "CausesDesire" in Con-ceptNet is similar to "Wants" in ATOMIC. This result also correlates with our findings from our analysis on DREAM, wherein we found that models with knowledge pre-training perform better on questions that fit knowledge domain but perform worse on others. In this case, pre-training on ATOMIC helps the model do better on questions that are similar to ATOMIC relations, even though overall performance is inferior. Finally, we noticed that questions of type "Antonym" appear to be the hardest ones. Many questions that fall into this category contain negations, and we hypothesize that the models still lack the ability to reason over negation sentences, suggesting another direction for future improvement.

Discussion
Based on our experimental results and error analysis, we see that external knowledge is only helpful when there is alignment between questions and knowledge-base types. Thus, it is crucial to identify the question type and apply the best-suited knowledge. In terms of knowledge-integration methods, attention-based injection seems to be the better choice for pre-trained language models such as BERT. Even when alignment between knowledge-base and dataset is sub-optimal, the performance would not degrade. On the other hand, pre-training on knowledge-bases would shift the language model's weight distribution toward its own domain, greatly. If the task domain does not fit knowledge-base well, model performance is likely to drop. When the domain of the knowledge-base aligns with that of the dataset perfectly, both knowledge-integration methods bring performance boosts and a combination of them could bring further gain.

Future Work
We have presented a survey on two popular knowledge bases (ConceptNet and ATOMIC) and recent knowledge-integration methods (attention and pre-training), on commonsense QA tasks. Evaluation on two QA datasets suggests that alignment between knowledge-bases and datasets plays a crucial role in knowledge-integration. We believe it is worth conducting a more comprehensive study of datasets and knowledge-bases and putting more effort towards defining an auxiliary learning objective, in a constrained-optimization (i.e., multi-task learning) framework, that identifies the type of knowledge required, based on data characteristics. In parallel, we are also interested in building a global commonsense knowledge base by aggregating ConceptNet, ATOMIC, and potentially other resources like FrameNet (Baker et al., 1998) and MetaNet (Dodge et al., 2015), on the basis of a shared-reference ontology (following the approaches described in (Gangemi et al., 2010) and (Scheffczyk et al., 2010)): the goal would be to assess whether injecting knowledge structures from a semantically-cohesive lexical knowledge base of commonsense guarantees stable model accuracy across datasets.