Q. Can Knowledge Graphs be used to Answer Boolean Questions? A. It’s complicated!

In this paper we explore the problem of machine reading comprehension, focusing on the BoolQ dataset of Yes/No questions. We carry out an error analysis of a BERT-based machine reading comprehension model on this dataset, revealing issues such as unstable model behaviour and some noise within the dataset itself. We then experiment with two approaches for integrating information from knowledge graphs: (i) concatenating knowledge graph triples to text passages and (ii) encoding knowledge with a Graph Neural Network. Neither of these approaches show a clear improvement and we hypothesize that this may be due to a combination of inaccuracies in the knowledge graph, imprecision in entity linking, and the models’ inability to capture additional information from knowledge graphs.


Introduction
explore the difficulty of Yes/No questions and introduce the BoolQ dataset which contains 16k questions based on real Google user queries, paired by crowdworkers with passages from Wikipedia. They establish a strong baseline using BERT large  and transfer learning from the Multi-Genre Natural Language Inference (MNLI) task (Williams et al., 2018).
In this work, we carry out an error analysis of 200 samples from the BERT large + M N LI baseline model and find out that 77% constitute genuine model errors, almost 6% of samples contain an incorrect answer tag, and 8% do not contain enough evidence to answer the question. The remaining 9% we classified as difficult questions as they involve deep understanding, reasoning, specific knowledge, and sometimes depend on opinion. Due to the unstable behaviour of the model, error samples vary * A significant part of this work was done during an internship at Google Research Switzerland in September-December 2019 in collaboration with Massimo Nicosia. from run to run, where a run refers to the pipeline of MNLI pre-training, BoolQ fine-tuning, and evaluation of the model. We introduce a stable accuracy metric to evaluate a system across multiple runs with the same hyperparameters. Stable accuracy over n runs refers to the proportion of questions that are always correctly answered. We observed a 3.3% and an 11% drop of stable accuracy over 2 and 10 runs respectively.
Next we turn our attention to improving machine reading comprehension (MRC) system performance. We hypothesize the system might benefit from additional information about entities and/or relations between the entities, in the question and passage. Consider, for example, (1) where pei is an abbreviation of Prince Edward Island. We propose and evaluate two approaches for augmenting questions and answers with KG information: (1) concatenating the model input with sentences constructed from ConceptNet triples 1 (Speer et al., 2017); and (2) encoding KG entities and relations with the Graph Neural Network (GNN) proposed by Shaw et al. (2019), a model suited to graph-based input. Neither approach shows a significant improvement over the baseline. We manually analyse 200 errors made by one run of the baseline system (33% of one-run errors) and discover that 6% of them involve an incorrect answer tag and another 8% involve confusing passages which do not give enough support for the answer (see Appendix B for examples). Table 1 shows a categorization of the errors according to the reasoning types provided by Clark et al. (2019). The majority of errors belongs to the Paraphrasing type (48.5%). In these cases, the answer is in the passage and only a minimum amount of extra knowledge and reasoning is required to answer the question. The Implicit and Missing Mention types account for 19.5% and 14% of errors respectively. Only about 3.5% of incorrectly answered questions require an understanding of examples given in the passage, 6% requrie factual reasoning, and 8% require other inference.

Stable Accuracy
We reproduce the results of the baseline BERT large + M N LI model released by Clark et al. (2019). 2 Its accuracy is between 80% and 82% ( Fig. 1 (a) G) with an average 81.41% accuracy over 10 runs (vs. 82.2% reported in Clark et al. (2019)). Our error analysis shows that a significant portion of the correctly answered questions varies from run to run together with around 40% of errors.
We define the ratio of the number of correctly answered questions across n runs to the total number of questions as stable accuracy. Formally, if Q is the set of all questions and Q i correct is the set of correctly answered questions at the i th run, the stable accuracy after n runs is defined as (2): The stable accuracy over 10 runs drops to 71% (see up to 10 runs ( Fig. 1 (a), L) does not outperform the baseline: the values are within the range of 78.09% and 81.77%. 3 We repeat the experiment using the robustly optimized RoBERT a large model  implemented by Wolf et al. (2019) and fine tuned on the MNLI task. This model has a better average accuracy (83.7)% but it is also more unstable: the stable accuracy drops to 64.0% (see Fig. 1 (b)). As with the BERT model, ensembling over 10 runs does not give a performance boost.
This observed behavior means that the system performs well on each run but every time it performs well on a different set of questions. This might be related to the notion of "forgettable" examples described by Toneva et al. (2019). The difference is that they discovered the ability of models to forget the learned examples during the training phase, while we examine stable and unstable examples when the training is finished.

Modeling Knowledge Graph Data
Our manual inspection of the results of one baseline system run reveals that approximately 20% of erroneous cases are questions involving some property of an entity or concept, or some hierarchical relationship between entities. An example of the former is (3) and the latter is (4).
(3) is i 80 in indiana a toll road (4) is college of william and mary an ivy league school?
We hypothesize that adding knowledge graph data could help in answering such questions, as well 8 as examples such as (1) and (5) below where the entity in the question is referred to using a different name in the passage.
(5) Question: does smeagol die in lord of the rings Passage: ... Gollum finally ... but he fell into the fires of the volcano, where both he and the Ring were destroyed. Answer: Yes We use the CloudAPI 4 to annotate text with tokens, part of speech tags, named entities with Freebase 5 KG identifiers (MIDs), numbers, dates and VerbNet 6 roles which can be used for establishing relations between entities.

Extending Passages with ConceptNet
ConceptNet (Liu and Singh, 2004;Speer et al., 2017) is an open semantic network based on DB-Pedia, Wiktionary, WordNet, and other resources. It captures common-sense knowledge and was created for computers to understand words and concepts in the same way people do. It was particularly designed to be used by NLP applications and widely used in MRC (Weissenborn et al., 2017;Bauer et al., 2018;Mihaylov and Frank, 2018;Lin et al., 2019;Qiu et al., 2019). Partly inspired by Weissenborn et al. (2017), we convert ConceptNet relations into sentences but instead of embedding them independently, we concatenate them to the baseline model input.

Sentence Extraction and Filtering
ConceptNet has 34 relation types. 7 Each relation has start and end entities and a strength of relation (relevance weight). We look up every annotated entity from questions and passages in ConceptNet. We extract the top 100 relations according to the relevance weight, and select those where both the start and end entities are in English. We remove relations that are not useful, such as "External URLs", or too broad such as "FormOf". Then we transform ConceptNet relations into simple sentences based on the relation description or, if there is no description, we create a string: [entity1] [relation] [entity2], e.g. the "panda is near a bamboo forest" string is created from entites: "panda", "bamboo forest" and the relation "LocatedNear". Fig. 2 shows a ConceptNet entity from example (1). The verbalized triples such as "pei is a synonym of Prince Edward Island" are prepended to the text passage.
Since such new sentences can add noise (see polyetherimide examples in Fig. 2) and a long input might confuse the model (Thayaparan et al., 2019), we aim to add extra sentences to the passages only if it is relevant and can better "explain" the nature of entities. To select those, we rank all extracted sentences S according to the sum of their similarities with the question q and passage p as shown in (6): ∀s ∈ S : score(s) = g(k(s), k(q))+g(k(s), k(p)) (6) where g ∈ {correlation, cosine} are similarity measures, k is a semantic embedding function. We use the semantic textual similarity model 8 proposed by . To filter more examples, we add an empirically tuned threshold for similarities 9 and select only those sentences which were ranked as the most similar to the question and passage by both correlation (inner product) and cosine similarity, and each score is higher than the established thresholds. Another method of selecting relevant sentences is to consider only the relations which connect an entity in the question to an entity in the passage. We then combine these two strategies: we add sentences only to the examples which meet both criteria (Intersection) or all that meet at least one of the criteria (Union).

Results
Table 2 shows the results averaged over 5 runs. With threshold filtering we add sentences to 21.84% 9 of passages, obtaining an average accuracy of 81.23% (see Table 2: SentEmb). Using entity relations from questions and answers, 22.58% of QA pairs are affected but the performance is slightly worse (see Table 2: Q&P Match).
The intersection gives the best performance. By affecting only 1.23% of the data, we obtain 81.46% average accuracy and 82.05% accuracy for the ensemble majority voting scenario. The Union criterion does not show any improvement on accuracy. The Intersection improvement, as well as the disimprovement of SentEmb, Q&PMatch, and Union, are not statistically significant with respect to the baseline. 10

Modeling Knowledge Graphs with GraphNNs
Facing instability of the BERT-based baseline and low coverage of ConceptNet (see Section 4) we experiment with a new architecture and knowledge graph. To better model graph-based input, such as entities and their relations, we tried a transformerbased seq2seq GNN (Shaw et al., 2019). Entities, relations and input tokens are embedded and fed to a GNN sub-layer that incorporates edge representations extending the self-attention mechanism. The encoder-decoder attention layer considers both encoder output token and entity representations, jointly normalizing attention weights over tokens and entities. In our case, the GNN decoder simply outputs our expected answers: "Yes" or "No" (see Fig. 3). In this case, we initialize the GNN with a pre-trained BERT large model and only fine tune on BoolQ.
As an alternative to ConceptNet we also tried the Google Knowledge Graph. It has more than 500 billion facts about 5 billion entities. 11 The entities describe real-world objects and concepts like 10 According to the two sample proportion Z-Test the maximum difference: z = −1.3674, p = 0.17068 11 https://blog.google/products/search/ about-knowledge-graph-and-knowledge-panels/ -l.v. 07/2020 people, places, events, and things. Entities are represented as nodes and connected by relations. The latter can simply indicate that a relation is present, or they may encode the type of relation. We try the first three of the following possible experiments: 1. adding a relation between different entities which have the same MID; 2. only adding connections between entities across the QA pair, as in the ConceptNet Q&P Match experiment; 3. distinguishing different types of relations; 4. adding a relation between different mentions of the same entity; 5. adding entities not mentioned in the text but linked to the mentioned entities.

Results
The results are presented in Table 3. The first row shows the baseline BERT model with no KG data and the remaining rows show the BERT + GN N system with no KG data, with ConceptNet or with the Google Knowledge Graph. Adding KG information does not outperform the baseline result. None of the differences between the baseline are statistically significant.

Analysis
ConceptNet Even after the filtering described in Section 3.1.1, we observe that often the relations from ConceptNet are too general and do not add new information, e.g. "cookie jar is a type of jar".
Such relations are already part of the language model. Petroni et al. (2019) show that BERT contains relational knowledge and has a strong ability to recall factual knowledge without fine-tuning. Furthermore, some entities are missing, e.g. there is a "Tom Hanks" entity but no "Meg Ryan" entity, or the entity "dragon ball" contains only non-English connections, confirming the general coverage issue of KGs. 12 Sensitivity We observe that the GNN is sensitive to the learning rate and hyper-parameters. Better tuning may compensate for the difference in performance wrt to the BERT baseline.
Entity recognition and linker We found issues with the entity linker. Named entities are often not covered or the MID is missing. In some cases, the entity has a wrong MID, e.g. in (7) the entity "northern ireland" is not recognised but the entity "ireland" (Republic of Ireland) is mentioned instead, while the entity "great britain" is recognised with the MID of "United Kingdom". Do KGs affect stable accuracy? We observe a positive tendency towards stable correct answers in the ConceptNet experiments (Table 4). The number of new stable correct answers is higher than the number of new stable errors for all settings except Q&AMatch. Also, for all scenarios except Intersection, the number of questions where the predicted answer fluctuates from incorrect to correct is higher than the number of questions where the predicted answer fluctuates from correct.
Is a KG necessary? The BoolQ dataset was not originally created to be used with a KG, and the passages were selected such that they contain the information required to answer a question. For some questions, such as (1) the additional information provided by a KG is helpful, and for questions like (7), even though the passage has all the 12 https://conceptnet.io/c/en/jar, https: //conceptnet.io/c/en/tom_hanks -An English term in ConceptNet 5.8, https://conceptnet.io/ c/en/meg_ryan -'meg ryan' is not a node in Con-ceptNet, https://conceptnet.io/c/en/dragon_ ball,-l.v. 07/2020  required information, a KG could highlight the relation between entities and help answer the question. However, there are also cases where a KG is not needed or cannot be applied, e.g. (8) and (9). In (8) a question is asked about a number format and the information about the specific last symbol is unlikely to be a part of a KG. (9) contains a very short passage explicitly saying there is a book but it was adapted from the screenplay. In this case, a KG could provide potentially confusing information simply stating that there is a book.

Conclusion
In this work, we take a closer look at a BERT baseline system on the BoolQ dataset, which reveals some inconsistencies in the data and some instability in the model. We try two approaches to integrating knowledge graph information, one based on augmenting the passage text and another using a Graph Neural Network. Neither are successful. One culprit is the lack of coverage of Con-ceptNet and another is related to accuracy of the entity recognition. We also suggest that the number of questions where suitable KG data is needed and could be found might just not be enough for the models to learn from.   Clark et al. (2019) showed the BERT large model  outperforming recurrent models with attention , both in their vanilla version and in combination with deep contextualized word representation (Peters et al., 2018).

B Erroneous and Confusing Examples
Some questions in BoolQ are formulated in a certain context which might change given time. For example (10) which is asking about a movie released this year. As the dataset was released in 2019 the data could be collected in 2018 so then the answer is yes but if this question would be asked in 2015 or today (2020) the answer should be no. Another example (11) where a passage provides the information about United States citizens border crossing requirements but the question does not specify what kind of citizenship the person asking the question holds. In contrast with example (12) where the question and passage provide an unconditional outcome as a holder of the Schengen visa (information from question) can enter Montenegro for 30 days (information from the passage). So, in such cases like examples (10) and (11), the passage information is not enough to answer the questions unconditionally.
(10) Question: is there a star wars movie this year Passage: The first film was followed by two successful sequels, The Empire Strikes Back (1980)  Some passages looked unrelated or do not contain enough information to obtain the answer, e.g. (13 -14). The passages are related to the questions but specific information is missing the answer "Yes" cannot be confirmed by the passages. We observe, around 8% of questions we confusing or have certain assumptions. Passage: A cordon bleu or schnitzel cordon bleu is a dish of meat wrapped around cheese (or with cheese filling), then breaded and pan-fried or deep-fried. Veal or pork cordon bleu is made of veal or pork pounded thin and wrapped around a slice of ham and a slice of cheese, breaded, and then pan fried or baked. For chicken cordon bleu chicken breast is used instead of veal. Ham cordon bleu is ham stuffed with mushrooms and cheese.

Answer: Yes
There are a few examples of errors (15 -17) from the dataset. The first error example is asking if shower gel can be used instead of shampoo in a negative form ("is it bad to ...") and the passage says that they are perfectly substitutable so the answer should be No (it is not bad). In the second example (16) the passage explicitly says India does not have a national language so the answer should be No. And in the third example (17) there is nothing that should make the reader believe there were any games outside of Russia, so the answer should be Yes. According to our analysis 6% of samples have the wrong answer tag.
(15) Question: Is it bad to wash your hair with shower gel? Passage: ... This means that shower gels can also double as an effective and perfectly acceptable substitute to shampoo, even if they are not labelled as a hair and body wash.