Enhancing Question Answering by Injecting Ontological Knowledge through Regularization

Deep neural networks have demonstrated high performance on many natural language processing (NLP) tasks that can be answered directly from text, and have struggled to solve NLP tasks requiring external (e.g., world) knowledge. In this paper, we present OSCR (Ontology-based Semantic Composition Regularization), a method for injecting task-agnostic knowledge from an Ontology or knowledge graph into a neural network during pre-training. We evaluated the performance of BERT pre-trained on Wikipedia with and without OSCR by measuring the performance when fine-tuning on two question answering tasks involving world knowledge and causal reasoning and one requiring domain (healthcare) knowledge and obtained 33.3%, 18.6%, and 4% improved accuracy compared to pre-training BERT without OSCR.


The Problem
"The detective flashed his badge to the police officer." The nearly effortless ease at which we, as humans, can understand this simple statement belies the depth of semantic knowledge needed for its understanding: What is a detective? What is a police officer? What is a badge? What does it mean to flash a badge? Why would the detective need to flash his badge to the police officer? Understanding this sentence requires knowing the answer to all these questions and relies on the reader's knowledge about this world: a detective investigates crime, police officers restrict access to the crime scene, and a badge can be a symbol of authority.
As shown in Figure 1, suppose we were interested in determining whether, upon showing the policeman his badge, it is more plausible that the detective would be let into the crime scene or that the police officer would confiscate the detective's badge? To answer this question, we would need Figure 1: Example of a question requiring commonsense and causal reasoning (Roemmele et al., 2011) with entities highlighted. to leverage our accumulated expectations about the world: although both scenarios are certainly possible, our accumulated expectations about the world suggest it would be very extraordinary for the police officer to confiscate the detective's badge rather than allow him to enter the crime scene.
Evidence of Grice's Maxim of Quantity (Grice, 1975), this shared knowledge of the world is rarely explicitly stated in text. Fortunately, some of this knowledge can be extracted from Ontologies and knowledge bases. For example ConceptNet (Speer et al., 2017) indicates that a detective is a T O police officer, and is C O finding evidence; that evidence can be L A a crime scene; and that a badge is a T O authority symbol.
While neural networks have been shown to obtain state-of-the-art performance on many types of question answering and reasoning tasks from raw data (Devlin et al., 2018;Rajpurkar et al., 2016;Manning, 2015), there has been less investigation into how to inject ontological knowledge into deep learning models, with most prior attempts embedding ontological information outside of the network itself (Wang et al., 2017).
In this paper, we present a pre-training regular-

Background and Related Work
The idea of training a model on a related problem before training on the problem of interest has been shown to be effective for many natural language processing tasks (Dai and Le, 2015;Peters et al., 2017;Howard and Ruder, 2018). More recent uses of pre-training adapt transfer learning by first training a network on a language modeling task and then fine-tuning (retraining) that model for a supervised problem of interest (Dai and Le, 2015;Howard and Ruder, 2018;Radford et al., 2018). Pre-training, in this way, has the advantage that the model can build on previous parameters to reduce the amount of information it needs to learn for a specific downstream task. Conceptually, the model can be viewed as applying what it has already learned from the language model task when learning the downstream task. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained neural network that has been shown to obtain state-of-the-art results on eleven natural language processing tasks after fine-tuning (Devlin et al., 2018). BERT relies on two pre-training objectives: (1) a variant of language modeling called Cloze (originally proposed in Taylor 1953) where-in 20 % of the words in a sentence are masked, and the model must unmask them and (2) a next sentence prediction task where-in the model is given two pairs of sentences and must decide if the second sentence immediately follows the first. Despite its strong empirical performance, the architecture of BERT is relatively simple: four layers of transformers (Vaswani et al., 2017) are stacked to process each sentence.
In terms of injecting knowledge into pre-training,  explored injecting entity information into BERT using multi-head attention. However, their approach requires explicitly indicating entity boundaries or relation constituents with special input tokens for down-stream fine-tuning. By contrast, OSCR requires no modification of input formats in the host network.  explored modifying BERT's pre-training by masking entire entities and phrases extracted from external knowledge. Meanwhile, Xie et al. (2019) explored projecting propositional knowledge using Graph Convolutional Networks (GCNs). OSCR, instead, introduces a regularization term that can be added to any natural language pre-training objectives, without modifying the architecture of the network or the pre-training objectives themselves.

The Data
Incorporating OSCR into pre-training requires an embedded ontology (or knowledge) graph, and one or more natural language pre-training objectives to regularize -in our case, BERT's Cloze and nextsentence prediction tasks. These objectives, in turn, require a document collection.

The Ontology
ConceptNet 5 is a semantic network containing relational knowledge contributed to Open Mind Common Sense (Singh et al., 2002) and to DB-Pedia (Auer et al., 2007), as well as dictionary knowledge from Wiktionary, the Open Multilingual WordNet (Singh et al., 2002;Miller, 1995), the high-level ontology from OpenCyc , and knowledge about word associations from "Games with a Purpose" (von Ahn, 2006). In our experiments we used ConceptNet 5 as our ontology relying on an embedded representation of the ontology known as ConceptNet NumberBatch (Speer et al., 2017), in which embeddings for all entities in ConceptNet were built using an ensemble of (a) data from Con-ceptNet, (b) word2vec (Mikolov et al., 2013), (c) GloVe (Pennington et al., 2014), and (d) OpenSubtitles 2016 using retrofitting.

The Documents
Our text corpus was a 2019 dump of English Wikipedia articles with templates expanded as provided by Wikipedia's Cirrus search engine . Preprocessing relied on NLTK's Punkt sentence segmenter (Loper and Bird, 2002), and the WordPiece subword tokenizer provided with BERT.

The Approach
Virtually all neural networks designed for natural language processing represent language as a sequence of words, subwords, or characters. By contrast, Ontologies and knowledge bases encode semantic information about entities, which may correspond to individual nouns (e.g., "badge") or multiword phrases ("police officer"). Consequently, injecting world and domain knowledge from a knowledge base into the network requires semantically decomposing the information about an entity into the supporting information about its constituent words. For example, injecting the semantics of "Spanish Civil War" into the network requires learning what information the word "Spanish" introduces to the nominal "Civil War" and what information "Civil" adds to the word "War". To do this, OSCR is implemented using a three-step approach illustrated in Figure 2: Step 1. entities are recognized in a sentence using a Finite State Transducer (FST); Step 2. the sequence of subwords corresponding to each entity are semantically composed to produce an entity-level encoding; and Step 3. the average energy between the composed entity encoding and the pre-trained entity encoding from the ontology is used as a regularization term in the pre-training loss function. By training the model to compose sequences of subwords into entities, during back-propagation, the semantics of each entity are decomposed and http://opus.nlpl.eu/OpenSubtitles-v2016. php https://www.mediawiki.org/wiki/Help: CirrusSearch https://www.nltk.org/_modules/nltk/ tokenize/punkt.html injected into the network based on the neural activations associated with its constituent words.

Entity Detection
We designed OSCR to require as few modifications to the underlying host network (e.g., BERT) as possible. We recognized entities during training and inference online by (1) tokenizing each entity in our ontology using the same tokenizer used to prepare the BERT pre-training data, and (2) compiling a Finite State Transducer to detect sequences of subword IDs corresponding to entities. The FST, illustrated in Figure 3, allowed us to detect entities on-the-fly without hard coding a specific ontology and without inducing any discernible change in training or inference time. Although we did not explore it in this work, this potentially allows for multiple ontologies to be injected through OSCR during pre-training. In these experiments, due to the simplicity of ConceptNet entities, we relied on exact string matching to detect entities. Formally, let = 1 , 2 , · · · , represent the sequence of words in a sentence. The FST processes and returns three sequences: 1 , 2 , · · · , ; 1 , 2 , · · · , ; and 1 , 2 , · · · , representing the start offset, length, and the pretrained embedded representation of every mention of any entity in the Ontology.
Entity Subsumption. When detecting entities, it is often the case that multiple entities may correspond to the same span of text. As illustrated in Figure 2, the entity "Spanish Civil War" contains the subsumed entities "Spanish", "Civil War", "Civil", and "War". Likewise, because BERT masks 20 % of the words in each sentence, it is possible for entities to involve masked words. Note: including or excluding subsumed and de-masked entities (as illustrated in Figure 2) provided no discernible effect in our experiments.
Entity Demasking. Because BERT masks tokens when pre-training, we evaluated the impact of (a) de-masking words before detecting entities and (b) ignoring all entity mentions involving masked words.

Semantic Composition
The role of semantic composition in OSCR, is to learn a composed representation 1 , 2 , · · · , for each entity detected in such that = compose , +1 , · · · , + . As pre-training in
Recurrent Additive Networks (RANs) are a simplified alternative to LSTM-or GRU-based recurrent neural networks that use only additive connections between successive layers and have been shown to obtain similar performance with 38% fewer learnable parameters (Lee et al., 2017).
Given a sequence of words 1 , 2 , · · · , we use the following layers to accumulate information about how the semantics of each word in an entity contribute to the overall semantics of the entity: where [•] represents vector concatenation, represents the content layer which encodes any new semantic information provided by word , • indicates an element-wise product, represents the input gate, represents the forget gate, represents the internal memories about the entity, and is the output layer encoding accumulated semantics about word . We define the composed entity + (i.e., the content vector of the RAN after processing the last token in the entity) for the sequence beginning with .

Linear Recurrent Additive Networks
To further reduce model complexity, we considered a second, simpler version of a RAN omits the content and output layers (i.e., Equations 1a and 1e) and Equation 1d is updated to depend on directly: = • + • −1 . As above, we define the composed entity + for the sequence of subwords beginning with .
Linear Interpolation Finally, we considered a third, even simpler form of semantic composition. Inspired by Goodwin and Harabagiu (2016), we represented the semantics of an entity as an unordered linear combination of the semantics of its constituent words, i.e.: + +1 + · · · + + + · .

Energy Regularization
We project the composed entities into the same vector space as the pretrained entity embeddings from the Ontology, and measure the average energy across all entities detected in the sentence: where is an energy function capturing the energy between the composed entity and the pretrained entity embedding . We considered three energy functions: (1) the Euclidean distance, (2) the absolute distance, and (3) the angular distance, which can handle negative values.

Experimental Setup
Hyper-parameter Tuning For each fine-tuning task, we used a greedy approach to hyper-parameter tuning by incrementally and independently optimizing: batch size ∈ {8, 16, 32}; initial learning rate ∈ 1 × 10 −5 , 2 × 10 −5 , 3 × 10 −5 ; whether to include subsumed entities ∈ {yes, no}; and whether to include masked entities ∈ {yes, no}. For CoPA, the Story Cloze task, and RQE, we found an optimal batch size of 16 and an optimal learning rate of 2 × 10 −5 . We also found that including subsumed entities and masked was optimal (at a net performance improvement of < 1% accuracy).
Pretraining We pretrained BERT using a 2019 Wikipedia dump formatted for Wikipedia's Cirrus search engine. Preprocessing relied on NLTK's Punkt sentence segmenter (Loper and Bird, 2002), and the WordPiece subword tokenizer provided with BERT. We used the vocabulary from BERT base (not large) and a maximum sequence size of 384 subwords, training 64 000 steps, with an initial learning rate of 2 × 10 −5 , and 320 warm-up steps.

BERT Modifications
We used a modified version of BERT, allowing for mixed-precision training. This necessitated a number of minor changes to improve numerical stability around softmax operations. Training was performed using a single node with 4 Tesla P100s each (multiple variants of OSCAR were trained simultaneously using five such nodes at a time). Non-TPU multi-GPU support was added to BERT based on Horovod and relying on Open MPI.

Results
We evaluated the impact of OSCR on three question answering tasks requiring world or domain knowledge and causal reasoning.
Choice of Plausible Alternatives a SemEval 2012 shared task, (CoPA) presents 500 training and 500 testing sets of two-choice questions and https://www.mediawiki.org/wiki/Help: CirrusSearch https://www.nltk.org/_modules/nltk/ tokenize/punkt.html https://eng.uber.com/horovod/ Premise: Gina misplaced her phone at her grandparents. It wasn't anywhere in the living room. She realized she was in the car before. She grabbed her dad's keys and ran outside.
Ending A: She found her phone in the car.
Ending B: She didn't want her phone anymore. requires to choose the most plausible cause or effect entailed by the premise, as illustrated in Figure 1 ( Roemmele et al., 2011). The topics of these questions were drawn from two sources: (1) personal stories taken from a collection of blogs (Gordon and Swanson, 2009); and (2) subject terms from the Library of Congress Thesaurus for Graphic Materials, while the incorrect alternatives were created so as to penalize "purely associative methods".
The Story Cloze Test evaluates story understanding, story generation, and script learning and requires a system to choose the correct ending to a four-sentence story, as illustrated in Figure 4 ( Mostafazadeh et al., 2016). In our experiments, we used only the 3,744 labeled stories.

The Impact of External Knowledge.
It is clear from Table 1 that incorporating OSCR provided a significant improvement in accuracy for both common sense causal reasoning tasks, indicating that OSCR was able to inject useful world knowledge into the network. We also evaluated the impact of OSCR on the Stanford Question Answering Dataset (SQuAD), version 1.1, and observed no discernable change in performance (an Accuracy of 86.6 % without and 86.5 % with OSCR). The lack of impact of SQuAD is unsurprising, as the vast majority of SQuAD questions can be answered directly by surface-level information in the text, but it shows that injecting world knowledge with OSCR does not come at the expense of model performance for tasks that require little outside knowledge.

The Impact of Domain Knowledge.
While less pronounced than the general domain, for the clinical domain, OSCR provided a modest improvement over standard BERT, and both improved over the state-of-the-art.

The Impact of Entity Masking
Entity Subsumption We evaluated the impact of including subsumed entities when calculating OSCR and found it provided, on average, only a minor increase in accuracy (< 1 % average relative improvement) at a 10 % increase in total training time. Consequently, we recommend ignoring all subsumed entities.
Entity De-masking De-masking entities had little over-all impact on model performance (< 1% average relative improvement) and no discernible effect on training time. This may be explained by the fact that Wikipedia sentences are typically much longer than standard English sentences, so the likelihood of an important entity being masked is relatively small.

The Role of Semantic Composition
When comparing semantic composition methods, the Linear method had the most consistent performance across both domains; the Recurrent Additive Network (RAN) obtained the lowest performance on the general domain and the highest performance on medical texts, while the Linear RAN exhibited the opposite behavior. While this suggests more complex domains require more complex representations of semantic composition, we recommend Linear composition as it exhibits consistent performance and requires 50% less training time than the RAN and 40% less than the Linear RAN.

The Impact of the Energy Functions
In terms of energy functions, the Euclidean distance was the most consistent, the Angular distance was the best for the Story Cloze and RQE tasks, and the Absolute difference was the best for CoPA. The Angular distance (being scale-invariant) is least affected by the number of subwords constituting an entity while the Absolute distance is most affected. Consequently, we believe the Absolute distance was only effective on the CoPA evaluation because the entities in CoPA are typically very short (single words or subwords). We recommend selecting the energy function based on the average length of entities in the fine-tuning tasks: Angular distance with long entities, Absolute distance with short entities, and Euclidean distance with varied entities.
Finally, we compared the impact of including and excluding subsumed and masked entities and found that neither resulted in any substantial change in model improvements (< 1 % change in accuracy), while ignored masked and subsumed entities lead to a 20 % average reduction in training time.

Limitations and Future Work
In this study, we only considered ConceptNet as our ontology because we were primarily interested in in-jecting common-sense world knowledge. However, OSCR is not specific to any Ontology. Likewise, we considered only one type of pretrained entity embeddings: ConceptNet NumberBatch (Speer et al., 2017), despite the availability of other, more sophisticated approaches for knowledge graph embedding including, TransE (Bordes et al., 2013), TranR (Lin et al., 2015), TransH (Wang et al., 2014), RESCAL (Nickel et al., 2011) and OSRL (Xiong et al., 2018). In future work, we hope to explore the impact of incorporating different Ontologies and knowledge graphs as well as alternative types of entity embeddings (Bordes et al., 2013;Lin et al., 2015;Wang et al., 2014;Nickel et al., 2011;Xiong et al., 2018).

Conclusions
In this paper we presented OSCR (Ontology-based Semantic Composition Regularization), a learned regularization method for injecting task-agnostic knowledge from an Ontology or knowledge graph into a neural network during pretraining. We evaluated the impact of including OSCR when pretraining BERT with Wikipedia articles by measuring the performance when fine-tuning on two question answering tasks involving world knowledge and causal reasoning and one requiring domain (healthcare) knowledge and obtained 33.3 %, 18.6 %, and 4 % improved accuracy compared to pre-training BERT without OSCR.