Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension

Machine reading comprehension (MRC) is a crucial and challenging task in NLP. Recently, pre-trained language models (LMs), especially BERT, have achieved remarkable success, presenting new state-of-the-art results in MRC. In this work, we investigate the potential of leveraging external knowledge bases (KBs) to further improve BERT for MRC. We introduce KT-NET, which employs an attention mechanism to adaptively select desired knowledge from KBs, and then fuses selected knowledge with BERT to enable context- and knowledge-aware predictions. We believe this would combine the merits of both deep LMs and curated KBs towards better MRC. Experimental results indicate that KT-NET offers significant and consistent improvements over BERT, outperforming competitive baselines on ReCoRD and SQuAD1.1 benchmarks. Notably, it ranks the 1st place on the ReCoRD leaderboard, and is also the best single model on the SQuAD1.1 leaderboard at the time of submission (March 4th, 2019).


Introduction
Machine reading comprehension (MRC), which requires machines to comprehend text and answer questions about it, is a crucial task in natural language processing. With the development of deep learning and the increasing availability of datasets (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Nguyen et al., 2016;Joshi et al., 2017), MRC has achieved remarkable advancements in the last few years.
Recently language model (LM) pre-training has caused a stir in the MRC community. These LMs are pre-trained on unlabeled text and then applied to MRC, in either a feature-based (Peters et al., 2018a) or a fine-tuning (Radford et al., 2018) manner, both offering substantial performance boosts. Among different pre-training mechanisms, BERT (Devlin et al., 2018), which uses Transformer encoder (Vaswani et al., 2017) and trains a bidirectional LM, is undoubtedly the most successful by far, presenting new state-of-the-art results in MRC and a wide variety of other language understanding tasks. Owing to the large amounts of unlabeled data and the sufficiently deep architectures used during pre-training, advanced LMs such as BERT are able to capture complex linguistic phenomena, understanding language better than previously appreciated (Peters et al., 2018b;Goldberg, 2019).
However, as widely recognized, genuine reading comprehension requires not only language understanding, but also knowledge that supports sophisticated reasoning (Chen et al., 2016;Mihaylov and Frank, 2018;Bauer et al., 2018;Zhong et al., 2018). Thereby, we argue that pre-trained LMs, despite their powerfulness, could be further improved for MRC by integrating background knowledge. Fig. 1 gives a motivating example from ReCoRD (Zhang et al., 2018). In this example, the passage describes that Sudan faces trade sanctions from US due to its past support for North Korea. The cloze-style question states that Sudan is subject to the Trump's ban, and asks the organization by which Sudan is deemed to be a state sponsor of terror. BERT fails on this case as there is not enough evidence in the text. But after introducing the world knowledge "Trump is the person who leads US" and word knowledge "sanctions has a common hypernym with ban", we can reasonably infer that the answer is "US". This example suggests the importance and necessity of integrating knowledge, even on the basis of a rather strong model like BERT. We refer interested readers to Appendix A for another motivating example from SQuAD1.1 (Rajpurkar et al., 2016).
Thus, in this paper, we devise KT-NET (abbr. for Knowledge and Text fusion NET), a new approach to MRC which improves pre-trained LMs with additional knowledge from knowledge bases (KBs). The aim here is to take full advantage of both linguistic regularities covered by deep LMs and high-quality knowledge derived from curated KBs, towards better MRC. We leverage two KBs: WordNet (Miller, 1995) that records lexical relations between words and NELL (Carlson et al., 2010) that stores beliefs about entities. Both are useful for the task (see Fig. 1). Instead of introducing symbolic facts, we resort to distributed representations (i.e., embeddings) of KBs (Yang and Mitchell, 2017). With such KB embeddings, we could (i) integrate knowledge relevant not only locally to the reading text but also globally about the whole KBs; and (ii) easily incorporate multiple KBs at the same time, with minimal task-specific engineering (see § 2.2 for detailed explanation).
As depicted in Fig. 2, given a question and passage, KT-NET first retrieves potentially relevant KB embeddings and encodes them in a knowledge memory. Then, it employs, in turn, (i) a BERT encoding layer to compute deep, context-aware representations for the reading text; (ii) a knowledge integration layer to select desired KB embeddings from the memory, and integrate them with BERT representations; (iii) a self-matching layer to fuse BERT and KB representations, so as to enable rich interactions among them; and (iv) an output layer to predict the final answer. In this way we enrich BERT with curated knowledge, combine merits of the both, and make knowledge-aware predictions.
We evaluate our approach on two benchmarks: ReCoRD (Zhang et al., 2018) andSQuAD1.1 (Rajpurkar et al., 2016). On ReCoRD, a passage is generated from the first few paragraphs of a news article, and the corresponding question the rest of the article, which, by design, requires background knowledge and reasoning. On SQuAD1.1 where the best models already outperform humans, questions remaining unsolved are really difficult ones. Both are appealing testbeds for evaluating genuine reading comprehension capabilities. We show that incorporating knowledge can bring significant and consistent improvements to BERT, which itself is one of the strongest models on both datasets.
The contributions of this paper are two-fold: (i) We investigate and demonstrate the feasibility of enhancing pre-trained LMs with rich knowledge for MRC. To our knowledge, this is the first study of its kind, indicating a potential direction for future research. (ii) We devise a new approach KT-NET to MRC. It outperforms competitive baselines, ranks the 1st place on the ReCoRD leaderboard, and is also the best single model on the SQuAD1.1 leaderboard at the time of submission (March 4th, 2019).

Our Approach
In this work we consider the extractive MRC task. Given a passage with m tokens P = {p i } m i=1 and a question with n tokens Q = {q j } n j=1 , our goal is to predict an answer A which is constrained as a contiguous span in the passage, i.e., A = {p i } b i=a , with a and b indicating the answer boundary.
We propose KT-NET for this task, the key idea of which is to enhance BERT with curated knowledge from KBs, so as to combine the merits of the both. To encode knowledge, we adopt knowledge graph embedding techniques (Yang et al., 2015) and learn vector representations of KB concepts. Given passage P and question Q, we retrieve for each token w ∈ P ∪ Q a set of potentially relevant KB concepts C(w), where each concept c ∈ C(w) is associated with a learned vector embedding c.
Based upon these pre-trained KB embeddings, KT-NET is built, as depicted in Fig. 2, with four major components: (i) a BERT encoding layer that computes deep, context-aware representations for questions and passages; (ii) a knowledge integration layer that employs an attention mechanism to select the most relevant KB embeddings, and integrates them with BERT representations; (iii) a self-matching layer that further enables rich interactions among BERT and KB representations; and (iv) an output layer that predicts the final answer.
In what follows, we first introduce the four major components in § 2.1, and leave knowledge embedding and retrieval to § 2.2.

Major Components of KT-NET
KT-NET consists of four major modules: BERT encoding, knowledge integration, self-matching, and final output, detailed as follows.
BERT Encoding Layer This layer uses BERT encoder to model passages and questions. It takes as input passage P and question Q, and computes for each token a context-aware representation.
Specifically, given passage P = {p i } m i=1 and question Q = {q j } n j=1 , we first pack them into a single sequence of length m + n + 3, i.e., where SEP is the token separating Q and P , and CLS the token for classification (will not be used in this paper). For each token s i in S, we construct its input representation as: where s tok i , s pos i , and s seg i are the token, position, and segment embeddings for s i , respectively. Tokens in Q share a same segment embedding q seg , and tokens in P a same segment embedding p seg . Such input representations are then fed into L successive Transformer encoder blocks, i.e., h i = Transformer(h −1 i ), = 1, 2, · · · , L, so as to generate deep, context-aware representations for passages and questions. We refer readers to (Devlin et al., 2018;Vaswani et al., 2017) for details. The final hidden states ∈ R d 1 are taken as the output of this layer.
Knowledge Integration Layer This layer is designed to further integrate knowledge into BERT, and is a core module of our approach. It takes as input the BERT representations {h L i } output from the previous layer, and enriches them with relevant KB embeddings, which makes the representations not only context-aware but also knowledge-aware.
Specifically, for each token s i , we get its BERT representation h L i ∈ R d 1 and retrieve a set of potentially relevant KB concepts C(s i ), where each concept c j is associated with KB embedding c j ∈ R d 2 . (We will describe the KB embedding and retrieval process later in § 2.2.) Then we employ an attention mechanism to adaptively select the most relevant KB concepts. We measure the relevance of concept c j to token s i with a bilinear operation, and calculate the attention weight as: where W ∈ R d 2 ×d 1 is a trainable weight parameter. As these KB concepts are not necessarily relevant to the token, we follow (Yang and Mitchell, 2017) to further introduce a knowledge sentinel c ∈ R d 2 , and calculate its attention weight as: The retrieved KB embeddings {c j } (as well as the sentinelc) are then aligned to s i and aggregated accordingly, i.e., with j α ij +β i = 1. 2 Here k i can be regarded as a knowledge state vector that encodes extra KB information w.r.t. the current token. We concatenate k i with the BERT representation h L i and output u i = [h L i , k i ] ∈ R d 1 +d 2 , which is by nature not only context-aware but also knowledge-aware.
Self-Matching Layer This layer takes as input the knowledge-enriched representations {u i }, and employs a self-attention mechanism to further enable interactions among the context components {h L i } and knowledge components {k i }. It is also an important module of our approach.
We model both direct and indirect interactions. As for direct interactions, given two tokens s i and s j (along with their knowledge-enriched representations u i and u j ), we measure their similarity with a trilinear function (Seo et al., 2017): and accordingly obtain a similarity matrix R with r ij being the ij-th entry. Here denotes elementwise multiplication, and w ∈ R 3d 1 +3d 2 is a trainable weight parameter. Then, we apply a row-wise softmax operation on R to get the self-attention weight matrix A, and compute for each token s i an attended vector v i , i.e., where a ij is the ij-th entry of A. v i reflects how each token s j interacts directly with s i . Aside from direct interactions, indirect interactions, e.g., the interaction between s i and s j via an intermediate token s k , are also useful. To further model such indirect interactions, we conduct a self-multiplication of the original attention matrix A, and compute for each token s i another attended vectorv i , i.e.,Ā whereā ij is the ij-th entry ofĀ.v i reflects how each token s j interacts indirectly with s i , through all possible intermediate tokens. Finally, we build the output for each token by a concatenation Output Layer We follow BERT and simply use a linear output layer, followed by a standard softmax operation, to predict answer boundaries. The probability of each token s i to be the start or end position of the answer span is calculated as: where {o i } are output by the self-matching layer, and w 1 , w 2 ∈ R 6d 1 +6d 2 are trainable parameters. The training objective is the log-likelihood of the true start and end positions: where N is the number of examples in the dataset, and y 1 j , y 2 j are the true start and end positions of the j-th example, respectively. At inference time, is chosen as the predicted answer.

Knowledge Embedding and Retrieval
Now we introduce the knowledge embedding and retrieval process. We use two KBs: WordNet and NELL, both stored as (subject, relation, object) triples, where each triple is a fact indicating a specific relation between two entities. WordNet stores lexical relations between word synsets, e.g., (organism, hypernym of, animal). NELL stores beliefs about entities, where the subjects are usually real-world entities and the objects are either entities, e.g., (Coca Cola, headquartered in, Atlanta), or concepts, e.g., (Coca Cola, is a, company). Below we shall sometimes abuse terminologies and refer to synsets, real-world entities, and concepts as "entities". As we have seen in Fig. 1, both KBs are useful for MRC.

KB Embedding
In contrast to directly encoding KBs as symbolic (subject, relation, object) facts, we choose to encode them in a continuous vector space. Specifically, given any triple (s, r, o), we would like to learn vector embeddings of subject s, relation r, and object o, so that the validity of the triple can be measured in the vector space based on the embeddings. We adopt the BILINEAR model (Yang et al., 2015) which measures the validity via a bilinear function f (s, r, o) = s diag(r)o. Here, s, r, o ∈ R d 2 are the vector embeddings associated with s, r, o, respectively, and diag(r) is a diagonal matrix with the main diagonal given by r. Triples already stored in a KB are supposed to have higher validity. A margin-based ranking loss is then accordingly designed to learn the embeddings (refer to (Yang et al., 2015) for details). After this embedding process, we obtain a vector representation for each entity (as well as relation) of the two KBs.

KB Concepts Retrieval
In this work, we treat WordNet synsets and NELL concepts as knowl-edge to be retrieved from KBs, similar to (Yang and Mitchell, 2017). For WordNet, given a passage or question word, we return its synsets as candidate KB concepts. For NELL, we first recognize named entities from a given passage and question, link the recognized mentions to NELL entities by string matching, and then collect the corresponding NELL concepts as candidates. Words within a same entity name and subwords within a same word will share the same retrieved concepts, e.g., we retrieve the NELL concept "company" for both "Coca" and "Cola". After this retrieval process, we obtain a set of potentially relevant KB concepts for each token in the input sequence, where each KB concept is associated with a vector embedding.
Advantages Previous attempts that leverage extra knowledge for MRC (Bauer et al., 2018;Mihaylov and Frank, 2018) usually follow a retrievethen-encode paradigm, i.e., they first retrieve relevant knowledge from KBs, and only the retrieved knowledge-which is relevant locally to the reading text-will be encoded and integrated for MRC. Our approach, by contrast, first learns embeddings for KB concepts with consideration of the whole KBs (or at least sufficiently large subsets of KBs). The learned embeddings are then retrieved and integrated for MRC, which are thus relevant not only locally to the reading text but also globally about the whole KBs. Such knowledge is more informative and potentially more useful for MRC. Moreover, our approach offers a highly convenient way to simultaneously integrate knowledge from multiple KBs. For instance, suppose we retrieve for token s i a set of candidate KB concepts C 1 (s i ) from WordNet, and C 2 (s i ) from NELL. Then, we can compute a knowledge state vector k 1 i based on C 1 (s i ), and k 2 i based on C 2 (s i ), which are further combined with the BERT hidden state As such, u i naturally encodes knowledge from both KBs (see the knowledge integration layer for technical details).

Datasets
In this paper we empirically evaluate our approach on two benchmarks: ReCoRD and SQuAD1.1.
ReCoRD-acronym for the Reading Comprehension with Commonsense Reasoning Datasetis a large-scale MRC dataset requiring commonsense reasoning (Zhang et al., 2018). It consists

Dataset
Train Dev Test ReCoRD 100,730 10,000 10,000 SQuAD1.1 87,599 10,570 9,533 of passage-question-answer tuples, collected from CNN and Daily Mail news articles. In each tuple, the passage is formed by the first few paragraphs of a news article, with named entities recognized and marked. The question is a sentence from the rest of the article, with a missing entity specified as the golden answer. The goal is to find the golden answer among the entities marked in the passage, which can be deemed as an extractive MRC task. This data collection process by design generates questions that require external knowledge and reasoning. It also filters out questions that can be answered simply by pattern matching, posing further challenges to current MRC systems. We take it as the major testbed for evaluating our approach. SQuAD1.1 (Rajpurkar et al., 2016) is a wellknown extractive MRC dataset that consists of questions created by crowdworkers for Wikipedia articles. The golden answer to each question is a span from the corresponding passage. In this paper, we focus more on answerable questions than unanswerable ones. Hence, we choose SQuAD1.1 rather than SQuAD2.0 (Rajpurkar et al., 2018). Table 1 provides the statistics of ReCoRD and SQuAD1.1. On both datasets, the training and development (dev) sets are publicly available, but the test set is hidden. One has to submit the code to retrieve the final test score. As frequent submissions to probe the unseen test set are not encouraged, we only submit our best single model for testing, 3 and conduct further analysis on the dev set. Both datasets use Exact Match (EM) and (macro-averaged) F1 as the evaluation metrics (Zhang et al., 2018).

Experimental Setups
Data Preprocessing We first prepare pre-trained KB embeddings. We use the resources provided by Yang and Mitchell (2017), where the WordNet embeddings were pre-trained on a subset consisting of 151,442 triples with 40,943 synsets and 18 relations, and the NELL embeddings pre-trained on a subset containing 180,107 entities and 258 concepts. Both groups of embeddings are 100-D. Refer to (Yang and Mitchell, 2017) for details.
Then we retrieve knowledge from the two KBs. For WordNet, we employ the BasicTokenizer built in BERT to tokenize text, and look up synsets for each word using NLTK (Bird and Loper, 2004). Synsets within the 40,943 subset are returned as candidate KB concepts for the word. For NELL, we link entity mentions to the whole KB, and return associated concepts within the 258 subset as candidate KB concepts. Entity mentions are given as answer candidates on ReCoRD, and recognized by Stanford CoreNLP (Manning et al., 2014) on SQuAD1.1.
Finally, we follow Devlin et al. (2018) and use the FullTokenizer built in BERT to segment words into wordpieces. The maximum question length is set to 64. Questions longer than that are truncated. The maximum input length (|S|) is set to 384. Input sequences longer than that are segmented into chunks with a stride of 128. The maximum answer length at inference time is set to 30.
Comparison Setting We evaluate our approach in three settings: KT-NETWordNet, KT-NETNELL, and KT-NETBOTH, to incorporate knowledge from WordNet, NELL, and both of the two KBs, respectively. We take BERT as a direct baseline, in which only the BERT encoding layer and output layer are used, and no knowledge will be incorporated. Our BERT follows exactly the same design as the original paper (Devlin et al., 2018). Besides BERT, we further take top-ranked systems on each dataset as additional baselines (will be detailed in § 3.3).
Training Details For all three settings of KT-NET (as well as BERT), we initialize parameters of the BERT encoding layer with pre-trained models officially released by Google 4 . These models were pre-trained on the concatenation of BooksCorpus (800M words) and Wikipedia (2,500M words), using the tasks of masked language model and next sentence prediction (Devlin et al., 2018). We empirically find that the cased, large model-which is case sensitive and contains 24 Transformer encoding blocks, each with 16 self-attention heads and 1024 hidden unitsperforms the best on both datasets. Throughout our experiments, we use this setting unless specified otherwise. Other trainable parameters are randomly initialized.   We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-5 and a batch size of 24. The number of training epochs is chosen from {2,3,4}, according to the best EM+F1 score on the dev set of each dataset. During training, the pre-trained BERT parameters will be fine-tuned with other trainable parameters, and the KB embeddings will be kept fixed, which is empirically observed to offer the best performance.

Results
On ReCoRD and SQuAD1.1, we compare our approach to BERT and the top 5 (single) models on the leaderboard (exclusive of ours). The results are given in Table 2 and Table 3, respectively, where the scores of the non-BERT baselines are taken directly from the leaderboard and/or literature.
On ReCoRD 5 (  with 12 self-attention heads and 768 hidden units); (iii) DocQA  and SAN (Clark and Gardner, 2018) are two previous state-of-the-art MRC models; (iv) the pre-trained LM ELMo (Peters et al., 2018a) is further used in DocQA. All these models, except for DCReader+BERT, were re-implemented by the creators of the dataset and provided as official baselines (Zhang et al., 2018). On SQuAD 6 ( Table 3): (i) BERT+TriviaQA is the former best model officially submitted by Google. It is an uncased, large model, and further uses data augmentation with TriviaQA (Joshi et al., 2017); (ii) WD, nlnet, and MARS are three competitive models that have not been published; (iii) QANet is a well performing MRC model proposed by Yu et al. (2018), and later re-implemented and submitted by Google Brain & CMU.
Results on dev sets show that (i) KT-NET consistently outperforms BERT (which itself already surpasses all the other baselines), irrespective of which KB is used, and on both datasets. Our best KT-NET model offers a 1.38/1.45 improvement in EM/F1 over BERT on ReCoRD, and a 0.74/0.46 improvement in EM/F1 on SQuAD1.1. (ii) Both KBs are capable of improving BERT for MRC, but the best setting varies across datasets. Integrating both KBs performs best on ReCoRD, while using WordNet alone is a better choice on SQuAD1.1.
Results on test sets further demonstrate the superiority of our approach. It significantly outperforms the former top leaderboard system by +2.52 EM/+2.78 F1 on ReCoRD. And on SQuAD1.1,

Case Study
This section provides a case study, using the motivating example described in Fig. 1, to vividly show the effectiveness of KT-NET, and make a direct comparison with BERT. For both methods, we use the optimal configurations that offer their respective best performance on ReCoRD (where the example comes from).

Relevant Knowledge Selection
We first explore how KT-NET can adaptively select the most relevant knowledge w.r.t. the reading text. Recall that given a token s i , the relevance of a retrieved KB concept c j is measured by the attention weight α ij (Eq. (1)), according to which we can pick the most relevant KB concepts for this token. Fig. 3(a) (left) presents 4 tokens from the question/passage, each associated with top 3 most relevant concepts from NELL or WordNet. As we can see, these attention distributions are quite meaningful, with "US" and "UN" attending mainly to the NELL concepts of "geopoliticalorganization" and "nongovorganization", respectively, "ban" mainly to the WordNet synset "forbidding NN 1", and "sanction" almost uniformly to the three highly relevant synsets.
Question/Passage Representations We further examine how such knowledge will affect the final representations learned for the question/passage. We consider all sentences listed in Fig. 1, and con-tent words (nouns, verbs, adjectives, and adverbs) therein. For each word s i , we take its final representation o i , obtained right before the output layer. Then we calculate the cosine similarity cos(o i , o j ) between each question word s i and passage word s j . The resultant similarity matrices are visualized in Fig. 3(a) and Fig. 3(b) (heat maps), obtained by respectively. 7 For BERT (Fig. 3(b)), given any passage word, all question words tend to have similar similarities to the given word, e.g., all the words in the question have a low degree of similarity to the passage word "US", while a relatively high degree of similarity to "repealed". Such phenomenon indicates that after fine-tuning in the MRC task, BERT tends to learn similar representations for question words, all of which approximately express the meaning of the whole question and are hard to distinguish.
For KT-NET ( Fig. 3(a)), by contrast, different question words can exhibit diverse similarities to a passage word, and these similarities may perfectly reflect their relationships encoded in KBs. For example, we can observe relatively high similarities between: (i) "administration" and "government" which share a same synset, (ii) "ban" and "sanctions" which have a common hypernym, and (iii) "sponsor" and "support" where a synset of the former has the relation "derivationally related form" with the latter, all in WordNet. Such phenomenon indicates that after integrating knowledge, KT-NET can learn more accurate representations which enable better question-passage matching. Fig. 3(a) and Fig. 3(b) (line charts) list the probability of each word to be start/end of the answer, predicted by KT-NET and BERT, respectively. BERT mistakenly predicts the answer as "UN Security Council", but our method successfully gets the correct answer "US".

Final Answer Prediction
We observed similar phenomena on SQuAD1.1 and report the results in Appendix B.
More recently, LMs such as ELMo (Peters et al., 2018b), GPT (Radford et al., 2018), and BERT (Devlin et al., 2018) have been devised. They pre-train deep LMs on large-scale unlabeled corpora to obtain contextual representations of text. When used in downstream tasks including MRC, the pre-trained contextual representations greatly improve the performance in either a fine-tuning or feature-based way. Built upon pre-trained LMs, our work further explores the potential of incorporating structured knowledge from KBs, combining the strengths of both text and knowledge representations.
Incorporating KBs Several MRC datasets that require external knowledge have been proposed, such as ReCoRD (Zhang et al., 2018), ARC , MCScript (Ostermann et al., 2018), OpenBookQA  and Com-monsenseQA (Talmor et al., 2018). ReCoRD can be viewed as an extractive MRC dataset, while the later four are multi-choice MRC datasets, with relatively smaller size than ReCoRD. In this paper, we focus on the extractive MRC task. Hence, we choose ReCoRD and SQuAD in the experiments.
Some previous work attempts to leverage structured knowledge from KBs to deal with the tasks of MRC and QA. Weissenborn et al. (2017), Bauer et al. (2018, Mihaylov and Frank (2018), Pan et al. (2019), ,  follow a retrieve-then-encode paradigm, i.e., they first retrieve relevant knowledge from KBs, and only the retrieved knowledge relevant locally to the reading text will be encoded and integrated. By contrast, we leverage pre-trained KB embeddings which encode whole KBs. Then we use attention mechanisms to select and integrate knowledge that is relevant locally to the reading text. Zhong et al. (2018) try to leverage pre-trained KB embeddings to solve the multi-choice MRC task. However, the knowledge and text modules are not integrated,but used independently to predict the answer. And the model cannot be applied to extractive MRC.

Conclusion
This paper introduces KT-NET for MRC, which enhances BERT with structured knowledge from KBs and combines the merits of the both. We use two KBs: WordNet and NELL. We learn embeddings for the two KBs, select desired embeddings from them, and fuse the selected embeddings with BERT hidden states, so as to enable context-and knowledge-aware predictions.Our model achieves significant improvements over previous methods, becoming the best single model on ReCoRD and SQuAD1.1 benchmarks. This work demonstrates the feasibility of further enhancing advanced LMs with knowledge from KBs, which indicates a potential direction for future research.
A Motivating Example from SQuAD1.1 We provide a motivating example from SQuAD1.1 to show the importance and necessity of integrating background knowledge. We restrict ourselves to knowledge from WordNet, which offers the best performance on this dataset according to our experimental results (Table 3). Fig. 4 presents the example. The passage states that the congress aimed to formalize a unified front in trade and negotiations with various Indians, but the plan was never ratified by the colonial legislatures nor approved of by the crown. And the question asks whether the plan was formalized. BERT fails on this case by spuriously matching the two "formalize" appearing in the passage and question. But after introducing the word knowledge "ratified is a hypernym of formalized" and "approved has a common hypernym with formalized", we can successfully predict that the correct answer is "never ratified by the colonial legislatures nor approved of by the crown".
Passage: [...] The goal of the congress was to formalize a unified front in trade and negotiations with various Indians, since allegiance of the various tribes and nations was seen to be pivotal in the success in the war that was unfolding. The plan that the delegates agreed to was never ratified by the colonial legislatures nor approved of by the crown. [...] Question: Was the plan formalized? Original BERT prediction: formalize a unified front in trade and negotiations with various Indians Prediction with background knowledge: never ratified by the colonial legislatures nor approved of by the crown Background knowledge: (ratified, hypernym-of, formalized) (approved, common-hypernym-with, formalized)

B Case Study on SQuAD1.1
We further provide a case study, using the above example, to vividly show the effectiveness of our method KT-NET, and make a direct comparison with BERT. We use the same analytical strategy as described in § 4. For both KT-NET and BERT, we use the optimal configurations that offer their respective best performance on SQuAD1.1 (where the example comes from).

Relevant Knowledge Selection
We first explore how KT-NET can adaptively select the most relevant knowledge w.r.t. the reading text. Fig.5(a) (left) presents 3 words from the question/passage, each associated with top 3 most relevant synsets from WordNet. 8 Here the relevance of synset c j to word s i is measured by the attention weight α ij (Eq. (1)). 9 As we can see, these attention distributions are quite meaningful, with "ratified" attending mainly to WordNet synset "sign VB 2", "formalized" mainly to synset "formalize VB 1", and "approved" mainly to synsets"approve VB 2" and "sanction VB 1".
Question/Passage Representations We further examine how such knowledge will affect the final representations learned for the question/passage. We consider all sentences listed in Fig. 4, and content words (nouns, verbs, adjectives, and adverbs) therein. For each word s i , we take its final repre- For BERT (Fig.5(b)), we observe very similar patterns as in the ReCoRD example ( § 4). Given any passage word, all question words tend to have similar similarities to the given word, e.g., all the words in the question have a low degree of similarity to the passage word "never", while a relatively high degree of similarity to "various". Such phenomenon indicates, again, that after fine-tuning in the MRC task, BERT tends to learn similar representations for question words, all of which approximately express the meaning of the whole question and are hard to distinguish.
For KT-NET ( Fig.5(a)), although the similarities between question and passage words are generally higher, these similarities may still perfectly reflect their relationships encoded in KBs. For example, we can observe relatively high similarities between: (i) "formalized" and "ratified" where the latter is a hypernym of the former; (ii) "formalized" and "approved" which share a common hypernym in WordNet. Such phenomenon indicates, again, that after integrating knowledge, KT-NET can learn more accurate representations which enable better question-passage matching.

Final Answer Prediction
With the learned representations, predicting final answers is a natural next step. Fig.5(a) and Fig.5(b) (line charts) list 10 During visualization, we take the averaged cosine similarity if word si or word sj has subwords. And we use a rowwise softmax operation to normalize similarity scores over all passage tokens. the probability of each word to be the start/end of the answer, predicted by KT-NET and BERT, respectively. As we can see, BERT mistakenly predicts the answer as "formalize a unified front in trade and negotiations with various Indians", but our method successfully gets the correct answer "never ratified by the colonial legislatures nor approved of by the crown".
The phenomena observed here are quite similar to those observed in the ReCoRD example, both demonstrating the effectiveness of our method and its superiority over BERT.