Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning

In this work, we aim at equipping pre-trained language models with structured knowledge. We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs. Building upon entity-level masked language models, our first contribution is an entity masking scheme that exploits relational knowledge underlying the text. This is fulfilled by using a linked knowledge graph to select informative entities and then masking their mentions. In addition we use knowledge graphs to obtain distractors for the masked entities, and propose a novel distractor-suppressed ranking objective which is optimized jointly with masked language model. In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training, to inject language models with structured knowledge via learning from raw text. It is more efficient than retrieval-based methods that perform entity linking and integration during finetuning and inference, and generalizes more effectively than the methods that directly learn from concatenated graph triples. Experiments show that our proposed model achieves improved performance on five benchmark datasets, including question answering and knowledge base completion tasks.


INTRODUCTION
Self-supervised pre-trained language models (LMs) like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) learn powerful contextualized representations. With task-specific modules and finetuning, they have achieved state-of-the-art results on a wide range of natural language processing tasks. Nevertheless, open questions remain about what these models have learned and improvements can be made along several directions. One such direction is, when downstream task performance depends on structured relational knowledge -the kind modeled by knowledge graphs 1 (KGs) -directly finetuning a pre-trained LM often yields sub-optimal results, even though some works (Petroni et al., 2019;Davison et al., 2019) show pre-trained LMs have been partially equipped with such knowledge.
To address this shortcoming, several recent works attempt to integrate KGs into pre-trained LMs. These approaches can be coarsely categorized into two classes, as shown in Figure 1. The first line of methods retrieve a KG subgraph (Liu et al., 2019a;Lin et al., 2019;Lv et al., 2019) and/or pre-trained graph embeddings (Zhang et al., 2019;Peters et al., 2019) via entity linking during both training and inference on downstream tasks. While these methods inject domain-specific knowledge directly into language representations, they rely heavily on the performance of the linking algorithm and/or the quality of graph embeddings. Graph embeddings, to be tractable over large-scale KGs, are often learned using shallow models (e.g., TransE (Bordes et al., 2013)) with limited expressive power. Besides, the linking and retrieval invoked during both finetuning and inference are costly, hence limit these methods' practicality.
The second class of methods Yao et al., 2019) use contextualized representations from pre-trained LMs to enrich graph embeddings and thus alleviate graph sparsity issues. This is especially helpful in the case of commonsense KGs (e.g., ConceptNet (Speer et al., 2017)) that consist of non-canonicalized text and hence suffer from severe sparsity . Specifically, these methods usually feed concatenated triples (e.g., [HEAD, Relation, TAIL]) into LMs for training or finetuning. The drawback is that focusing on knowledge base completion tends to over-adapt the models to this specific task, which comes at the cost of generalization.
In this work, we equip masked language models (MLMs), e.g. BERT, with structured knowledge via self-supervised pre-training on raw text. Compared to the first class, we expose LMs to structured information only during pre-training, thus circumvent costly knowledge retrieval and integration in finetuning and inference. Also the dependency on the performance of linking algorithm is greatly reduced. Compared to the second class, we learn from free-form text through MLMs rather than triples, which fosters generalization on other downstream tasks.
Specifically, given a corpus of raw text and a KG, two KG-guided self-supervision tasks are formulated to inject structured knowledge into MLMs. First, taking inspiration from Baidu-ERNIE (Sun et al., 2019a), we reformulate the masked language modeling objective to an entity-level masking strategy, where entities are identified by linking their text mentions to concepts/phrases in a commonsense KG or named entities in an ontological KG (Bollacker et al., 2008). The role of KG here is to provide a "vocabulary" of entities to be masked. To further exploit implicit relational and logical information underlying raw text, we design a KG-guided masking scheme that selects informative entities by considering both document frequency and mutual reachability of the entities detected in the text. In addition to the new entity-level MLM task above, a novel distractor-suppressed ranking task is proposed. Negative entity samples are derived from the KG and used as distractors for the masked entities to make the learning more effective.
Note that our approach never observes the KG directly, through triples or other forms. Rather, the KG plays a guiding role in our proposed tasks. Its guidance helps the model exploit the text corpus more effectively, as verified in the experiments. If a downstream task can benefit from explicit exposure of KG, a method by Davison et al. (2019) can be used to transform KG triples into natural grammatical texts that can be passed into our model.
We evaluate our method on five benchmarks, including question answering and knowledge base completion (KBC) tasks. Results show our method achieves state-of-the-art or competitive performance on all benchmarks, followed by analyses on the effects of various modeling choices.

VANILLA MASKED LANGUAGE MODEL
To ground our approach, this section summarizes MLMs for pre-training bidirectional Transformers (Devlin et al., 2019). Compared to causal LMs (Peters et al., 2018) trained unidirectionally, MLMs randomly mask some tokens and predict the masked tokens by considering their context on both sides. Formally, given a piece of text U , a tokenizer, e.g. BPE (Sennrich et al., 2016), is used to produce a sequence of tokens [w 1 , . . . , w n ]. A certain percentage of the original tokens are then masked and replaced: of those, 80% with special token [MASK], 10% with a token sampled from the vocabulary V, and the remaining kept unchanged. The masked sequence, denoted as [w (m) 1 , . . . , w (m) n ], is passed into a Transformer encoder to produce contextual representations for the sequence: where H ∈ R d h ×n and d h denotes the hidden size. The training loss L M for the MLM task is defined as where M denotes the set of masked token indices, and P (w i |H :,i ) is the probability of predicting the original token w i given the representations computed from the masked token sequence.

GRAPH-GUIDED MASKED LANGUAGE MODEL
This section begins with a description of entity-level masked language modeling. Section 3.2 proposes a KG-guided entity masking scheme for the entity-level MLM task. A novel distractorsuppressed ranking task is presented in Section 3.3, which operates on the masked entities and their negative entity samples from the KG. We use a multi-task learning objective combining the two tasks above to jointly train our proposed Graph-guided Masked Language Model (GLM). An illustration of the GLM is shown in Figure 2.

ENTITY-LEVEL MASKED LANGUAGE MODEL
As aforementioned, directly training an MLM with graph triples learns structured knowledge at the cost of the model's generalization to tasks involving natural text such as question answering. Inspired by distantly supervised relation extraction (Mintz et al., 2009) which assumes that any sentence containing two entities can be used to express the relation between these two entities in a KG, we argue that it is possible for an MLM to learn structured knowledge from raw text if guided properly by a KG.
Roughly speaking, we take detected entity mentions as masking candidates, where the entity can be a concept/phrase in a commonsense KG or a named entity in an ontological KG. The intuition is that the mentions in text often represent knowledge-grounded, semantically meaningful text spans. Formally, we first use a KG to provide a vocabulary of entities for building an entity linking system. We then detect all entity mentions appearing in a piece of text U from a corpus. This leads to a set of linked entities E = {e 1 , e 2 , . . . } {e|e ∈ KG ∧ e ∈ U } with C ej being the corresponding token indices in U for each entity mention e j .
The idea of entity-level masking is not new. For example, Sun et al. (2019a) and  randomly mask entity candidates under uniform distribution for training MLMs. We take this idea further by building our masking scheme with the guidance of a KG, as explained below.

KG-GUIDED ENTITY MASKING SCHEME
In this section, we develop a new entity masking scheme to facilitate structured knowledge learning for MLMs. It explores implicit relational information underlying raw text by the guidance of a KG, and is shown to mask more informative entities compared to the previous random approach.
In particular, the scheme is designed to avoid or reduce masking two types of entities: trivial and undeducible. Trivial entities, such as have been and what do in ConceptNet, are ubiquitous in corpora, since they are used to compose sentences. However, they express bare semantics and function similarly to stop words. On the other hand, an undeducible entity is defined as an entity that is hardly reached from any other entities detected in the same text, within certain hops over the linked KG. Examples include general modifiers and ambiguous entity linking results, as shown with green in Figure 2.
Given a masking budget (e.g., 20% of total tokens in our setting), we sample token spans iteratively as follows until the budget is reached: 1) 20% of the time we sample a random token span under a geometric distribution with p = 0.2, and 2) 80% of the time we sample an entity mention from the candidates detected in §3.1. The probability to sample an entity mention C ej is defined as where Nb(e) {e | PLen(e ↔ e) < R hop ∧e ∈ E}. The term DF(·) denotes document frequency, PLen(e ↔ e ) is the length of the shortest undirected path between the two entities, | · | denotes the set size, and [x] b a max(a, min(x, b)).
Note, the first part in Eq.(3) is designed to eliminate trivial entities that frequently appear. The second part measures whether an entity can be reached from other entities detected in the same text within R hop -hops, and assigns a higher sampling weight to an entity (e.g., criminal in Figure 2) that could more easily be inferred by others. By guiding the model to favor masking deducible but non-trivial entities, this scheme facilitates the MLM ingesting relational knowledge into representation learning. R hop/thresh/min/max are hyperparameters that trade off between trivial and undeducible entities.
Finally, it is worth noting, frequently appearing entities that are excluded via I {·} in Eq.
(3) can still be masked via 20% random span masking budget, but now with much smaller probabilities.

DISTRACTOR-SUPPRESSED RANKING TASK
Empowered by the informative entity-level masks, it is natural to extend the MLM with "negative" entities sampled from the KG, by treating the masked entities as "positive". It has been shown that negative sampling is especially useful for structured knowledge learning in graph embedding approaches (Sun et al., 2019c;Cai and Wang, 2018), but how to effectively integrate negative samples from KGs into MLMs remains open.
Recently, Ye et al. (2019) propose to mask one entity mention in a sentence, and then formulate a multiple-choice QA task for structured knowledge learning, by treating the masked sentence as the question, and the masked entity plus its negative samples as answer candidates. However, this model does not quite match the MLM since only one entity can be masked in a text.
Here we propose a distractor-suppressed ranking objective that operates on each pair of a masked entity from §3.2 and its negative sample from the KG. The negative sample can be viewed as a distractor. We use a Transformer encoder to separately produce the embeddings of positive and negative entities using their associated node contents in the KG. We then contrast the positive and negative entity embeddings, u and u , against the masked entity mention's contextual representation, v, using vector similarity as plausible scores for the both entities.
Specifically, given a set of masked entities from §3.2, E p = {e 1 , . . . , e m } ⊆ E, with the corresponding entity mentions C ej , we gather the contextual representation for each masked entity mention, by mean-pooling over representations of its composite tokens, where the representations are generated by the Transformer encoder of the MLM: Here v j is the resulting contextual representation for e j . Since each entity's original mention is invisible to the encoder, v j is rich in contextual features.
We then sample negative sample(s) from the KG for each e j ∈ E p and derive a set of positivenegative entity pairs {(e j , e j )} m j=1 . In particular, given a positive entity e j , the sampling method randomly selects an entity e j from the KG as a negative sample. The sampling favors its sibling entities with the same relation, whose sample weights are twice than the others. This is similar to Ye et al. (2019) and aims to provide strong distractors. Then, another Transformer encoder separately encodes positive and negative entities, which is parameter-tied with the MLM in 3.2 but uses distinct position embeddings. To distinguish entity text coming from KG's node or natural text, we append a special token to the entity text, i.e., text j = [CLS] + e j + [ENT]. We pass text j into the encoder to obtain the entity embedding for e j , i.e., u j = Pool(TransformerEnc(text j )).
Here, Pool(·) denotes collecting the contextualized embedding from the [CLS] token, as in Devlin et al. (2019). The resulting u j ∈ R d h is an LM-augmented entity embedding for e j . We apply the same process to e j to obtain negative entity embedding u j .
The procedure above yields a set of tuples, {(v j , u j , u j )} m j=1 . Finally, a BiLinear layer (abbrv. BiLnr) is used as a parameterized metric to calculate a similarity score between v j and u j (or u j ). The score is where s j and s j are scores for positive and negative entities, respectively. The two BiLinear layers used in Eq.(6) are parameter-tied. We then use a margin-based hinge loss to train the MLM with the formulated pairwise ranking task, i.e., where the margin λ is a hyperparameter.
The proposed distractor-suppressed ranking task has several nice properties. First, only a lightweight BiLinear layer is used to measure the score. Second, training to distinguish positive from negative samples may make the model more effective. Intuitively, two neighboring entities in graph are often assigned with similar distributed representations, but express differently in subtle context;  Table 1: Summary statistics of five benchmarks. The first two are multiple-choice question answering tasks.
The rest include one link prediction and two triple classification tasks.
this task helps discriminate them. Finally, in contrast to the work of Ye et al. (2019), ours is fully compatible with the entity-level MLM training task.
The final loss function for our model is defined as a combination of the entity-level MLM loss L M , and the distractor-suppressed ranking loss L R , with the latter weighted by a hyperparameter γ:

COMPARISON TO PRIOR ENTITY-LEVEL MLMS
Our work differs from prior entity-level MLMs, including SpanBERT  and Baidu-ERNIE (Sun et al., 2019a;b) in several ways. While the motivation of previous work is to move beyond token to another text unit, our method looks for ways to introduce structured knowledge from KGs into language models. As such, named entities in prior works are recognized via NLP toolkits. The entities are simply masked in random and relational knowledge unlikely exists among them. In contrast, entities in GLM are linked to a support knowledge graph, and masking has taken into consideration how an entity interacts with its neighbors in the KG. Similarly for modeling objective, previously proposed objectives such as span boundary objective , aim at learning text semantics as in traditional MLM objectives. By exploiting relational knowledge among the recognized entities, we end up with a ranking task that is specially designed for the proposed entity-level MLM to directly acquire structured information.

TRAINING SETUP
In this work we focus on non-canonicalized commonsense KGs, specifically ConceptNet, although the proposed approach is also applicable to ontological KGs such as Freebase. For training efficiency we use two relatively small free-form corpora. One is the Open Mind Common Sense (OMCS) raw corpus 2 consisting of 800K short sentences. The other is the ARC corpus  containing 14M unordered, science-related sentences. Both corpora are parsed to have their entities linked to ConceptNet by using an inverted index built with fuzzy matching (Appx. A).
For downstream tasks, CommonsenseQA (Talmor et al., 2019) and SocialIQA (Sap et al., 2019b) are used to evaluate GLM's performance on natural question answering (QA) task. We also experiment with three knowledge base completion (KBC) tasks: WN18RR (Dettmers et al., 2018), WN11 (Bordes et al., 2013) and commonsense knowledge base completion (Li et al., 2016), to assess whether the proposed approach can benefit graph-related tasks. We do not use other benchmarks like FB15k-237 (Dettmers et al., 2018) because they are derived from ontological KGs. And we do not use WN18 (Socher et al., 2013) as it suffers from the "informative value" problem (Dettmers et al., 2018). Statistics of these benchmarks are summarized in Table 1. It is worth mentioning that, although WordNet (Miller, 1998) is included in ConceptNet, the triples in ConceptNet are never used during GLM training but only raw texts from OMCS (which is a standalone source of ConceptNet and independent of WordNet), so the relation labels in WordNet are never seen by the GLM.
For efficiency considerations, we initialized GLM with either BERT or RoBERTa rather than training from scratch. We choose to match the corresponding baseline model (whether it uses BERT or  RoBERTa) in each downstream task for fair comparison. In practice, we can initialize GLM with any state-of-the-art pre-trained bidirectional language model.
More training setups are detailed in Appendix A.

QUESTION ANSWERING TASK EVALUATION
CommonsenseQA. Table 2 reports test results from the leaderboard 3 and from our approach. A brief introduction to each approach without reference can be found on the leaderboard. Compared to the previous best model RoBERTa+KE which is also trained with an extra in-domain corpus (i.e., OMCS) and uses retrieval during finetuning, our approach achieves 0.8% absolute improvement to deliver a new state-of-the-art result. In addition, GLM based on RoBERTa-large outperforms its respective baseline RoBERTa-large by 2.0%.
Note that approaches that use information retrieval (e.g., RoBERTa+KE) must retrieve from Wikipedia during finetuning and inference, which increases the computational overhead significantly. In contrast, approaches based on additional self-supervised pre-training are more efficient, but often achieve sub-optimal performance since they lack explicitly retrieved, targeted context. The proposed GLM falls into the latter high-efficiency group while still outperforms IR-based approaches.
Some prior works (e.g., RoBERTa+CSPT) find that directly finetuning on triples from a KG can hurt performance. This evidence empirically supports our hypothesis that finetuning on triples can over-adapt a model to graph-based tasks and limit their generalization to other tasks.
Our approach is not directly comparable to Lv et al. (2019) and Lin et al. (2019), the first of which achieves 75.3% accuracy. This is because during finetuning and inference, those methods explicitly find a path from question to answer concept in ConceptNet. This helps filter human-generated distractor answers since they never appear in ConceptNet. In addition, our method never uses Con-ceptNet during finetuning and only observes a small subgraph of ConceptNet (about 30% ∼ 40% linked concepts without relations) during pre-training.

Method Dev Test
GPT (Radford et al., 2018) 63.3 63.0 BERT-base (Devlin et al., 2019) 63.3 63.1 BERT-large (Devlin et al., 2019) 66.0 64.5 RoBERTa-large (Liu et al., 2019b) 78.   SocialIQA. The dataset 4 is built upon the ATOMIC knowledge graph (Sap et al., 2019a) and focuses on reasoning about people's actions and their social implications. It thus serves as an out-ofdomain evaluation task for GLM trained using ConceptNet. Similar to CommonsenseQA, this task is formulated as a multiple-choice QA problem. The evaluation results listed in Table 3 demonstrate that our approach also achieves state-of-the-art performance on this out-of-domain dataset.

GRAPH-RELATED TASK EVALUATION
For this set of tasks we follow KG-BERT (Yao et al., 2019) that finetunes BERT encoder over a concatenation of a triple's head, relation, and tail, followed by an MLP to compute a confidence score whether the triple is reasonable. Since BERT-base model is used in KG-BERT, for fair comparison we train a GLM from BERT-base, denoted as "GLM (BERT-base)" and finetune the GLM on KBC datasets.
WordNet Knowledge Base Completion. Table 4 lists test results for the WN18RR link prediction task. GLM outperforms KG-BERT by a large margin and sets a new state-of-the-art result. GLM is superior to translation-based graph embedding models (e.g., RotatE and TansE), graph convolutional networks (e.g., R-GCN), and convolution-based methods (e.g., ConvKB and ConvE).
Test results for the WN11 triple classification task are listed in Table 5. Consistent with the results on WN18RR, our approach outperforms translation-based graph embedding models and convolutionbased methods, and improves state-of-the-art accuracy by 0.5%.
Commonsense Knowledge Base Completion (CKBC). Finally, we evaluate our approach on the CKBC task, which should directly benefit from commonsense knowledge. Since CKBC is derived   from the OMCS corpus, for fair comparison, we provide baseline model with equivalent training data ("+ Data" in Table 6). In addition, we remove raw sentences belonging to CKBC's test set from our GLM training corpora to avoid data leakage. Table 6 show our approach outperforms KG-BERT baseline even when the latter is equipped with equivalent data (increasing training triples from 100K to 600K). There are two possible reasons why performance actually drops with more data: 1) the training triples are sorted w.r.t annotated confidence, so additional triples have a lower quality and may introduce noise; 2) more negative sampling must be done with more training triples, which introduces more false negative examples (from 1.25% to 2.42%).

ZERO-SHOT EVALUATION ON CKBC
To explore whether GLM can indeed learn the structured information from raw text, we conduct a zero-shot evaluation on CKBC by following Davison et al. (2019). For fair comparison, we re-train the GLM (BERT-large) on a new corpus in which all raw texts containing the CKBC testing pairs are discarded. We re-implement coherency ranking and estimate PMI (Davison et al., 2019), and the results are shown in Table 7. Our augmented pre-trained language model significantly outperforms its baseline, which demonstrates our model's capability of retaining structured information in a language model.

ABLATION STUDY
To systematically evaluate the effectiveness of each component in the proposed approach, we conduct an ablation study in Table 8 through pre-training language models with different setups and then finetuning on CommonsenseQA task.
When KG-guided entity masking introduced in §3.2 is replaced with random entity masking during GLM pretraining, 0.4% accuracy drop is observed when the model is subsequently evaluated on CommonsenseQA dev set. If we set γ in Eq.(8) to zero when pre-training GLM, this leads to 0.9% accuracy drop. When both entity-level masking and distractor-suppressed ranking are removed, the setting becomes equivalent to performing continual pre-training of BERT-large on our corpora. This   When we replace distractor-suppressed ranking with span boundary objective , a significant performance decrease (-2.2%) is observed. Further replacing our entity-level masking with random span masking, however, only loses 0.8% in accuracy. It is worth noticing that the latter setup is equivalent to continual pre-training with SpanBERT (Joshi et al., 2019) on our corpora. In line with SpanBERT's conclusion that the performance of linguistic masking is not consistent, our KG-guided entity-level MLM (i.e., GLM w/o the ranking task) is worse than random span with SBO (68.1% vs. 68.2%). This suggests that objective to mask and objective to learn need to be paired, and linguistic masking can be useful if equipped with appropriate learning objective (e.g., our distractor-suppressed ranking task) during pre-training.

ANALYSIS OF TRAINING & MASKING SCHEMES
Given the same dev set of texts with the same masked tokens, Figure 3 compares different learning and masking schemes by plotting L M (left) and L R (right) defined in Eq.(8) w.r.t training steps. It is observed that our KG-guided entity masking is more efficient than three other masking schemes, including random entity masking (Sun et al., 2019a), random span masking , and random whole-word masking (Devlin et al., 2019). Table 9 lists a few example sentences with masked tokens highlighted according to the corresponding masking scheme. Both KG-guided and random entity masking can mask informative chunks and long-term dependency needs to be modeled in order to infer the masked tokens. In contrast, random span or word masking is likely to mask tokens that can be easily inferred from local context -a much simpler task. Furthermore, our KG-guided entity masking tends to select more informative phrases compared to random entity masking.

RELATED WORK
Our work is related to Baidu-ERNIE (Sun et al., 2019a) and SpanBERT , which both extend token level masking to the span level. For example, Baidu-ERNIE does so to improve the model's knowledge learning, using uniformly random masking for phrases and entities. A detailed comparison between our model and span-level pre-trained language models can be found in §3.4. As briefly summarized in §1, existing methods for integrating knowledge into pre-trained LMs No. Masked Text with different masking methods 1 something you ::: need :: to ::: do before you get up early is set an alarm ← (leave the office) 2 you would talk with someone :: far away because you :::: want ::::: keep in touch ← (meet strangers) 3 : if ::: you :::: want to drill a hole then you should carefully plan where you will drill it ← (cut of beef) Table 9: Case study for different masking schemes. Note that 1) underline: KG-guided entity masking; 2) ::::::: underline :::: wave: random entity masking; 3) italic: random span masking; and 4) bold: random whole-word masking. Text in parenthesis is a negative sample for KG-guided entity masking.
can be coarsely categorized into two classes. For example, Peters et al. (2019) retrieve entities' embeddings according to the similarity between a Transformer's hidden states and pre-trained graph embeddings, then treat the retrieved embeddings as extra inputs to the next layer. In contrast, Bosselut et al. (2019) directly finetune a pre-trained LM on partially-masked triples from a KG, aiming at commonsense KBC tasks. In addition our work is also related to using negative samples for effective learning (Cai and Wang, 2018).
Our work also differs from the works combining knowledge graph with text information via joint embedding Yamada et al. (2016). They usually use the texts containing co-occurrence of entities to enrich the graph embeddings, which are specially designed for graph-related tasks. For example, Wang et al. (2014) embed entities from KG and the entities' text contents in the same latent space, however, regardless of textual co-occurrences and their textual relations in natural language corpus. Further taking into account the sharing of sub-structure in the textual relations in a largescale corpus, Toutanova et al. (2015) apply a CNN to the lexicalized dependency paths of the textual relation, for an augmented relation representation. The representation can be fed into any previous graph embedding approach for enhanced performance on KBC. We share similar inspirations when utilizing the texts containing entity co-occurrences and embedding entities' text contents into latent space. But beyond the shallow joint embeddings, our work takes advantage of pre-trained MLMs and equips them with structured knowledge via two self-supervised objectives built upon raw text. Hence it can produce generic text representations to benefit various downstream tasks.

CONCLUSION AND DISCUSSION
In this work, we aim at equipping pre-trained LMs with structured knowledge through novel selfsupervised tasks. Building upon entity-level MLMs, we propose an entity masking scheme under guidance from a KG. This method masks informative entity mentions and facilitates learning structured knowledge underlying free-form text. In addition, we propose a distractor-suppressed ranking objective to utilize negative samples from KG as distractors for effective model training. Experiments show that finetuning our KG-guided pre-trained MLMs yields improved performance on relevant downstream tasks. In the future, we will use a combination of commonsense and ontological KGs, and large-scale corpora (e.g., Wikipedia or Common Crawl) to pre-train an MLM from scratch, which we expect to benefit a wide range of tasks.

A IMPLEMENTATION DETAILS
Pre-trained Transformer-based language models implemented by Huggingface 5 are adapted for our models.
Training Hyperparameters. For continual pre-training, we do not tune hyperparameters due to the computational cost. Only a limited set of hyperparameters are tried according to empirical intuitions. Currently R hop/min/max in Eq.(3) are set to 3/1.0/2.0 respectively, and R thresh aims to filter out entities with top 5% document frequency, thus varies with corpora. We set λ in Eq.(7) to be 1.0 and γ in Eq.(8) to be 0.2. The continual pre-training runs for 5 epochs, with a batch size of 128, a learning rate of 3e-5/1e-5 (BERT/RoBERTa), learning warmup proportion of 10%/5% (BERT/RoBERTa) and weight decay of 0.01. For both BERT and RoBERTa, max sequence length is set to be 80 and 20 for sentence-level encoding and entity embedding respectively. The masking proportion is lifted from 15% to 20% without tuning compared with BERT and RoBERTa. An intuition is that our model is initialized with well-trained language models (e.g., BERT), a slightly larger masking proportion could hold the entity with longer text span, and make the learning more efficient.
During finetuning on downstream tasks, we conduct grid search for hyperparameters, including batch size, number of epochs/steps, learning rate, which are summarized in Table 10.
Details about KG-Guided Entity Masking. In addition to finding informative and non-trivial masks, the KG-guided entity masking method can be used to filter the corpus, since if a piece of text (e.g., sentence or passage) contains only trivial and undeducible entities, the text may hardly contribute to model learning. With this strategy, most (∼ 90%) sentences in ARC corpus were filtered out, which led to a very efficient training for our models. Training our model based on BERTlarge or RoBERTa-large only costs about 1 day on single V100 GPU with mixed float precision, in total about 70K steps.   KG-BERT Re-Implementation. In this work, we re-implemented KG-BERT (Yao et al., 2019) for improving the efficiency of test inference and negative sampling, as the original implementation is less optimized. The original version requires about 3 days to make a full linking prediction inference on the test set. Our re-implemented version reduces the time to 20 hours (i.e., 3× acceleration) using the same single GPU. We also improved negative sampling with ∼ 3× speed than the original version. The modifications can be summarized as 1) the ranking basis is changed from logits to probabilities; 2) mixed float is employed to replace the float; and 3) negative sampling is re-implemented for efficiency. In addition, we found that, in any evaluation script (Yao et al., 2019;Sun et al., 2019c), the model can successfully rank the candidates as long as assigning the positive triple with one of the largest scores. The following can explain why the re-implemented KG-BERT can achieve much superior performance. First, the mixed float normalized probabilities, rather than logits, are used as ranking basis, which only ranges from 0 to 1. And second, such encoder and classifier paradigm (e.g., KG-BERT) for link prediction usually suffers from polysemy or ambiguity of entities, which leads to over-confident false positive prediction. Hence, our re-implementation significantly improves the Hits@1 for KG-BERT, which however is still a reasonable setting to verify and compare how much structural knowledge is retained in the pre-trained LMs. For a fair comparison with previous works, we list the results with ranking basis of logits in Table 11.
KBC Tasks Implementation. For CKBC dataset, we find the performance is very poor when we directly finetune either BERT-base or our approach on the concatenated triples, i.e., "[CLS] HEAD [SEP] Rel [SEP] TAIL [SEP]". Hence, we follow Davison et al. (2019) to transform the triples to natural language sentences, and then use these sentences as input to finetune the pre-trained LMs. For WN18RR, we directly use the data processed by Yao et al. (2019), in which a description sentence is attached to each entity/phrase. For WN11, we follow KG-BERT to directly concatenate the triples and then use them to finetune the pre-trained LMs.
Knowledge Graph and Entity Linking. We aim at enhancing the pre-trained language models with commonsense structured knowledge, so we employ ConceptNet (Speer et al., 2017) as the backend knowledge graph. Since ConceptNet is a multi-lingual knowledge graph, we first filtered out all the triples which include non-English items. In addition, we treated the KG as an undirected graph when identifying entity's mutual reachability. As for entity linking, there are many mature entity linking systems for ontological or factoid KGs, such as S-MART (Yang and Chang, 2015), DBpeida Lookup, and DeepType (Raiman and Raiman, 2018). However, for commonsense KG whose content consists of non-canonicalized or free-form texts, there is no such a system to complete its entity linking. Therefore, we built an efficient inverted index out of lemma-based fuzzy matching as our entity linking system which is going to be open-sourcing.
Evaluation Metrics. For multiple-choice question answering tasks and triple classification tasks, we use accuracy as the metric. For link prediction task, there are two kinds of metrics: the first is mean rank (MR) and mean reciprocal rank (MRR), and the second is H@N (namely Hits@N) that means the proportion of correct entities in top N after being sorted w.r.t predicted confidence. Note we only report results under the filtered setting (Bordes et al., 2013) which removes all corrupted triples appearing in training, dev and test set.

B MORE EXPERIMENTS
PhysicalIQA. In addition to CommonsenseQA and SocialIQA shown in the main paper, a dataset named PhysicalIQA 6 is also used to evaluate our method. It is also regarded as an out-of-domain dataset compared with our training corpora. However, our implemented code base cannot reproduce the state-of-the-art results that are achieved by RoBERTa-large finetuning, possibly due to different pre-processing and feeding strategies for pre-trained LMs, e.g., special token, concatenation scheme, representation gathering method (Mitra et al., 2019). Hence, we only report re-implemented results with the same network structure, same data-preprocessing method, same random seed and same hyperparameter grid search, for fair comparison. The accuracy on dev set is 78.7% and 80.2% for RoBERTa-large baseline and GLM (RoBERTa) respectively, which further demonstrates the effectiveness of our approach on out-of-domain datasets.
HellaSWAG. We also try to apply our approach to HellaSWAG 7 (Zellers et al., 2019). It is a plausible inference task and requires reasoning over linguistic context and external knowledge. The task is to choose one plausible ending from four candidates. Same as PhysicalIQA, we only fairly report the accuracy on dev set for a fast comparison. With our implementation, finetuning RoBERTalarge on this dataset achieves 84.1% dev accuracy which is much higher than the best dev accuracy (83.5%) on leaderboard. However, finetuning our approach achieves 83.9% accuracy, which is slightly worse than the baseline. We noticed that examples in HellaSWAG frequently have multiple consecutive sentences for inference, thus our model trained on single, unordered sentences may only achieve sub-optimal performance. On the other hand, this is another out-of-domain dataset which may not benefit from our training knowledge graph and corpora.

C ERROR ANALYSIS
Compared with original pre-trained LMs (e.g., BERT and RoBERTa), some limitations are found in our pre-trained models, which mainly fall into • Over masking: Compared to random masking, our KG-guided masking scheme is more likely to mask all key parts of a sentence, which leaves little room for MLM task. • Short context: Since our corpora consist of just single, unordered sentences, information that spans across multiple sentences is not encoded effectively. When downstream tasks rely heavily on consecutive sentences, finetuning our model yields inferior performance. This can be empirically verified by the performance drop on HellaSWAG dataset which involves long contexts. • Pipeline model: Same as any other method aiming to integrate LMs with KG, a linking system is first applied to detect entities in a text, which inevitably suffers from graph sparsity and leads to error propagation. However, our method is less sensitive to such errors compared with the methods that links entity during both finetuning and inference.

D RELATED WORK (EXTENDED)
This work is in line with Baidu-ERNIE (Sun et al., 2019a) and SpanBERT  which replace word-level mask (Devlin et al., 2019) with span-level one for knowledge information and long-term dependency. In particular, Baidu-ERNIE uses uniformly random masking for phrases and entities, whereas SpanBERT directly masks out token spans sampled under geometric distribution.
The work of Petroni et al. (2019) finds that, without finetuning, pre-trained LMs (e.g., BERT) contains relational knowledge competitive with traditional NLP methods with oracle knowledge. Nevertheless, how to integrate the oracle knowledge into the pre-trained LMs for further performance improvement remains an open question.
As briefly summarized in the introduction of the main paper, existing methods can be coarsely categorized into two classes. For the first class, those methods retrieve a KG subgraph or/and pretrained graph embeddings via entity linking during finetuning and inference. K-BERT (Liu et al., 2019a) retrieves a path from KG as description for each detected entity in text, and inserts such description into input sequence to Transformer encoder with carefully designed attention mask and position embedding. KnowBert (Peters et al., 2019) and THU-ERNIE (Zhang et al., 2019) first retrieve the detected entities' embeddings from pre-trained graph embeddings (Bordes et al., 2013), and then treat these retrieved embeddings as extra inputs for each layer of Transformer encoder. Lin et al. (2019) and Lv et al. (2019) aim to solve commonsense multiple-choice QA problem. They retrieve a graph path from entities detected in question to each answer entry, and then encode (e.g., via LSTM) these paths as heterogeneous representations for higher-level modules.
The second class of methods use contextualized representations from pre-trained LMs to enrich graph embeddings and thus alleviate graph sparsity issues. COMET  finetunes pre-trained LM on partially-masked triples from KG, which only aims at commonsense knowledge graph completion tasks.  perform transfer learning from pre-trained language models to knowledge graphs for enhanced contextual representation of the knowledge. KG-BERT (Yao et al., 2019) directly concatenates the head, relation and tail of a triple, and finetunes pre-trained LMs on such data with binary classification objective, i.e., whether a triple is correct or not.
How to generate and utilize negative samples is important for learning graph embeddings and structured knowledge (Sun et al., 2019c;Ye et al., 2019). For example, KBGAN (Cai and Wang, 2018) uses a knowledge graph embedding model as negative sample generator to assist the training of the desired model, which acts as the discriminator in GANs. Rather than a standalone generator, selfadversarial sampling (Sun et al., 2019c) generates negative samples according to the current entity or relation embeddings. BERT-AMS (Ye et al., 2019) and our proposed ranking task share a similar motivation that the model is able to effectively learn the structured knowledge from negative samples, but they differ in the task designs. BERT-AMS builds a multiple-choice question answering task for utilizing negative samples, which imitates the developing procedure of CommonsenseQA (Talmor et al., 2019) and aims to improve performance on that particular dataset. In contrast, our approach is more general and formulates a ranking task along with the entity-level masked language modeling objective for pre-training knowledge-aware LMs.