LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.


Introduction
Many natural language tasks involve entities, e.g., relation classification, entity typing, named entity recognition (NER), and question answering (QA). Key to solving such entity-related tasks is a model to learn the effective representations of entities. Conventional entity representations assign each entity a fixed embedding vector that stores information regarding the entity in a knowledge base (KB) (Bordes et al., 2013;Trouillon et al., 2016;Yamada et al., 2016Yamada et al., , 2017. Although these models capture the rich information in the KB, they require entity linking to represent entities in a text, and cannot represent entities that do not exist in the KB. By contrast, contextualized word representations (CWRs) based on the transformer (Vaswani et al., 2017), such as BERT (Devlin et al., 2019), and RoBERTa , provide effective general-purpose word representations trained with unsupervised pretraining tasks based on language modeling. Many recent studies have solved entity-related tasks using the contextualized representations of entities computed based on CWRs Peters et al., 2019;. However, the architecture of CWRs is not well suited to representing entities for the following two reasons: (1) Because CWRs do not output the span-level representations of entities, they typically need to learn how to compute such representations based on a downstream dataset that is typically small. (2) Many entity-related tasks, e.g., relation classification and QA, involve reasoning about the relationships between entities. Although the transformer can capture the complex relationships between words by relating them to each other multiple times using the self-attention mechanism Reif et al., 2019), it is difficult to perform such reasoning between entities because many entities are split into multiple tokens in the model. Furthermore, the word-based pretraining task of CWRs is not suitable for learning the representations of entities because predicting a masked word given other words in the entity, e.g., predicting "Rings" given "The Lord of the [MASK]", is clearly easier than predicting the entire entity.
In this paper, we propose new pretrained contextualized representations of words and entities by developing LUKE (Language Understanding with Knowledge-based Embeddings). LUKE is based on a transformer (Vaswani et al., 2017) trained using a large amount of entity-annotated corpus  Figure 1: Architecture of LUKE using the input sentence "Beyoncé lives in Los Angeles." LUKE outputs contextualized representation for each word and entity in the text. The model is trained to predict randomly masked words (e.g., lives and Angeles in the figure) and entities (e.g., Los Angeles in the figure). Downstream tasks are solved using its output representations with linear classifiers.
obtained from Wikipedia. An important difference between LUKE and existing CWRs is that it treats not only words, but also entities as independent tokens, and computes intermediate and output representations for all tokens using the transformer (see Figure 1). Since entities are treated as tokens, LUKE can directly model the relationships between entities. LUKE is trained using a new pretraining task, a straightforward extension of BERT's masked language model (MLM) (Devlin et al., 2019). The task involves randomly masking entities by replacing them with [MASK] entities, and trains the model by predicting the originals of these masked entities. We use RoBERTa as base pre-trained model, and conduct pretraining of the model by simultaneously optimizing the objectives of the MLM and our proposed task. When applied to downstream tasks, the resulting model can compute representations of arbitrary entities in the text using [MASK] entities as inputs. Furthermore, if entity annotation is available in the task, the model can compute entity representations based on the rich entity-centric information encoded in the corresponding entity embeddings.
Another key contribution of this paper is that it extends the transformer using our entity-aware self-attention mechanism. Unlike existing CWRs, our model needs to deal with two types of tokens, i.e., words and entities. Therefore, we assume that it is beneficial to enable the mechanism to easily determine the types of tokens. To this end, we enhance the self-attention mechanism by adopting different query mechanisms based on the attending token and the token attended to.
We validate the effectiveness of our proposed model by conducting extensive experiments on five standard entity-related tasks: entity typing, relation classification, NER, cloze-style QA, and extractive QA. Our model outperforms all baseline models, including RoBERTa, in all experiments, and obtains state-of-the-art results on five tasks: entity typing on the Open Entity dataset (Choi et al., 2018), relation classification on the TACRED dataset (Zhang et al., 2017), NER on the CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003), clozestyle QA on the ReCoRD dataset (Zhang et al., 2018a), and extractive QA on the SQuAD 1.1 dataset (Rajpurkar et al., 2016). We publicize our source code and pretrained representations at https://github.com/studio-ousia/luke.
The main contributions of this paper are summarized as follows: • We propose LUKE, a new contextualized representations specifically designed to address entityrelated tasks. LUKE is trained to predict randomly masked words and entities using a large amount of entity-annotated corpus obtained from Wikipedia.
• We introduce an entity-aware self-attention mechanism, an effective extension of the original mechanism of transformer. The proposed mechanism considers the type of the tokens (words or entities) when computing attention scores.
• LUKE achieves strong empirical performance and obtains state-of-the-art results on five popular datasets: Open Entity, TACRED, CoNLL-2003, ReCoRD, and SQuAD 1.1.

Static Entity Representations
Conventional entity representations assign a fixed embedding to each entity in the KB. They include knowledge embeddings trained on knowledge graphs (Bordes et al., 2013;Yang et al., 2015;Trouillon et al., 2016), and embeddings trained using textual contexts or descriptions of entities retrieved from a KB (Yamada et al., 2016(Yamada et al., , 2017Cao et al., 2017;Ganea and Hofmann, 2017). Similar to our pretraining task, NTEE (Yamada et al., 2017) and RELIC (Ling et al., 2020) use an approach that trains entity embeddings by predicting entities given their textual contexts obtained from a KB. The main drawbacks of this line of work, when representing entities in text, are that (1) they need to resolve entities in the text to corresponding KB entries to represent the entities, and (2) they cannot represent entities that do not exist in the KB.

Contextualized Word Representations
Many recent studies have addressed entity-related tasks based on the contextualized representations of entities in text computed using the word representations of CWRs Baldini Soares et al., 2019;Peters et al., 2019;Wang et al., 2019bWang et al., , 2020. Representative examples of CWRs are ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), which are based on deep bidirectional long short-term memory (LSTM) and the transformer (Vaswani et al., 2017), respectively. BERT is trained using an MLM, a pretraining task that masks random words in the text and trains the model to predict the masked words. Most recent CWRs, such as RoBERTa , XLNet (Yang et al., 2019), Span-BERT , ALBERT (Lan et al., 2020), BART , and T5 (Raffel et al., 2020), are based on transformer trained using a task equivalent to or similar to the MLM. Similar to our proposed pretraining task that masks entities instead of words, several recent CWRs, e.g., Span-BERT, ALBERT, BART, and T5, have extended the MLM by randomly masking word spans instead of single words. Furthermore, various recent studies have explored methods to enhance CWRs by injecting them with knowledge from external sources, such as KBs. ERNIE  and Know-BERT (Peters et al., 2019) use a similar idea to enhance CWRs using static entity embeddings sep-arately learned from a KB. WKLM (Xiong et al., 2020) trains the model to detect whether an entity name in text is replaced by another entity name of the same type. KEPLER (Wang et al., 2019b) conducts pretraining based on the MLM and a knowledge-embedding objective (Bordes et al., 2013). K-Adapter (Wang et al., 2020) was proposed concurrently with our work, and extends CWRs using neural adapters that inject factual and linguistic knowledge. This line of work is related to ours because our pretraining task also enhances the model using information in the KB.
Unlike the CWRs mentioned above, LUKE uses an improved transformer architecture with an entity-aware self-attention mechanism that is designed to effectively solve entity-related tasks. LUKE also outputs entity representations by learning how to compute them during pretraining. It achieves superior empirical results to existing CWRs and knowledge-enhanced CWRs in all of our experiments. Figure 1 shows the architecture of LUKE. The model adopts a multi-layer bidirectional transformer (Vaswani et al., 2017). It treats words and entities in the document as input tokens, and computes a representation for each token. Formally, given a sequence consisting of m words w 1 , w 2 , ..., w m and n entities e 1 , e 2 , ..., e n , our model computes D-dimensional word representations h w 1 , h w 2 , ..., h wm , where h w ∈ R D , and entity representations h e 1 , h e 2 , ..., h en , where h e ∈ R D . The entities can be Wikipedia entities (e.g., Beyoncé in Figure 1) or special entities (e.g., [MASK]).

Input Representation
The input representation of a token (word or entity) is computed using the following three embeddings: • Token embedding represents the corresponding token. We denote the word token embedding by A ∈ R Vw×D , where V w is the number of words in our vocabulary. For computational efficiency, we represent the entity token embedding by decomposing it into two small matrices, B ∈ R Ve×H and U ∈ R H×D , where V e is the number of entities in our vocabulary. Hence, the full matrix of the entity token embedding can be computed as BU.
• Position embedding represents the position of the token in a word sequence. A word and an entity appearing at the i-th position in the sequence are represented as C i ∈ R D and D i ∈ R D , respectively. If an entity name contains multiple words, its position embedding is computed by averaging the embeddings of the corresponding positions, as shown in Figure 1.
• Entity type embedding represents that the token is an entity. The embedding is a single vector denoted by e ∈ R D .
The input representation of a word and that of an entity are computed by summing the token and position embeddings, and the token, position, and entity type embeddings, respectively. Following past work (Devlin et al., 2019;, we insert special tokens [CLS] and [SEP] into the word sequence as the first and last words, respectively.

Entity-aware Self-attention
The self-attention mechanism is the foundation of the transformer (Vaswani et al., 2017), and relates tokens each other based on the attention score between each pair of tokens. Given a sequence of input vectors x 1 , x 2 , ..., x k , where x i ∈ R D , each of the output vectors y 1 , y 2 , ..., y k , where y i ∈ R L , is computed based on the weighted sum of the transformed input vectors. Here, each input and output vector corresponds to a token (a word or an entity) in our model; therefore, k = m + n. The i-th output vector y i is computed as: where Q ∈ R L×D , K ∈ R L×D , and V ∈ R L×D denote the query, key, and value matrices, respectively.
Because LUKE handles two types of tokens (i.e., words and entities), we assume that it is beneficial to use the information of target token types when computing the attention scores (e ij ). With this in mind, we enhance the mechanism by introducing an entity-aware query mechanism that uses a different query matrix for each possible pair of token types of x i and x j . Formally, the attention score e ij is computed as follows: where Q w2e , Q e2w , Q e2e ∈ R L×D are query matrices. Note that the computational costs of the original mechanism and our proposed mechanism are identical except the additional cost of computing gradients and updating the parameters of the additional query matrices at the training time.

Pretraining Task
To pretrain LUKE, we use the conventional MLM and a new pretraining task that is an extension of the MLM to learn entity representations. In particular, we treat hyperlinks in Wikipedia as entity annotations, and train the model using a large entityannotated corpus retrieved from Wikipedia. We randomly mask a certain percentage of the entities by replacing them with special [MASK] entities 1 and then train the model to predict the masked entities. Formally, the original entity corresponding to a masked entity is predicted by applying the softmax function over all entities in our vocabulary: where h e is the representation corresponding to the masked entity, T ∈ R H×D and W h ∈ R D×D are weight matrices, b o ∈ R Ve and b h ∈ R D are bias vectors, gelu(·) is the gelu activation function (Hendrycks and Gimpel, 2016), and layer norm(·) is the layer normalization function (Lei Ba et al., 2016). Our final loss function is the sum of MLM loss and cross-entropy loss on predicting the masked entities, where the latter is computed identically to the former.

Modeling Details
Our model configuration follows RoBERTa LARGE , pretrained CWRs based on a bidirectional transformer and a variant of BERT (Devlin et al., 2019). In particular, our model is based on the bidirectional transformer with D = 1024 hidden dimensions, 24 hidden layers, L = 64 attention head dimensions, and 16 self-attention heads. The number of dimensions of the entity token embedding is set to H = 256. The total number of parameters is approximately 483 M, consisting of 355 M in RoBERTa and 128 M in our entity embeddings. The input text is tokenized into words using RoBERTa's tokenizer with the vocabulary consisting of V w = 50K words. For computational efficiency, our entity vocabulary does not include all entities but only the V e = 500K entities most frequently appearing in our entity annotations. The entity vocabulary also includes two special entities, i.e., [MASK] and [UNK].
The model is trained via iterations over Wikipedia pages in a random order for 200K steps. To reduce training time, we initialize the parameters that LUKE have in common with RoBERTa (parameters in the transformer and the embeddings for words) using RoBERTa. Following past work (Devlin et al., 2019;, we mask 15% of all words and entities at random. If an entity does not exist in the vocabulary, we replace it with the [UNK] entity. We perform pretraining using the original self-attention mechanism rather than our entity-aware self-attention mechanism because we want an ablation study of our mechanism but can not afford to run pretraining twice. Query matrices of our self-attention mechanism (Q w2e , Q e2w , and Q e2e ) are learned using downstream datasets. Further details of our pretraining are described in Appendix A.

Experiments
We conduct extensive experiments using five entityrelated tasks: entity typing, relation classification, NER, cloze-style QA, and extractive QA. We use similar model architectures for all tasks based on a simple linear classifier on top of the representations of words, entities, or both. Unless otherwise specified, we create the input word sequence by inserting tokens of [CLS] and [SEP] into the original word sequence as the first and the last tokens, respectively. The input entity sequence is built using [MASK] entities, special entities introduced for the task, or Wikipedia entities. The token embedding of a task-specific special entity is initialized using that of the [MASK] entity, and the query matrices of our entity-aware self-attention mechanism (Q w2e , Q e2w , and Q e2e ) are initialized using the original query matrix Q.

Name
Prec. Rec. F1 UFET  77.4 60.6 68.0 BERT  76.4 71.0 73.6 ERNIE  78.4 72.9 75.6 KEPLER (Wang et al., 2019b) 77.2 74.2 75.7 KnowBERT (Peters et al., 2019) 78.6 73.7 76.1 K-Adapter (Wang et al., 2020) 79.3 75.8 77.5 RoBERTa (Wang et al., 2020) 77.6 75.0 76.2 LUKE 79.9 76.6 78.2 Because we use RoBERTa as the base model in our pretraining, we use it as our primary baseline for all tasks. We omit a description of the baseline models in each section if they are described in Section 2. Further details of our experiments are available in Appendix B.

Entity Typing
We first conduct experiments on entity typing, which is the task of predicting the types of an entity in the given sentence. Following , we use the Open Entity dataset (Choi et al., 2018), and consider only nine general entity types. Following Wang et al. (2020), we report loose micro-precision, recall, and F1, and employ the micro-F1 as the primary metric.
Model We represent the target entity using the [MASK] entity, and enter words and the entity in each sentence into the model. We then classify the entity using a linear classifier based on the corresponding entity representation. We treat the task as multi-label classification, and train the model using binary cross-entropy loss averaged over all entity types.
Baselines UFET (Choi et al., 2018) is a conventional model that computes context representations using the bidirectional LSTM. We also use BERT, RoBERTa, ERNIE, KnowBERT, KEPLER, and K-Adapter as baselines.

Relation Classification
Relation classification determines the correct relation between head and tail entities in a sentence.
We conduct experiments using TACRED dataset (Zhang et al., 2017), a large-scale relation classification dataset containing 106,264 sentences with 42 relation types. Following Wang et al. (2020), we report the micro-precision, recall, and F1, and use the micro-F1 as the primary metric.
Model We introduce two special entities, [HEAD] and [TAIL], to represent the head and the tail entities, respectively, and input words and these two entities in each sentence to the model. We then solve the task using a linear classifier based on a concatenated representation of the head and tail entities. The model is trained using cross-entropy loss.
Baselines C-GCN (Zhang et al., 2018b) uses graph convolutional networks over dependency tree structures to solve the task. MTB (Baldini Soares et al., 2019) learns relation representations based on BERT through the matching-the-blanks task using a large amount of entity-annotated text. We also compare LUKE with BERT, RoBERTa, SpanBERT, ERNIE, KnowBERT, KEPLER, and K-Adapter.

Results
The experimental results are presented in Table 2. LUKE clearly outperforms our primary baseline, RoBERTa, by 1.4 F1 points, and the previous best published models, namely MTB and KnowBERT, by 1.2 F1 points. Furthermore, it achieves a new state of the art by outperforming K-Adapter by 0.7 F1 points.

Named Entity Recognition
We conduct experiments on the NER task using the standard CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003). Following past work, we  report the span-level F1.
Model Following Sohrab and Miwa (2018), we solve the task by enumerating all possible spans (or n-grams) in each sentence as entity name candidates, and classifying them into the target entity types or non-entity type, which indicates that the span is not an entity. For each sentence in the dataset, we enter words and the [MASK] entities corresponding to all possible spans. The representation of each span is computed by concatenating the word representations of the first and last words in the span, and the entity representation corresponding to the span. We classify each span using a linear classifier with its representation, and train the model using cross-entropy loss. We exclude spans longer than 16 words for computational efficiency. During the inference, we first exclude all spans classified into the non-entity type. To avoid selecting overlapping spans, we greedily select a span from the remaining spans based on the logit of its predicted entity type in descending order if the span does not overlap with those already selected. We also use ELMo, BERT, and RoBERTa as baselines. To conduct a fair comparison with RoBERTa, we report its performance using the model described above with the span representation computed by concatenating the representations of the first and last words of the span.

Results
The experimental results are shown in

Cloze-style Question Answering
We evaluate our model on the ReCoRD dataset (Zhang et al., 2018a), a cloze-style QA dataset consisting of over 120K examples. An interesting characteristic of this dataset is that most of its questions cannot be solved without external knowledge. The following is an example question and its answer in the dataset: Question: According to claims in the suit, "Parts of 'Stairway to Heaven,' instantly recognizable to the music fans across the world, sound almost identical to significant portions of 'X."' Answer: Taurus Given a question and a passage, the task is to find the entity mentioned in the passage that fits the missing entity (denoted by X in the question above).
In this dataset, annotations of entity spans (start and end positions) in a passage are provided, and the answer is contained in the provided entity spans one or multiple times. Following past work, we evaluate the models using exact match (EM) and token-level F1 on the development and test sets.
Model We solve this task by assigning a relevance score to each entity in the passage and selecting the entity with the highest score as the answer. Following , given a question q 1 , q 2 , ..., q j , and a passage p 1 , p 2 , ..., p l , the input word sequence is constructed as: [CLS]q 1 , q 2 , ..., q j [SEP] [SEP]p 1 , p 2 , ..., p l [SEP]. Further, we input [MASK] entities corresponding to the missing entity and all entities in the passage. We compute the relevance score of each entity in the passage using a linear classifier with the concatenated representation of the missing entity and the corresponding entity. We train the model using binary cross-entropy loss averaged over all entities in the passage, and select the entity with the highest score (logit) as the answer.
Baselines DocQA+ELMo ) is a model based on ELMo, bidirectional attention flow (Seo et al., 2017), and self-attention mechanism. XLNet+Verifier (Li et al., 2019) is a model based on XLNet with rule-based answer verification, and is the winner of a recent competition  (Li et al., 2019) 80.6 82.1 81.5 82.7 RoBERTa  89.0 89.5 --RoBERTa (ensemble)    based on this dataset (Ostermann et al., 2019). We also use BERT and RoBERTa as baselines.

Results
The results are presented in Table 4. LUKE significantly outperforms RoBERTa, the best baseline, on the development set by 1.8 EM points and 1.9 F1 points. Furthermore, it achieves superior results to RoBERTa (ensemble) on the test set without ensembling the models.

Extractive Question Answering
Finally, we conduct experiments using the wellknown Stanford Question Answering Dataset (SQuAD) 1.1 consisting of 100K question/answer pairs (Rajpurkar et al., 2016). Given a question and a Wikipedia passage containing the answer, the task is to predict the answer span in the passage. Following past work, we report the EM and token-level F1 on the development and test sets.
Model We construct the word sequence from the question and the passage in the same way as in the previous experiment. Unlike in the other experiments, we input Wikipedia entities into the model based on entity annotations automatically generated on the question and the passage using a mapping from entity names (e.g., "U.S.") to their referent entities (e.g., United States). The mapping is automatically created using the entity hyperlinks in Wikipedia as described in detail in Appendix C. We solve this task using the same model architecture as that of BERT and RoBERTa. In particular, we use two linear classifiers independently on top of the word representations to predict the span boundary of the answer (i.e., the start and end positions), and train the model using cross-entropy loss.
Baselines We compare our models with the results of recent CWRs, including BERT, RoBERTa, SpanBERT, XLNet, and ALBERT. Because the results for RoBERTa and ALBERT are reported only on the development set, we conduct a comparison  (Yang et al., 2019) 89.0 94.5 89.9 95.1 ALBERT (Lan et al., 2020) 89.3 94.8 --RoBERTa  88.9 94.6 --LUKE 89.8 95.0 90.2 95.4  with these models using this set. To conduct a fair compassion with RoBERTa, we use the same model architecture and hyper-parameters as those of RoBERTa .

Results
The experimental results are presented in Table 5. LUKE outperforms our primary baseline, RoBERTa, by 0.9 EM points and 0.4 F1 points on the development set. Furthermore, it achieves a new state of the art on this competitive dataset by outperforming XLNet by 0.3 points both in terms of EM and F1. Note that XLNet uses a more sophisticated model involving beam search than the other models considered here.

Analysis
In this section, we provide a detailed analysis of LUKE by reporting three additional experiments.

Effects of Entity Representations
To investigate how our entity representations influence performance on downstream tasks, we perform an ablation experiment by addressing NER on the CoNLL-2003 dataset and extractive QA on the SQuAD dataset without inputting any entities. In this setting, LUKE uses only the word sequence to compute the representation for each word. We address the tasks using the same model architectures as those for RoBERTa described in the corresponding sections. As shown in Table 6, this setting clearly degrades performance, i.e., 1.4 F1 points on the CoNLL-2003 dataset and 0.6 EM points on the SQuAD dataset, demonstrating the effectiveness of our entity representations on these two tasks.

Effects of Entity-aware Self-attention
We conduct an ablation study of our entity-aware self-attention mechanism by comparing the performance of LUKE using our mechanism with that using the original mechanism of the transformer. As shown in Table 7, our entity-aware self-attention mechanism consistently outperforms the original mechanism across all tasks. Furthermore, we observe significant improvements on two kinds of tasks, relation classification (TACRED) and QA (ReCoRD and SQuAD). Because these tasks involve reasoning based on relationships between entities, we consider that our mechanism enables the model (i.e., attention heads) to easily focus on capturing the relationships between entities.

Effects of Extra Pretraining
As mentioned in Section 3.4, LUKE is based on RoBERTa with pretraining for 200K steps using our Wikipedia corpus. Because past studies Lan et al., 2020) suggest that simply increasing the number of training steps of CWRs tends to improve performance on downstream tasks, the superior experimental results of LUKE compared with those of RoBERTa may be obtained because of its greater number of pretraining steps.
To investigate this, we train another model based on RoBERTa with extra pretraining based on the MLM using the Wikipedia corpus for 200K training steps. The detailed configuration used in the pretraining is available in Appendix A. We evaluate the performance of this model on the CoNLL-2003 and SQuAD datasets using the same model architectures as those for RoBERTa described in the corresponding sections. As shown in Table 8, the model achieves similar performance to the original RoBERTa on both datasets, which indicates that the superior performance of LUKE is not owing to its longer pretraining.

Conclusions
In this paper, we propose LUKE, new pretrained contextualized representations of words and entities based on the transformer. LUKE outputs the contextualized representations of words and entities using an improved transformer architecture with using a novel entity-aware self-attention mechanism. The experimental results prove its effectiveness on various entity-related tasks. Future work involves applying LUKE to domain-specific tasks, such as those in biomedical and legal domains.      Table 9. Table 10 shows the hyper-parameters used for the extra pretraining of RoBERTa on our Wikipedia corpus described in Section 5. As shown in the Table, we use the same hyper-parameters as the ones used to train LUKE. We train the model for 200K steps and update all parameters throughout the training.

B Details of Experiments
We conduct the experiments using NVIDIA's Py-Torch Docker container 19.02 hosted on a server with two Intel Xeon E5-2698 v4 CPUs and eight V100 GPUs. For each dataset, excluding SQuAD, we conduct hyper-parameter tuning using grid search based on the performance on the development set. We evaluate performance using EM on the ReCoRD dataset, and F1 on the other datasets. Because our computational resources are limited, we use the following constrained search space: • learning rate: 1e-5, 2e-5, 3e-5   • number of training epochs: 2, 3, 5 We do not tune the hyper-parameters of the SQuAD dataset, and use the ones described in . The hyper-parameters and other details, including the training time, number of GPUs used, and the best score on the development set, are shown in Table 11. For the other hyper-parameters, we simply follow Liu et al. (2020) (see Table 12). We optimize the model using AdamW with learning rate warmup and linear decay of the learning rate. We also use early stopping based on performance on the development set. The details of the datasets used in our experiments are provided below.

B.1 Open Entity
The Open Entity dataset used in  consists of training, development, and test sets, where each set contains 1,998 examples with labels of nine general entity types. The dataset is downloaded from the website for . 2 We compute the reported results using our code based on that of .

B.3 CoNLL-2003
The CoNLL-2003 dataset comprises training, development, and test sets, containing 14,987, 3,466, and 3,684 sentences, respectively. Each sentence contains annotations of four entity types, namely person, location, organization, and miscellaneous. The dataset is downloaded from the relevant website. 4 The reported results are computed using the conlleval script obtained from the website.

B.4 ReCoRD
The ReCoRD dataset consists of 100,730 training, 10,000 development, and 10,000 test questions created based on 80,121 unique news articles. The dataset is obtained from the relevant website. 5 We compute the performance on the development set using the official evaluation script downloaded from the website. Performance on the test set is obtained by submitting our model to the leaderboard.

B.5 SQuAD 1.1
The SQuAD 1.1 dataset contains 87,599 training, 10,570 development, and 9,533 test questions created based on 536 Wikipedia articles. The dataset is downloaded from the relevant website. 6 We compute performance on the development set using the official evaluation script downloaded from the website. Performance on the test set is obtained by submitting our model to the leaderboard.

C Adding Entity Annotations to SQuAD dataset
For each question-passage pair in the SQuAD dataset, we first create a mapping from the entity names (e.g., "U.S.") to their referent Wikipedia entities (e.g., United States) using the entity hyperlinks on the source Wikipedia page of the passage. We then perform simple string matching to extract all entity names in the question and the passage, and treat all matched entity names as entity annotations for their referent entities. We ignore an entity name if the name refers to multiple entities on the page. Further, to reduce noise, we also exclude an entity name if its link probability, the probability that the name appears as a hyperlink in Wikipedia, is lower than 1%. We use the March 2016 version of Wikipedia to collect the entity hyperlinks and the link probabilities of the entity names.