Scalable Zero-shot Entity Linking with Dense Entity Retrieval

This paper introduces a conceptually simple, scalable, and highly effective BERT-based entity linking model, along with an extensive evaluation of its accuracy-speed trade-off. We present a two-stage zero-shot linking algo-rithm, where each entity is deﬁned only by a short textual description. The ﬁrst stage does retrieval in a dense space deﬁned by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then re-ranked with a cross-encoder, that concatenates the mention and entity text. Experiments demonstrate that this approach is state of the art on recent zero-shot benchmarks (6 point absolute gains) and also on more established non-zero-shot evaluations (e.g. TACKBP-2010), despite its relative simplicity (e.g. no explicit entity embeddings or manually engineered mention tables). We also show that bi-encoder linking is very fast with nearest neighbour search (e.g. linking with 5.9 million candidates in 2 milliseconds), and that much of the accuracy gain from the more expensive cross-encoder can be transferred to the bi-encoder via knowledge distillation. Our code and models are available at https://github.


Introduction
Scale is a key challenge for entity linking; there are millions of possible entities to consider for each mention. To efficiently filter or rank the candidates, existing methods use different sources of external information, including manually curated mention tables (Ganea and Hofmann, 2017), incoming Wikipedia link popularity (Yamada et al., 2016), and gold Wikipedia entity categories (Gillick et al., 2019). In this paper, we show that BERT-based models set new state-of-the-art performance levels * Work done during internship with Facebook. for large scale entity linking when used in a zero shot setup, where there is no external knowledge and a short text description provides the only information we have for each entity. We also present an extensive evaluation of the accuracy-speed tradeoff inherent to large pre-trained models, and show is possible to achieve very efficient linking with modest loss of accuracy.
More specifically, we introduce a two stage approach for zero-shot linking (see Figure 1 for an overview), based on fine-tuned BERT architectures . In the first stage, we do retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions (Humeau et al., 2019;Gillick et al., 2019). Each retrieved candidate is then examined more carefully with a cross-encoder that concatenates the mention and entity text, following Logeswaran et al. (2019). This overall approach is conceptually simple but highly effective, as we show through detailed experiments.
Our two-stage approach achieves a new state-ofthe-art result on TACKBP-2010, with an over 30% relative error reduction. By simply reading the provided text descriptions, we are able to outperform previous methods that included many extra cues such as entity name dictionaries and link popularity. We also improve the state of the art on existing zero-shot benchmarks, including a nearly 6 point absolute gain on the recently introduced Wikia corpus (Logeswaran et al., 2019) and more than 7 point absolute gain on WikilinksNED Unseen-Mentions (Onoe and Durrett, 2019).
Finally, we do an extensive evaluation of the accuracy-speed trade-off inherent in our bi-and cross-encoder models. We show that the two stage methods scales well in a full Wikipedia setting, by linking against all the 5.9M Wikipedia entities for TACKBP-2010, while still outperforming existing model with much smaller candidate sets. We wikipedia dense space My kids really enjoyed a ride in the Jaguar! Jaguar is the luxury vehicle brand.

Jaguar_cars
Jaguar! is a junior roller coaster. also show that bi-encoder linking is very fast with approximate nearest neighbor search (e.g. linking over 5.9 million candidates in 2 milliseconds), and that much of the accuracy gain from the more expensive cross-encoder can be transferred to the bi-encoder via knowledge distillation. We release our code and models, as well as a system to link entity mentions to all of Wikipedia (similar to TagME (Ferragina and Scaiella, 2011)). 1

Related Work
We follow most recent work in studying entity linking with gold mentions. 2 The entity linking task can be broken into two steps: candidate generation and ranking. Prior work has used frequency information, alias tables and TF-IDF-based methods for candidate generation. For candidate ranking, He et al. (2013), Sun et al. (2015), Yamada et al. (2016), Ganea and Hofmann (2017), and Kolitsas et al. (2018) have established state-of-the-art results using neural networks to model context word, span and entity. There is also recent work demonstrating that fine-grained entity typing information helps linking (Raiman and Raiman, 2018;Onoe and Durrett, 2019;Khalife and Vazirgiannis, 2018). Two recent results are most closely related to our work. Logeswaran et al. (2019) proposed the zero-shot entity linking task. They use cross-encoders for entity ranking, but rely on traditional IR-techniques for candidate generation and did not evaluate on large scale benchmarks such as TACKBP. Gillick et al. (2019) show that dense embeddings work well for candidate generation, but they did not do pre-training and included external category labels in their bi-encoder architectures, limiting their linking to entities in Wikipedia. Our approach can be seen as generalizing both of these lines of work, and showing for the first time that pre-trained zero-shot architectures are both highly accurate and computationally efficient at scale. Humeau et al. (2019) studied different architectures to use deep pre-trained bidirectional transformers and performed detailed comparison of three different architectures, namely bi-encoder, poly-encoder, cross-encoder on tasks of sentence selection in dialogues. Inspired by their work, we use similar architectures to the problem of entity linking, and in addition, demonstrate that biencoder can be a strong model for retrieval. Instead of using the poly-encoder as a trade-off between cross-encoder and bi-encoder, we propose to train a bi-encoder model with knowledge distillation (Buciluundefined et al., 2006;Hinton et al., 2015) from a cross-encoder model to further improve the biencoder's performances.

Definition and Task Formulation
Entity Linking Given an input text document D = {w 1 , ..., w r } and a list of entity mentions M D = {m 1 , ..., m n }, the output of an entity linking model is a list of mention-entity pairs {(m i , e i )} i∈ [1,n] where each entity is an entry in a knowledge base (KB) (e.g. Wikipedia), e ∈ E. We assume that the title and description of the entities are available, which is a common setting in entity linking (Ganea and Hofmann, 2017;Logeswaran et al., 2019). We also assume each mention has a valid gold entity in the KB, which is usually referred as in-KB evaluation. We leave the out-of-KB prediction (i.e. nil prediction) to future work.

Zero-shot Entity Linking
We also study zeroshot entity linking (Logeswaran et al., 2019). Here the document setup is the same, but the knowledge base is separated in training and test time. Formally, denote E train and E test to be the knowledge base in training and test, we require E train ∩ E test = ∅. The set of text documents, mentions, and entity dictionary are separated in training and test so that the entities being linked at test time are unseen. Figure 1 shows our overall approach. The biencoder uses two independent BERT transformers to encode model context/mention and entity into dense vectors, and each entity candidate is scored as the dot product of these vectors. The candidates retrieved by the bi-encoder are then passed to the cross-encoder for ranking. The cross-encoder encodes context/mention and entity in one transformer, and applies an additional linear layer to compute the final score for each pair.

Bi-encoder
Architecture We use a bi-encoder architecture similar to the work of Humeau et al. (2019) to model (mention, entity) pairs. This approach allows for fast, real-time inference, as the candidate representations can be cached. Both input context and candidate entity are encoded into vectors: where τ m and τ e are input representations of mention and entity respectively, T 1 and T 2 are two transformers. red(.) is a function that reduces the sequence of vectors produced by the transformers into one vector. Following the experiments in Humeau et al. (2019), we choose red(.) to be the last layer of the output of the [CLS] token.

Context and Mention Modeling
The representation of context and mention τ m is composed of the word-pieces of the context surrounding the mention and the mention itself. Specifically, we construct input of each mention example as: where mention, ctxt l , ctxt r are the word-pieces tokens of the mention, context before and after the mention respectively, and [M s ], [M e ] are special tokens to tag the mention. The maximum length of the input representation is a hyperparameter in our model, and we find that small value such as 32 works well in practice (see Appendix A).
Entity Modeling The entity representation τ e is also composed of word-pieces of the entity title and description (for Wikipedia entities, we use the first ten sentences as description). The input to our entity model is: where title, description are word-pieces tokens of entity title and description, and [ENT] is a special token to separate entity title and description representation.
Scoring The score of entity candidate e i is given by the dot-product: Optimization The network is trained to maximize the score of the correct entity with respect to the (randomly sampled) entities of the same batch (Lerer et al., 2019;Humeau et al., 2019). Concretely, for each training pair (m i , e i ) in a batch of B pairs, the loss is computed as: Lerer et al. (2019) presented a detailed analysis on speed and memory efficiency of using batched random negatives in large-scale systems. In addition to in-batch negatives, we follow Gillick et al. (2019) by using hard negatives in training. The hard negatives are obtained by finding the top 10 predicted entities for each training example. We add these extra hard negatives to the random inbatch negatives.
Inference At inference time, the entity representation for all the entity candidates can be precomputed and cached. The inference task is then reduced to finding maximum dot product between mention representation and entity candidate representations. In Section 5.2.3 we present efficiency/accuracy trade-offs by exact and approximate nearest neighbor search using FAISS (Johnson et al., 2019) in a large-scale setting.

Cross-encoder
Our cross-encoder is similar to the ones described by Logeswaran et al. (2019) and Humeau et al. (2019). The input is the concatenation of the input context and mention representation and the entity representation described in Section 4.1 (we remove the [CLS] token from the entity representation). This allows the model to have deep cross attention between the context and entity descriptions. Formally, we use y m,e to denote our context-candidate embedding: where τ m,e is the input representation of mention and entity, T cross is a transformer and red(.) is the same function as defined in Section 4.1.
Scoring To score entity candidates, a linear layer W is applied to the embedding y m,e : s cross (m, e) = y m,e W Optimization Similar to methods in Section 4.1, the network is trained using a softmax loss to maximize s cross (m i , e i ) for the correct entity, given a set of entity candidates (same as in Equation 4). Due to its larger memory and compute footprint, we use the cross-encoder in a re-ranking stage, over a small set (≤ 100) of candidates retrieved with the bi-encoder. The cross-encoder is not suitable for retrieval or tasks that require fast inference.

Knowledge Distillation
To better optimize the accuracy-speed trade-off, we also report knowledge distillation experiments that use a cross-encoder as a teacher for a bi-encoder model. We follow Hinton et al. (2015) to use a softmax with temperature where the target distribution is based on the cross-encoder logits.
Concretely, let z be a vector of logits for set of entity candidates and T a temperature, and σ(z, T ) a (tempered) distribution over the entities with Then the overall loss function, incorporating both distillation and student losses, is calculated as where e is the ground truth label distribution with probability 1 for the gold entity, H is the crossentropy loss function, and α is coefficient for mixing distillation and student loss L st . The student logits z s are the output of the bi-encoder scoring function s(m, e i ), the teacher logits the output of the cross-encoder scoring funcion s cross (m, e).

Experiments
In this section, we perform an empirical study of our model on three challenging datasets.

Datasets
The

Evaluation Setup and Results
We experiment with both BERT-base and BERTlarge  for our bi-encoders and cross-encoders. The details of training infrastructure and hyperparameters can be found in Appendix A. All models are implemented in PyTorch 5 and optimizied with Adam (Kingma and Ba, 2014). We use (base) and (large) to indicate the version of our model where the underlying pretrained transformer model is BERT-base and BERT-large, respectively.

Zero-shot Entity Linking
First, we train our bi-encoder on the training set, initializing each encoder with pre-trained BERT base. Hyper-parameters are chosen based on Recall@64 on validation datase. For specifics, see Appendix A.2. Our bi-encoder achieves much higher recall than BM25, as shown in Figure 2. Following Logeswaran et al. (2019), we use the top 64 retrieved candidates for the ranker, and we report Recall@64 on train, validation and test in Table 1. After training the bi-encoder for candidate generation, we train our cross-encoder (initialized with pre-trained BERT) on the top 64 retrieved candidates from bi-encoder for each sample on the train- ing set, and evaluate the cross-encoder on the test dataset. Overall, we are able to obtain a much better end-to-end accuracy, as shown in Table 2, largely due to the improvement on the retrieval stage.

Method
U.Acc.  We also report cross-encoder performance on the same retrieval method (BM25) used by Logeswaran et al. (2019) in Table 3, where the performance is evaluated on the subset of test instances for which the gold entity is among the top 64 candidates retrieved by BM25. We observe that our cross-encoder obtains slightly better results than reported by Logeswaran et al. (2019), likely due to implementation and hyper-parameter details.

TACKBP-2010
Following prior work (Sun et al., 2015;Gillick et al., 2019;Onoe and Durrett, 2019), we pre-train our models on Wikipedia 6 data. Data and model training details can be found in Appendix A.1.

Method
Valid Test  After training our model on Wikipedia, we finetune the model on the TACKBP-2010 training dataset. We use the top 100 candidates retrieved by the bi-encoder as training examples for the cross-encoder, and chose hyper-parameters based on cross validation. We report accuracy results in Table 4. For ablation studies, we also report the following versions of our model: 1. bi-encoder only: we use bi-encoder for candidate ranking instead of cross-encoder.
2. Full Wikipedia: we use 5.9M Wikipedia articles as our entity Knowlegebase, instead of TACKBP Reference Knowledgebase.
3. Full Wikipedia w/o finetune: same as above, without fine-tuning on the TACKBP-2010 training set.
As expected, the cross-encoder performs better than the bi-encoder on ranking. However, both models exceed state-of-the-art performance levels, demonstrating that the overall approach is highly effective. We observe that our model also performs well when we change the underlying Knowledgebase to full Wikipedia, and even without finetuning on the dataset. In Table 5 we show that our bi-encoder model is highly effective at retrieving relevant entities, where the underlying Knowledgebase is full Wikipedia.
There are however many other cues that could potentially be added in future work. For example, Khalife and Vazirgiannis (2018) report 94.57% precision on the TACKBP-2010 dataset. However, their method is based on the strong assumption that a gold fine-grained entity type is given for each mention (and they do not attempt to do entity type   prediction). Indeed, if fine-grained entity type information is given by an oracle at test time, then Raiman and Raiman (2018) reports 98.6% accuracy on TACKBP-2010, indicating that improving fine-grained entity type prediction would likely improve entity linking. Our results is achieved without gold fine-grained entity type information. Instead, our model learns representations of context, mention and entities based on text only.

WikilinksNED Unseen-Mentions
Similarly to the approach described in Section 5.  and applied directly on the test set as well as our model trained on this dataset directly without training on Wikipedia examples. We report our models' performance of accuracy on the test set in Table 6, along with baseline models presented from Onoe and Durrett (2019). We observe that our model out-performs all the baseline models.
Inference time efficiency To illustrate the efficiency of our bi-encoder model, we profiled retrieval speed on a server with Intel Xeon CPU E5-2698 v4 @ 2.20GHz and 512GB memory. At inference time, we first compute all entity embeddings for the pool of 5.9M entities. This step is resource intensive but can be paralleled. On 8 Nvidia Volta v100 GPUs, it takes about 2.8 hours to compute all entity embeddings. Given a query of mention embedding, we use FAISS (Johnson et al., 2019) IndexFlatIP index type (exact search) to obtain top 100 entity candidates. On the WikilinksNED Unseen-Mentions test dataset which contains 10K queries, it takes 9.2 ms on average to return top 100 candidates per query in batch mode. We also explore the approximate search options using FAISS. We choose the IndexHNSWFlat index type following Karpukhin et al. (2020). It takes additional time in index construction while reduces the average time used per query. In Table 7, we see that HN SW 1 7 reduces the average query time to 2.6 ms with less than 1.2% drop in accuracy and re-7 Neighbors to store per node: 128, construction time search depth: 200, search depth: 256; construction time: 2.1h.

Method
Acc R@10 R@30 R@100 ms/q   call, and HN SW 2 8 further reduce the query time to 1.4 ms with less than 2.1% drop.

Influence of number of candidates retrieved
In a two-stage entity linking systems, the choice of number of candidates retrieved influences the overall model performance. Prior work often used a fixed number of k candidates where k ranges from 5 to 100 (for instance, Yamada et al. (2016) and Ganea and Hofmann (2017) choose k = 30, (Logeswaran et al., 2019) choose k = 64). When k is larger, the recall accuracy increases, however, the ranking stage accuracy is likely to decrease. Further, increasing k would often increase the run-time on the ranking stage. We explore different choices of k in our model, and present the recall@K curve, ranking stage accuracy and overall accuracy in Figure 3. Based on the overall accuracy, we found that k = 10 is optimal.

Knowledge Distillation
In this section, we present results on knowledge distillation, using our cross-encoder as a teacher model and bi-encoder as a student model.

Mention
Bi-encoder Cross-encoder But surely the biggest surprise is Ronaldo's drop in value, despite his impressive record of 53 goals and 14 assists in 75 appearances for Juventus.

Ronaldo (Brazilian footballer)
Cristiano Ronaldo ... they spent eleven days in the United Kingdom and Spain, photographing things like Gothic statues, bricks, and stone pavements for use in textures.

Gothic fiction Gothic art
To many people in many cultures, music is an important part of their way of life. Ancient Greek and Indian philosophers defined music as tones ...

Acient Greek
Ancient Greek philosophy We experiment knowledge distillation on the TACKBP-2010 and the WikilinksNED Unseen-Mentions dataset. We use the bi-encoder pretrained on Wikipedia as the student model, and fine-tune it on each dataset with knowledge distillation from the teacher model, which is the best performing cross-encoder model pretrained on Wikipedia and fine-tuned on the dataset.
We also fine-tune the student model in our experiments on each dataset, without the knowledge distillation component, as baseline models. As we can see in Table 9, the bi-encoder model trained with knowledge distillation from cross-encoder outperforms the bi-encoder without knowledge distillation, providing another point in the accuracy-speed trade-off curve for these architectures.  6 Qualitative Analysis Table 8 presents some examples from our biencoder and cross-encoder model predictions, to provide intuition for how these two models consider context and mention for entity linking.
In the first example, we see that the bi-encoder mistakenly links "Ronaldo" to the Brazilian football player, while the cross-encoder is able to use context word "Juventus" to disambiguate. In the second example, the cross-encoder is able to identify from context that the sentence is describing art instead of fiction, where the bi-encoder failed. In the third example, the bi-encoder is able to find the correct entity "Ancient Greek,"; where the crossencoder mistakenly links it to the entity "Ancient Greek philosophy," likely because that the word "philosophers" is in context. We observe that crossencoder is often better at utilizing context information than bi-encoder, but can sometimes make mistakes because of misleading context cues.

Conclusion
We proposed a conceptually simple, scalable, and highly effective two stage approach for entity linking. We show that our BERT-based model outperforms IR methods for entity retrieval, and achieved new state-of-the-art results on recently introduced zero-shot entity linking dataset, Wik-ilinksNED Unseen-Mentions dataset, and the more established TACKBP-2010 benchmark, without any task-specific heuristics or external entity knowledge. We present evaluations of the accuracy-speed trade-off inherent to large pre-trained models, and show that it is possible to achieve efficient linking with modest loss of accuracy. Finally, we show that knowledge distillation can further improve biencoder model performance. Future work includes: • Enriching entity representations by adding entity type and entity graph information; • Modeling coherence by jointly resolving mentions in a document; • Extending our work to other languages and other domains; • Joint models for mention detection and entity linking.

A Training details and hyper-parameters Optimization
• Computing infrastructure: we use 8 Nvidia Volta v100 GPUs for model training.
• Bounds for each hyper parameter: see Table  10. In addition, for our bi-encoders, we use a max number of tokens of [32,64,128] for context/mention encoder and 128 for candidate encoder. In our knowledge distillation experiments, we set α = 0.5, and T in [2, 5].
We use grid search for hyperparameters, for a total number of 24 trials.
• Number of model parameters: see Table 11.
• For all our experiments we use accuracy on validation set as criterion for selecting hyperparameters.

A.1 Training on Wikipedia data
We use Wikipedia data to train our models first, then fine-tune it on specific dataset. This approach is used in our experiments on TACKBP-2010 and WikilinksNED Unseen-Mentions datasets. We use the May 2019 English Wikipedia dump which includes 5.9M entities, and use the hyperlinks in articles as examples (the anchor text is the mention). We use a subset of all Wikipedia linked mentions as our training data for the bi-encoder model (A total of 9M examples). We use a holdout set of 10K examples for validation. We train our cross-encoder model based on the top 100 retrieved results from our bi-encoder model on Wikipedia data. For the training of the cross-encoder model,