Zero-shot Entity Linking with Dense Entity Retrieval

We consider the zero-shot entity-linking challenge where each entity is defined by a short textual description, and the model must read these descriptions together with the mention context to make the final linking decisions. In this setting, retrieving entity candidates can be particularly challenging, since many of the common linking cues such as entity alias tables and link popularity are not available. In this paper, we introduce a simple and effective two stage approach for zero-shot linking, based on fine-tuned BERT architectures. In the first stage, we do retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then examined more carefully with a cross-encoder, that concatenates the mention and entity text. Our approach achieves a nearly 5 point absolute gain on a recently introduced zero-shot entity linking benchmark, driven largely by improvements over previous IR-based candidate retrieval. We also show that it performs well in the non-zero-shot setting, obtaining the state-of-the-art result on TACKBP-2010.


Introduction
Traditional entity linking approaches often assume that entities to be linked at test time are present in the training set. However, for many practical use cases a zero-shot scenario is more appropriate (Logeswaran et al., 2019). Here a short description of the entity is the only piece of information accessible at test time. In this setting, retrieving entity candidates can be particularly challenging, since many of the common linking cues such as entity alias tables and link popularity are not available. Previous work has used TF-IDF-based techniques (Logeswaran et al., 2019), but we show that * Work done during internship with Facebook. performance can be significantly boosted by instead doing retrieval in a dense embedding space.
We introduce two stage approach for zero-shot linking, based on fine-tuned BERT architectures. In the first stage, we do retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions (Humeau et al., 2019;Gillick et al., 2019). We show that BERT-based models provide very effective bi-encoders, by significantly boosting overall recall comparing to IR-based method. Each retrieved candidate is then examined more carefully with a cross-encoder that concatenates the mention and entity text, following Logeswaran et al. (2019). This overall approach is simple but highly effective, as we show through detailed experiments. It is also scalable if efficient nearest neighbor methods are used in the first stage, as long the candidate entity set can be kept small.
We evaluate performance on the recently introduced Wikia zero-shot corpus, as well as the more established TACKBP-2010 benchmark (Ji et al., 2010). Our two-stage approach achieves a nearly 5 point absolute gain on Wikia, driven largely by improvements in the retrieval component.
We also show that it achieves a new state-ofthe-art result on TACKBP-2010, a non-zero-shot setup, with an over 30% relative error reduction. By simply reading the provided text descriptions, we are able to outperform previous methods that included many extra cues such as entity name dictionaries and link popularity.

Related Work
We follow most recent work in studying entity linking with gold mentions. 1 Given mentions, the entity linking task can be broken into two steps: wikipedia dense space My kids really enjoyed a ride in the Jaguar! Jaguar is the luxury vehicle brand.

Jaguar_cars
Jaguar! is a junior roller coaster.  candidate generation and candidate ranking. Prior work has used frequency information, alias tables and TF-IDF-based methods for candidate generation. For candidate ranking, He et al. (2013), Sun et al. (2015), Yamada et al. (2016), Ganea and Hofmann (2017), and Kolitsas et al. (2018) have established state-of-the-art results using neural networks to model context word, span and entity. There is also recent work demonstrating that fine-grained entity typing information helps entity linking (Raiman and Raiman, 2018; Onoe and Durrett, 2019; Khalife and Vazirgiannis, 2018).
We build directly on two recent results. Logeswaran et al. (2019) proposed the zero-shot entity linking task, where mentions must be linked to unseen entities without in-domain labeled data. They use cross-encoders for entity ranking, but rely on IR-techniques for candidate generation. Gillick et al. (2019) show that dense embeddings work well for candidate generation, but use relatively simple neural bi-encoder architectures. Our approach can be seen as a combination of ideas from both of these lines of work. Humeau et al. (2019) studied different architectures to use deep pre-trained bidirectional transformers and performed detailed comparison of three different architectures, namely Bi-encoder, Poly-encoder, Cross-encoder on tasks of sentence selection in dialogues. Inspired by Humeau et al. (2019), we use similar architectures to the problem of entity linking, and in addition, demonstrate that Bi-encoder can be a strong model for retrieval.

Definition and Task Formulation
Entity Linking Given an input of text document D = {w 1 , ..., w n } of words and a list of entity mentions M d = {m 1 , ..., m n }, the output of an entity linking model is a list of mention-entity pairs {(m i , e i )} i∈ [1,n] where each entity is an entry in a knowledge base (KB) (e.g. Wikipedia), e ∈ E. We assume that the title and description of the entities are available, which is a common setting in entity linking.
We assume each mention has a valid gold entity in the KB, which is usually referred as in-KB evaluation. We leave the out-of-KB prediction (i.e. nil prediction) to future work.

Zero-shot Entity Linking
We also study zeroshot entity linking (Logeswaran et al., 2019). Here the document setup is the same, but the knowledge base is separated in training and test time. Formally, denote E train and E test to be the knowledge base in training and test, we require E train ∩ E test = ∅.
The set of text documents, mentions, and entity dictionary are separated in training and test so that the entities being linked at test time are unseen.

Methodology
We use BERT base  in our biencoder and cross-encoder models, as described in Section 4.1 and 4.2. Figure 1 shows the overall approach. The bi-encoder uses two independent BERT transformers to encode model context/mention and entity into dense vectors, and each entity candidate is scored as the dot product of these vectors. The cross-encoder encodes con-text/mention and entity in one transformer, and applies an additional linear layer to compute the final score for each pair.

Bi-encoder
Architecture We use a bi-encoder architecture similar to the work of (Humeau et al., 2019) to model (mention, entity) pairs. This approach allows for fast, real-time inference, as the candidate representations can be cached. Both input context and candidate entity are encoded into vectors: where τ m and τ e are input representations of mention and entity respectively, T 1 and T 2 are two transformers. red(.) is a function that reduces the sequence of vectors produced by the transformers into one vector. Following the experiments in (Humeau et al., 2019), we choose red(.) to be the last layer of the output of the [CLS] token.

Context and Mention Modeling
The representation of context and mention τ m is composed of the word-pieces of the context surrounding the mention and the mention itself. Specifically, we construct input of each mention example as: where mention, ctxt l , ctxt r are the word-pieces tokens of the mention, context before and after the mention respectively, and [M s ], [M e ] are special tokens to tag the mention. The maximum length of the input representation is a hyper-parameter in our model.

Entity Modeling
The entity representation τ e is similarly composed of word-pieces of the entity title and description. The input of our entity model is: where title, description are word-pieces tokens of entity title and description, and [ENT] is a special token to separate entity title and description representation.
Scoring The score of entity candidate e i is given by the dot-product: s(m, e i ) = y m · y e i Optimization The network is trained using a softmax loss to maximize the score of the correct entity with respect to random entities. For each training pair (m i , e i ) in a batch of B pairs, the loss is computed as: Following previous work (e.g. (Lerer et al., 2019), (Humeau et al., 2019)), in training we consider the other elements of the batch as negatives. (Lerer et al., 2019) presented a detailed analysis on speed and memory efficiency of using batched random negatives in large-scale systems.
Inference At inference time, the entity representation for all the entity candidates can be precomputed and cached. The inference task is then reduced to finding maximum dot product between mention representation and entity candidate representations. We use brute-force search in our experiments, however, this can be done efficiently using fast nearest neighbor search libraries such as FAISS (Johnson et al., 2019) in a large-scale setting.

Cross-encoder
Our cross-encoder is similar to the ones described by Logeswaran et al. (2019) and Humeau et al. (2019). The input is the concatenation of the context and mention representation and the entity representation described in Section 4.1 (we remove the [CLS] token from the entity representation). This allows the model to have deep cross attention between the context and entity descriptions, and often produces better empirical results compared to the bi-encoder. Formally, we use y m,e to denote our context-candidate embedding: where τ m,e is the input representation of mention and entity, T cross is a transformer and red(.) is the same function as defined in Section 4.1.
Scoring To score entity candidate, a linear layer W is applied to the embedding y m,e to reduce it from a vector to a scalar: s cross (m, e) = y m,e W.
Optimization Similar to methods in Section 4.1, the network is trained using a softmax loss to maximize s(m i , e i ) for the correct entity, given a set of entity candidates.
Unlike in the bi-encoder where one can recycle the other entities of the batch as negatives, training the cross-encoder is more memory-intensive. We use the cross-encoder for the re-ranking stage, where we obtained retrieval results from the biencoder. The cross-encoder is not suitable for retrieval or tasks that require fast inference.

Experiments
In this section, we perform an empirical study of our model on two challenging datasets.

Datasets
Zero-shot EL dataset was constructed by Logeswaran et al. (2019) from Wikia. 2 The task is to link entity mentions in text to a entity dictionary with provided entity descriptions, in a set of domains. There are 49K, 10K, 10K examples in the train, validation, test set respectively. The entities in the validation and test sets are from different domains than the train set, allowing for evaluation of performance on entirely unseen entities. The entity dictionaries covers different domains and range in size from 10K to 100K entities.
TACKBP-2010 is widely used for evaluating entity linking systems Ji et al. (2010). 3 Following prior work, we measure in-KB accuracy (P@1). There are 1074, 1020 annotated mention/entity pairs derived from 1453, 2231 original news and web documents on training and evaluation dataset, respectively. All the entities are from the TAC Reference Knowledgebase which contains 818,741 entities with titles, descriptions and other meta info.

Zero-shot Entity Linking
First, we train our bi-encoder on the training set, initializing each encoder with pre-trained BERT base . Hyper-parameters are chosen based on Recall@64 on validation dataset, following suggested range from . Our bi-encoder achieves much higher recall compares to BM25, as shown in Figure 2  We then train our cross-encoder (initialized with pre-trained BERT base) based on the top 64 retrieved candidates for each sample on the training set, and evaluate the cross-encoder on the test dataset. By improving the retrieval part of the system, we are able to obtain a much better end-toend accuracy, as shown in Table 2  We also report cross-encoder performance on the same retrieval method (BM25) used by Logeswaran et al. (2019) in Table 3. We observe that our cross-encoder obtains slightly better results than reported by Logeswaran et al. (2019), likely due to implementation and hyper-parameter details.

Method
Valid

TACKBP-2010
Following prior work (Sun et al., 2015;Gillick et al., Table 4. We also report a version of our model where we use bi-encoder for candidate ranking instead of cross-encoder. As expected, the crossencoder performs better than the bi-encoder on ranking. However, both models exceed state-ofthe-art performance levels, demonstrating that the overall approach is high effective. There are however many other cues that could potentially be added in future work. For example, Khalife and Vazirgiannis (2018) report 94.57% precision on the TACKBP-2010 dataset. However, their method is based on the strong assumption that a gold fine-grained entity type is given for each mention (and they do not attempt to do entity type prediction). Indeed, if fine-grained entity type information is given by an oracle at test time,

Method
Accuracy He et al. (2013) 81.0 Sun et al. (2015) 83  then (Raiman and Raiman, 2018) reports 98.6% accuracy on TACKBP-2010, indicating that improving fine-grained entity type prediction would likely to improve entity linking. Our results is achieved without making the assumption that finegrained entity type information is given. Instead, our model learns representation of context, mention and entity based on text only.

Conclusion
We proposed a simple, scalable, and effective two stage approach for entity linking. We show that our BERT-based model outperforms IR methods for entity retrieval, and achieved new state-ofthe-art results on a recently introduced zero-shot entity linking dataset, as well as the more established TACKBP-2010 benchmark, without any task-specific heuristics. Future work includes: • Enriching entity representations by adding entity type information and entity graph information.
• Modeling coherence by jointly resolving mentions in a document.
• Extending our work to other languages and other domains.