Entity Linking in 100 Languages

We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The model outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare entities and low-resource languages pose challenges at this large-scale, so we advocate for an increased focus on zero- and few-shot evaluation. To this end, we provide Mewsli-9, a large new multilingual dataset (http://goo.gle/mewsli-dataset) matched to our setting, and show how frequency-based analysis provided key insights for our model and training enhancements.


Introduction
Entity linking (EL) fulfils a key role in grounded language understanding: Given an ungrounded entity mention in text, the task is to identify the entity's corresponding entry in a Knowledge Base (KB). In particular, EL provides grounding for applications like Question Answering (Févry et al., 2020b) (also via Semantic Parsing (Shaw et al., 2019)) and Text Generation (Puduppully et al., 2019); it is also an essential component in knowledge base population (Shen et al., 2014). Entities have played a growing role in representation learning. For example, entity mention masking led to greatly improved fact retention in large language models (Guu et al., 2020;Roberts et al., 2020).
But to date, the primary formulation of EL outside of the standard monolingual setting has been cross-lingual: link mentions expressed in one language to a KB expressed in another (McNamee et al., 2011;Tsai and Roth, 2016;Sil et al., 2018). 1 http://goo.gle/mewsli-dataset The accompanying motivation is that KBs may only ever exist in some well-resourced languages, but that text in many different languages need to be linked. Recent work in this direction features progress on low-resource languages (Zhou et al., 2020), zero-shot transfer (Sil and Florian, 2016;Zhou et al., 2019) and scaling to many languages (Pan et al., 2017), but commonly assumes a single primary KB language and a limited KB, typically English Wikipedia.
We contend that this popular formulation limits the scope of EL in ways that are artificial and inequitable.
First, it artificially simplifies the task by restricting the set of viable entities and reducing the variety of mention ambiguities. Limiting the focus to entities that have English Wikipedia pages understates the real-world diversity of entities. Even within the Wikipedia ecosystem, many entities only have pages in languages other than English. These are often associated with locales that are already underrepresented on the global stage. By ignoring these entities and their mentions, most current modeling and evaluation work tend to side-step underappreciated challenges faced in practical industrial applications, which often involve KBs much larger than English Wikipedia, with a much more significant zero-or few-shot inference problem.
Second, it entrenches an English bias in EL research that is out of step with the encouraging shift toward inherently multilingual approaches in natural language processing, enabled by advances in representation learning (Johnson et al., 2017;Pires et al., 2019;Conneau et al., 2020).
Third, much recent EL work has focused on models that rerank entity candidates retrieved by an alias table (Févry et al., 2020a), an approach that works well for English entities with many linked mentions, but less so for the long tail of entities and languages.
To overcome these shortcomings, this work makes the following key contributions: • Reformulate entity linking as inherently multilingual: link mentions in 104 languages to entities in WikiData, a language-agnostic KB.
• Advance prior dual encoder retrieval work with improved mention and entity encoder architecture and improved negative mining targeting.
• Establish new state-of-the-art performance relative to prior cross-lingual linking systems, with one model capable of linking 104 languages against 20 million WikiData entities.
• Introduce Mewsli-9, a large dataset with nearly 300,000 mentions across 9 diverse languages with links to WikiData. The dataset features many entities that lack English Wikipedia pages and which are thus inaccessible to many prior cross-lingual systems.
• Present frequency-bucketed evaluation that highlights zero-and few-shot challenges with clear headroom, implicitly including lowresource languages without enumerating results over a hundred languages.

Task Definition
Multilingual Entity Linking (MEL) is the task of linking an entity mention m in some context language l c to the corresponding entity e ∈ V in a language-agnostic KB. That is, while the KB may include textual information (names, descriptions, etc.) about each entity in one or more languages, we make no prior assumption about the relationship between these KB languages L kb = {l 1 , . . . , l k } and the mention-side language: l c may or may not be in L kb . This is a generalization of cross-lingual EL (XEL), which is concerned with the case where L kb = {l } and l c = l . Commonly, l is English, and V is moreover limited to the set of entities that express features in l .

MEL with WikiData and Wikipedia
As a concrete realization of the proposed task, we use WikiData (Vrandečić and Krötzsch, 2014) as our KB: it covers a large set of diverse entities, is broadly accessible and actively maintained, and it provides access to entity features in many languages. WikiData itself contains names and short descriptions, but through its close integration with all Wikipedia editions, it also connects entities to rich descriptions (and other features) drawn from the corresponding language-specific Wikipedia pages.
Basing entity representations on features of their Wikipedia pages has been a common approach in EL (e.g. Sil and Florian, 2016;Francis-Landau et al., 2016;Gillick et al., 2019;Wu et al., 2019), but we will need to generalize this to include multiple Wikipedia pages with possibly redundant features in many languages.

WikiData Entity Example
Consider the WikiData Entity Sí Ràdio Q3511500 , a now defunct Valencian radio station. Its KB entry references Wikipedia pages in three languages, which contain the following descriptions: 2 • (Catalan) Sí Ràdio fou una emissora de ràdio musical, la segona de Radio Autonomía Valenciana, S.A. pertanyent al grup Radiotelevisió Valenciana.
Note that these Wikipedia descriptions are not direct translations, and contain some name variations. We emphasize that this particular entity would have been completely out of scope in the standard crosslingual task (Tsai and Roth, 2016), because it does not have an English Wikipedia page.
In our analysis, there are millions of WikiData entities with this property, meaning the standard setting skips over the substantial challenges of modeling these (often rarer) entities, and disambiguating them in different language contexts. Our formulation seeks to address this.

Knowledge Base Scope
Our modeling focus is on using unstructured textual information for entity linking, leaving other modalities or structured information as areas for future work. Accordingly, we narrow our KB to the subset of entities that have descriptive text available: We define our entity vocabulary V as all Wiki-Data items that have an associated Wikipedia page in at least one language, independent of the languages we actually model. 3 This gives 19,666,787 entities, substantially more than in any other task settings we have found: the KB accompanying the entrenched TAC-KBP 2010 benchmark (Ji et al., 2010) has less than a million entities, and although English Wikipedia continues to grow, recent work using it as a KB still only contend with roughly 6 million entities (Févry et al., 2020a;Zhou et al., 2020). Further, by employing a simple rule to determine the set of viable entities, we avoid potential selection bias based on our desired test sets or the language coverage of a specific pretrained model.

Supervision
We extract a supervision signal for MEL by exploiting the hyperlinks that editors place on Wikipedia pages, taking the anchor text as a linked mention of the target entity. This follows a long line of work in exploiting hyperlinks for EL supervision (Bunescu and Paşca, 2006;Singh et al., 2012;Logan et al., 2019), which we extend here by applying the idea to extract a large-scale dataset of 684 million mentions in 104 languages, linked to WikiData entities. This is at least six times larger than datasets used in prior English-only linking work (Gillick et al., 2019). Such large-scale supervision is beneficial for probing the quality attainable with current-day high-capacity neural models.

Mewsli-9 Dataset
We facilitate evaluation on the proposed multilingual EL task by releasing a matching dataset that covers a diverse set of languages and entities.
Mewsli-9 (Multilingual Entities in News, linked) contains 289,087 entity mentions appearing in 58,717 originally written news articles from WikiNews, linked to WikiData. 4 The corpus includes documents in nine languages, representing five language families and WikiNews articles constitute a somewhat different text genre from our Wikipedia training data: The articles do not begin with a formulaic entity description, for example, and anchor link conventions are likely different. We treat the full dataset as a test set, avoiding any fine-tuning or hyperparameter tuning, thus allowing us to evaluate our model's robustness to domain drift.

Model
Prior work showed that a dual encoder architecture can encode entities and contextual mentions in a dense vector space to facilitate efficient entity retrieval via nearest-neighbors search (Gillick et al., 2019;Wu et al., 2019). We take the same approach.
The dual encoder maps a mention-entity pair (m, e) to a score: where φ and ψ are learned neural network encoders that encode their arguments as d-dimensional vectors (d=300, matching prior work). Our encoders are BERT-based Transformer networks (Vaswani et al., 2017;, which we initialize from a pretrained multilingual BERT checkpoint. 7 For efficiency, we only use the first 4 layers, which results in a negligible drop in performance relative to the full 12-layer stack. The WordPiece vocabulary contains 119,547 symbols covering the top 104 Wikipedia languages by frequency-this is the language set we use in our experiments.

Mention Encoder
The mention encoder φ uses an input representation that is a combination of local context (mention span with surrounding words, ignoring sentence boundaries) and simple global context (document title). The document title, context, and mention span are marked with special separator tokens as well as identifying token type labels (see Figure 1 for details). Both the mention span markers and 7 github.com/google-research/bert multi_cased_L-12_H-768_A-12 document title have been employed in related work (Agarwal and Bikel, 2020;Févry et al., 2020a). We use a maximum sequence length of 64 tokens similar to prior work (Févry et al., 2020a), up to a quarter of which are used for the document title. The CLS token encoding from the final layer is projected to the encoding dimension to form the final mention encoding.

Entity Encoders
We experiment with two entity encoder architectures. The first, called Model F, is a featurized entity encoder that uses a fixed-length text description (64 tokens) to represent each entity (see Figure 1). The same 4-layer Transformer architecture is used-without parameter sharing between mention and entity encoders-and again the CLS token vector is projected down to the encoding dimension. Variants of this entity architecture were employed by Wu et al. (2019) and Logeswaran et al. (2019).
The second architecture, called Model E is simply a QID-based embedding lookup as in Févry et al. (2020a). This latter model is intended as a baseline. A priori, we expect Model E to work well for common entities, less well for rarer entities, and not at all for zero-shot retrieval. We expect Model F to provide more parameter-efficient storage of entity information and possibly improve on zeroand few-shot retrieval.

Entity Description Choice
There are many conceivable ways to make use of entity descriptions from multiple languages. We limit the scope to using one primary description per entity, thus obtaining a single coherent text fragment to feed into the Model F encoder.
We use a simple data-driven selection heuristic that is based on observed entity usage: Given an entity e, let n e (l) denote the number of mentions of e in documents of language l, and n(l) the global number of mentions in language l across all entities. From a given source of descriptionsfirst Wikipedia and then WikiData-we order the candidate descriptions (t l 1 e , t l 2 e , . . . ) for e first by the per-entity distribution n e (l) and then by the global distribution n(l). 8 For the example entity in Section 2.1.1, this heuristic selects the Catalan description because 9/16 training examples link to the Catalan Wikipedia page.

Training Process
In all our experiments, we use an 8k batch size with in-batch sampled softmax (Gillick et al., 2018). Models are trained with Tensorflow (Abadi et al., 2016) using the Adam optimizer (Kingma and Ba, 2015; Loshchilov and Hutter, 2019). All BERTbased encoders are initialized from a pretrained checkpoint, but the Model E embeddings are initialized randomly. We doubled the batch size until no further held-out set gains were evident and chose the number of training steps to keep the training time of each phase under one day on a TPU. Further training would likely yield small improvements. See Appendix B for more detail.

Experiments
We conduct a series of experiments to gain insight into the behavior of the dual encoder retrieval models under the proposed MEL setting, asking: • What are the relative merits of the two types of entity representations used in Model E and Model F (embeddings vs. encodings of textual descriptions)?
• Can we adapt the training task and hardnegative mining to improve results across the entity frequency distribution?
• Can a single model achieve reasonable performance on over 100 languages while retrieving from a 20 million entity candidate set?

Evaluation Data
We follow Upadhyay et al. (2018) and evaluate on the "hard" subset of the Wikipedia-derived test set introduced by Tsai and Roth (2016) for crosslingual EL against English Wikipedia, TR2016 hard . This subset comprises mentions for which the correct entity did not appear as the top-ranked item in their alias table, thus stress-testing a model's ability to generalize beyond mention surface forms. Unifying this dataset with our task formulation and data version requires mapping its gold entities from the provided, older Wikipedia titles to newer WikiData entity identifiers (and following intermediate Wikipedia redirection links). This succeeded for all but 233/42,073 queries in TR2016 hard -our model receives no credit on the missing ones.
To be compatible with the pre-existing train/test split, we excluded from our training set all mentions appearing on Wikipedia pages in the full TR2016 test set. This was done for all 104 languages, to avoid cross-lingual overlap between train and test sets. This aggressive scheme holds out 33,460,824 instances, leaving our final training set with 650,975,498 mention-entity pairs. Figure 2 provides a break-down by language.

Setup and Metrics
In this first phase of experiments we evaluate design choices by reporting the differences in Re-call@100 between two models at a time, for conciseness. Note that for final system comparisons, it is standard to use Accuracy of the top retrieved entity (R@1), but to evaluate a dual encoder retrieval model, we prefer R@100 as this is better matched to its likely use case as a candidate generator.
Here we use the TR2016 hard dataset, as well a portion of the 104-language set held out from our training data, sampled to have 1,000 test mentions per language. (We reserve the new Mewsli-9 dataset for testing the final model in Section 5.5.) Reporting results for 104 languages is a challenge. To break down evaluation results by entity frequency bins, we partition a test set according to the frequency of its gold entities as observed in the training set. This is in line with recent recommendations for finer-grained evaluation in EL (Waitelonis et al., 2016;Ilievski et al., 2018).
We calculate metrics within each bin, and report macro-average over bins. This is a stricter form of the label-based macro-averaging sometimes used, but better highlights the zero-shot and few-shot cases. We also report micro-average metrics, computed over the entire dataset, without binning.  Table 2: R@100 differences between pairs of models: (a) model F (featurized inputs for entities) relative to model E (dedicated embedding for each entity); (b) add cross-lingual entity-entity task on top of the mention-entity task for model F; (c) control label balance per-entity during negative mining (versus not).

Entity Encoder Comparison
We first consider the choice of entity encoder, comparing Model F with respect to Model E.
Table 2(a) shows that using the entity descriptions as inputs leads to dramatically better performance on rare and unseen entities, in exchange for small losses on entities appearing more than 100 times, and overall improvements in both macro and micro recall.
Note that as expected, the embedding Model E gives 0% recall in zero-shot cases, as their embeddings are randomly initialized and never get updated in absence of any training examples.
The embedding table of Model E has 6 billion parameters, but there is no sharing across entities. Model F has approximately 50 times fewer parameters, but can distribute information in its shared, compact WordPiece vocabulary and Transformer layer parameters. We can think of these dual encoder models as classifiers over 20 million classes where the softmax layer is either parameterized by an ID embedding (Model E) or an encoding of a description of the class itself (Model F). Remarkably, using a Transformer for the latter approach effectively compresses (nearly) all the information in the traditional embedding model into a compact and far more generalizable model. This result highlights the value of analyzing model behavior in terms of entity frequency. When looking at the micro-averaged metric in isolation, one might conclude that the two models perform similarly; but the macro-average is sensitive to the large differences in the low-frequency bins.

Auxiliary Cross-Lingual Task
In seeking to improve the performance of Model F on tail entities, we return to the (partly redundant) entity descriptions in multiple languages. By choosing just one language as the input, we are ignoring potentially valuable information in the remaining descriptions.
Here we add an auxiliary task: cross-lingual entity description retrieval. This reuses the entity encoder ψ of Model F to map two descriptions of an entity e to a score, s(t l e , t l e ) ∝ ψ(t l e ) T ψ(t l e ), where t l e is the description selected by the earlier heuristic, and t l e is sampled from the other available descriptions for the entity.
We sample up to 5 such cross-lingual pairs per entity to construct the training set for this auxiliary task. This makes richer use of the available multilingual descriptions, and exposes the model to 39 million additional high-quality training examples whose distribution is decoupled from that of the mention-entity pairs in the primary task. The multitask training computes an overall loss by averaging the in-batch sampled softmax loss for a batch of (m, e) pairs and for a batch of (e, e) pairs. Table 2(b) confirms this brings consistent quality gains across all frequency bins, and more so for uncommon entities. Again, reliance on the microaverage metric alone understates the benefit in this data augmentation step for rarer entities.

Hard-Negative Mining
Training with hard-negatives is highly effective in monolingual entity retrieval (Gillick et al., 2019), and we apply the technique they detail to our multilingual setting.  bpy  an  vo  sco  ast  io  ba  be  nds  sw  jv  gl  lb  hy  lv  oc  mk  eu  bg  it  ja  af  ceb  pms  es  pl  az  iw  ka  ca  fy  no  nn  id  sv  min  cs  nl  zh  fi  lt  ru  de  tg  sl  vi  cy  tt  zh-TW  pt  da  ko  hr  tr  el  su  uk  ro  bn  sr-Latn  hu  ms  en  sr  bs  et  kk  fr  lmo  lah  br  la  ht  mg  ar  fa  ce  ml  bar  cv  mn  pa  th  mr  sk  gu  sq  scn  kn  te  fil  ur  yo  ga  ky  uz  is  new  ta  ne hi my log(Training Size) Model Alias Table   Figure 2: Accuracy of Model F + on the 104 languages in our balanced Wikipedia heldout set, overlayed on alias table accuracy and Wikipedia training set size. (See Figure B1 in the Appendix for a larger view.) In its standard form, a certain number of negatives are mined for each mention in the training set by collecting top-ranked but incorrect entities retrieved by a prior model. However, this process can lead to a form of the class imbalance problem as uncommon entities become over-represented as negatives in the resulting data set. For example, an entity appearing just once in the original training set could appear hundreds or thousands of times as a negative example. Instead, we control the ratio of positives to negatives on a per-entity basis, mining up to 10 negatives per positive.
Table 2(c) confirms that our strategy effectively addresses the imbalance issue for rare entities with only small degradation for more common entities. We use this model to perform a second, final round of the adapted negative mining followed by further training to improve on the macro-average further by +.05 (holdout) and +.08 (TR2016 hard ).
The model we use in the remainder of the experiments combines all these findings. We use Model F with the entity-entity auxiliary task and hard negative mining with per-entity label balancing, referenced as Model F + .

Linking in 100 Languages
Breaking down the model's performance by language (R@1 on our heldout set) reveals relatively strong performance across all languages, despite greatly varying training sizes (Figure 2). It also shows improvement over an alias table baseline on all languages. While this does not capture the relative difficulty of the EL task in each language, it does strongly suggest effective cross-lingual transfer in our model: even the most data-poor languages have reasonable results. This validates our massively multilingual approach.

Comparison to Prior Work
We evaluate the performance of our final retrieval model relative to previous work on two existing   (Tsai and Roth, 2016) and (Upadhyay et al., 2018). datasets, noting that direct comparison is impossible because our task setting is novel.

Cross-Lingual Wikification Setting
We compare to two previously reported results on TR2016 hard : the WIKIME model of Tsai and Roth (2016) that accompanied the dataset, and the XELMS-MULTI model by Upadhyay et al. (2018).
Both models depend at their core on multilingual word embeddings, which are obtained by applying (bilingual) alignment or projection techniques to pretrained monolingual word embeddings. As reported in Table 3, our multilingual dual encoder outperforms the other two by a significant margin. To the best of our knowledge, this is the highest accuracy to-date on this challenging evaluation set. (Our comparison is limited to the four languages on which Upadhyay et al. (2018) evaluated their multilingual model.) This is a strong validation of the proposed approach because the experimental setting is heavily skewed toward the prior models: Both are rerankers, and require a first-stage candidate gen-  eration step. They therefore only disambiguate among the resulting ≤20 candidate entities (only from English Wikipedia), whereas our model performs retrieval against all 20 million entities.

Out-of-Domain English Evaluation
We now turn to the question of how well the proposed multilingual model can maintain competitive performance in English and generalize to a domain other than Wikipedia. Gillick et al. (2019) provides a suitable comparison point. Their DEER model is closely related to our approach, but used a more light-weight dual encoder architecture with bags-ofembeddings and feed-forward layers without attention and was evaluated on English EL only. On the English WikiNews-2018 dataset they introduced, our Transformer-based multilingual dual encoder matches their monolingual model's performance at R@1 and improves R@100 by 0.01 (reaching 0.99) Our model thus retains strong English performance despite covering many languages and linking against a larger KB. See Table 4.

Evaluation on
Mewsli-9 Table 5 shows the performance of our model on our new Mewsli-9 dataset compared with an alias table baseline that retrieves entities based on the prior probability of an entity given the observed mention string. Table 6 shows the usual frequencybinned evaluation. While overall (micro-average) performance is strong, there is plenty of headroom in zero-and few-shot retrieval.

Example Outputs
We sampled the model's correct predictions on Mewsli-9, focusing on cross-lingual examples where entities do not have an English Wikipedia page (Table 7). These examples demonstrate that the model effectively learns cross-lingual entity representations. Based on a random sample of the model's errors, we also show examples that summarize notable error categories.

Reranking Experiment
We finally report a preliminary experiment to apply a cross-attention scoring model (CA) to rerank entity candidates retrieved by the main dual encoder (DE), using the same architecture of Logeswaran et al. (2019). We feed the concatenated mention text and entity description into a 12-layer Transformer model, initialized from the same multilingual BERT checkpoint referenced earlier.
The CA model's CLS token encoding is used to classify mention-entity coherence. We train the model with a binary cross-entropy loss, using positives from our Wikipedia training data, taking for each one the top-4 DE-retrieved candidates plus 4 random candidates (proportional to the positive distributions).

Outcome
Correct: A Spanish mention of "fruit juice" linked to its German description-only "juice" has a dedicated English Wikipedia page.

Outcome
Wrong: A legitimately ambiguous mention of "Italy" in Serbian (sports context), for which model retrieved the water polo and football teams, followed by the expected basketball team entity, all featurized in Italian.  We use the trained CA model to rerank the top-5 DE candidates for Mewsli-9 (Table 6). We observed improvements on most frequency buckets compared to DE R@1, which suggests that the model's few-shot capability can be improved by cross-lingual reading-comprehension. This also offers an initial multilingual validation of a similar two-step BERT-based approach recently introduced in a monolingual setting by (Wu et al., 2019), and provides a strong baseline for future work.

Conclusion
We have proposed a new formulation for multilingual entity linking that seeks to expand the scope of entity linking to better reflect the real-world challenges of rare entities and/or low resource languages. Operationalized through Wikipedia and WikiData, our experiments using enhanced dual encoder retrieval models and frequency-based evaluation provide compelling evidence that it is feasible to perform this task with a single model covering over a 100 languages.
Our automatically extracted Mewsli-9 dataset serves as a starting point for evaluating entity link-ing beyond the entrenched English benchmarks and under the expanded multilingual setting. Future work could investigate the use of non-expert human raters to improve the dataset quality further.
In pursuit of improved entity representations, future work could explore the joint use of complementary multi-language descriptions per entity, methods to update representations in a light-weight fashion when descriptions change, and incorporate relational information stored in the KB.