Design Challenges in Low-resource Cross-lingual Entity Linking

Cross-lingual Entity Linking (XEL), the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia, has seen a lot of research in recent years, with a range of promising techniques. However, current techniques do not rise to the challenges introduced by text in low-resource languages (LRL) and, surprisingly, fail to generalize to text not taken from Wikipedia, on which they are usually trained. This paper provides a thorough analysis of low-resource XEL techniques, focusing on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention. Our analysis indicates that current methods are limited by their reliance on Wikipedia’s interlanguage links and thus suffer when the foreign language’s Wikipedia is small. We conclude that the LRL setting requires the use of outside-Wikipedia cross-lingual resources and present a simple yet effective zero-shot XEL system, QuEL , that utilizes search engines query logs. With experiments on 25 languages, QuEL shows an average increase of 25% in gold candidate re-call and of 13% in end-to-end linking accuracy over state-of-the-art baselines. 1


Introduction
Cross-lingual Entity Linking (XEL) aims at grounding mentions written in a foreign (source) language (SL) into entries in a (target) language Knowledge Base (KB), which we consider here as the English Wikipedia following Pan et al. (2017); Upadhyay et al. (2018a); Zhou et al. (2020). In Figure 1, for instance, an Odia (an Indo-Aryan language in India) mention ("Chilika Lake") is linked to the corresponding English Wikipedia entry. The XEL Figure 1: The EXL task: in the given sentence we link "Chilika Lake" to its corresponding English Wikipedia task typically involves two main steps: (1) candidate generation, retrieving a list of candidate KB entries for the mention, and (2) candidate ranking, selecting the most likely entry from the candidates.
While XEL techniques have been studied heavily in recent years, many challenges remain in the LRL setting. Specifically, existing candidate generation methods perform well on Wikipedia-based dataset but fail to generalize beyond Wikipedia, to news and social media text. Error analysis on existing LRL XEL systems shows that the key obstacle is candidate generation. For example, 79.3%-89.1% of XEL errors in Odia can be attributed to the limitations of candidate generation.
In this paper, we present a thorough analysis of the limitations of several leading candidate generation methods. Although these methods adopt different techniques, we find that all of them heavily rely on Wikipedia interlanguage links 2 as their crosslingual resources. However, small SL Wikipedia size limits their performance in the LRL setting. As shown in Figure 2, while the core challenge of LRL XEL is to link LRL entities (A) to candidates in the English Wikipedia (C), interlanguage links only map a small subset of the LRL entities that appear in both LRL Wikipedia (B) and English Wikipedia. Therefore, methods that only leverage interlanguage links (B ∩ C) as the main source of supervision cannot cover a wide range of entities.
For example, the Amharic Wikipedia has 14,854 entries, but only 8,176 of them have interlanguage links to English. Furthermore, as we show, existing candidate generation methods perform well on Wikipedia-based datasets but fail to generalize to outside-Wikipedia text such as news or social media. Our observations lead to the conclusion that the LRL setting necessitates the use of outside-Wikipedia cross-lingual resources. Specifically, we propose various ways to utilize the abundant query log from online search engines to compensate for the lack of supervision. Similar to Wikipedia, a free online encyclopedia created by Internet users, Query logs (QL) provide a free resource, collaboratively generated by a large number of users, and mildly curated. However, it is orders of magnitude larger than Wikipedia. 3 In particular, it includes all of Wikipedia cross-lingual resources as a subset since a search of an SL mention leads to the English Wikipedia entity if the corresponding Wikipedia entries are interlanguage-linked.
In this paper, the main part, Sec. 3 presents a thorough method-wise evaluation and analysis of leading candidate generation methods, and quantifies their limitations as a function of SL Wikipedia size and the size of the interlanguage cross-lingual resources. Based on the limitations, we analyze QuEL CG, an improved candidate generation method utilizing QL in Sec. 4, showing that it exceeds the Wikipedia resource limit on LRL. To exhibit a system-wise XEL comparison, in Sec. 5, we suggest a simple yet efficient zero-shot XEL framework QuEL, that incorporates QuEL CG. QuEL achieves an average of 25% increase in gold candidate recall, and 13% increase in end-to-end linking accuracy on outsidewikipedia text. Wikipedia anchor text mappings: A clickable text mention in Wikipedia articles is annotated with anchor text linking it to a Wikipedia entry. The following retrieving order: SL anchor text → SL Wikipedia entry → English Wikipedia entry (where the last step is done via the Wikipedia interlanguage links), allows one to build a bilingual title mapping from a SL mentions to Wikipedia English entries, resulting in a probabilistic mapping with scores calculated using total counts (Tsai and Roth, 2016).

XEL systems for Low-resource languages
We briefly survey key approaches to XEL below. Direct Mapping Based Systems, including xlwikifier (Tsai and Roth, 2016) and xelms (Upadhyay et al., 2018a), focus on building a SL to English mapping to generate candidates. For candidate ranking, both xlwikifier and xelms combine supervision from multiple languages to learn a ranking model. Word Translation Based Systems including Pan et al. (2017) and ELISA (Zhang et al., 2018), extract SL -English name translation pairs and apply an unsupervised collective inference approach to link the translated mention. Transliteration Based Systems include Tsai and Roth (2018) and translit (Upadhyay et al., 2018b). translit uses a sequence-to-sequence model and bootstrapping to deal with limited data. It is useful when the English and SL word pairs have similar pronunciation. Pivoting Based Systems including Zhou et al., 2019) and PBEL PLUS (Zhou et al., 2020), remove the reliance on SL resources and use a pivot language for candidate generation. Specifically, they train the XEL model on a selected high-resource language, and apply it to SL mentions through language conversion.

Query Logs
Query logs have long been used in many tasks such as across-domain generalization (Rüd et al., 2011) for NER, and ontological knowledge acquisition (Alfonseca et al., 2010;Pantel et al., 2012). In the English entity linking task, Shen et al. (2015) pointed out that google query logs can be an efficient way to identify candidates. Dredze et al. (2010); Monahan et al. (2011) use search result as one of their methods for candidate generation on high resource language entity linking task. While earlier work indicates that query logs provide abundant information that can be used in NLP tasks, as far as we know, it has never been used in crosslingual tasks as we do here. And, only search information has been used, while we suggest using Maps information too.

Candidate Generation Analysis
In this section, we analyze four leading candidate generation methods: p(e|m), xlwikifier, name trans, pivoting, and translit(see Table 1 for the systems used by each method) and discuss their limitations.

Candidate Generation Methods
Each method listed in Table 1, is discussed below along with the level of resources it requires. p(e|m) (Tsai and Roth, 2016) creates a direct probabilistic mapping table using Wikipedia title mappings and the anchor text mappings, between SL and English. E.g., if an Oromo mention "Itoophiyaatti" is the anchor text linked to an Oromo Wikipedia entity "Itoophiyaa" 5 , and "Itoophiyaa" has an interlanguage link to English Wikipedia entity "Ethiopia", then "Ethiopia" will be added as a candidate for the mention "Itoophiyaatti". Thus, p(e|m) follows a linking flow: SL mention → SL Wikipedia entity → English Wikipedia entity.23 name trans (Name Translation) as introduced in (Pan et al., 2017;Zhang et al., 2018) performs word alignment on Wikipedia title mappings, to induce a fixed word-to-word translation table between SL and English. For instance, to link the Suomi name "Pekingin tekninen instituutti" (Beijing Institute of Technology), it translates each word in the mention: ("Pekingin" -Beijing, "tekninen" -Technology, "instituutti" -Institute). At test time, after mapping each word in the given mention to English, it links the translation to English Wikipedia using an unsupervised collective inference approach. translit (Upadhyay et al., 2018b) trains a seq2seq model on Wikipedia title mappings, to generate the English entity directly. pivoting to a related high-resource language (HRL) (Zhou et al., 2020)

Current Methods' Limitations
This section discusses four major limitations that existing methods suffer from, and quantifies these with experimental results. The results use the LORELEI dataset (Strassel and Tracey, 2016), a realistic text corpora that consist of news and social media text, all from outside-Wikipedia (see Section 5). Some tables also include a comparison with the proposed QL based candidate generation method QuEL CG that we describe in Section 4. We use the definition of gold candidate recall in Zhou et al. (2020), which is the proportion of SL mentions that have the gold English entity in the candidate list, as the evaluation metric.

Shortage of Interlanguage Links
As illustrated in Figure 3 (specific numbers are in Appendix A.1) with statistics from the 2019-10-20 wikidump 6 and in Table 2 for five randomly picked  low-resource languages, many LRL Wikipedias only have a few interlanguage links to the English Wikipedia. Consequently, only a few Wikipedia title mappings and anchor text mappings are accessible by all four methods, as shown in Table 1.
For p(e|m), the workflow is: SL mention → SL Wikipedia entity → English Wikipedia entity, and it breaks if one link is missing. For example, "Nawala" has an English Wikipedia page, but does not have a corresponding Sinhala page. Given its Sinhala mention 7 , the interlink is missing and p(e|m)returns 0 probability for "Nawala".
For name trans, its translation ability is limited by the tokens contained in the Wikipedia title mappings. For a SL mention, when none of its tokens ever appeared in the SL Wikipedia titles, it will not have any English translation, and thus generate no candidates. As for translit and pivoting, they will have fewer data pairs to train on and the model performance would suffer.

Small LRL Mention Coverage
In the LRL setting, few Wikipedia articles lead to fewer Wikipedia anchor text mappings, thus reducing the ability of current methods to cover many SL mentions. For instance, the LRL Oromo Wikipedia article for "Laayibeeriyaa" 8 has much fewer hyperlinks than the English Wikipedia article for "Liberia" 9 , even though they are linked through an interlanguage link. Figures 3 and 4   To evaluate Wikipedia coverage, we propose a global metric called mention token coverage that can be computed without the gold data. Mention token coverage is the percentage of SL mentions that have at least one token appearing in Wikipedia title mappings or in anchor text mappings. For example, when we consider Somali language, the mention "Shabeelada hoose" has a token "hoose" covered in Wikipedia titles, so it is counted in the mention token coverage. But"Soomaalieed" is not covered in Wikipedia titles or anchor text mappings, so it is not counted for mention token coverage. High mention token coverage values tend to guarantee better supervision when trained on Wikipedia. Indeed, in Figure 4, we can clearly see that the mention token coverage for LRLs is much smaller than that of high-resource languages (consult Figure 3 for the distinction between LRLs and HRLs).
We also compare the mention token coverage with gold candidate recall for each method in Figure 4. When a method has a gold candidate recall higher than mention token coverage, it means that the method is able to generalize beyond Wikipedia. In contrast, when gold candidate recall is lower than mention token coverage, it implies that the method is limited by Wikipedia resources.
To compare relation between mention token coverage and gold candidate recall more clearly, in Figure 5, we show the ratio of gold candidate recall over mention token coverage. Existing methods' ratio ranges between 0.31 to 1.27, with the average of 0.72. This suggests that existing methods are bounded by Wikipedia resources in most cases. This figure also shows the generalization ability of our QL-based candidate generation method QuEL CG (introduced in Sec.4) to outside-Wikipedia mentions with an average ratio of 1.13 ranging between 0.74 and 1.92.  Figure 5: Average gold candidate recall remains 0.5-0.8 times that of mention token coverage for existing approaches, but the ratio is 1.0-2.0 for QuEL CG. The lines are the ratio of gold candidate recall per candidate generation method divided by mention token coverage. Average gold candidate recall of existing methods: p(e|m), name trans, pivoting, is mostly limited by mention token coverage and cannot exceed it, with 0.72 average ratio. However, our proposed QuEL CG can reach recall up to 2 times that of mention token coverage, with average ratio of 1.13 on all languages, tested to be statistically significant with p-value < 0.01%.

translit Data Requirements
translit suffers from the inability to satisfy several of its data requirements. Typically, transliteration models need many training data pairs, which is hard to get using LRL Wikipedia title pairs. Also, they require SL mentions and gold English entities to have word-by-word mappings. Table 3 shows results of translit trained on name pairs from Wikipedia title mappings (translit-Wiki). We only provide results for languages for which the model has been released; this sample already shows clearly that the performance of translit drops significantly on LRLs.

pivoting Prerequisites
While pivoting does not suffer from insufficient cross-lingual resources between SL and English, it is limited by resources between related HRL and English. More importantly, it is limited the availability of related HRL that is similar enough to SL. pivoting learns through grapheme or phoneme similarity. However, grapheme similarity is not enough since not every LRL has a related HRL that uses the same scripts. In these cases pivoting uses phoneme similarity and maps strings to international phonetic alphabet (IPA) symbols. For example, (Zhou et al., 2020) uses Epitran (Mortensen et al., 2018 to convert strings to IPA symbols, but Epitran only supports 55 of the 309 Wikipedia languages, and covers only 8 out of 12 low-resource languages in the LORELEI corpus. As in Figure 4, pivoting gains from the language conversion compared with p(e|m), but the increase varies among languages, depending on the choice of related HRL. Most importantly, the related HRL cannot replace SL and language conversion may limit on the linking ability. For example, the language pair Oromo-Indonesian(grapheme) has same scripts but pivoting still suffers from low-resource on Oromo.

QuEL CG
As mentioned in Section 3, Wikipedia's crosslingual resources are not enough for XEL to per-form well in low-resource settings. We argue that outside-Wikipedia resources are essential to compensate for the lack of supervision. We suggest that search engines query logs provide an excellent resource in this situation, since it is a (very large) super-set of Wikipedia's cross-lingual data but covers even more; as pointed out in the Introduction, it is a collaboratively generated, mildly curated, resource and its effective size is a function of the number of SL native speakers using the search engine. Thus, query logs can help the candidate generation process map SL mentions to English Wikipedia candidates even when they are not covered by Wikipedia interlanguage links. We propose an improved candidate generation method, QuEL CG, that uses query log mapping files. We obtain a high-quality candidate list through directly searching SL mentions in query logs and a query-based pivoting method. While we can choose any search engines,we use here Google search 10 and Google maps 11 . QuEL CG also runs in conjunction with p(e|m)to cover the cases where query log mapping is not robust.

Query Logs as Search Results
As the first step, we search the morphologically normalized SL mention in the Google search engine (implementation details in Appendix A.2) and retrieve a list of web-page results. We pick top k (k is usually 1 or 5) Wikipedia web-page results P k . Note that if the searched result is SL Wikipedia article and it has an interlanguage link to English, we convert it to the corresponding English one through the link, and then mark the corresponding English entity as a candidate. When the SL mention is a geopolitical or location entity, we search its normalized mention using Google Map. Since it only returns English surface of the location instead of Wikipedia articles, we further search the resulted English surface in Google search using the same procedure described above.

Query-based Pivoting
We also conduct language-indifferent pivoting using query logs. Note that the pivoting methods described below are different from pivoting in Section 3.
Some LRLs have high-resource languages they are similar to (e.g. Sinhala to Hindi and Tigrinya to Amharic). In order to exploit the similarities between an LRL and HRL without having to choose one good HRL, we use query logs for pivoting. We first follow the same search steps described above to get top k Wikipedia web-pages results P k . Finally, we continue the same process on the new pivoted mentions. A special case here is languagespecific pivoting on selected language pairs. We use a simple utf-8 converter to translate SL mention into a related, but higher-resource language, such as Odia to Hindi, and then run the candidate generation process described above on the pivoted mention.

Experiments: System Comparison
Given the analysis of the key candidate generation process in Section 3, this section moves to study its implications on the overall performance of different LRL XEL systems. We first propose our LRL XEL framework, QuEL, by combining QuEL CG with a zero-shot candidate ranking module. Our experimental goal is to compare all systems on both outside-Wikipedia data and Wikipedia data. We further analyze the entity distribution and entity type on the linking results. In addition, an ablation study is demonstrated in Appendix A.3.

Datasets
Dataset details are reported in Appendix A.1. LORELEI dataset (Strassel and Tracey, 2016) is a realistic and challenging dataset that includes news and social media such as twitter. We divide its 25 languages into LRL and HRL as in Figure 3. Entities in LORELEI are of four types: geopolitical entities (GPE), locations (LOC), persons (PER) and organizations (ORG). The dataset provides a specific English KB that mentions are linked to; we processed the original dataset to link to the English Wikipedia instead. Our processed gold labels will be available along with LORELEI dataset 12 . Given a KB entity, we link it to Wikipedia if the KB provides a Wikipedia link. For a PER or ORG entity without Wikipedia link, we use its KB-provided English information, e.g. name and description, to search for Wikipedia entry, and manually check the correctness. Otherwise, we do not include this entity and remove any mentions linking to it in the EDL dataset. We process these types differently because PER and ORG entities only compose around 12 https://catalog.ldc.upenn.edu/ LDC2020T10 5% of the gold entities. Wikipedia-based dataset collected by (Tsai and Roth, 2016) is built upon Wikipedia anchor text mappings. All languages in this dataset are highresource ones as defined in Figure 3.

System Comparison
We compare the supervised SOTA systems, xlwikifier, xelms, ELISA, PBEL PLUS, that use candidate generation methods analyzed earlier, with a new, QL-based system, that we present below. Implementation details are in Appendix A.4

A QL-based XEL: QuEL
Given the limitations discussed in Sec. 3 we propose a new XEL system, QuEL, that uses QuEL CG (Sec. 4) along with the following zeroshot candidate ranking module. Given a candidate list C m (the output of QuEL CG on SL mention m), QuEL uses the multilingual BERT (Devlin et al., 2018) to score the candidates against m. Specifically, for each candidate c ∈ C m it computes a score W (c, m) that measures "relatedness" between m and c. It then picks the candidate with the highest score as its output. In case of a tie, we break it by following the candidate selection order, from Google search results, to Google Map results, to p(e|m) candidates. We explain below the components of W (c, m). Candidate Multiplicity Weight A candidate can be suggested by multiple sources-Google search, Google Map search, query-based pivoting, or p(e|m). QuEL prefers candidates generated by multiple sources. We define candidate c multiplicity weight as: W Source (c) = N um Source (c), the number of sources that generate c. Contextual Disambiguation. QuEL uses Multilingual BERT (M-BERT) (Devlin et al., 2018) for multilingual embeddings, to compute the similarity of the context of the mention m and the candidate's context in the English Wikipedia. We denote by W context (c, m) the cosine similarity between m's context embedding and c's context embedding (see details in Appendix A.5). Finally, the score for candidate c is W (c, m) = W source (c) · W context (c, m), and we select the most likely entity: e = arg max c∈Cm W (c, m) as output.

Entity Linking Results
Comprehensive evaluations of both the LORELEI and the Wikipedia based datasets are shown in Figures 6 and 7 and in Table 4. (Scores that correspond to the figures are reported in Appendix A.6.) Note that gold candidate recall on xlwikifier and xelms are identical because they use the same candidate generation module, p(e|m).
QuEL is shown to significantly improve over existing approaches on both datasets, especially on the more difficult LORELEI dataset where it improves on almost all the languages and shows an average of 25% increase in gold candidate recall and of 13% in linking accuracy. On the Wikipediabased dataset, QuEL shows an average increase of 4% in gold candidate recall, while reaching the SOTA on linking accuracy. Importantly, most other systems, (ELISA, xlwikifier and xelms) use supervised ranking modules.

System
Gold   Table 4 shows huge performance gaps between the two datasets using the SOTA baseline xelms. The 20% percentage difference in both metrics proves that the LORELEI dataset is more difficult due to having more outside-Wikipedia mentions, and more focus on low-resource languages. One exception is ELISA, which has lower performance on Wikipedia-based dataset. We believe it fails to cover many Wikipedia mentions because it does not use Wikipedia anchor text mappings. Similarly, when we consider the same language performance on the two datasets (see Figures 6 and 7), e.g., Tamil and Thai, the LORELEI dataset appears harder to deal with. However, QuEL achieves similar results on both datasets, and also brings the gold candidate recall for LRL much closer to that of HRL. Another important observation is that QuEL performs significantly better on the LORELEI dataset. It suggests that our proposed QuEL CG addresses the outside-Wikipedia coverage problem well by exploiting the query logs. To understand why QuEL exceeds baselines largely on  LORELEI dataset, we analyze it on entity resource and type distribution as below. Entity Resource. Considering the insufficient Wikipedia interlanguage links for LRL in Table 2, we investigate whether QuEL CG helps in this situation. In Table 5, QuEL shows 6.3% to 27.4% improvement in gold candidate recall, indicating that it can effectively perform XEL without the Wikipedia cross-lingual resource, leading to a significant improvement for LRL XEL.

Analysis
Entity Type. Table 6 shows the evaluation on all four types of entities. We observe that QuEL improves more on GPE and LOC entities, than on  PER and ORG entities. We believe the improvement is brought by Google Map query logs.

Conclusion
This work provides a thorough analysis of existing LRL XEL techniques, focusing on the step of generating English candidates for foreign language mentions. The analysis identifies the inherent lack of sufficient inter-lingual supervision signals as a key shortcoming of the current approach. This leads to proposing a rather simple method that leverages query logs, that are highly effective in addressing these challenges. Given that our experiments show a 25% increase in the candidate generation, one future research direction is to improve candidate ranking in LRL by incorporating coherence statistics and entity types. Moreover, given the effectiveness of query logs, we believe it can be applied to other cross-lingual tasks like relation extraction and Knowledge Base completion.

A.2 Implementation Details of QuEL CG
To perform our improved candidate generation method QuEL CG, we first conduct morphological normalization on the SL mention before querying. Then, we use the normalization output as search input. We use Google search 14 and Google Map 15 . We can also customize the search input for better results on extremely low-resource languages (Odia and Illocano) as below.
Languagespecific Morphological normalization is a basic process for all candidate generation methods. An entity may have different surface forms in the document, which makes candidate generation difficult. To cope with this issue, several operations including removing, adding, or replacing suffixes and prefixes are conducted as a prior process. Customize search input. To better retrieve Wikipedia pages as search results and ignore other web-page results, "wiki" or "[Country of SL]" can be appended to the original search input.

A.3 Ablation Study
We now quantify the effects of each component in our candidate generation method and show the results in Table 9. Google Map. Our model is default to use the Google Map cross lingual resource. We test the effect of adding supervision from this QL. Google top1. In everywhere that takes the Google query log (QL) results, take only the first Wikipedia page result that is in source or target language as candidate. Google top5. Similar to Google top1, take the top 5 Wikipedia page results as candidates. We can see that Google top1 and top5's effects are language dependent. p(e|m).
We test whether adding the p(e|m) module would help in linking performance. To better show the results, p(e|m) is added under the setting of using QL and Google Map KB, without adding other modules. Pivoting. Pivoting here refers to our query-based pivoting, different from pivoting in Section 3. We picked two low-resource languages: Odia and Tigrinya, to explore the pivoting effect and show results in Table 10. On Odia, language-specific pivoting skill is used. Since we know in prior that Odia and Hindi are similar while the latter has much more resource, a simple utf8-converter is used to transform Odia to Hindi, and then runs the Hindi mention through our whole system. On Tigrinya, a language-indifferent pivoting skill is used. After getting QL results, besides using Google top1 or top5, we further pick Wikipedia page results that are in any other language, such as Amharic or Scots that have similar scripts, but with richer cross-lingual supervision then Tigrinya.
We further examined the effect of transliteration models using trained models (Upadhyay et al., 2018b) specifically on Sinhala and Odia, with bilingually mapped Wikipedia titles as supervision. We also used Google transliteration resource for Odia mentions. However, no increase on linking accuracy is observed, and the absolute increase in gold candidate recall is less than 0.5%. Since we only studied on Sinhala and Odia, maybe the transliteration resource is useful on other languages.
Tables 9 and 10 show that we added a lot of value beyond the use of Google search -simply using google search without adding other parts of our candidate generation methods does not have good linking results. Indeed, incorporating online search engine query logs effectively to XEL is highly nontrivial. In this context, it is important to note that all existing methods make heavy use of Wikipedia, and therefore using QL as a cross-lingual resource is as fair. Moreover, as our results show, the use of Wikipedia allows existing systems to perform well only on Wikipedia data, which is uninteresting for all practical purposes. As shown in Figure 6, Tables 11 and 12, our method works well outside Wikipedia!

A.4 Implementation Details of Compared Systems
For xlwikifier 16 , we use different versions of candidate ranking on the two datasets. Since the  Wikipedia-based dataset provides training data, we use its provided version of candidate ranking. However, the LORELEI dataset has no training data, and thus no candidate ranking model can be trained. We just pick the first candidate as the result. For a comparison purpose, since xelms uses the same candidate generation module and provides better candidate ranking, xlwikifier is close to and mostly up-bounded by xelms results.
For xelms 17 , we use trained ranking modules on most languages when available, and use the zero-shot version of ranking module for the rest of languages (Akan and Kinyarwanda).
For ELISA, we access the system using the API 18 directly provided by its authors, and call the GET/entity linking/{identifier} function.
For PBEL PLUS 19 , we test this approach only on low-resource languages on the LORELEI dataset, because it generates candidates through pivoting to a related high-resource language, and it does not make sense to pivot a already highresource language to other languages.

A.5 Implementation Details of Our System
During candidate ranking, for a mention m in a document D, we get the sentence s m where m appears and computes its contextualized embedding v m = M-BERT(e, s m ). For each c ∈ C m , we retrieve a list of sentences S c = {s 1 , s 2 , ..., s n } that 17 https://github.com/shyamupa/xelms 18 https://nlp.cs.rpi.edu/software 19 https://github.com/shuyanzhou/pbel_ plus contains the candidate entity c in its corresponding Wikipedia page's summary. The contextualized embedding for c is denoted by v c : Note that we picked two representative languages: Odia and Ilocano, for which we have additional LORELEI provided monolingual text, and trained the M-BERT model (Devlin et al., 2018) using their Wikipedia data along with LORELEI text. We did not use pre-trained M-BERT 20 on all languages because many low-resource languages are not supported, and for the supported ones the performance increase is much less than that of models trained with LORELEI text plus Wikipedia data. This experiment serves to show the gain one could get from additional supervision and, at the same time, highlights the results we show when M-BERT is not available, which is more realistic.

A.6 Comprehensive Evaluation
This section includes comprehensive evaluation on XEL systems.   Table 12: Quantitative evaluation results on 25 languages on LORELEI dataset (continued). Accu is linking accuracy, Rec@n is gold candidate recall, with n ranging between 2 to 9 for QuEL and 100 for PBEL PLUS. Rec@5 is gold candidate recall if we reserve only top5 candidates by the ranking score.  Table 13: Quantitative evaluation results on 9 languages on Wikipedia-based dataset. Accu is linking accuracy, Rec@n is gold candidate recall, with n ranging between 2 to 9 for QuEL and 100 for PBEL PLUS.
Rec@5 is gold candidate recall if we reserve only top5 candidates by the ranking score.