Towards Zero-resource Cross-lingual Entity Linking

Cross-lingual entity linking (XEL) grounds named entities in a source language to an English Knowledge Base (KB), such as Wikipedia. XEL is challenging for most languages because of limited availability of requisite resources. However, many works on XEL have been on simulated settings that actually use significant resources (e.g. source language Wikipedia, bilingual entity maps, multilingual embeddings) that are not available in truly low-resource languages. In this work, we first examine the effect of these resource assumptions and quantify how much the availability of these resource affects overall quality of existing XEL systems. We next propose three improvements to both entity candidate generation and disambiguation that make better use of the limited resources we do have in resource-scarce scenarios. With experiments on four extremely low-resource languages, we show that our model results in gains of 6-20% end-to-end linking accuracy.


Introduction
Entity linking (EL; Bunescu and Pas ¸ca (2006); Cucerzan (2007); Dredze et al. (2010); Hoffart et al. (2011)) identifies entity mentions in a document and associates them with their corresponding entries in a structured Knowledge Base (KB) (Shen et al., 2015), such as Wikipedia or Freebase (Bollacker et al., 2008).EL involves two main steps: (1) candidate generation, retrieving a list of candidate KB entries for each entity mention, and (2) disambiguation, selecting the most likely entry from the candidate list.
In this work, we focus on cross-lingual entity linking (XEL;McNamee et al. (2011), Ji et al. (2015)), where the document is in a (source) language that is different from the (target) language 1 Code is available at https://github.com/shuyanzhou/burn_xel of the KB.Following recent work (Sil et al., 2018;Upadhyay et al., 2018), we use English Wikipedia as this KB. Figure 1 shows an example.
XEL to English from major languages such Spanish and Chinese has been carefully studied, and significant progress has been made.Success in these languages can be largely attributed to the availability of rich resources.Specifically, the following is a list of resources required by recent works (Tsai and Roth, 2016;Pan et al., 2017;Sil et al., 2018;Upadhyay et al., 2018): English Wikipedia (W eng ): The target KB and a large corpus of text.Importantly, the text is annotated with anchor text linking between entity mentions (e.g."Holland" in the body text of an article) and the page for the entity (e.g."Netherlands").These annotations can be used to extract mentionentity maps for entity candidate generation, and to directly train entity disambiguation systems.Source Language Wikipedia (W src ): KB and corresponding text in the source language.Similarly to English Wikipedia, this can be used to obtain mention-entity maps or train disambiguation systems, but the size of Wikipedia is relatively small for most low-resource languages.Bilingual Entity Maps (M): A map between source language entities and English entities.One common source of this map is Wikipedia interlanguage links between the source language and English.These inter-language links can directly and unambiguously link entities in the source language KB to the English KB.Multilingual Embeddings (E): These embeddings map words in different languages to the same vector space.
The availability of these resources varies widely among languages.They are available for highresource languages such as Spanish and Chinese, which have been widely used as test-beds for XEL.For example, there are over 1.5 million articles in Spanish Wikipedia, which provide an abundance of annotations.However, the situation is not as favorable for most other languages: while W eng is invariant of the source language to link from, many of the other resources are small or nonexistent.In fact, only 300 languages (from ≈7000 living languages in the world) have Wikipedia W src , and among these many have a limited number of pages.For example, Oromo, a Cushitic language with 30 million speakers, has only 776 Wikipedia pages.It is similarly difficult to obtain exhaustive bilingual entity maps, and for many languages even the monolingual/parallel text necessary to train multilingual embeddings is scarce.
This work makes two major contributions regarding XEL for low-resource languages.
The first major contribution is empirical.We extensively evaluate the effect of resource restrictions on existing XEL methods in true lowresource settings instead of simulated ones (Section 4).We compare the performance of both the candidate generation model and the disambiguation model of our baseline XEL system between two high-resource languages and four lowresource languages.We quantify how much the availability of the aforementioned resources affect the overall quality of the existing methods, and find that with scarce access to these resources, the performance of existing methods drops significantly.This highlights the effect of resource constraints in realistic settings, and indicates that these constraints should be considered more carefully in future system design.
Our second major contribution is methodological.We propose three methods as first steps towards ameliorating the large degradation in performance we see in low-resource settings.(1) We investigate a hybrid candidate generation method, combining existing lookup-based and neural candidate generation methods to improve candidate list recall by 9-24%.(2) We propose a set of entity disambiguation features that are entirely language-agnostic, allowing us to train a disambiguation system on English and transfer it directly to low-resource languages.(3) We design a non-linear feature combination method, which makes it possible to combine features in a more flexible way.We test these three methodological improvements on four extremely low-resource languages (Oromo, Tigrinya, Kinyarwanda, and Sinhala), and find that the combination of these three techniques leads to consistent performance gains in all four languages, amounting to 6-23% improvement in end-to-end XEL accuracy.

Problem Formulation
Given a set of documents D = {D 1 , D 2 , ..., D l } in any source language L s , a set of detected mentions M D = {m 1 , m 2 , ..., m n } for each document D, and the English Wikipedia E KB , the goal of XEL is to associate each mention with its corresponding entity in the English Wikipedia.We denote an entity in English Wikipedia as e and its parallel entity in the source language Wikipedia as e src .
For each m i ∈ M D , candidate generation first retrieves a list of candidate entities e i = {e i,1 , e i,2 , ..., e i,n } from E KB based on probabilities p i = {p i,1 , p i,2 , ..., p i,n } where p i,j denotes p(e i,j |m i ).Then, the disambiguation model assigns a score s(e i,j |D) to each e i,j .These scores are normalized among e i and result in the probability p(e i,j |D).The entity with highest score is selected as the prediction.We denote the gold entity as e * .
Performance of candidate generation is measured by gold candidate recall: the proportion of mentions whose top-n candidate list contains the gold entity over all test mentions.This recall upper-bounds performance of an entity disambiguation system.In the consideration of the computational cost of the more complicated downstream disambiguation model, this n is often 30 or smaller (Sil et al., 2018;Upadhyay et al., 2018).The performance of an end-to-end XEL system is measured by accuracy: the proportion of mentions whose predictions are correct.We follow Yamada et al. (2017); Ganea and Hofmann (2017) and focus on in-KB accuracy; we ignore mentions whose linked entity does not exist in the KB in this work.This section describes existing methods for candidate generation and disambiguation, and our baseline XEL system, which is heavily inspired by existing works (Ling et al., 2015;Globerson et al., 2016;Pan et al., 2017).We investigate the effect of resource constraints on this system in Section 4. Based on empirical observations, we propose our improved XEL system in Section 5 and present its results in Section 6.

Candidate Generation
WIKIMENTION: With access to all the resources we list above, there is a straightforward approach to candidate generation used by most state-ofthe-art work in XEL (Sil et al., 2018;Upadhyay et al., 2018).Specifically, a monolingual mentionentity map can be extracted from W src by finding all cross-article links in W src , and using the anchor text as mention m and the linked entity as e src .These entities are then redirected to English Wikipedia with M to obtain e.For instance, if Oromo mention "Itoophiyaatti" is linked to entity "Itoophiyaa" in some Oromo Wikipedia pages, the corresponding English Wikipedia entity "Ethiopia" will be acquired through M and used as a candidate entity for the mention.The score p(e i,j |m i ) provided by this model shows the probability of linking to e i,j when mentioning m i .Because of its heavy reliance on W src and M, WIKI-MENTION does not generalize well to real lowresource settings.We discuss this in Section 4.1.
PIVOTING: Recently, Rijhwani et al. ( 2019) propose a zero-shot transfer learning method for XEL candidate generation, which uses no resources in the source language.A character-level LSTM is trained to encode entities using a bilingual entity map between some high-resource language and English.If the chosen high-resource language is closely related to the low-resource language (same language family, shared orthography etc.), zero-shot transfer will often be successful in generating candidates for the low-resource language.In this case, the model generated score s(e i,j |m i ) indicates the similarity which should be further normalized into a probability p(e i,j |m i ) (Section 5.1).
Notably, both methods have advantages and disadvantages, with PIVOTING generally being more robust, and WIKIMENTION being more accurate when resources are available.To take advantage of this, we propose a method for calibrated combination of these two methods in Section 5.1.

Featurization and Linear Scoring
Next, we move to the entity disambiguation step, which we further decompose into (1) the design of features and (2) the choice of inference model that combines these features together.

Featurization
Unfortunately for low-resource settings, many XEL disambiguation models rely on extensive resources such as E and W src (Sil et al., 2018;Upadhyay et al., 2018) to obtain features.However, some previous work on XEL does limit its resource usage to W eng , which is available regardless of the source language.Our baseline follows one such method by Pan et al. (2017).
We use two varieties of features: unary features that reflect properties of a single entity and binary features that quantify coherence between pairs of entities.The top half of Table 1 shows unary feature functions, which take one argument e i,j and return a value that represents some property of this entity.The grayed mention-entity prior f 1 l (e i,j ) is the main unary feature used by Pan et al. (2017), and we use this in our baseline.Binary features are in the bottom half of Table 1.Each binary feature function f i g (e i,j , e k,w ) takes two entities as arguments, and returns a value that indicates the relatedness between the entities.Similarly, the grayed co-occurrence feature f 1 g (e i,j , e k,w ) is used in the baseline.We refer to these two features as BASE.
While these features have proven useful in higher-resource XEL, in lower-resource scenarios, we hypothesize that it is more important to design features that make the most use of the languageinvariant resource W eng to make up for the relative lack of other resources in the source language.We discuss more intelligent features in Section 5.2.

Non-iterative Linear Inference Model
While the design of features is resource-sensitive, the choice of an inference model is fortunately resource-agnostic as it only relies on the existence of features.Our baseline follows existing (X)EL works (Ling et al., 2015;Globerson et al., 2016;Pan et al., 2017) to linearly aggregate unary features to a local score s l (e|D) and binary features to a global score s g (e|D).The local score reflects the properties of an independent entity, and the global score quantifies the coherence between an entity and other linked entities in the document.The score of each entity is defined as: The local score is the linear combination of unary features f i l (e i,j ) ∈ Φ(e i,j ): where W l ∈ R d l ×1 and d l is the number of unary features in the vector.
On the other hand, the global score s g is an average aggregation of mention evidence s m across the document.Each s m (m k , e i,j ) indicates how strongly a context mention m k supports the j-th candidate entity of mention m i : As a mention is in fact the surface form of other candidate entities, s m (m k , e i,j ) can be measured by the relatedness between the candidate entities e k of m k and e i,j .Our baseline inference model follows Ling et al. (2015); Globerson et al. (2016) to process this evidence in a GREEDY manner: (s e (e i,j , e k,w )) (3) Similarly to s l , s e (e i,j , e k,w ) is the linear combination of binary features f i g (e i,j , e k,w ) ∈ Ψ(e i,j , e k,w ): The greedy strategy often results in a suboptimal assignment, as the confidence of each candidate entity is not taken into consideration.To solve this problem, we propose iteratively updating belief of each candidate entity in Section 5.3.
Following Upadhyay et al. (2018); Sil et al. (2018), we consider WIKIMENTION as the baseline candidate generation model and BASE+GREEDY as the baseline disambiguator.We denote WIKIMENTION+BASE+GREEDY as the end-to-end baseline system.

Experiment I: Real Low-resource Constraints in XEL
In this section, we study the effects of resource constraints in truly low-resource settings; we then evaluate how this changes the conclusions we may draw about the efficacy of existing XEL models.We attempt to answer the following research questions: (1) how the does the availability of resources influence the performance of XEL systems, and (2) how do truly low-resource settings diverge from XEL with more resources?
We perform this study within the context of our WIKIMENTION+BASE+GREEDY baseline (which is conceptually similar to previous work).
We carry out the study on several languages and datasets: TAC-KBP: TAC-KBP 2011 for English (en) (Ji et al., 2011), TAC-KBP 2015 for Spanish (es) and Chinese (zh) (Ji et al., 2015).All contain documents from forums and news.
Detailed experimental settings are in Section 6.1.It is notable that a large number of previous works examine XEL on simulated low-resource settings such as the TAC-KBP datasets for large languages such as Chinese and English (Sil et al., 2018;Upadhyay et al., 2018), while the DARPA-LRL datasets are more reflective of true constraints in low-resource scenarios.

Results
Table 2 shows various statistics for the baseline system on English, two high-resource, and four low-resource XEL languages.The first row of Table 2 shows the gold candidate recall of WIKI-MENTION on 7 languages.The Wikipedia sizes of each language are shown in the last row of the table for reference.In general, the gold candidate recall of WIKIMENTION is positively correlated with the size of available Wikipedia resoruces.We can note that compared to the four low-resource languages, the statistics of the two high-resource languages are closer to those of English.
End-to-end performance of a system that selects the entity with the highest score according to WIKIMENTION is listed in the second row of the table.This trivial context-insensitive disambiguation method results in performance not far from the upper bound in six XEL languages.However, the size of the gap between this method and Weng  the upper bound is largely different between highand low-resource settings -this gap is significant for high-resource languages, but quite small for the four low-resource languages.Accordingly, in third row where we apply the disambiguation method BASE+GREEDY, we find gains of 2-7% on the high-resource languages, but little to no gain on the low-resource languages.This shows that when using a standard candidate generation method such as WIKIMENTION, there is little room for more sophisticated disambiguation models to improve performance, despite the fact that development of disambiguation methods (rather than candidate generation) has been the focus of much prior work.

Proposed Model Improvements
Next, we introduce our proposed methods: (1) calibrated combination of two existing candidate generation models, (2) an XEL disambiguation model that makes best use of resources that will be available in extremely low-resource settings.

Calibrated Candidate List Combination
As the gold candidate recall decides the upper bound of an (X)EL system, candidate lists with close to 100% recall are ideal.However, this is hard to achieve for most low-resource languages where existing candidate generation models only provide candidate lists with low recall (less than 60%, as we show in Section 4.1).Further, combination of candidate lists retrieved by different models is non-trivial as the scores are not comparable among models.For example, scores of WIKIMENTION have probabilistic interpretation while scores of PIVOTING do not.
We propose a simple method to solve this problem: we convert scores without probabilistic interpretation to ones that are scaled to the zero-one simplex.Given mention m i and its top-n candidate entity list E i along with their scores S i , the re-calibrated scores are identified as: where γ is a hyper-parameter that controls the peakiness of the distribution.After calibration, it is safe to combine prior scores with an average.

Feature Design
Next, we introduce the feature set for our disambiguation model, including features inspired by previous work (Sil and Florian, 2016;Ganea et al., 2016;Pan et al., 2017), as well as novel features specifically designed to tackle the low-resource scenario.We intentionally avoid features that take source language context words into consideration, as these would be heavily reliant on W eng and M and weaken the transferability of the model.The formulation and resource requirements of unary and binary features are shown in the top and bottom halves of Table 1 respectively.
For unary features, we consider the number of mentions an entity is related to as f 3 l , where we consider the entity e i,j related to mention m k if it co-occurs with any candidate entity of m k (Moro et al., 2014).We also add the entity prior score f 2 l among the whole Wikipedia (Yamada et al., 2017) to reflect the entity's overall salience.The exact match number f 4 l indicates mention coreference.For binary features, we attempt to deal with the noise and sparsity inherent in the co-occurrence counts of f 1 g .To tackle noise, we calculate the smoothed Positive Pointwise Mutual Information (PPMI) (Church and Hanks, 1990;Ganea et al., 2016) between two entities as f 2 g , which robustly estimates how much more the two entities cooccur than we expect by chance.To tackle sparsity, we incorporate English entity embeddings of Yamada et al. (2017), and calculate embedding similarity between two entities as f 3 g .Similar techniques have also been used by existing works (Ganea and Hofmann, 2017;Kolitsas et al., 2018).We also add the hyperlink count f 4 g between a pair of entities as, if entity e i 's Wikipedia page mentions e j , they are likely to be related.
We name our proposed feature set that includes all features listed in Table 1 as FEAT.

BURN: Feature Combination Model
With the growing number of features, we posit that a linear model with greedy entity pair selection (Section 3.2) is not expressive enough to take advantage of a rich feature set.Yamada et al. (2017) use Gradient Boosted Regression Trees (GBRT;Friedman (2001)) to combine features, but GBRTs do not allow for end-to-end training and thus constrain the flexibility of the model.Ganea et al. (2016); Ganea and Hofmann (2017) propose to use Loopy Belief Propagation (LBP; Murphy et al. (1999)) to estimate the global score (Equation ( 2)) and use non-linear functions to combine local and global scores (Equation ( 1)).However, BP is challenging to implement, and previous work has not attempted to combine more fine-grained features (e.g.unary feature Φ(e i,j )) non-linearly.
Instead, we propose a belief update recurrent network (BURN) that combines features in a nonlinear and iterative fashion.Compared to existing work (Naradowsky and Riedel, 2016;Ganea et al., 2016;Ganea and Hofmann, 2017) as well as our base model, the advantages of BURN are: (1) it is easy to implement with existing neural network toolkits, (2) parameters can be learned endto-end, (3) it considers non-linear combinations over more fine-grained features and thus has potential to fit more complex combination patterns, (4) it can model (distance) relations between mentions in the document.
Given unary feature vector Φ(e i,j ) with d l features, BURN replaces the linear combination in Equation ( 1) with two fully connected layers: . σ is a non-linear function, for which we use leaky rectified linear units (Leaky ReLu; Maas et al. ( 2013)).We add a linear addition of the input to alleviate the gradient vanishing problem.Equation (4) is revised in a similar way.
As discussed in Equation ( 3), our baseline model calculates the mention evidence greedily.However, there may be many candidate entities for each mention, some containing noise.BURN solves this problem by weighting s e (e i,j , e k,w ) with the current entity probability p(e k,w |D).An illustration is in the bottom of Figure 2. The evidence from m k is now defined as: Instead of simply averaging mention evidence in Equation (2), we also use a gating function to control the influence of m k 's mention evidence on m i (top of Figure 2), giving score The gating function g is essentially a lookup table that has one scalar for each distance (in words) between two mentions.We train this table along with all other parameters of the model.The motivation for this gating function is that a mention is more likely to be coherent with a nearby mention than a distant one.We assume that this is true for almost all languages, and thus will be useful even without training in the language to be processed.
As shown in Equation ( 6), there is a circular dependency between entities.To solve this problem, we iteratively update the probability of entities until convergence or reaching a maximum number of iterations T .In iteration t, the calculation of s m will use entity probabilities from iteration t − 1.The revised Equation ( 6) is as follows: s e (e i,j , e k,w )p t−1 (e k,w |D) Unrolling this network through iterations, we can see that this is in fact a recurrent neural network.
Training BURN: The weights of BURN are learned end-to-end with the objective function: As discussed above, the disambiguation model is fully language-agnostic and it does not require any annotated EL data or other resources in the source language.The model weights W l , W g and the lookup table g m of gating function are trained on the TAC-KBP 2010 English training set (Ji et al., 2010) only and used as-is in another language.We use TAC-KBP 2012 English test set (Mayfield and Javier, 2012) as our development set.
6 Experiment II: Improving Low-resource XEL Section 4 demonstrated a dramatic performance degradation for XEL in realistic low-resource settings.In this section, we evaluate the utility of our proposed methods that improve low-resource XEL.

Training Details
All models are implemented in PyTorch (Paszke et al., 2017).The size of the pre-trained entity embeddings (Yamada et al., 2017) is 300, trained with a window size of 15 and 15 negative samples.The hidden size h of both W 1 l and W 1 g is set to 128, e 2, 1 , e 3, 1 )   s e (e 2,1 , e 3,2 ) the dropout rate is set to 0.5.For the gating function, we set mention distances that are larger than 50 tokens to 50, then bin the distances with a bin size of 4. We only consider the 30 nearest context mentions for each mention.The maximum number of iterations for inference is set to 20.We use the Adam optimizer with the default learning rate (1e-3) to train the model.The γ of calibrated candidate combination is set to 1.It takes around two hours to train a GREEDY model and ten hours to train a BURN model with a Titan X GPU, regardless of the feature set.

Results
Table 3 compares models on the datasets we introduce in Section 4. Given that the critical issue was the degradation of candidate recall of the resource-heavy WIKIMENTION method in lowresource settings (Section 4), we first examine the alternative resource-light PIVOTING model.The first rows of block 1 and 2 of the table show the gold candidate recall of each method.While PIVOTING greatly exceeds WIKIMENTION on ti, which only has 168 Wikipedia pages, its performance is much lower on si, which has 15k pages.Overall, while these two models could outperform each other in their respective favorable settings (when a similar pivot language exists for the former, and when a large Wikipedia exists for the latter), it is challenging to decide which is more appropriate in the face of the realistic setting of existent, but scarce, resources.Thus, in the third block of the results for the hybrid candidate generation model which uses both WIKIMENTION and PIVOTING.Compared to WIKIMENTION, this method improves the gold candidate recall between 9 to 24% over all four low-resource languages.The improvement (> 15%) is especially considerable for om and rw.This reflects the fact that there are a significant number of unique candidate entities retrieved by these two candidate generation methods, and developing a proper way to combine them together results in higher-quality candidate lists.Notably, this method has also increased the headroom for a disambiguation model to contributein contrast to the WIKIMENTION setting where the difference between prior p(e|m) and gold accuracy was minimal, now there is a 3-9% accuracy gap between the two settings.
Next, we turn to methods that close this gap.Focusing on this third block of the table, we can see that the proposed disambiguation model can take advantage of better candidate lists and yields significantly better results on all four languages.Notably, we observe that BURN consistently yields the best performance over all languages, improving by 0.2 to 3.3% over GREEDY.This result demonstrates the advantage of iterative nonlinear feature combination in low-resource settings.In contrast, there is not a consistent improvement from the proposed feature set FEAT compared to the baseline BASE.This is interest-ing as FEAT+BURN outperformed BASE+BURN by more than 10% on the English development set on which it was validated.We suspect this is because the feature value distribution of the English training data is different from that of lowresource languages, leading to sub-optimal transfer.We leave training algorithms for bridging this gap as an interesting avenue of future work.
In the context of the end-to-end system, the combination of our proposed methods brings 6-23% improvement over the baseline system.For languages (ti, om, rw) where resources are relative scarce, the improvement is especially considerable, ranging from 13 to 23%, indicating that our work is a promising first step towards improving XEL in realistic low-resource scenarios.

Conclusion
This paper has made two major contributions to the study of low-resource cross-lingual entity linking (XEL).First, we perform an extensive empirical evaluation on the effect of different resource availability assumptions on XEL and demonstrate that (1) the accuracy of existing systems greatly degrades on true low-resource settings, and (2) standard candidate generation systems constrain the performance of end-to-end XEL.This fact has been under-discussed in existing work and we argue that more attention should be paid to candidate generation for low-resource XEL.Second, based on our empirical study, we propose three methodologies for candidate generation and disambiguation that make the best use of limited resources we will have in realistic settings.Experimental results suggest that our proposed methodologies are effective under extremely limited-resource scenarios, giving improvements in 6-23% end-to-end linking accuracy over the baseline system.
An immediate future focus is further improving the performance of candidate generation models in realistic low-resource settings.Further, we could consider more sophisticated strategies for crosslingual training of entity disambiguation systems that fill the gap between English training data and real world low-resource data.

[
Figure1: XEL for two low-resource languages -Oromo and Sinhala, linking source mentions to entity "Netherlands" in English Wikipedia.

Figure 2 :
Figure2: Top: the global score of an entity is a weighted aggregation of mention evidence from context mentions, instead of an average.Bottom: each mention evidence is a weighted entity-pair score, instead of the max.

Table 1 :
Unary features (top half) and binary features (bottom half).Gray indicates BASE features."Variable" means this feature comes from the candidate generation model and thus its resource dependency will be decided by that model; is set to 1e-7; c(e) is the frequency of an entity among all anchor links in W eng ; c(e i , e j ) is the co-occurrence count of two entities in W eng ; p(e i , e j ) is normalized over all entity pairs and p (e i ) is normalized over all entities with smoothing parameter γ = 0.75; V e represents the entity embedding of e i ; H ei represents a set of entities in e i 's English Wikipedia page.

Table 2 :
Gold candidate recall of WIKIMENTION over seven languages, accuracy (%) of selecting the highest score entity, and accuracy after end-to-end EL using the BASE+GREEDY method.

Table 3 :
table we showBlock Index Weng Wsrc M Accuracy (%) of different systems.shows the resource requirements.The performances of the end-toend baseline system grayed .The performances of baseline disambiguation for each candidate generation model are underlined and numbers in bold show the best performance for each setting.p(e|m) refers to the method that chooses the highest prior score provided by corresponding candidate generation method.