Joint Multilingual Supervision for Cross-lingual Entity Linking

Cross-lingual Entity Linking (XEL) aims to ground entity mentions written in any language to an English Knowledge Base (KB), such as Wikipedia. XEL for most languages is challenging, owing to limited availability of resources as supervision. We address this challenge by developing the first XEL approach that combines supervision from multiple languages jointly. This enables our approach to: (a) augment the limited supervision in the target language with additional supervision from a high-resource language (like English), and (b) train a single entity linking model for multiple languages, improving upon individually trained models for each language. Extensive evaluation on three benchmark datasets across 8 languages shows that our approach significantly improves over the current state-of-the-art. We also provide analyses in two limited resource settings: (a) zero-shot setting, when no supervision in the target language is available, and in (b) low-resource setting, when some supervision in the target language is available. Our analysis provides insights into the limitations of zero-shot XEL approaches in realistic scenarios, and shows the value of joint supervision in low-resource settings.


Introduction
Entity Linking (EL) systems ground entity mentions in text to entries in Knowledge Bases (KB), such as Wikipedia (Mihalcea and Csomai, 2007).Recently, the task of Cross-lingual Entity Linking (XEL) has gained attention (McNamee et al., 2011;Ji et al., 2015;Tsai and Roth, 2016) with the goal of grounding entity mentions written in any language to the English Wikipedia.For instance, Figure 1 shows a Tamil (a language with >70 million speakers) and an English mention (shown [enclosed])  [mentions] of the entity Liverpool_F.C. from the respective Wikipedias.Tamil Wikipedia only has 9 mentions referring to Liverpool_F.C., whereas English Wikipedia has 5303 such mentions.Clearly, there is a need to augment the limited contextual evidence in low-resource languages with evidence from high-resource languages like English.Tamil sentence translates to "Suarez plays for [Liverpool] and Uruguay." and their mention contexts.XEL involves grounding the Tamil mention (which translates to 'Liverpool') to the football club Liverpool_F.C., and not the city or the university.XEL enables knowledge acquisition directly from documents in any language, without resorting to machine translation.
Training an EL model requires grounded mentions, i.e. mentions of entities that are grounded to a Knowledge Base (KB), as supervision (Figure 1).While millions of such mentions are available in English, by virtue of hyperlinks in the English Wikipedia, this is not the case for most languages.This makes learning XEL models challenging, especially for languages with limited resources (e.g., the Tamil Wikipedia is only 1% of the English Wikipedia in size).To overcome this challenge, it is desirable to augment the limited contextual evidence available in the target language with evidence from high-resource languages like English.
We propose XELMS (XEL with Multilingual Supervision) ( §2), the first approach that fulfills the above desiderata by using multilingual supervision to train an XEL model.XELMS represents the mention contexts of the same entity from different languages in the same semantic space using a single context encoder ( §2.1).Language-agnostic entity representations are jointly learned with the relevant mention context representations, so that an entity and its context share similar representations.t < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 s t c G r X q s o A S K 8 x c q O N Y 8 G / 3 2 M 8 = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h h r N B t e b W 3 T n I K v E K U o M C z U H 1 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 6 r 9 F P D E 8 o m d M R 7 l i o a c e N n 8 8 Q z c m a V I Q l j b Z 9 C M l d / b 2 Q 0 M m Y a B X Y y T 2 i W v V z 8 z + u l G F 7 7 m V B J i l y x x U d h K g n G J D + f D I X m D O X U E s q 0 s F k J G 1 N N G d q S K r Y E b / n k V d K + q H t u 3 b u / r D V u i j r K c A K n c A 4 e X E E D 7 q A J L W C g 4 B l e 4 c 0 x z o v z 7 n w s R k t O s X M M f + B 8 / g D 3 J Z E Y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 s t c G r X q s o A S K 8 x c q O N Y 8 G / 3 2 M 8 = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h h r N B t e b W 3 T n I K v E K U o M C z U H 1 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 6 r 9 F P D E 8 o m d M R 7 l i o a c e N n 8 8 Q z c m a V I Q l j b Z 9 C M l d / b 2 Q 0 M m Y a B X Y y T 2 i W v V z 8 z + u l G F 7 7 m V B J i l y x x U d h K g n G J D + f D I X m D O X U E s q 0 s F k J G 1 N N G d q S K r Y E b / n k V d K + q H t u 3 b u / r D V u i j r K c A K n c A 4 e X E E D 7 q A J L W C g 4 B l e 4 c 0 x z o v z 7 n w s R k t O s X M M f + B 8 / g D 3 J Z E Y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 s t c G r X q s o A S K 8 x c q O N Y 8 G / 3 2 M 8 = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h h r N B t e b W 3 T n I K v E K U o M C z U H 1 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 6 r 9 F P D E 8 o m d M R 7 l i o a c e N n 8 8 Q z c m a V I Q l j b Z 9 C M l d / b 2 Q 0 M m Y a B X Y y T 2 i W v V z 8 z + u l G F 7 7 m V B J i l y x x U d h K g n G J D + f D I X m D O X U E s q 0 s F k J G 1 N N G d q S K r Y E b / n k V d K + q H t u 3 b u / r D V u i j r K c A K n c A 4 e X E E D 7 q A J L W C g 4 B l e 4 c 0 x z o v z 7 n w s R k t O s X M M f + B 8 / g D 3 J Z E Y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 s t c G r X q s o A S K 8 x c q O N Y 8 G / 3 2 M 8 = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h h r N B t e b W 3 T n I K v E K U o M C z U H 1 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 6 r 9 F P D E 8 o m d M R 7 l i o a c e N n 8 8 Q z c m a V I Q l j b Z 9 C M l d / b 2 Q 0 M m Y a B X Y y T 2 i W v V z 8 z + u l G F 7 7 m V B J i l y x x U d h K g n G J D + f D I X m D O X U E s q 0 s F k J G 1 N N G d q S K r Y E b / n k V d K + q H t u 3 b u / r D V u i j r K c A K n c A 4 e X E E D 7 q A J L W C g 4 B l e 4 c 0 x z o v z 7 n w s R k t O s X M M f + B 8 / g D 3 J Z E Y < / l a t e x i t > e < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 q c 8 4 C d o + w z 8 L 3 G 9 w V R A a 1 j L U m I = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M n 0 p h 0 6 m Y S Z i V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S A T X x n W / n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T x q 6 z h V D F s s F r H q B l S j 4 B J b h h u B 3 U Q h j Q K B n W B y m / u d J 1 S a x / L B T B P 0 I z q S P O S M G i s 9 9 i N q x k G Y 4 W x Q r b l 1 d w 6 y S r y C 1 K B A c 1 D 9 6 g 9 j l k Y o D R N U 6 5 7 n J s b P q D K c C Z x V + q n G h L I J H W H P U k k j 1 H 4 2 T z w j Z 1 Y Z k j B W 9 k l D 5 u r v j Y x G W k + j w E 7 m C f W y l 4 v / e b 3 U h N d + x m W S G p R s 8 V G Y C m J i k p 9 P h l w h M 2 J q C W W K 2 6 y E j a m i z N i S K r Y E b / n k V d K + q H t u 3 b u / r D V u i j r K c A K n c A 4 e X E E D 7 q A J L W A g 4 R l e 4 c 3 R z o v z 7 n w s R k t O s X M M f + B 8 / g D g W p E J < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 q c 8 4 C d o + w z 8 L 3 G 9 w V R A a 1 j L U m I = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M n 0 p h 0 6 m Y S Z i V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S A T X x n W / n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T x q 6 z h V D F s s F r H q B l S j 4 B J b h h u B 3 U Q h j Q K B n W B y m / u d J 1 S a x / L B T B P 0 I z q S P O S M G i s 9 9 i N q x k G Y 4 W x Q r b l 1 d w 6 y S r y C 1 K B A c 1 D 9 6 g 9 j l k Y o D R N U 6 5 7 n J s b P q D K c C Z x V + q n G h L I J H W H P U k k j 1 H 4 2 T z w j Z 1 Y Z k j B W 9 k l D 5 u r v j Y x G W k + j w E 7 m C f W y l 4 v / e b 3 U h N d + x m W S G p R s 8 V G Y C m J i k p 9 P h l w h M 2 J q C W W K 2 6 y E j a m i z N i S K r Y E b / n k V d K + q H t u 3 b u / r D V u i j r K c A K n c A 4 e X E E D 7 q A J L W A g 4 R l e 4 c 3 R z o v z 7 n w s R k t O s X M M f + B 8 / g D g W p E J < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 q c 8 4 C d o + w z 8 L 3 G 9 w V R A a 1 j L U m I = " > A e Z W 0 L q q u U 3 U f L y u 1 m z y O I j p G J + g M u e g K 1 d A 9 q q M m o m i C n t E r e r M y 6 8 V 6 t z 4 W r Q U r n z l C f 2 B 9 / g C u t 5 Q x < / l a t e x i t > TC-Loss < l a t e x i t s h a 1 _ b a s e 6 4 = " I L d X p 4 I X v X Z p z w / t l J 3 G m K x X a 3 s = " > A A A B + 3 i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S y C F 0 s i g h 6 L v X j w U K F f 0 I S y 2 W 7 b p Z t s 2 J 1 I S 8 h f 8 e J B E a / + E W / + G 7 d t D t r 6 Y O D x 3 g w z 8 4 J Y c A 2 O 8 2 0 V N j a 3 t n e K u 6 W 9 / Y P D p B N X S P G q i F K J q i Z / S K 3 q z M e r H e r Y 9 l a 8 H K Z 0 7 Q H 1 i f P 6 u j l C 8 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I L d X p 4 I X v X Z p z w / t l J 3 G m K x X a 3 s = " > A A A B + 3 i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S y C F 0 s i g h 6 L v X j w U K F f 0 I S y 2 W 7 b p Z t s 2 J 1 I S 8 h f 8 e J B E a / + E W / + G 7 d t D t r 6 Y O D x 3 g w z 8 4 J Y c A 2 O 8 2 0 V N j a 3 t n e K u 6 W 9 / Y P D p B N X S P G q i F K J q i Z / S K 3 q z M e r H e r Y 9 l a 8 H K Z 0 7 Q H 1 i f P 6 u j l C 8 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I L d X p 4 I X v X Z p z w / t l J 3 G m K x X a 3 s = " > A A A B + 3 i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S y C F 0 s i g h 6 L v X j w U K F f 0 I S y 2 W 7 b p Z t s 2 J 1 I S 8 h f 8 e J B E a / + E W / + G 7 d t D t r 6 Y O D x 3 g w z 8 4 J Y c A 2 O 8 2 0 V N j a 3 t n e K u 6 W 9 / Y P D p B N X S P G q i F K J q i Z / S K 3 q z M e r H e r Y 9 l a 8 H K Z 0 7 Q H 1 i f P 6 u j l C 8 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I L d X p 4 I X v X Z p z w / t l J 3 G m K x X a 3 s = " > A A A B + 3 i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S y C F 0 s i g h 6 L v X j w U K F f 0 I S y 2 W 7 b p Z t s 2 J 1 I S 8 h f 8 e J B E a / + E W / + G 7 d t D t r 6 Y O D x 3 g w z 8 4 J Y c A 2 O 8 2 0 V N j a 3 t n e K u 6 W 9 / Y P D p B N X S P G q i F K J q i Z / S K 3 q z M e r H e r Y 9 l a 8 H K Z 0 7 Q H 1 i f P 6 u j l C 8 = < / l a t e x i t >

EC-Loss
< l a t e x i t s h a 1 _ b a s e 6 4 = " y W g Training mention contexts originate from two (or more) languages

Everton won against
[Liverpool] in a FA Cup match.
< l a t e x i t s h a 1 _ b a s e 6 4 = " T w S I E j i x f w 4 7 R y e J L A j p q X Q A Everton won against [Liverpool] in an FA Cup match.
V p y i p 1 j + A P n 8 w f e 1 Z E I < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G G t k 1 q N 9 4 p 3 l S 2 + M x R D s 9 V p y i p 1 j + A P n 8 w f e 1 Z E I < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G G t k 1 q N 9 4 p 3 l S 2 + M x R D s 9 V p y i p 1 j + A P n 8 w f e 1 Z E I < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G G t k 1 q N 9 4 p 3 l S 2 + M x R D s 9 p y i p 1 j + A P n 8 w f j Z J E L < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v m P s / L 4 D T Q + 0 3 1 t 4 y l A O l X 3 U 3 5 p y i p 1 j + A P n 8 w f j Z J E L < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v m P s / L 4 D T Q + 0 3 1 t 4 y l A O l X 3 U 3 5 p y i p 1 j + A P n 8 w f j Z J E L < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v m P s / L 4 D T Q + 0 3 1 t 4 y l A O l X 3 U 3 5  Additionally, by encoding freely available structured knowledge, like fine-grained entity types, the entity and context representations can be further improved ( §2.2).The ability to use multilingual supervision enables XELMS to learn XEL models for target languages with limited resources by exploiting freely available supervision from high resource languages (like English).We show that XELMS outperforms existing state-of-the-art approaches that only use target language supervision, across 3 benchmark datasets in 8 languages ( §5.1).Moreover, while previous XEL models ( McNamee et al., 2011;Tsai and Roth, 2016) train separate models for different languages, XELMS can train a single model for performing XEL in multiple languages ( §5.2).
One of the goals of XEL is to enable understanding of languages with limited resources.We provide experimental analyses in two such settings.In the zero-shot setting ( §6.1), where no supervision is available in the target language, we show that the good performance of zero-shot XEL approaches (Sil et al., 2018) can be attributed to the use of prior probabilities.These probabilities are computed from large amount of grounded mentions, which are not available in realistic zero-shot settings.In the low-resource setting ( §6.2), where some supervision is available in the target language, we show that even when only a fraction of the available supervision in the target language is provided, XELMS can achieve competitive performance by exploiting supervision from English.
The contributions of our work are, • A new XEL approach, XELMS, that learns a XEL model for a language with limited resources by exploiting additional supervision from a high-resource language like English.• XELMS can also train a single XEL model for multiple languages jointly, which we show improves on separately trained models.• Analysis of XEL approaches in the zero-shot and low-resource settings.Our analysis reveals that in realistic scenarios, zero-shot XEL is not as effective as previously shown.We also show that in low-resource settings jointly training with English leads to better utilization of target language supervision.
2 Cross-lingual EL with XELMS Given a mention m in a document D written in any language, XEL involves linking m to its gold entity e * in a KB, An overview of XELMS is shown in Figure 2a.XELMS computes the probability, P context (e | m), of a mention m referring to entity e ∈ K using a mention context vector g ∈ R h representing m's context, and an entity vector e ∈ R h , representing the entity e ∈ K (one vector per entity).XELMS can also incorporate structured knowledge like fine-grained entity types ( §2.2) using a multitask learning approach (Caruana, 1998), by learning a type vector t ∈ R h for each possible type t (e.g., sports_team) associated with the entity e.The entity vector e, context vector g and the type vector t are jointly trained, and interact through appropriately defined pairwise loss terms -an Entity-Context loss (EC-LOSS), Type-Entity loss (TE-LOSS) and a Type-Context loss (TC-LOSS).
The mention context vector g is generated by a mention context encoder ( §2.1), shown in Fig- ure 2b.The mention context of m in a document D consists of: (a) neighboring words around the mention, which we refer to as its local context and, (b) surfaces of other mentions appearing in D, which we refer as its document context.XELMS is trained using grounded mentions in multiple languages (English and Tamil in Figure 2a), which can be derived from Wikipedia ( §4.1).

Mention Context Representation
To learn from mention contexts in multiple languages, we generate mention context representations using a language-agnostic mention context encoder.An overview of the mention context encoder is shown in Figure 2b.Below we describe the components of the mention context encoder, namely multilingual word embeddings and local and document context encoders.
Multilingual Word Embeddings (Ammar et al., 2016b;Smith et al., 2017;Duong et al., 2017) jointly encode words in multiple (≥2) languages in the same vector space such that semantically similar words in the same language, and translationally equivalent words in different languages are close (per cosine similarity).Multilingual embeddings generalize bilingual embeddings, which do the same for two languages only.
We use FASTTEXT (Bojanowski et al., 2017;Smith et al., 2017), which aligns monolingual embeddings of multiple languages in the same space using a small dictionary (∼2500 pairs) from each language to English.Both monolingual embeddings and the dictionary can be easily obtained for languages with limited resources.We denote the multilingual word embeddings for a set of tokens Multilingual Token Embeddings of the Context on the Right … k < l a t e x i t s h a 1 _ b a s e 6 4 = " F N 4 g 9 G F e + 3 z i a T g M 0 D M w d U e U K p 0 Local Context Representation The local context of a mention m, spanning tokens i to j, consists of left context (tokens i − W to j) and right context (tokens i to j + W ). For example, for the mention [Liverpool] in Figure 2b, the left and right contexts are "Everton won against Liverpool" and "Liverpool in a FA Cup match" respectively.The local context encoder (Figure 3) encodes the left and the right contexts into vectors l ∈ R h and r ∈ R h using a convolutional neural network (CNN).These two vectors are then combined to generate the local context vector c ∈ R h (Figure 2b).
The CNN convolves continuous spans of k tokens using a filter matrix F ∈ R kd×h to project the concatenation (⊕ operator) of the token embeddings in the span.The resulting vector is passed through a ReLU unit to generate convolutional output O i .The outputs {O i } are pooled by averaging, Left and right context vectors l and r are computed using respective ENC(.) layers, Context Conditional Probability We compute the probability of a mention m linking to entity e using its context vector g and the entity vector e,

Including Type Information
Incorporating the fine-grained types of a mention m can help rank entities of the appropriate type higher than others (Ling et al., 2015;Gupta et al., 2017;Raiman and Raiman, 2018).For instance, knowing the correct type of mention [Liverpool] as sports_team and constraining linking to entities with the relevant type, encourages disambiguation to the correct entity.
To make the mention context representation g type-aware, we predict the set of fine-grained types of m, T(m) = {t 1 , ..., t |T(m)| } using g.Each t i belongs to a pre-defined type vocabulary Γ. 2 The probability of a type t belonging to T(m) given the mention context is defined as P(t | m) = σ(t T g), where σ is the sigmoid function and t is the learnable embedding for type t.
We define a Type-Context loss (TC-LOSS) as, TC-LOSS = BCE(T(m), P(t | m)) where BCE is the Binary Cross-Entropy Loss, We also incorporate the entity-type information in the entity representations, and define a similar Type-Entity loss (TE-LOSS).
To identify the gold types T(m) of a mention m, we make the distant supervision assumption (same as Ling et al. (2015)) and assign the types of the gold entity e * to be the types of the mention.Gold fine-grained types of the entities can be acquired from resources like Freebase (Bollacker et al., 2008) or YAGO (Hoffart et al., 2013).

Training and Inference
We explain how XELMS generates candidate entities, performs inference, and combines the different training losses.

Candidate Generation
Candidate generation identifies a small number of plausible entities for a mention m to avoid brute force comparison with all KB entities.Given m, candidate generation outputs a list of candidate entities C(m) = {e 1 , e 2 , • • • , e K } of size at most K (we use K=20), each associated with a prior probability P prior (e i | m) indicating the probability of m referring to e i , given only m's surface.P prior is estimated from counts over the training mentions.
We adopt Tsai and Roth (2016)'s candidate generation strategy with some minor modifications (Appendix A).Using other approaches like Cross-Wikis (Spitkovsky and Chang, 2012), lead to consistently worse recall.We note that transliteration based candidate generation (McNamee et al., 2011;Pan et al., 2017;Tsai and Roth, 2018;Upadhyay et al., 2018) can further improve recall.

Inference
We combine the context conditional entity probability P context (e | m) (eq.5) and prior probability P prior (e | m) by taking their union:

Training Objective
When only training the mention context encoder and entity vectors, we minimize the EC-LOSS averaged over all training mentions.When using the two type-aware losses, we minimize a weighted sum of EC-LOSS, TE-LOSS, and TC-LOSS, using the weighing scheme of Kendall et al. (2018)

Experimental Setup
We briefly describe the training and evaluation datasets, and the previous XEL approaches from the literature used in our comparison.

Training Mentions
Following previous work, we use hyperlinks from Wikipedia (dumps dated 05/20/2017) as our source of grounded mentions for supervision.Wikipedias in different languages have different pages for the same entity, which are resolved by using interlanguage links (e.g., page 利 物 浦 in Chinese Wikipedia resolves to Liverpool in English).Training mentions statistics are shown in Table 1.
We evaluate on 8 languages -German (de), Spanish (es), Italian (it), French (fr), Chinese (zh), Arabic (ar), Turkish (tr) and Tamil (ta), each of which has varying amount of grounded mentions from the respective Wikipedia (Table 1).We note that our method is applicable to any of the 293 Wikipedia languages as a target language.

Evaluation Datasets
We evaluate XELMS on the following benchmark datasets, spanning 8 different languages, thus providing an extensive evaluation.TH-Test A subset of the dataset used in (Tsai and Roth, 2016), derived from Wikipedia. 3The mentions in the dataset fall in two categories -easy and hard, where hard mentions are those for which the most likely candidate according to the prior probability (i.e., arg max P prior (e | m)) is not the correct title.Indeed, most Wikipedia mentions can be correctly linked by selecting the most likely candidate (Ratinov et al., 2011).We use all the hard mentions from Tsai and Roth (2016)'s test splits for each language, and collectively call this subset TH-TEST.
TAC15-Test TAC 2015 (Ji et al., 2015) dataset for Chinese and Spanish.It contains documents from discussion forum articles and news.
We evaluate all models using linking accuracy on gold mentions, and assume gold mentions are provided at test time.Table 2 summarizes the different domains of the evaluation datasets.
Tuning We avoid any dataset-specific tuning, instead tuning on a development set and applying the same parameters across all datasets.All tunable parameters were tuned on a development set containing the hard mentions from the train split released by Tsai and Roth (2016).We refer the reader to Appendix B for details on tuning.

Comparative Approaches
We compare against the following state-of-the-art (SoTA) approaches, described with the language from which they use mention contexts in (.),

Experiments
We show that: (a) XELMS can train a better entity linking model for a target language on various benchmark datasets by exploiting additional data from a high resource language like English ( §5.1).
(b) XELMS can train a single XEL model for multiple related languages and improve upon separately trained models ( §5.2).(c) Adding additional type information as multi-task loss to XELMS further improves performance ( §5.3).In all tables, we report the linking accuracy of XELMS, averaged over 5 different runs, and mark with * the statistical significance (p < 0.01) of the best result (shown bold) against the state-of-the-art (SoTA) using Student's one-sample t-test.

Monolingual and Joint Models
In Table 3 and 4 we compare XELMS(mono), which uses monolingual supervision in the target language only, and XELMS(joint), which uses supervi-  sion from English in addition to the monolingual supervision, with the state-of-the-art approaches.
We see that XELMS(mono) achieves similar or slightly better scores than respective SoTA on all datasets.The SoTA for MCN-TEST in Turkish and Chinese enhances the model by using transliteration for candidate generation, explaining their superior performance.XELMS(joint) performs substantially better than XELMS(mono) on all datasets (Table 3 and 4), proving that using additional supervision from a high resource language like English leads to better linking performance.In particular, XELMS(joint) outperforms the SoTA on all languages in TH-TEST, on Spanish in TAC15-Test, and on 4 of the 7 languages in MCN-TEST.

Multilingual Training
XELMS is the first approach that can train a single XEL model for multiple languages.To demonstrate this capability, we train a model, henceforth referred as XELMS(multi), jointly on 5 related languages -Spanish, German, French, Italian and En- glish.We compare XELMS(multi) to the respective XELMS(joint) model for each language.Table 4 and 5, show that XELMS(multi) is better (or at par) than XELMS(joint) on all datasets.This shows that XELMS(multi) can making more efficient use of available supervision in related languages than previous approaches which trained separate models per language.

Adding Fine-grained Type Information
To study the effect of adding fine-grained type information, in Table 4 we compare XELMS(mono) and XELMS(joint) to XELMS(mono +type ) and XELMS(joint +type ) respectively, which are versions of XELMS(mono) and XELMS(joint) trained using the two type-aware losses.
XELMS(mono +type ) and XELMS(joint +type ) both improve compared to XELMS(mono) and XELMS(joint) on MCN-TEST and TH-TEST (Table 6 vs Table 3), showing the benefit of using structured knowledge in the form of fine-grained types.Similar trends are also seen on TAC15-TEST (Table 4), where XELMS(joint +type ) improves on the SoTA for Spanish and Chinese.

Experiments with Limited Resources
The key motivation of XELMS is to exploit supervision from high-resource languages like English to aid XEL for languages with limited resources.In this section, we examine two such scenarios, (a) Zero-shot setting i.e., no supervision available in the target language.Our analysis reveals the limitations of zero-shot XEL approaches and finds that the prior probabilities play an important role in achieving good performance ( §6.1), which are unavailable in realistic zero-shot scenarios.(b) Low-resource setting i.e., some supervision available in the target language.We show that by combining supervision from a high-resource language, like English, XELMS can achieve competitive performance with a fraction of available supervision in the target language ( §6.2).

Zero-shot Setting
We first explain how XELMS can perform zero-shot XEL, the implications of our zero-shot setting, and how it is more realistic than previous work.
Zero-shot XEL with XELMS XELMS performs zero-shot XEL by training a model using English supervision and multilingual embeddings for English, and directly applying it to the test data in another language using the respective multilingual word embedding instead of English embeddings.
No Prior Probabilities Prior probabilities (or prior), i.e., P prior have been shown to be a reliable indicator of the correct disambiguation in entity linking (Ratinov et al., 2011;Tsai and Roth, 2016).These probabilities are estimated from counts over the training mentions in the target language.In the absence of training data for the target language, as in the zero-shot setting, these prior probabilities are not available to an XEL model.

Comparison to Previous Work
The only other model capable of zero-shot XEL is that of Sil et al. (2018).However, Sil et al. (2018) use prior probabilities and coreference chains for the target language in their zero-shot experiments, both of which will not be available in a realistic zero-shot scenario.Compared to Sil et al. (2018), we evaluate the performance of zero-shot XEL in more realistic setting, and show it is adversely affected by absence of prior probabilities.Is zero-shot XEL really effective?To evaluate the effectiveness of the zero-shot XEL approach, we perform zero-shot XEL using XELMS on all datasets.Table 7 shows zero-shot XEL results on all datasets, both with and without using the prior during inference.Note that zero-shot XEL (with prior) is close to SoTA (Sil et al. (2018)) on TAC15-TEST, which also uses the prior for zeroshot XEL.However, for zero-shot XEL (without prior) performance drops by more than 20% for TAC15-Test, 2.4% for TH-Test and by 2.1% for McN-Test.This indicates that zero-shot XEL is not effective in a realistic zero-shot setting (i.e., when the prior is unavailable for inference).We found that the prior is indeed a strong indicator of the correct disambiguation.For instance, simply selecting the the most likely candidate using the prior for TAC15-TEST achieved 77.2% and 78.8% for Spanish and Chinese respectively.It is interesting to note that both zero-shot XEL (with or without prior) perform worse than the best possible model on TH-TEST, because TH-TEST was constructed to ensure prior probabilities are not strong indicators (Tsai and Roth, 2016).On MCN-TEST, we found that an average of 75.9% mentions have only one (the correct) candidate, making them trivial to link, regardless of the absence of priors.
The results show that most of the XEL performance in zero-shot settings can be attributed to availability of prior probabilities for the candidates.It is evident that zero-shot XEL in a realistic setting (i.e., when prior probabilities are not available) is still a challenging problem.

Low-resource Setting
We analyze the behavior of XELMS in a lowresource setting, i.e. when some supervision is available in the target language.The aim of this setting is to estimate how much supervision from in the target language L (= Turkish (tr), Chinese (zh) and Spanish (es)).We compare both XELMS(mono) and XELMS(joint) to the best results using all available supervision, denoted by L-best.To discount the effect of the prior, all results above are without it.For number of train mentions = 0, XELMS(joint) is equivalent to zero-shot without prior.Best viewed in color.
the target language is needed to get reasonable performance when using it jointly with supervision from English.To discount the effect of prior probabilities, we report all results without the prior.
Figure 4 plots results on the TH-Test dataset when training a XELMS(joint) model by gradually increasing the number of mention contexts for target language L (= Spanish, Chinese and Turkish) that are available for supervision.Figure 4 also shows the best results achieved using all available target language supervision (denoted by Lbest).For comparison with the mono-lingually supervised model, we also plot the performance of XELMS(mono), which only uses the target language supervision.
Figure 4 shows that after training on 0.75M mentions from Turkish and Chinese (and 1.0M mentions from Spanish), the XELMS(joint) model is within 2-3% of the respective L-best model which uses all training mentions in the target language, indicating that XELMS(joint) can reach competitive performance even with a fraction of the full target language supervision.For comparison, a XELMS(mono) model trained on the same number of training mentions is 5-10% behind the respective XELMS(joint) model, showing better utilization of target language supervision by XELMS(joint).
Existing approaches have taken two main directions to obtain supervision for learning XEL models -(a) using mention contexts appearing in the target language (McNamee et al., 2011;Tsai and Roth, 2016), or (b) using mention contexts appearing only in English (Pan et al., 2017;Sil et al., 2018).We describe these directions and their limitations below, and explain how XELMS overcomes these limitations.
McNamee et al. ( 2011) use annotation projection via parallel corpora to generate mention contexts in the target language, while Tsai and Roth (2016) learns separate XEL models for each language and only use mention contexts in the target language.Both these approach have scalability issues for languages with limited resources.Another limitation of these approaches is that they train separate models for each language, which is inefficient when working with multiple languages.XELMS overcomes these limitations as it can use mention context from multiple languages simultaneously, and train a single model.
Other approaches only use mention contexts from English.While Pan et al. (2017) compute entity coherence statistics from English Wikipedia, Sil et al. (2018) perform zero-shot XEL for Chinese and Spanish by using multilingual embeddings to transfer a pre-trained English EL model.However, our work suggests that mention contexts in the target language should also be used, if available.Indeed, a recent study (Lewoniewski et al., 2017) found that for language sensitive topics, the quality of information can be better in the relevant language version of Wikipedia than the English version.Our analysis also shows that zero-shot XEL approaches like that of Sil et al. (2018) are not effective in realistic zero-shot scenarios where good prior probabilities are unlikely to be available.In such cases, we showed that combining supervision available in the target language with supervision from a high-resource language like English can yield significant performance improvements.

Conclusion
We introduced XELMS, an approach that can combine supervision from multiple languages to train an XEL model.We illustrate its benefits through extensive evaluation on different benchmarks.XELMS is also the first approach that can train a single model for multiple languages, making more efficient use of available supervision than previous approaches which trained separate models.
Our analysis sheds light on the poor performance of zero-shot XEL in realistic scenarios where the prior probabilities for candidates are unlikely to exist, in contrast to findings in previous work that focused on high-resource languages.We also show how in low-resource settings, XELMS makes it possible to achieve competitive performance even when only a fraction of the available supervision in the target language is provided.
Several future research directions remain open.For all XEL approaches, the task of candidate generation is currently limited by existence of a target language Wikipedia and remains a key challenge.A joint inference framework which enforces coherent predictions (Cheng and Roth, 2013;Globerson et al., 2016;Ganea and Hofmann, 2017) could also lead to further improvements for XEL.Similar techniques can be applied to other information extraction tasks like relation extraction to extend them to multilingual settings.

Figure 1 :
Figure 1: Tamil and English mention contexts containing a) Overview of XELMS.Mentions are shown [enclosed].

Figure 2 :
Figure 2: (a) Grounded mentions from two or more languages (English and Tamil shown) can be used to supervise XELMS.The context g, entity e and type t vectors interact through Entity-Context loss (EC-LOSS), Type-Context loss (TC-LOSS) and Type-Entity loss (TE-LOSS).The Tamil sentence is the same as in Figure 1, and other mentions in it translate to [Suarez] and [Uruguay].(b) The Mention Context Encoder ( §2.1) encodes the local context (neighboring words) and the document context (surfaces of other mentions in the document) of the mention into g.Internal view of local context encoder is in Figure 3.

ReLUFigure 3 :
Figure 3: Local Context Encoder, for the right context. Figure 2b shows how it fits inside Mention Context Encoder.
P context (e | m) = exp(g T e) e ∈C(m) exp(g T e ) (5) where C(m) denotes all candidate entities of the mention m ( §3.1 explains how C(m) is generated).We minimize the negative log-likelihood of P context (e | m) with respect to the gold entity e * against the candidate entities C(m), and call it the Entity-Context loss (EC-LOSS), EC-LOSS = − log P context (e * | m) e ∈C(m) P context (e | m) (6)

P
model (e | m) = P prior (e | m) + P context (e | m) − P prior (e | m) × P context (e | m) Inference for the mention m picks the entity, ê = arg max e∈C(m) P model (e | m) (8)

Figure 4 :
Figure 4: Linking accuracy vs. the number of train mentions These vectors together generate the local context vector c = F 2h,h (l ⊕ r).Here F d i ,do : v i → v o denotes a feed-forward layer that takes v i ∈ R d i as input, and outputs v o ∈ R do .To incorporate this, we define the document context d m of a mention m appearing in document D to be the bag of all other mentions in D. We encode d m into a dense document context vector d ∈ R h by a feed-forward layer d = F |V |,h (d m ).Here V is the set containing all mention surfaces seen during training.

Table 1 :
Number of train mentions (from Wikipedia) in each language, with % size relative to English (51.7M mentions).Train mentions from Wikipedias like Arabic, Turkish and Tamil are <10% the size of those from the English Wikipedia.

Table 2 :
Evaluation datasets used in our experiments.
McNamee et al. (2011)lish Only) uses entity coherence statistics from English Wikipedia and the document context of a mention for XEL.Current SoTA on MCN-TEST, except for Italian and Turkish, for which it'sMcNamee et al. (2011).

Table 3 :
XELMS(joint)improves upon XELMS(mono) and the current State-of-The-Art (SoTA) on TH-TEST and MCN-TEST, showing the benefit of using additional supervision from English.The best score is shown bold and * marks statistical significance of best against SoTA.Refer §4.3 for details on SoTA.

Table 4 :
Linking accuracy on TAC15-Test.Numbers for Sil et al. (2018) from personal communication.

Table 5 :
Linking accuracy of a single XELMS(multi) model for four languages -German, Spanish, French and Italian.Individually trained XELMS(joint) scores are also shown.The best score is shown bold and * marks statistical significance of best against SoTA.Refer §4.3 for details on SoTA.

Table 6 :
Adding fine-grained type information further improves linking accuracy (compare to Table3).The best score is shown bold and * marks statistical significance of best against SoTA.Refer §4.3 for details on SoTA.

Table 7 :
Linking accuracy of the zero-shot (Z-S) approach on different datasets.Zero-shot (w/ prior) is close to SoTA for datasets like TAC15-Test, but performance drops in the more realistic setting of zero-shot (w/o prior) ( §6.1) on all datasets, indicating most of the performance can be attributed to the presence of prior probabilities.The slight drop in MCN-TEST is due to trivial mentions, which only have a single candidate.