Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision

Jointly representation learning of words and entities benefits many NLP tasks, but has not been well explored in cross-lingual settings. In this paper, we propose a novel method for joint representation learning of cross-lingual words and entities. It captures mutually complementary knowledge, and enables cross-lingual inferences among knowledge bases and texts. Our method does not require parallel corpus, and automatically generates comparable data via distant supervision using multi-lingual knowledge bases. We utilize two types of regularizers to align cross-lingual words and entities, and design knowledge attention and cross-lingual attention to further reduce noises. We conducted a series of experiments on three tasks: word translation, entity relatedness, and cross-lingual entity linking. The results, both qualitative and quantitative, demonstrate the significance of our method.


Introduction
Multi-lingual knowledge bases (KB) store millions of entities and facts in various languages, and provide rich background structural knowledge for understanding texts. On the other hand, text corpus contains huge amount of statistical information complementary to KBs. Many researchers leverage both types of resources to improve various natural language processing (NLP) tasks, such as machine reading (Yang and Mitchell, 2017), question answering Hao et al., 2017).
Most existing work jointly models KB and text corpus to enhance each other by learning word and entity representations in a unified vector space. For example, ; Yamada et al. (2016);  utilize the co-occurrence information to align similar words and entities with similar embedding vectors. Toutanova et al. (2015); * Corresponding author. Wu et al. (2016); Han et al. (2016); Weston et al. (2013a); Wang and Li (2016) represent entities based on their textual descriptions together with the structured relations. These methods focused on mono-lingual settings. However, for cross-lingual tasks (e.g., cross-lingual entity linking), these approaches need to introduce additional tools to do translation, which suffers from extra costs and inevitable errors (Ji et al., , 2016.
In this paper, we carry out cross-lingual joint representation learning, which has not been fully researched in the literature. We aim at creating a unified space for words and entities in various languages, and easing cross-lingual semantic comparison, which will benefit from the complementary information in different languages. For instance, two different meanings of word center in English are expressed by two different words in Chinese: center as the activity-specific building is expressed by 中心, center as the basketball player role is 中 锋.
Our main challenge is the limited availability of parallel corpus, which is usually either expensive to obtain, or only available for certain narrow domains (Gouws et al., 2015). Many work has been done to alleviate the problem. One school of methods uses adversarial technique or domain adaption to match linguistic distribution (Zhang et al., 2017b;Barone, 2016;Cao et al., 2016). These methods do not require parallel corpora. The weakness is that the training process is unstable and that the high complexity restricts the methods only to small-scale data. Another line of work uses pre-existing multi-lingual resources to automatically generate "pseudo bilingual documents" Moens, 2015, 2016). However, negative results have been observed due to the occasional poor quality of training data (Vulic and Moens, 2016). All above methods only focus on words. We consider both words and entities, which makes the parallel data issue more challenging.
In this paper, we propose a novel method for joint representation learning of cross-lingual words and entities. The basic idea is to capture mutually complementary knowledge in a shared semantic space, which enables joint inference among cross-lingual knowledge base and texts without additional translations. We achieve it by (1) utilizing an existing multi-lingual knowledge base to automatically generate cross-lingual supervision data, (2) learning mono-lingual word and entity representations, (3) applying cross-lingual sentence regularizer and cross-lingual entity regularizer to align similar words and entities with similar embeddings. The entire framework is trained using a unified objective function, which is efficient and applicable to arbitrary language pairs that exist in multi-lingual KBs.
Particularly, we build a bilingual entity network from inter-language links 1 in KBs for regularizing cross-lingual entities through a variant of skipgram model (Mikolov et al., 2013c). Thus, monolingual structured knowledge of entities are not only extended to cross-lingual settings, but also augmented from other languages. On the other hand, we utilize distant supervision to generate comparable sentences for cross-lingual sentence regularizer to model co-occurrence information across languages. Compared with "pseudo bilingual documents", comparable sentences achieve higher quality, because they rely not only on the shared semantics at document level, but also on cross-lingual information at sentence level. We further introduce two attention mechanisms, knowledge attention and cross-lingual attention, to select informative data in comparable sentences.
Our contributions can be concluded as follows: • We proposed a novel method that jointly learns representations of not only crosslingual words but also cross-lingual entities in a unified vector space, aiming to enhance the embedding quality from each other via complementary semantics.
• Our proposed model introduces distant supervision coupled with attention mechanisms to generate comparable data as cross-lingual supervision, which can benefit many crosslingual analysis.
• We did qualitative analysis to have an intuitive impression of our embeddings, and quantitative analysis in three tasks: word translation, entity relatedness, and crosslingual entity linking. Experiment results show that our method demonstrates significant improvements in all three tasks.

Related Work
Jointly representation learning of words and entities attracts much attention in the fields of Entity Linking (Zhang et al., 2017a;Cao et al., 2018), Relation Extraction (Weston et al., 2013b) and so on, yet little work focuses on cross-lingual settings. Inspiringly, we investigate the task of crosslingual word embedding models (Ruder et al., 2017), and classify them into three groups according to parallel corpora used as supervisions: (i) methods requiring parallel corpus with aligned words as constraint for bilingual word embedding learning (Klementiev et al., 2012;Zou et al., 2013;Wu et al., 2014;Luong et al., 2015;Ammar et al., 2016;Soricut and Ding, 2016). (ii) methods using parallel sentences (i.e. translated sentence pairs) as the semantic composition of multi-lingual words (Gouws et al., 2015;Kociský et al., 2014;Hermann and Blunsom, 2014;Chandar et al., 2014;Shi et al., 2015;Mogadala and Rettinger, 2016). (iii) methods requiring bilingual lexicon to map words from one language into the other (Mikolov et al., 2013b;Faruqui and Dyer, 2014;Xiao and Guo, 2014). The major weakness of these methods is the limited availability of parallel corpora. One remedy is to use existing multi-lingual resources (i.e. multilingual KB). Camacho-Collados et al. (2015) combines several KBs (Wikipedia, WordNet and Ba-belNet) and leverages multi-lingual synsets to learn word embeddings at sense level through an extra post-processing step. Artetxe et al. (2017) starts from a small bilingual lexicon and using a self-learning approach to induce the structural similarity of embedding spaces. Moens (2015, 2016) collect comparable documents on same themes from multi-lingual Wikipedia, shuffle and merge them to build "pseudo bilingual documents" as training corpora. However, the quality of "pseudo bilingual documents" are difficult to control, resulting in poor performance in several cross-lingual tasks (Vulic and Moens, 2016).
Another remedy matches linguistic distribu-tion via adversarial training (Barone, 2016;Zhang et al., 2017b;Lample et al., 2018), domain adaption (Cao et al., 2016). However, these methods suffer from the instability of training process and the high complexity. This either limits the scalability of vocabulary size or relies on a strong distribution assumption. Inspired by Vulic and Moens (2016), we generate highly qualified comparable sentences via distant supervision, which is one of the most promising approaches to addressing the issue of sparse training data, and performs well in relation extraction (Lin et al., 2017a;Mintz et al., 2009;Zeng et al., 2015;Hoffmann et al., 2011;Surdeanu et al., 2012). Our comparable sentences may further benefit many other cross-lingual analysis, such as information retrieval .

Preliminaries
Given a multi-lingual KB, we take (i) text corpus, (ii) entity and their relations, (iii) a set of anchors as inputs, and learn embeddings for each word and each entity in various languages. For clarity, we use English and Chinese as sample languages in the rest of the paper, and use superscript y ∈ {en, zh} to denote language-specific parameters 2 .
We use multi-lingual Wikipedia as KB including a set of entities E y = {e y i } and their articles. We concatenate these articles together, and form text corpus D y = ⟨w y 1 , . . . , w y i , . . . , w y |D| ⟩. Hyper links in articles are denoted by Anchors A y = {⟨w y i , e y j ⟩}, which indicates that word w y i refers to entity e y j . G y = (E y , R y ) is the mono-lingual Entity Network (EN), where R y = {⟨e y i , e y j ⟩} if there is a link between e y i , e y j . We use interlanguage links in Wikipedia as cross-lingual links R en−zh = {⟨e en i , e zh i ′ ⟩}, indicating e en i , e zh i ′ refer to the same thing in English and Chinese. Crosslingual word and entity representation learning is to map words and entities in different languages into a unified semantic space. Each word and entity obtain their embedding vectors 3 w y i and e y j .

Framework
To alleviate the heavy burden of limited parallel corpora and additional translation efforts, we utilize existing multi-lingual resources to distantly supervise cross-lingual word and entity representation learning, so that the shared embedding space supports joint inference among KB and texts across languages. As shown in Figure 1, our framework has two steps: (1) Cross-lingual Suhttps://en.wikipedia.org/wiki/Lists_of_ languages_by_number_of_speakers.
pervision Data Generation builds a bilingual entity network and generates comparable sentences based on cross-lingual links; (2) Joint Representation Learning learns cross-lingual word and entity embeddings using a unified objective function.
Our assumption throughout the entire framework is as follows: The more words/entities two contexts share, the more similar they are.
As shown in Figure 1, we build a bilingual EN G en−zh by using G en , G zh and cross-lingual links R en−zh . Thus, entities in different languages shall be connected in a unified network to facilitate cross-lingual entity alignments. Meanwhile, from KB articles, we extract comparable sentences S en−zh = {⟨s en k , s zh k ⟩} as high qualified parallel data to align similar words in different languages.
Based on generated cross-lingual data G en−zh , S en−zh and mono-lingual data D y , A y , where y ∈ {en, zh}, we jointly learn crosslingual word and entity embeddings through three components: (1) Mono-lingual Representation Learning, which learns mono-lingual word and entity embeddings for each language by modeling co-occurrence information through a variant of skip-gram model (Mikolov et al., 2013c). (2) Cross-lingual Entity Regularizer, which aligns entities that refer to the same thing in different languages by extending the mono-lingual model to bilingual EN. For example, entity Foust in English and entity 福 斯 特 (Foust) in Chinese are closely embedded in the semantic space because they share common neighbors in two languages, All-star and NBA 选 秀 (draft), etc..
(3) Cross-lingual Sentence Regularizer, which models cross-lingual co-occurrence at sentence level in order to learn translated words to have most similar embeddings. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in a pair of comparable sentences, therefore, their vector representations shall be close in the semantic space. The above components are trained jointly under a unified objective function.

Cross-lingual Supervision Data Generation
This section introduces how to build a bilingual entity network G en−zh and comparable sentences S en−zh from a multi-lingual KB.

Bilingual Entity Network Construction
Entities with cross-lingual links refer to the same thing, which implies they are equivalent across languages. Conventional knowledge representation methods only add edges between e en i and e zh i ′ indicating a special "equivalent" relation (Zhu et al., 2017). Instead, we build G en−zh = (E en ∪ E zh , R en ∪R zh ∪R en−zh ) by enriching the neighbors of cross-lingual linked entities. That is, we add edgesR en−zh between two mono-lingual ENs by letting all neighbors of e en i be neighbors of e zh i ′ , and vice versa, if ⟨e en i , e zh i ′ ⟩ ∈ R en−zh . G en−zh extends G en and G zh to bilingual settings in a natural way. It not only keeps a consistent objective in mono-lingual ENs-entities, no matter in which language, will be embedded closely if share common neighbors-but also enhances each other with more neighbors in the foreign language.
Following the method in Zhu et al. (2017), there will be no edge between Chinese entity 福斯特 (Foust) and English entity Pistons, which implies a wrong fact that 福斯特 (Foust) does not belong to Pistons. Our method enriches the missing relation between entities 福斯特 (Foust) and 活塞队 (Pistons) in incomplete Chinese KB through corresponding English common neighbors, Allstar, NBA, etc., as illustrated in Figure 1.

Comparable Sentences Generation
To supervise the cross-lingual representation learning of words, we automatically generate comparable sentences as cross-lingual training data. Comparable sentences are not translated paired sentences, but sentences with the same topic in different languages. As shown in the middle layer (Figure 1), the pair of sentences are comparable sentences: (1) "Lawrence Michael Foust was an American basketball player who spent 12 seasons in NBA", Inspired by the distant supervision technique in relation extraction, we assume that sentence s en k in Wikipedia articles of entity e en i explicitly or implicitly describes e en i (Yamada et al., 2017), and that s en k shall express a relation between e en i and e en j if another entity e en j is in s en k . Meanwhile, we find a comparable sentence s zh k ′ in another language which satisfies s zh k ′ containing e zh j ′ in Wikipedia articles of Chinese entity e zh i ′ , where ⟨e en i , e zh i ′ ⟩, ⟨e en j , e zh j ′ ⟩ ∈ R en−zh . As shown in Figure 1, the sentences in the second level are comparable due to the similar theme of the relation between entity Foust and NBA. To find this type of sentences, we search the anchors in the English aritcle and Chinese article of cross-lingual entity Foust, respectively, and extract the sentences including another crosslingual entity NBA. Comparable sentences can be regarded as cross-lingual contexts.
Unfortunately, comparable sentences suffer from two issues caused by distant supervision: Wrong labelling. Take English as sample, there may be several sentences s en k,l | L l=1 containing the same entity e en j in the article of e en i . A straightforward solution is to concatenate them into a longer sentence s en k , but this increases the chance to include unrelated sentences. Unbalanced information. Sometimes the pair of sentences convey unbalanced information, e.g., the English sentence in the middle layer ( Figure 1) contains Foust spent 12 seasons in NBA while the comparable Chinese sentence not.
To address the issues, we propose knowledge attention and cross-lingual attention to filter out unrelated information at sentence level, and at word level respectively.

Joint Representation Learning
As shown in Figure 2, there are three components in learning cross-lingual word and entity representations, which are trained jointly. In this section, we will describe them in detail.

Mono-lingual Representation Learning
Following Yamada et al. (2016); , we learn mono-lingual word/entity embeddings based on corpus D y , anchors A y and entity network G y . Capturing the cooccurrence information among words and entities, these embeddings serve as the foundation and will be further extended to bilingual settings using the proposed cross-lingual regularizers, which will be detailed in the next section. Monolingually, we utilize a variant of Skipgram model (Mikolov et al., 2013c) to predict the contexts given current word/entity: where x y i is either a word or an entity, and C(x y i ) denotes: (i) contextual words in a pre-defined window of x y i if x y i ∈ D y , (ii) neighbor entities that linked to x y i if x y i ∈ G y , (iii) contextual words of w y j if x y i is entity e y i in an anchor ⟨w y j , e y i ⟩ ∈ A y .

Cross-lingual Entity Regularizer
The bilingual EN G en−zh merges entities in different languages into a unified network, resulting in the possibility of using the same objective as in mono-lingual ENs. Thus, we naturally extend mono-lingual function to cross-lingual settings: where C ′ (e y i ) denotes cross-lingual contextsneighbor entities in different languages that linked to e y i . Thus, by jointly learning mono-lingual representation with cross-lingual entity regularizer, words and entities share more common contexts, and will have similar embeddings. As shown in Figure 1, English entity NBA co-occurs with words basketball and player in texts, so they are embedded closely in the semantic space. Meanwhile, cross-lingual linked entities NBA and NBA (zh) have similar representations due to the most common neighbor entities, e.g., Foust.

Cross-lingual Sentence Regularizer
Comparable sentences provide cross-lingual cooccurrence of words, thus, we can use them to learn similar embeddings for the words that frequently co-occur by minimizing the Euclidean distance as follows: where s en k , s zh k ′ are sentence embeddings. Take English as sample language, we define it as the average sum of word vectors weighted by the combination of two types of attentions:  dow of x y i if x y i ∈ D y , (ii) neighbor entities that linked to x y i if x y i ∈ G y , (iii) contextual words of w y j if x y i is entity e y i in an anchor ⟨w y j , e y i ⟩ ∈ A y .

Cross-lingual Entity Regularizer
The bilingual EN G en−zh merges entities in different languages into a unified network, resulting in the possibility of using the same objective as in mono-lingual ENs. Thus, we naturally extend mono-lingual function to cross-lingual settings: where C ′ (e y i ) denotes cross-lingual contextsneighbor entities in different languages that linked to e y i . Thus, by jointly learning mono-lingual representation with cross-lingual entity regularizer, words and entities share more common contexts, and will have similar embeddings. As shown in Figure 1, English entity NBA co-occurs with words basketball and player in texts, so they are embedded close in the semantic space. Meanwhile, crosslingual linked entities NBA and NBA (zh) have similar representations due to the most common neighbor entities, e.g., Foust.
same entities in art belling errors incre almost irrelevant to at filtering out wro smaller weights an weights. Thus, we ilarity between s y k,l ψ(e y m , s y k,l ) where sim is sim use cosine simila per. We normalize L l ψ(e y m , s y k,l ) =

Cross-lingual Atte
Inspired by self-2017b), we motiva ing on potential inf tences themselves. aligned words betw words without alig to the maximum si generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
lations. However, these methods only mono-lingual settings, and few researc been done in cross-lingual scenarios.
In this paper, we propose to learn cro word and entity representations in the sam tic space, to enable joint inference amon text across languages without any additio lation mechanism, which is usually expe may introduce inevitable errors. Our em are helpful to break down language gap tasks, such as cross-lingual entity linking the major challenge lies in measuring th ity between entities and corresponding m words in different languages. e NBA , w , w , w The intuition is that, words and e various languages share some common meanings 1 , but there are also ways in w differ. On one hand, we utilize their shar tics to align similar words and entities w lar embedding vectors, no matter they same language or not. On the other ha lingual embeddings will benefit from diff guages due to the complementary knowl instance, textual ambiguity in one lang disappear in another language, e.g., the t 1 Some cross-lingual pioneering work observ embeddings trained separately on monolingual hibit isomorphic structure across languages (M 2013;Zhang et al., 2017). generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages.
e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages.
e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). 2016) learns to represent entities based on their textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages.
e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). been done in cross-lingual scenarios.
In this paper, we propose to learn cr word and entity representations in the sa tic space, to enable joint inference amo text across languages without any additi lation mechanism, which is usually exp may introduce inevitable errors. Our e are helpful to break down language gap tasks, such as cross-lingual entity linkin the major challenge lies in measuring t ity between entities and corresponding words in different languages.
e NBA The intuition is that, words and various languages share some commo meanings 1 , but there are also ways in w differ. On one hand, we utilize their sha tics to align similar words and entities lar embedding vectors, no matter they same language or not. On the other ha lingual embeddings will benefit from di guages due to the complementary know instance, textual ambiguity in one lang disappear in another language, e.g., the 1 Some cross-lingual pioneering work obser embeddings trained separately on monolingua hibit isomorphic structure across languages (M 2013;Zhang et al., 2017). text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). edge attention and cross-lingual to select the most informative d filter out noise, which will furove the performance. In expereparate tasks of word translation relatedness demonstrate the efss of our method with an average 0% and 3% over baselines, rey. Using entity linking as a case e results on benchmark dataset quality of our embeddings. ction knowledge bases (KB), storing milties and their facts in various lanide rich structured knowledge for unatural language beyond texts. Meanant text corpus contains large amount nowledge complementary to existing ore, researchers leverage both types to improve various natural language LP) related tasks, such as relation exton et al., 2013;Lin et al., 2017), and(Tsai andRoth, 2016;Yamada et al., al., 2017;Ji et al., 2016).
tic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). wledge attention and cross-lingual n to select the most informative and filter out noise, which will furprove the performance. In exper-, separate tasks of word translation tity relatedness demonstrate the efness of our method with an average f 20% and 3% over baselines, reely. Using entity linking as a case the results on benchmark dataset the quality of our embeddings. uction al knowledge bases (KB), storing milntities and their facts in various lanvide rich structured knowledge for ung natural language beyond texts. Meanndant text corpus contains large amount l knowledge complementary to existing refore, researchers leverage both types s to improve various natural language (NLP) related tasks, such as relation exeston et al., 2013;Lin et al., 2017), and ng (Tsai and Roth, 2016;Yamada et al., et al., 2017;Ji et al., 2016).
word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). (1) where x y i is either a word or an entity, and C(x y i ) denotes: (i) contextual words in a pre-defined window of x y i if x y i ∈ D y , (ii) neighbor entities that linked to x y i if x y i ∈ G y , (iii) contextual words of w y j if x y i is entity e y i in an anchor ⟨w y j , e y i ⟩ ∈ A y .

Cross-lingual Entity Regularizer
The bilingual EN G en−zh merges entities in different languages into a unified network, resulting in the possibility of using the same objective as in mono-lingual ENs. Thus, we naturally extend mono-lingual function to cross-lingual settings: where C ′ (e y i ) denotes cross-lingual contextsneighbor entities in different languages that linked to e y i . Thus, by jointly learning mono-lingual representation with cross-lingual entity regularizer, words and entities share more common contexts, and will have similar embeddings. As shown in Figure 1, English entity NBA co-occurs with words basketball and player in texts, so they are embedded close in the semantic space. Meanwhile, crosslingual linked entities NBA and NBA (zh) have similar representations due to the most common neighbor entities, e.g., Foust.
aligned words.

Knowledge Attention
Suppose that sentences {s en k,l |l ∈ L} contain the same entities in articles of entity e y m, the wrong labelling errors increase because some of them are almost irrelevant to e y m. Knowledge attention aims at filtering out wrong labelled sentences through smaller weights and related sentences with higher weights. Thus, we define it proportional to the similarity between s y k,l and e y m: ψ(e y m , s y k,l ) ∝ sim(e y m , where sim is similarity measurement, and we use cosine similarity in the rest of the paper. We normalize knowledge attention such that L l ψ(e y m, s y k,l ) = 1.

Cross-lingual Attention
Inspired by self-attention mechanism (Lin et al., 2017b), we motivate cross-lingual attention focusing on potential information from comparable sentences themselves. The intuition is to find possible aligned words between languages, and filter out the words without alignments. We define it according to the maximum similarity: entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017).
ter in which language, e.g., English entity Foust and Chinese entity 福 斯 特 (Foust) are embedded close in semantic space due to the common neighbors NBA, All-star and NBA 选秀 (draft), etc.
(3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space. All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-multiple relations. For example (Figure 1), there will be no direct relation between Chinese entity 福 斯特 (Foust) and English entity Piston by merely adding the equivalence relation between Foust and 福斯特, which is in contradiction with the fact that Foust belongs to Piston, no matter in which language.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit all relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors of entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate comparable sentences from Wikipedia articles. As shown in Figure 1, from the page articles of crosslingual linked entities e e Kobe and e z Kobe , we extract those sentences including another cross-lingual linked entities e e Joe and e z Joe as comparable sentences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention of the page entity (talking something about the entity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresses their relation. Therefore, we make a similar assumption as in relation extraction: If two entities participate in a relation, and both of them joint inference among knowledge base and text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). 2016) learns to represent entities based on their textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- method that integrates cross-lingual word and entity representation learning to enable joint inference among knowledge base and text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). ilar embedding vectors. Another approach in (Han et al., 2016;Toutanova et al., 2015;Wu et al., 2016) learns to represent entities based on their textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- joint inference among knowledge base and text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). 2016) learns to represent entities based on their textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). joint inference among knowledge base and text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). 2016) learns to represent entities based on their textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e NBA , w , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016; Yamada et al., 2016;Ji et al., 2016).
textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages.
eNBA The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). into a longer sentence s en k , but this increases the chance to include unrelated sentences. Unbalanced information. Sometimes the pair of sentences convey different information, e.g., the English sentence in layer 2 (Figure 1) contains Foust spent 12 seasons in NBA while the comparable Chinese sentence not.
To address the issues, we propose knowledge attention and cross-lingual attention to filter out unrelated information at sentence level and at word level, respectively.

Mono-lingual Representation Learning
Following (Yamada et al., 2016;, we learn mono-lingual word/entity embeddings based on corpus D y , anchors A y and entity network G y . We utilize a variant of Skip-gram model (Mikolov et al., 2013c) to predict the contexts given current word/entity: (1) where x y i is either a word or an entity, and C(x y i ) denotes: (i) contextual words in a pre-defined window of x y i if x y i ∈ D y , (ii) neighbor entities that linked to x y i if x y i ∈ G y , (iii) contextual words of w y j if x y i is entity e y i in an anchor ⟨w y j , e y i ⟩ ∈ A y .

Cross-lingual Entity Regularizer
The bilingual EN G en−zh merges entities in different languages into a unified network, resulting in the possibility of using the same objective as in mono-lingual ENs. Thus, we naturally extend mono-lingual function to cross-lingual settings: where C ′ (e y i ) denotes cross-lingual contextsneighbor entities in different languages that linked to e y i . Thus, by jointly learning mono-lingual representation with cross-lingual entity regularizer, words and entities share more common contexts, and will have similar embeddings. As shown in Figure 1, English entity NBA co-occurs with words basketball and player in texts, so they are embedded close in the semantic space. Meanwhile, crosslingual linked entities NBA and NBA (zh) have similar representations due to the most common neighbor entities, e.g., Foust.
Comparable sentences provide cross-lingual cooccurrence of words, thus, we learn similar embeddings for the words that frequently co-occur together by minimizing the Euclidean distance: where s en k , s zh k ′ are sentence embeddings. Take English as sample language, we define it as the average sum of word vectors weighted by the combination of two types of attentions: s en k = l∈L ψ(e en m , s en k,l ) w en i ∈s en k,l ψ ′ (w en i , w zh j )w en i (4) where {s en k,l |l ∈ L} is a set of sentences containing the same entity (as mentioned in Section 4.2), and ψ(e en m , s en k,l ) is knowledge attention that aims at filter out wrong labelling sentences, and ψ ′ (w en i , w zh j ) is cross-lingual attention to deal with the unbalanced information through possible aligned words.

Knowledge Attention
Suppose that sentences {s en k,l |l ∈ L} contain the same entities in articles of entity e y m, the wrong labelling errors increase because some of them are almost irrelevant to e y m. Knowledge attention aims at filtering out wrong labelled sentences through smaller weights and related sentences with higher weights. Thus, we define it proportional to the similarity between s y k,l and e y m: ψ(e y m , s y k,l ) ∝ sim(e y m , where sim is similarity measurement, and we use cosine similarity in the rest of the paper. We normalize knowledge attention such that L l ψ(e y m, s y k,l ) = 1.

Cross-lingual Attention
Inspired by self-attention mechanism (Lin et al., 2017b), we motivate cross-lingual attention focusing on potential information from comparable sentences themselves. The intuition is to find possible aligned words between languages, and filter out the words without alignments. We define it according to the maximum similarity: while, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). lingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- traction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). guages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). side, and in the left side there are three main components of joint representation learning. Red texts with brackets are anchors, dashed lines between entities denote relations, and solid lines are cross-lingual links. mon neighbors have similar embeddings, no matter in which language, e.g., English entity Foust and Chinese entity 福 斯 特 (Foust) are embedded close in semantic space due to the common neighbors NBA, All-star and NBA 选秀 (draft), etc.
(3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space.
All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-jective since a cross-lingual link actually contains multiple relations. For example (Figure 1), there will be no direct relation between Chinese entity 福 斯特 (Foust) and English entity Piston by merely adding the equivalence relation between Foust and 福斯特, which is in contradiction with the fact that Foust belongs to Piston, no matter in which language.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit all relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors of entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate comparable sentences from Wikipedia articles. As shown in Figure 1, from the page articles of crosslingual linked entities e e Kobe and e z Kobe , we extract those sentences including another cross-lingual linked entities e e Joe and e z Joe as comparable sentences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention of the page entity (talking something about the entity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresses their relation. Therefore, we make a similar assumption as in relation extraction: If two entities participate in a relation, and both of them while, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). lingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- traction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
instance, textual ambiguity in one language may disappear in another language, e.g., the two mean- of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). disappear in another language, e.g., the two mean- of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). guages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-side, and in the left side there are three main components of joint representation learning. Red texts with brackets are anchors, dashed lines between entities denote relations, and solid lines are cross-lingual links. mon neighbors have similar embeddings, no matter in which language, e.g., English entity Foust and Chinese entity 福 斯 特 (Foust) are embedded close in semantic space due to the common neighbors NBA, All-star and NBA 选秀 (draft), etc.
(3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space.
All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-jective since a cross-lingual link actually contains multiple relations. For example (Figure 1), there will be no direct relation between Chinese entity 福 斯特 (Foust) and English entity Piston by merely adding the equivalence relation between Foust and 福斯特, which is in contradiction with the fact that Foust belongs to Piston, no matter in which language.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit all relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors of entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate comparable sentences from Wikipedia articles. As shown in Figure 1, from the page articles of crosslingual linked entities e e Kobe and e z Kobe , we extract those sentences including another cross-lingual linked entities e e Joe and e z Joe as comparable sentences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention of the page entity (talking something about the entity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresses their relation. Therefore, we make a similar assumption as in relation extraction: If two entities participate in a relation, and both of them method that integrates cross-lingual word and entity representation learning to enable joint inference among knowledge base and text across languages, capturing mutually complementary knowledge. Instead of reliance on parallel data, we automatically generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). ilar embedding vectors. Another approach in (Han et al., 2016;Toutanova et al., 2015;Wu et al., 2016) learns to represent entities based on their textual descriptions together with the structured relations. However, these methods only focus on mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). generate cross-lingual training data via distant supervision over multi-lingual knowledge bases. We also propose two types of knowledge attention and cross-lingual attention to select the most informative words and filter out noise, which will further improve the performance. In experiments, separate tasks of word translation and entity relatedness demonstrate the effectiveness of our method with an average gain of 20% and 3% over baselines, respectively. Using entity linking as a case study, the results on benchmark dataset verify the quality of our embeddings.

Introduction
Multi-lingual knowledge bases (KB), storing millions of entities and their facts in various languages, provide rich structured knowledge for understanding natural language beyond texts. Meanwhile, abundant text corpus contains large amount of potential knowledge complementary to existing KBs. Therefore, researchers leverage both types of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016). mono-lingual settings, and few researches have been done in cross-lingual scenarios.
In this paper, we propose to learn cross-lingual word and entity representations in the same semantic space, to enable joint inference among KB and text across languages without any additional translation mechanism, which is usually expensive and may introduce inevitable errors. Our embeddings are helpful to break down language gaps in many tasks, such as cross-lingual entity linking, in which the major challenge lies in measuring the similarity between entities and corresponding mentioned words in different languages. e , w , w The intuition is that, words and entities in various languages share some common semantic meanings 1 , but there are also ways in which they differ. On one hand, we utilize their shared semantics to align similar words and entities with similar embedding vectors, no matter they are in the same language or not. On the other hand, crosslingual embeddings will benefit from different languages due to the complementary knowledge. For instance, textual ambiguity in one language may disappear in another language, e.g., the two mean-1 Some cross-lingual pioneering work observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages (Mikolov et al., 2013;Zhang et al., 2017). ded close in semantic space due to the common neighbors NBA, All-star and NBA 选秀 (draft), etc.
(3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space. All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-斯特 (Foust) and English entity Piston by merely adding the equivalence relation between Foust and 福斯特, which is in contradiction with the fact that Foust belongs to Piston, no matter in which language.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit all relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors of entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate comparable sentences from Wikipedia articles. As shown in Figure 1, from the page articles of crosslingual linked entities e e Kobe and e z Kobe , we extract those sentences including another cross-lingual linked entities e e Joe and e z Joe as comparable sentences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention of the page entity (talking something about the entity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresses their relation. Therefore, we make a similar assumption as in relation extraction: If two entities participate in a relation, and both of them e Jordan Ls = ||s en s zh || 2 where x y i is either a word or an entity, and C(x y i ) denotes: (i) contextual words in a pre-defined window of x y i if x y i ∈ D y , (ii) neighbor entities that linked to x y i if x y i ∈ G y , (iii) contextual words of w y j if x y i is entity e y i in an anchor ⟨w y j , e y i ⟩ ∈ A y .  ||w ki − C l (w ki )|| 2 < s km , s lm >∈ S k,l m k l 1 learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space.
All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-Foust belongs to Piston, no matter in which language.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit all relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors of entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate comparable sentences from Wikipedia articles. As shown in Figure 1, from the page articles of crosslingual linked entities e e Kobe and e z Kobe , we extract those sentences including another cross-lingual linked entities e e Joe and e z Joe as comparable sentences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention of the page entity (talking something about the entity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresses their relation. Therefore, we make a similar assumption as in relation extraction: If two entities participate in a relation, and both of them (3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space. All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-adding the equivalence relation between Foust and 福斯特, which is in contradiction with the fact that Foust belongs to Piston, no matter in which language.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit all relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors of entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate comparable sentences from Wikipedia articles. As shown in Figure 1, from the page articles of crosslingual linked entities e e Kobe and e z Kobe , we extract those sentences including another cross-lingual linked entities e e Joe and e z Joe as comparable sentences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention of the page entity (talking something about the entity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresses their relation. Therefore, we make a similar assumption as in relation extraction: If two entities participate in a relation, and both of them ing (Tsai and Roth, 2016;Yamada et al., et al., 2017;Ji et al., 2016).
1 1 1 and Chinese entity 福 斯 特 (Foust) are embedded close in semantic space due to the common neighbors NBA, All-star and NBA 选秀 (draft), etc.
(3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space.
All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-will be no direct relation between Chinese entity 福 斯特 (Foust) and English entity Piston by merely adding the equivalence relation between Foust and 福斯特, which is in contradiction with the fact tha Foust belongs to Piston, no matter in which lan guage.
Therefore, we build bi-lingual entity network by making cross-lingual linked entities that inherit al relations from each other. Concretely, we enhance mono-EN by adding edges from all neighbors o entity e e i to e z j if < e e i , e z j >∈ R e−z (layer 2). e

Comparable Sentences Generation
We utilize distant supervision to generate com parable sentences from Wikipedia articles. A shown in Figure 1, from the page articles of cross lingual linked entities e e Kobe and e z Kobe , we extrac those sentences including another cross-lingua linked entities e e Joe and e z Joe as comparable sen tences S e−z = {< s e k , s z k >}. The intuition is that we consider each sentence in a Wikipedia article has a pseudo mention o the page entity (talking something about the en tity) (Yamada et al., 2017). Thus, if a sentence also mentions another entity, it implicitly expresse their relation. Therefore, we make a similar as sumption as in relation extraction: If two enti ties participate in a relation, and both of them of resources to improve various natural language processing (NLP) related tasks, such as relation extraction (Weston et al., 2013;Lin et al., 2017), and entity linking (Tsai and Roth, 2016;Yamada et al., 2016;Ji et al., 2016).
mon neighbors have similar embeddings, no matter in which language, e.g., English entity Foust and Chinese entity 福 斯 特 (Foust) are embedded close in semantic space due to the common neighbors NBA, All-star and NBA 选秀 (draft), etc.
(3) Cross-lingual Sentence Regularizer aims to learn mutually translated words with similar embeddings by pushing their cross-lingual contexts (i.e. comparable sentences) together. For example, English word basketball and the translated Chinese word 篮球 frequently co-occur in comparable sentences, and are close in the semantic space.
All word/entity embeddings are trained jointly under a unified optimization objective. Next, we will introduce how to generate bi-lingual EN and comparable sentences as well as the three components for joint representation learning in turn.

Cross-lingual Supervision Data Generation
This section introduces how to extract more crosslingual clues from multi-lingual KB in the form of bi-lingual EN and comparable sentences.

Bi-lingual Entity Network Construction
Conventional knowledge representation methods normally regard cross-lingual links as a special equivalence type of relation between two entities (Zhu et al., 2017). However, we argue that this may mislead to an inconsistent training ob-jective since a cross-lingual link ac multiple relations. For example (F will be no direct relation between Ch 斯特 (Foust) and English entity Pis adding the equivalence relation betw 福斯特, which is in contradiction w Foust belongs to Piston, no matter guage.
Therefore, we build bi-lingual ent making cross-lingual linked entities relations from each other. Concrete mono-EN by adding edges from al entity e e i to e z j if < e e i , e z j >∈ R e−z e

Comparable Sentences Generatio
We utilize distant supervision to parable sentences from Wikipedia shown in Figure 1, from the page ar lingual linked entities e e Kobe and e z K those sentences including another linked entities e e Joe and e z Joe as co tences S e−z = {< s e k , s z k >}. The intuition is that we consider in a Wikipedia article has a pseud the page entity (talking something tity) (Yamada et al., 2017). Thus, if mentions another entity, it implic their relation. Therefore, we mak sumption as in relation extraction ties participate in a relation, and at filtering out wrong labelling sentences, and ψ ′ (w en i , w zh j ) is cross-lingual attention to deal with the unbalanced information through possible aligned words.
Next, we will introduce the two types of attentions in detail.

Knowledge Attention
Suppose that sentences s en k,l | L l=1 contain the same entities in articles of entity e en m , the wrong labelling errors increase, because some s en k,l is almost irrelevant to e en m . Knowledge attention assigns smaller weights to wrong labelled sentences, and higher weights to related sentences. Thus, we define it proportional to the similarity between s en k,l and e en m : ψ(e en m , s en k,l ) ∝ sim(e en m , where sim is similarity measurement. We use cosine similarity in the presented work. Knowledge attention is normalized to satisfy ∑ L l ψ(e en m , s en k,l ) = 1.

Cross-lingual Attention
Inspired by self-attention mechanism (Lin et al., 2017b), we motivate cross-lingual attention focusing on potential information from comparable sentences themselves. The intuition is to find possible aligned words between languages, and filter out the words without alignments. We define it according to the maximum similarity computed by our crosslingual word embeddings: We set a threshold for discarding non-aligned words if ψ ′ (w en i , w zh j ) < θ, and make a normalization for selected words. We set θ = 0 in experiments. Thus, unbalanced information is trimmed to the common meanings between s en k and s zh k ′ . For example (Figure 1), words American, basketball, player are selected due to their aligned Chinese words 美国, 篮球, 运动员, while 12 seasons in s en k or 前 (former) in s zh k ′ are discarded due to low attentions.
The reason of using such regularizer lies in two points: (1) the embeddings of cross-lingual aligned words become closer within the pair of comparable sentences, and meanwhile (2) the distance between their contexts is also minimized, which keeps the same way as used in mono-lingual word embeddings training-the words sharing more contexts have similar embeddings. In this way, our regularizer follows a similar assumption with (Gouws et al., 2015): The more frequently two words occur in parallel/comparable sentence pairs, the closer their representation will be.

Training
All above components are jointly trained using the overall objective function as follows: where γ is a hyper-parameter to tune the effect of cross-lingual sentence regularizer, and set to 1 in experiments. We use Softmax as probability function, and negative sampling and SGD for efficient optimization (Mikolov et al., 2013a).

Experiments
In this section, we describe some qualitative analysis with nearest neighbors and quantitative experiments with the tasks of word translation, entity relatedness and cross-lingual entity linking to verify the quality of crosslingual word embeddings, entity embeddings and the joint inference among them, respectively.
The codes of our proposed model can be found in https://github.com/ TaoMiner/MultiLingualEmbedding. We choose Wikipedia, the April 2017 dump, as multi-lingual KB and six popular languages for evaluation. The preprocessing consists of following steps: converting texts into lower cases, filtering out symbols and low frequency words and entities (less than 5), and tokenizing Chinese corpus using Jieba 4 and Japanese corpus using mecab 5 . The statistics is listed in Table 1. For brevity, we adopt two-letter abbreviations: 'En', 'Zh', 'Es', 'Ja', 'It' and 'Tr' for English, Chinese, Spanish, Japanese, Italian and Turkish, respectively. The token sub-column denotes the total number of word/entity in the entire training corpus, and we use 'm' to denote million and 'b' for billion.

Experiment Settings
For cross-lingual settings, we choose five language pairs to compare with state-of-the-art methods, whose statistics is listed in Table 2.
We trained our method using the suggested parameters in Skip-gram model (Mikolov et al., 2013c) and evaluate the embeddings shared by all tasks for fairly comparison. We set training epoch as 2 to ensure convergence, which costs nearly 20 4 https://github.com/fxsjy/jieba 5 http://taku910.github.io/mecab/   We manually checked nearest neighbors to have a straightforward impression of the quality of our embeddings. The nearest neighbors of English word basketball is listed in Table 3. As Table 3 shows, we find the correct translation ranked at top 1 (marked by +), and the listed words as well as English nearest words are all basketball related, indicating a higher quality of our crosslingual word embeddings. Interestingly, we found that although all nearest entities are sports related, e.g., NBA or Professional sports, there is an obvious culture divergence between Chinese entities and English entities, such as Hong Kong basketball league v.s. All-America.

Word Translation
Following (Zhang et al., 2017b), we test our crosslingual word embeddings on benchmark dataset including over 2,000 bilingual word pairs on average. The ground truth is obtained from Open  Multilingual WordNet 6 or Google translation. We compare all methods using the same vocabulary, and analyze the vocabulary size's impact by setting a nearly 5k small scale and 50k large scale. We choose several state-of-the-art methods as baseline, using different level of parallel data: (1) TM (Mikolov et al., 2013b), IA  are pioneers and popular transformation based methods using bilingual lexicon.
(2) Bilbowa (Gouws et al., 2015) is typical work using parallel sentences and performs quite well. (3) BWESG (Vulic and Moens, 2016) is similar to our method and achieves best performance in the literature of using comparable data. (4) Adversarial model (Zhang et al., 2017b) is the state-ofthe-arts without parallel data. Besides, we remove attention from our method to investigate the impacts from attention mechanisms, marked with Ours-noatt.
For fair comparison, we report the results in original paper (Zhang et al., 2017b) except Bilbowa and BWESG, which didn't report their results on the same benchmark datasets. So, we carefully implement them using released codes on the same training corpus as ours with suggested parameters. Nevertheless, we do not have performance reports of Zh-En, It-En, Tr-En and Ja-Zh with Bilbowa due to the lack of parallel data used in the original paper. As shown in Table 4, we can see: • Our proposed method significantly outperforms all the baseline methods with average gains of 21% and 9.1% on large and small vocabulary. This proves the high quality of our generated cross-lingual data and the effectiveness of our joint framework.
• The pair of languages have similar culture achieves better performance (Es-En, It-En, Tr-En, Ja-Zh) than that have different cultural origins, e.g., Zh-En. 6 http://compling.hss.ntu.edu.sg/omw • Languages with richer corpus have better translations because adequate training data helps to capture more accurate cross-lingual semantics (Es-En, It-En, Tr-En v.s. Ja-Zh).
• Our method has less performance reduction between small and large vocabulary than methods based on parallel word pairs, because we adopt a consistent objective function which aligns cross-lingual semantics, and simultaneously keeps their own monolingual semantics.
• Attention mechanisms further improve the performance, mainly because they help to select the most informative words and sentences, filtering out unrelated data.

Entity Relatedness
With respect to our entity embeddings, we have conducted experiments to evaluate English entity relatedness following (Ganea and Hofmann, 2017; Hoffart et al., 2011), in which the dataset contains 3,314 entities, and each entity has 91 candidate entities labeled with 1 or 0, indicating whether they are semantically related. Given an entity, we rank candidate entities according to their similarity based on our embeddings, and evaluate the ranking quality through two standard metrics: normalized discounted cumulative gain (NDCG) (Järvelin and Kekäläinen, 2002) and mean average precision (MAP) (Manning et al., 2008).
To give a comprehensive fair comparison, we choose several widely used and state-of-the-art methods as our baselines, and compare with the results in the original papers: (1) WLM (Milne and Witten, 2008), the popular semantic similarity measurement based on Wikipedia anchor links.
(2) ALIGN (Yamada et al., 2016) and MPME , state-of-the-arts that jointly learn word and entity embeddings using mono-lingual EN. (3) Table 5 shows the results of baseline methods as well as our methods based on different languages. We also test the cases of our method without training cross-lingual words, marked as Ours-e. We can see our method outperforms all baseline methods by introducing cross-lingual information, and all bilingual ENs lead to similar results. Strangely, ALIGN and DJ with more embedding dimensions seemly fails to capture overall relatedness (performance reduction from top@1 to top@5). The best performance of Ours-e implies that training cross-lingual word slightly harms the performance of entity embeddings. We can introduce additional sense embeddings in future .
Although favorable improvements has been achieved by using our English entity embeddings, it shall be fewer than that of other languages, because resources of English are already quite rich, and even richer than many other languages, thus contributions from other languages will be less significant than vice versa. Due to the limitation of the publication, we neglect to report experiment results on the vice versa direction.

Cross-lingual Entity Linking
Entity linking, the task of identifing the languagespecific reference entity for mentions in texts, raises the key challenges of comparing the relevance between entities and contextual words around the mentions (Cao et al., 2015;Nguyen et al., 2016). Recently, the surge of cross-lingual analysis pushes the entity linking task on crosslingual settings . Therefore, we comprehensively measure our joint inference ability among words and entities using the tri-lingual EL benchmark dataset KBP2015, which consists of 944 documents and 38,831 mentions, and divides them into 444 and 500 documents for training and evaluation. Note that the main purpose of it is not to beat other EL models but to evaluate the quality of our embeddings, so we adopt a simple classifier GBRT (Gradient Boost Regression Tree) based method as in Yamada et al., 2016), replace with our cross-lingual embeddings, and filter out mentions that are out of our vocabulary.  Table 6: Tri-lingual Entity Linking. Table 6 shows the top 1 linking accuracy (%). We can see our method performs much better than the second ranked system, and is competitive with the top ranked system. Considering that the systems utilize additional translation tools , we conclude that our embeddings are high qualified for joint inference among entities and words in different languages.

Conclusions
In this paper, we propose a novel method to jointly learn cross-lingual word and entity representations that enables effective inference among crosslingual knowledge bases and texts. Instead of parallel data, we use distant supervision over multilingual KB to generate high quality comparable data as cross-lingual supervision signals for two types of regularizer. We introduce attention mechanism to further improve the training quality. A series of experiments on several tasks verify the effectiveness of our methods as well as the quality of cross-lingual word and entity embeddings.
In the future, we will enrich semantics of lowresourced languages by cross-lingual linking to rich-resourced languages, and extend more crosslingual words and entities to multi-lingual settings.

Acknowledgments
The work is supported by NSFC key project (No. 61533018，U1736204，61661146007), Ministry of Education and China Mobile Research Fund (No. 20181770250), and THUNUS NExT++ Co-Lab. Partial financial support from P3ML project funded by BMBF of Germany under grant number 01/S17064 is greatly acknowledged.