Bridge Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding

Integrating text and knowledge into a unified semantic space has attracted significant research interests recently. However, the ambiguity in the common space remains a challenge, namely that the same mention phrase usually refers to various entities. In this paper, to deal with the ambiguity of entity mentions, we propose a novel Multi-Prototype Mention Embedding model, which learns multiple sense embeddings for each mention by jointly modeling words from textual contexts and entities derived from a knowledge base. In addition, we further design an efficient language model based approach to disambiguate each mention to a specific sense. In experiments, both qualitative and quantitative analysis demonstrate the high quality of the word, entity and multi-prototype mention embeddings. Using entity linking as a study case, we apply our disambiguation method as well as the multi-prototype mention embeddings on the benchmark dataset, and achieve state-of-the-art performance.

Integrating text and knowledge into a unified semantic space has attracted significant research interests recently. However, the ambiguity in the common space remains a challenge, namely that the same mention phrase usually refers to various entities. In this paper, to deal with the ambiguity of entity mentions, we propose a novel Multi-Prototype Mention Embedding model, which learns multiple sense embeddings for each mention by jointly modeling words from textual contexts and entities derived from a knowledge base. In addition, we further design an efficient language model based approach to disambiguate each mention to a specific sense. In experiments, both qualitative and quantitative analysis demonstrate the high quality of the word, entity and multi-prototype mention embeddings. Using entity linking as a study case, we apply our disambiguation method as well as the multi-prototype mention embeddings on the benchmark dataset, and achieve state-of-the-art performance.

Introduction
Jointly learning text and knowledge representations in a unified vector space greatly benefits many Natural Language Processing (NLP) tasks, such as knowledge graph completion Wang and Li, 2016), relation extraction , word sense disambiguation (Mancini et al., 2016), entity classification (Huang et al., 2017) and linking (Huang et al., 2015).
Existing work can be roughly divided into two categories. One is encoding words and entities into a unified vector space using Deep Neural * Corresponding author. Networks (DNN). These methods suffer from the problems of expensive training and great limitations on the size of word and entity vocabulary Toutanova et al., 2015;Wu et al., 2016). The other is to learn word and entity embeddings separately, and then align similar words and entities into a common space with the help of Wikipedia hyperlinks, so that they share similar representations (Wang et al., 2014;Yamada et al., 2016). However, there are two major problems arising from directly integrating word and entity embeddings into a unified semantic space. First, mention phrases are highly ambiguous and can refer to multiple entities in the common space. As shown in Figure 1, the same mention independence day (m 1 ) can either refer to a holiday: Independence Day (US) or a film: Independence Day (film). Second, an entity often has various aliases when mentioned in various contexts, which implies a much larger size of mention vocabulary compared with entities. For example, in Figure 1, the documents d 2 and d 3 describes the same entity Independence Day (US) (e 2 ) with distinct mentions: independence day and July 4th. We observe tens of millions of mentions referring to 5 millions of entities in Wikipedia.
To address these issues, we propose to learn multiple embeddings for mentions inspired by the Word Sense Disambiguation (WSD) task (Reisinger and Mooney, 2010;Huang et al., 2012;Tian et al., 2014;Neelakantan et al., 2014;Li and Jurafsky, 2015). The basic idea behind it is to consider entities in KBs that can provide a meaning repository of mentions (i.e. words or phrases) in texts. That is, each mention has one or multiple meanings, namely mention senses, and each sense corresponds to an entity. Furthermore, we assume that different mentions referring to the same entity express the same meaning and share a common mention sense embedding, which largely reduces the size of mention vocabulary to be learned. For example, the mentions Independence Day in d 2 and July 4th in d 3 have a common mention sense embedding during training since they refer to the same holiday. Thus, text and knowledge are bridged via mention sense.
In this paper, we propose a novel Multi-Prototype Mention Embedding (MPME) model, which jointly learns the representations of words, entities, and mentions at sense level. Different mention senses are distinguished by taking advantage of both textual context information and knowledge of reference entities. Following the frameworks in (Wang et al., 2014;Yamada et al., 2016), we use separate models to learn the representations for words, entities and mentions, and further align them by a unified optimization objective. Extending from skip-gram model and CBOW model, our model can be trained efficiently (Mikolov et al., 2013a,b) from a large scale corpus. In addition, we also design a language model based approach to determine the sense for each mention in a document based on multi-prototype mention embeddings.
For evaluation, we first provide qualitative analysis to verify the effectiveness of MPME to bridge text and knowledge representations at the sense level. Then, separate tasks for words and entities show improvements by using our word, entity and mention representations. Finally, using entity linking as a case study, experimental results on the benchmark dataset demonstrate the effectiveness of our embedding model as well as the disambiguation method.

Preliminaries
In this section, we formally define the input and output of multi-prototype mention embedding.
A knowledge base KB contains a set of entities E = {e j }, and their relations. We use Wikipedia as the given knowledge base, and organize it as a directed knowledge network: nodes denote entities, and edges are outlinks from Wikipedia pages. In the directed network, we define the entities that point to e j as its neighbors N (e j ), but ignore those entities that e j points to, so that the repeated computations on the same edge would be avoided if edges were undirected.
A text corpus D is a set of sequential words D = {w 1 , · · · , w i , · · · , w |D| }, where w i is the ith word and |D| is the length of the word sequence.
Since an entity mention m l may consist of multiple words, we define an annotated text corpus 1 as D = {x 1 , · · · , x i , · · · , x |D | }, where x i corresponds to either a word w i or a mention m l . We define the words around x i within a predefined window as its context words C(x i ).
An Anchor is a Wikipedia hyperlink from a mention m l linking to its entity e j , and is represented as a pair < m h , e j >∈ A. The anchors provide mention boundaries as well as their reference entities from Wikipedia articles. These Wikipedia articles are used as an annotated text corpus D in this paper.
Multi-Prototype Mention Embedding . Given a KB, an annotated text corpus D and a set of anchors A, we aim to learn multi-prototype mention embedding, namely multiple sense embeddings s j l ∈ R k for each mention m l as well as word embeddings w and entity embeddings e. We use M * l = {s l j } to denote the sense set of mention m l , where each s l j refers to an entity e j . Thus, the vocabulary size is reduced to a fixed number |{s * j }| = |E|. We use s * j to denote the shared sense of mentions referring to entity e j . Example As shown in Figure 1, Independence Day (m 1 ) has two mention senses s 1 1 , s 1 2 , and July 4th (m 2 ) has one mention sense s 2 2 . Based on the assumption in Section 1, we have s * 2 = s 1 2 = s 2 2 referring to entity Independence Day (US) (e 2 ).

An Overview of Our Method
Given a knowledge base KB, an annotated text corpus D and a set of anchors A, we aim to jointly learn word, entity and mention sense representations: w, e, s. As shown in Figure 2, our framework contains two key components:

Method
In this section, we present three main components in MPME: text model, knowledge model and joint model, and then introduce the detailed information on training process. Finally, we briefly introduce the framework for entity linking.  dings. Actually, MPME is flexible to utilize pretrained entity embeddings from arbitrary knowledge representation model, and enjoys their advantages of different aspects in knowledge bases 2 . This is reasonable because we output two separately semantic vector spaces for text and knowledge respectively, while we can still obtain the relatedness between word and entity indirectly by computing similarity between word and mention embeddings referring to that entity.

Method
In this section, we present three main components in MPME: text model, knowledge model and joint model, and then introduce the detailed information on training process. Finally, we briefly introduce the framework for entity linking.
3.3 Text model 3.6 Training 3.7 Integrating into GBDT for EL 3.2 Skip-gram model 3.3 Text model embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word w i or a mention sense of entity title t s l : on Software Engineering, 15(9):1066-1077.  tion sense has an embedding (sense vector) t l and a context cluster with center µ(t s l ). The representation of the context is defined as the average of the word vectors in the context: C(wi) = 1 |C(wi)| P wj 2C(wi) wj. We predict t s l , the sense of entity title tl in the mention < tl, C(tl) >, when observed with context C(tl) as the context cluster membership. Formally, we have: where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(tl)). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : before conducting the experiments on the tasks, we first give qualitative analysis of words, mentions and entities. firstly, we give the phrase embedding by its nearest words and entities.
next, we give quantitative analysis on several tasks. .
Joint representation learning of text and knowledge for knowledge graph completion. CoRR, abs/1611.04125. (WSD) task, we use the context information to distinguish existing mention senses, or create a new out-of-KB sense. To be concrete, each mention sense has an embedding (sense vector) t s l and a context cluster with center µ(t s l ). The representation of the context is defined as the average of the word vectors in the context: C(wi) = 1 |C(wi)| P wj 2C(wi) wj. We predict t s l , the sense of entity title tl in the mention < tl, C(tl) >, when observed with context C(tl) as the context cluster membership. Formally, we have: where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(tl)). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : type model.

Qualitative Analysis
before conducting the experiments on the tasks, we first give qualitative analysis of words, mentions and entities. firstly, we give the phrase embedding by its nearest words and entities.
next, we give quantitative analysis on several tasks.  When encounter an mention of entity title tl, inspired by the idea of word sense disambiguation (WSD) task, we use the context information to distinguish existing mention senses, or create a new out-of-KB sense. To be concrete, each mention sense has an embedding (sense vector) t s l and a context cluster with center µ(t s l ). The representation of the context is defined as the average of the word vectors in the context: C(wi) = 1 |C(wi)| P wj 2C(wi) wj. We predict t s l , the sense of entity title tl in the mention < tl, C(tl) >, when observed with context C(tl) as the context cluster membership. Formally, we have:

Entity
where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(tl)). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : 1. directly align words with entity. 2. align mention with entity using single prototype model.

Qualitative Analysis
before conducting the experiments on the tasks, we first give qualitative analysis of words, mentions and entities.
firstly, we give the phrase embedding by its nearest words and entities.
next, we give quantitative analysis on several tasks. nearest distance of the context vector to sense clus-ter center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.

Entity
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word w i or a mention sense of entity title t s l : w i /t s l , , , w, , e j , e  text C(tl) as the context cluster membership. Formally, we have: where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(tl)). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : tasks.  where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(tl)). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(tl)). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l :  cluster center is the average of all the context vec-tors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : Communications of the ACM, 18 (6)  ter center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : References Alfred V Aho and Margaret J Corasick. 1975. Efficient string matching: an aid to bibliographic search.
Communications of the ACM, 18 (6)  non-parametric clustering procedure to learn out-of-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word wi or a mention sense of entity title t s l : 5 Related Work  where is a hyper-parameter and t max l = argmax t s l sim(µ(t s l ), C(t l )). We adopt an online non-parametric clustering procedure to learn outof-KB mention senses, which means that if the nearest distance of the context vector to sense cluster center is larger than a threshold, we create a new context cluster and a new sense vector that doesn't belong to any entity-centric senses. The cluster center is the average of all the context vectors belonging to that cluster. For the similarity metric, we use cosine in our experiments.
Here, we extend Skip-gram model to learn word embeddings as well as mention sense embeddings by the following objective to maximize the probability of observing the context words given either a word w i or a mention sense of entity title t s l :  learn two mention senses s 1 1 , s 1 2 for m 1 , and one mention sense s 2 2 for m 2 . Clearly, these two mentions share a common sense in the last two documents: the United States holiday e 2 , so we have s ⇤ 2 = s 1 2 = s 2 2 . Note that w, m, s are naturally embedded into the same semantic space since they are basic units in texts, and e modeling the graph structure in KB is actually in another semantic space.

Method
In this section, we firstly describe the framework of MPME, followed by the detailed information of each key component. Then, we introduce a well designed mention sense disambiguation method, which can also be used for entity linking in a unsupervised way.

Framework
Given KB, D and A, we are to jointly learn word, entity and mention representations: w, e, m. Serving as basic units in texts, Word {w i } and entity title {t l } are naturally embedded into a unified semantic space, meanwhile entities {e j } are mapped to one of mention senses of its title: t s l . Thus, text and knowledge are combined via the bridge of mentions. We can easily obtain the similarity between word and entity Similarity(w i , e j ) by computing the similarity between word and its corresponding mention sense: Similarity(w i , f(e j )).
As shown in Figure 2, our proposed MPME contains four key components: (1) Mention Sense Mapping: we map the anchor < m h , e j >2 A to the corresponding mention sense t s l to reduce the vocabulary to learn. (2) Entity Representation Learning given a knowledge base KB, we construct a knowledge network among entities, and given annotated we learn entity both contextual beddings in orde senses that has s sponding entity Representatio an iterative upd optimization ob beddings w i and own semantic sp the new learned inspires us to gl choosing mentio tion names in t mention sense c tion sense disam garded as linkin unsupervised wa tion ??.

Mention S
There are two k mention senses, tion senses. The beginning. Give tract entity titles mention senses, on how many e latter is to find given mention n mention generat Conventional generally mainta and entity that knowledge base nizes the mentio matching. Or it fi names in texts u nition) tool, and didate entities vi Since this co paper, we adop s ⇤ 2 = s 1 2 = s 2 2 . Note that w, m, s are naturally e bedded into the same semantic space since t are basic units in texts, and e modeling the gra structure in KB is actually in another seman space.

Method
In this section, we firstly describe the framew of MPME, followed by the detailed information each key component. Then, we introduce a w designed mention sense disambiguation meth which can also be used for entity linking in a supervised way. Thus, text and knowledge are co bined via the bridge of mentions. We can e ily obtain the similarity between word and tity Similarity(w i , e j ) by computing the simi ity between word and its corresponding ment sense: Similarity(w i , f(e j )).
As shown in Figure 2, our proposed MPM contains four key components: (1) Mention Se Mapping: we map the anchor < m h , e j >2 A the corresponding mention sense t s l to reduce vocabulary to learn. (2) Entity Representat Learning given a knowledge base KB, we c struct a knowledge network among entities, a KB, a text corpus D and a set of anchors A, multiprototype mention embedding is to learn multiple sense embeddings s j l 2 R k for each mention m l as well as word embeddings w and entity embeddings e. Note that s l j 2 m ⇤ l denotes that mention sense of m l refers to entity e j , where m ⇤ l represents the sense set of m l . Different mentions may share the same mention sense, denoted as s ⇤ j . Example As shown in Figure 1, there are two different mentions "Independence Day" m 1 and "July 4th" m 2 in the documents. MPME is to learn two mention senses s 1 1 , s 1 2 for m 1 , and one mention sense s 2 2 for m 2 . Clearly, these two mentions share a common sense in the last two documents: the United States holiday e 2 , so we have s ⇤ 2 = s 1 2 = s 2 2 . Note that w, m, s are naturally embedded into the same semantic space since they are basic units in texts, and e modeling the graph structure in KB is actually in another semantic space.

Method
In this section, we firstly describe the framework of MPME, followed by the detailed information of each key component. Then, we introduce a well designed mention sense disambiguation method, which can also be used for entity linking in a unsupervised way.

Framework
Given knowledge base KB, text corpus D and a set of anchors A, we are to jointly learn word, entity and mention representations: w, e, m. As shown in Figure 2, our proposed MPME contains four key components: (1) Mention Sense Mapping: given an anchor < m l , e j >, we map it to the corresponding mention sense to reduce the mention vocabulary to learn embeddings. If only a mention is given, we map it to several mention senses that requires disambiguation (Section 3.4). (2) Entity Representation Learning based on outlinks in Wikipedia pages, we construct a knowledge network to represent the semantic relatedness among entities. And then learn entity embeddings so that similar entities on the graph have similar representations. (3) Mention Representation Learning given mapped anchors in contexts, we learn mention sense embeddings by incorporating both textual context embeddings and entity embeddings. (4) Text Representation Learning we extend skip-gram model to simultaneously learn word and mention sense embeddings on annotated text corpus D 0 . Following (Yamada et al., 2016), we use wikipedia articles as text corpus, and the anchors provide annotated mentions 1 .
We jointly train (2), (3) and (4) by using a unified optimization objective. The outputs embeddings of word and mention are naturally in the same semantic space since they are different units in annotated text corpus D 0 for text representation learning. Entity embeddings keep their own semantics in another vector space, because we only use them as answers to predict in mention representation learning by extending Continuous BOW model, which will be further discussed in Section ??.

Memorial Day
word embeddings w i and entity embeddings e j keep their own semantic space and are naturally bridged via the new learned entity title embeddings t l , which inspires us to globally optimize the probability of choosing mention senses of all the phrases of mention names in the given document. Since each mention sense corresponds to an entity, the mention sense disambiguation process can also be regarded as linking entities to knowledge base in a unsupervised way, which will be detailed in Section ??.

Mention Sense Mapping
There are two kinds of mappings: from entities to mention senses, and from mention names to mention senses. The former is pre-defined at the very beginning. Given the knowledge Base KB, we extract entity titles {t l } and initialize with multiple mention senses, where the sense number depends on how many entities share a common title. The latter is to find possible mention senses for the given mention name, which is similar to candidate mention generation in entity linking task.
Conventional candidate mention generation generally maintains a list of pairs of mention name and entity that denotes a candidate reference in knowledge base for the mention name, and recognizes the mention name in text by accurate string 1 We can also annotate text corpus by using NER tool like python nltk to recognize mentions, and disambiguating its mapped mention senses as described in Section 3.4. This is an ongoing work with the goal of learning additional out-of-KB senses by self-training. In this paper, we will focus on the effectiveness of our model and the quality of three kinds of learned embeddings. dings e. Note that s j 2 m l denotes that mention sense of m l refers to entity e j , where m ⇤ l represents the sense set of m l . Different mentions may share the same mention sense, denoted as s ⇤ j . Example As shown in Figure 1, there are two different mentions "Independence Day" m 1 and "July 4th" m 2 in the documents. MPME is to learn two mention senses s 1 1 , s 1 2 for m 1 , and one mention sense s 2 2 for m 2 . Clearly, these two mentions share a common sense in the last two documents: the United States holiday e 2 , so we have s ⇤ 2 = s 1 2 = s 2 2 . Note that w, m, s are naturally embedded into the same semantic space since they are basic units in texts, and e modeling the graph structure in KB is actually in another semantic space.

Method
In this section, we firstly describe the framework of MPME, followed by the detailed information of each key component. Then, we introduce a well designed mention sense disambiguation method, which can also be used for entity linking in a unsupervised way.

Framework
Given knowledge base KB, text corpus D and a set of anchors A, we are to jointly learn word, entity and mention representations: w, e, m. As shown in Figure 2, our proposed MPME contains four key components: (1) Mention Sense Mapping: given an anchor < m l , e j >, we map it to the corresponding mention sense to reduce the mention vocabulary to learn embeddings. If only a mention is given, we map it to several mention senses that requires disambiguation (Section 3.4).
(2) Entity Representation Learning based on outlinks in Wikipedia pages, we construct a knowledge network to represent the semantic relatedness among entities. And then learn entity embeddings so that similar entities on the graph have similar representations. (3) Mention Representation Learning given mapped anchors in contexts, we learn mention sense embeddings by incorporating both textual context embeddings and entity embeddings. (4) Text Representation Learning we extend skip-gram model to simultaneously learn We jointly train (2), (3) and (4) by using a un fied optimization objective. The outputs embed dings of word and mention are naturally in th same semantic space since they are different uni in annotated text corpus D 0 for text representatio learning. Entity embeddings keep their own se mantics in another vector space, because we onl use them as answers to predict in mention repre sentation learning by extending Continuous BOW model, which will be further discussed in Sectio 3.3.4. Figure 2 shows a real example of "" e Memorial Day word embeddings w i and entity embeddings e keep their own semantic space and are naturall bridged via the new learned entity title embed dings t l , which inspires us to globally optimiz the probability of choosing mention senses of a the phrases of mention names in the given docu ment. Since each mention sense corresponds to a entity, the mention sense disambiguation proces can also be regarded as linking entities to know edge base in a unsupervised way, which will b detailed in Section ??.

Mention Sense Mapping
There are two kinds of mappings: from entities t mention senses, and from mention names to men tion senses. The former is pre-defined at the ver beginning. Given the knowledge Base KB, we ex tract entity titles {t l } and initialize with multip mention senses, where the sense number depend on how many entities share a common title. Th latter is to find possible mention senses for th given mention name, which is similar to candida mention generation in entity linking task.
Conventional candidate mention generatio generally maintains a list of pairs of mention nam and entity that denotes a candidate reference i knowledge base for the mention name, and recog nizes the mention name in text by accurate strin matching. Or it firstly recognizes possible mentio 1 We can also annotate text corpus by using NER tool lik python nltk to recognize mentions, and disambiguating i mapped mention senses as described in Section 3.4. This an ongoing work with the goal of learning additional out-o KB senses by self-training. In this paper, we will focus o the effectiveness of our model and the quality of three kin of learned embeddings. Similar to WDS, we maintain a context cluster for each mention sense, which can be used for disambiguation given the contexts (Section 5). For example, in d1 of Figure 2, the context cluster of s ⇤ consists of all context vectors When encountering a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: P (D 0 , . . . , s l j , . . . , ) ⇡ Y P (D 0 |s l j ) · P (s l j ) ⇡ Y P (C(m l )|s l j ) · P (N (m l )|s l j ) · P (s l j ) where P (C(m l )|s l j ) is proportional to cosine s ilarity between context vector and mention se cluster center µ l j to measure the mention's lo similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l occurring in a piece of text (e.g. a docume and P (N (m l )|s l j ) is defined as global proba ity since it measures global coherence of neigh mentions. The underlying idea is to achieve c sistent semantics in a piece of text assuming t all mentions inside it are talking about the sa topic. In this paper, we regard the mention sen identified first as neighbors of the rest mention P (s l j ) denotes prior probability of a ment sense occurring in texts proportional to the quency of corresponding entity in Wikipedia chors: where is a hyper-parameter to smooth gaps between different entity frequencies, nam smoothing parameter. It controls the importa Similar to WDS, we maintain a context cluster for each mention sense, which can be used for disambiguation given the contexts (Section 5). For example, in d1 of Figure 2, the context cluster of s ⇤ consists of all context vectors When encountering a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance where s ⇤ j = g(< m l , e j >) is obtained from anchors in wikipedia articles.
Thus, similar words and mention senses will be closed in text space, such as w film and s ⇤ Independence Day (film) , or w celebrations and s ⇤ Independence Day (US) because they frequently occur in the same contexts.
Similar to WDS, we maintain a context cluster for each mention sense, which can be used for disambiguation given the contexts (Section 5). For example, in d 1 of Figure 2, the context cluster of s ⇤ consists of all context vectors When encountering a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: 5 Mention Sense Disambiguation MPME learns each mention with multiple sense embeddings, and each sense corresponds to a context cluster. Given an annotated document D 0 including M mentions, and their sense sets according to Section ??: M ⇤ l = {s l j |s l j 2 g(m l ), m l 2 M}. In this section, we describe how to determine the mention sense for each mention m l in the document.
Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance Similar to WDS, we maintain a context cluster for each mention sense, which can be used for disambiguation given the contexts (Section 5). For example, in d 1 of Figure 2, the context cluster of s ⇤ consists of all context vectors When encountering a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance Independence Day (film) s ⇤ Independence Day (US) because they frequently occur in the same contexts.
Similar to WDS, we maintain a context cluster for each mention sense, which can be used for disambiguation given the contexts (Section 5). For example, in d 1 of Figure 2, the context cluster of s ⇤ consists of all context vectors When encountering a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance Thus, similar words and mention senses will be closed in text space, such as w film and s ⇤ Independence Day (film) , or w celebrations and s ⇤ Independence Day (US) because they frequently occur in the same contexts.
Similar to WDS, we maintain a context cluster for each mention sense, which can be used for disambiguation given the contexts (Section 5). For example, in d 1 of Figure 2, the context cluster of s ⇤ consists of all context vectors When encountering a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: ing to Section ??: M ⇤ l = {s l j |s l j 2 g(m l ), m l 2 M}. In this section, we describe how to determine the mention sense for each mention m l in the document. Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance here s ⇤ j = g(< m l , e j >) is obtained from anhors in wikipedia articles.
Thus, similar words and mention senses ill be closed in text space, such as w film nd s ⇤ Independence Day (film) , or w celebrations and ⇤ Independence Day (US) because they frequently ocur in the same contexts. Similar to WDS, we maintain a context cluster or each mention sense, which can be used for dismbiguation given the contexts (Section 5). For xample, in d 1 of Figure 2, the context cluster of ⇤ consists of all context vectors When encounterng a mention, the context vector we also maintain a context cluster center µ ⇤ j or each mention sense s ⇤ j , which is computed y averaging all the context vectors belonging o the cluster. We define context vector as he average sum of context word embeddings 1 C(wi)| P wj 2C(wi) w j . The cluster center is helpul for inducing mention sense in contexts. When ncounter a mention, we map it to a set of mention enses, and then find the nearest one according to he distance from its context vector to each menion sense cluster center, which will be discussed n Section 5.
.5 Joint Training onsidering all the above representation learning omponents, we define the overall objective funcion as linear combinations: embeddings, and each sense corresponds to a context cluster. Given an annotated document D 0 including M mentions, and their sense sets according to Section ??: M ⇤ l = {s l j |s l j 2 g(m l ), m l 2 M}. In this section, we describe how to determine the mention sense for each mention m l in the document.
Based on language model, identifying mention senses in a document can be regarded as maximizing their joint probability. However, the global optimum is expensive, in which each mention gets an optimum sense, to search over the space of all mention senses of all mentions in the document. Thus, we approximately assign each mention independently: where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance ing a mention, the context vector we also maintain a context cluster center µ ⇤ j for each mention sense s ⇤ j , which is computed by averaging all the context vectors belonging to the cluster. We define context vector as the average sum of context word embeddings 1 |C(wi)| P wj 2C(wi) w j . The cluster center is helpful for inducing mention sense in contexts. When encounter a mention, we map it to a set of mention senses, and then find the nearest one according to the distance from its context vector to each mention sense cluster center, which will be discussed in Section 5.

Joint Training
Considering all the above representation learning components, we define the overall objective function as linear combinations: Thus, we approximately assign each mention independently: P (D 0 , . . . , s l j , . . . , ) ⇡ Y P (D 0 |s l j ) · P (s l j ) ⇡ Y P (C(m l )|s l j ) · P (N (m l )|s l j ) · P (s l j ) where P (C(m l )|s l j ) is proportional to cosine similarity between context vector and mention sense cluster center µ l j to measure the mention's local similarity, namely local probability.
N (m l ) denotes neighbor mentions of m l cooccurring in a piece of text (e.g. a document), and P (N (m l )|s l j ) is defined as global probability since it measures global coherence of neighbor mentions. The underlying idea is to achieve consistent semantics in a piece of text assuming that all mentions inside it are talking about the same topic. In this paper, we regard the mention senses identified first as neighbors of the rest mentions.
P (s l j ) denotes prior probability of a mention sense occurring in texts proportional to the frequency of corresponding entity in Wikipedia anchors: where is a hyper-parameter to smooth the gaps between different entity frequencies, namely smoothing parameter. It controls the importance

Mention Sense Mapping
To reduce the size of the mention vocabulary, each mention is mapped to a set of shared mention senses according to a predefined dictionary. We build the dictionary by collecting entity-mention pairs < m l , e j > from Wikipedia anchors and page titles, then create mention senses if there is a different entity. The sense number of a mention depends on how many different entity-mention pairs it is involved.
Formally, we have: M * l = g(m l ) = g(< m l , e j >) = {s * j }, where g(·) denotes the mapping function from an entity mention to its mention sense given an anchor. We directly use the anchors contained in the annotated text corpus D for training. As Figure 2 shows, we replace the anchor <July 4th, Independence Day (US)> with the corresponding mention sense: s * Independence Day (U S) . Representation Learning Using KB, A and D as input, we design three separate models and a unified optimization objective to jointly learn entity, word and mention sense representations into two semantic spaces. As shown in the knowledge space in Figure 2, entity embeddings can reflect their relatedness in the network. For example, Independence Day (US) (e 1 ) and Memorial Day (e 3 ) are close to each other because they share some common neighbors, such as United States and Public holidays in the United States.
Word and mention embeddings are learned in the same semantic space. As two basic units in D , their embeddings represent their distributed semantics in texts. For example, mention Independence Day and word celebrations co-occur frequently when it refers to the holiday: Independence Day (US), thus they have similar representations. Without disambiguating the mention senses, some words, such as film will also share similar representations as Independence Day.
Besides, by introducing entity embeddings into our MPME framework, the knowledge information will also be distilled into mention sense embeddings, so that the mention sense Memorial Day will be similar as Independence Day (US).
Mention Sense Disambiguation According to our predefined dictionary, each mention has been mapped to more than one senses, and learned with multiple embedding vectors. Consequently, to induce the correct sense for a mention within a context is critical in the usage of the multiprototype embeddings, especially in an unsupervised way. Formally, given an annotated document D , we determine one senseŝ * j ∈ M * l for each mention m l ∈ D , whereŝ * j is the correct sense. Based on language model, we design a mention sense disambiguation method without using any supervision that takes into account three aspects: 1) sense prior denotes how dominant the sense is, 2) local context information reflects how semantically appropriate the sense is in the context, and 3) global mention information denotes how semantically consistent the sense is with the neighbor mentions. To better utilize the context information, we maintain a context cluster for each mention sense during training, which will be detailed in Section 4.4.
Since each mention sense corresponds to an entity in the given KB, the disambiguation method is equivalent to entity linking. Thus, text and knowledge base is bridged via the multiprototype mention embeddings. We will give more analysis in Section 6.4.

Representation Learning
Distributional representation learning plays an increasing important role in many fileds (Bengio et al., 2013;Zhang et al., 2017Zhang et al., , 2016 due to its effectiveness for dimensionality reduction and addressing sparseness issue. For NLP tasks, this trends has been accelerated by the Skip-gram and CBOW models (Mikolov et al., 2013a,b) due to its efficiency and remarkable semantic compositionality of embedding vectors. In this section, we first briefly introduce the Skip-gram and CBOW models, and then extend them to three variants for the word, mention and entity representation learning.

Skip-Gram and CBOW model
The basic idea of the Skip-gram and CBOW models is to model the predictive relations among sequential words. Given a sequence of words D, the optimization objective of Skip-gram model is to use the current word to predict its context words by maximizing the average log probability: In contrast, CBOW model aims to predict the current word given its context words: Formally, the conditional probability P (w o |w i ) is defined using a softmax function: where w i , w o denote the input and output word vectors during training. Furthermore, these two models can be accelerated by using hierarchical softmax or negative sampling (Mikolov et al., 2013a,b).

Entity Representation Learning
Given a knowledge base KB, we aim to learn entity embeddings by modeling "contextual" entities, so that the entities sharing more common neighbors tend to have similar representations. Therefore, we extend Skip-gram model to a network by maximizing the log probability of being a neighbor entity.
Clearly, the neighbor entities serve a similar role as the context words in Skip-gram model. As shown in Figure 2, entity Memorial Day (e 3 ) also share two common neighbors of United States and Public holidays in the United States with entity Independence Day (US), thus their embeddings are close in the Knowledge Space. These entity embeddings will be later used to learn mention representations.

Mention Representation Learning
As mentioned above, the textual context information and reference entities are helpful to distinguish different senses for a mention. Thus, given an anchor < m l , e j > and its context words C(m l ), we combine mention sense embeddings with its context word embeddings to predict the reference entity by extending CBOW model. The objective function is as follows: where s * j = g(< m l , e j >). Thus, if two mentions refer to similar entities and share similar contexts, they tend to be close in semantic vector space. Take Figure 1 as an example again, mentions Independence Day and Memorial Day refer to similar entities Independence Day (US) (e 1 ) and Memorial Day (e 2 ), they also share some similar context words, such as celebrations in documents d 2 , d 3 , so their sense embeddings are close to each other in the text space.

Text Representation Learning
Instead of directly using a word or a mention to predict the context words, we incorporate mention sense to joint optimize word and sense representations, which can avoid some noise introduced by ambiguous mentions. For example, in Figure 2, without identifying the mention Independence Day as the holiday or the film, various dissimilar context words such as the words celebrations and film in documents d 1 , d 2 will share similar semantics, which will further affect the performance of entity representations during joint training.
Given the annotated corpus D , we use a word w i or a mention sense s * j to predict the context words by maximizing the following objective function: where s * j = g(< m l , e j >) is obtained from anchors in Wikipedia articles.
Thus, words and mention senses will share the same vector space, where similar words and mention senses are close to each other, such as celebrations and Independence Day (US) because they frequently occur in the same contexts.
Similar to WDS, we maintain a context cluster for each mention sense, which can be used for mention sense disambiguation (Section 5). The context cluster of a mention sense s * j contains all the context vectors of its mention m l . We compute context vector of m l by averaging the sum of its context word embeddings: Further, the center of a context cluster µ * j is defined as the average of context vectors of all mentions which refer to the sense. These context clusters will be later used to disambiguate the sense of a given mention with its contexts.

Joint Training
Considering all of the above representation learning components, we define the overall objective function as linear combinations: The goal of training MPME is to maximize the above function, and iteratively update three types of embeddings. Also, we use negative sampling technique for efficiency (Mikolov et al., 2013a). MPME shares the same entity representation learning method with (Yamada et al., 2016), but the role of entities in the entire framework as well as mention representation learning is different in three aspects. First, we focus on learning embeddings for mentions, not merely words as in (Yamada et al., 2016). Clearly, MPME is more natural to integrate text and knowledge base. Second, we propose to learn multiple embeddings for each mention denoting its different meanings. Third, we prefer to use both mentions and context words to predict entities, so that the distribution of entities will help improve word embeddings, meanwhile, avoid being hurt if we force entity embeddings to satisfy word embeddings during training (Wang et al., 2014). We will give more analysis in experiments.

Mention Sense Disambiguation
As mentioned in Section 3, we induce a correct senseŝ * j ∈ M * l for each mention m l in an annotated document D . We regard this problem from the perspective of language model that maximizes a joint probability of all mention senses contained in the document. However, the global optimum is expensive with a time complexity of O(|M||M * l |). Thus, we approximately identify each mention sense independently: where P (C(m l )|s * j ), local context information (Section 3), denotes the probability of the local contexts of m l given its mention sense s * j . we define it proportional to the cosine similarity between the current context vector and the sense context cluster center µ * j as described in Section 4.4. It measures how likely a mention sense occurring together with current context words. For example, given the mention sense Independence Day (film), word film is more likely to appear within the context than the word celebrations.
P (N (m l )|s l j ), global mention information, denotes the probability of the contextual mentions of m l given its sense s l j , whereN (m l ) is the collection of the neighbor mentions occurring together with m l in a predefined context window. We define it proportional to the cosine similarity between mention sense embeddings and the neighbor mention vector, which is computed similar to context vector: 1 |N (m l )|ŝ l j , whereŝ l j is the correct sense for m l .
Considering there are usually multiple mentions in a document to be disambiguated. The mentions disambiguated first will be helpful for inducing the senses of the rest mentions. That is, how to choose the mentions disambiguated first will influence the performance. Intuitively, we adopt two orders similar to : 1) L2R (left to right) induces senses for all the mentions in the document following natural order that varies according to language, normally from left to right in the sequence. 2) S2C (simple to complex) denotes that we determine the correct sense for those mentions with fewer senses, which makes the problem easier.
Global mention information assumes that there should be consistent semantics in a context window, and measures whether all neighbor mentions are related. For instance, two mentions Memorial Day and Independence Day occur in the same document. If we already know that Memorial Day denotes a holiday, then obviously Independence Day has higher probability of being a holiday than a film.
P (s * j ), sense prior, is a prior probability of sense s * j indicating how possible it occurs without considering any additional information. We define it proportional to the frequency of sense s * j in Wikipedia anchors: where A s * j is the set of anchors annotated with s * j , and γ is a smoothing hyper-parameter to control the impact of prior on the overall probability, which is set by experiments (Section 6.4).

Experiment
Setup We choose Wikipedia, the March 2016 dump, as training corpus, which contains nearly 75 millions of anchors, 180 millions of edges among entities and 1.8 billions of tokens after preprocessing. We then train MPME 2 for 1.5 millions of words, 5 millions of entities and 1.7 millions of mentions. The entire training process in 10 iterations costs nearly 8 hours on the server with 64 core CPU and 188GB memory.
We use the default settings in word2vec 3 , and set our embedding dimension as 200 and context window size as 5. For each positive example, we sample 5 negative examples 4 .
Baseline Methods As far as we know, this is the first work to deal with mention ambiguity in the integration of text and knowledge representations, so there is no exact baselines for comparison. We use the method in (Yamada et al., 2016) as a baseline, marked as ALIGN 5 , because (1) this is the most similar work that directly aligns word and entity embeddings.
(2) it achieves the state-of-the-art performance in entity linking task.
To investigate the effect of multi-prototype, we degrade our method to single-prototype as another baseline, which means to use one sense to represent all mentions with the same phrase, namely Single-Prototype Mention Embedding (SPME). For example, SPME only learns one unique sense vector for Independence Day whatever it denotes a holiday or a film.

Qualitative Analysis
We use cosine similarity to measure the similarity of two vectors, and present the top 5 nearest words and entities for two most popular senses of the mention Independence Day. Because ALIGN is incapable of dealing with multiple words, we only present the results of SPME and MPME.
As shown in Figure 1, without considering mention sense, the mention Independence Day can only show a dominant holiday sense based on SPME and ignore all other senses. Instead, MPME successfully learns two clear and distinct senses. For the sense Independence Day (US), all of its nearest words and entities, such as parades, celebrations, and Memorial Day, are holiday related, while for another sense Independence Day (film), its nearest words and entities, like robocop and The Terminator, are all science fiction films. The results verify the effectiveness of our framework in learning mention embeddings at the sense level.

Entity Relatedness
To evaluate the quality of entity embeddings, we conduct experiments using the dataset which is designed for measuring entity relatedness (Ceccarelli et al., 2013;Huang et al., 2015;Yamada et al., 2016). The dataset contains 3,314 entities, and each mention has 91 candidate entities on average with gold-standard labels indicating whether they are semantically related. We compute cosine similarity between entity embeddings to measure their relatedness, and rank them in a descending order. To evaluate the ranking quality, we use two standard metrics: normalized discounted cumulative gain (NDCG) (Järvelin and Kekäläinen, 2002) and mean average precision (MAP) (Schütze, 2008).
We design another baseline method: En-tity2vec, which learns entity embeddings using the method described in Section 4.2, without joint training with word and mention sense embeddings. As shown in Table 2, ALIGN achieves lower performance than Entity2vec, because it doesn't consider the mention phrase ambiguity and yields lots of noise when forcing entity embeddings to satisfy word embeddings and aligning them into the unified space. For example, the entity Gente (magazine) should be more relevant to the entity France, the place where its company locates. However, ALIGN mixed various meanings of mention Gente (e.g., the song) and ranked some bands higher (e.g., entity Poolside (band)). SPME also doesn't consider the ambiguity of mentions but achieves comparative results with Entity2vec. We analyze the reasons and find that, it can avoid some noise by using word embeddings to predict entities. MPME outperforms all the other methods, which demonstrates that the unambiguous textual information is helpful to refine the entity embeddings.

Word Analogical Reasoning
Following (Mikolov et al., 2013a;Wang et al., 2014), we use the word analogical reasoning task to evaluate the quality of word embeddings. The dataset consists of 8,869 semantic questions ("Paris":"France"::"Rome":?), and 10,675 syntactic questions (e.g., "sit":"sitting"::"walk":?). We solve it by finding the closest word vector w ? to w F rance −w P aris +w Rome according to cosine similarity. We compute accuracy for top 1 nearest word to measure the performance. We also adopt Word2vec 6 as an additional baseline method, which provides a standard to measure the impact from other components on word embeddings. Table 3 shows the results. We can see that ALIGN, SPME and MPME, achieve higher performance in dealing with semantic questions, because relations among entities (e.g., country-capital relation for entity France and Paris) enhance the semantics in word embeddings through jointly training. On the other hand, their performance for syntactic questions is weakened because more accurate semantics yields a bias to predict semantic relations even though given a syntactic query. For example, given the query "pleasant":"unpleasant"::"possibly":?, our model tends to return the word (e.g., probably) highly semantical related to query words, such as possibly, instead of the syntactical similar word impossibly. In this scenario, we are more concerned about semantic task to incorporate knowledge of reference entities into word embeddings, and this issue could be tackled, to some extent, by using syntactic tool like stemming. The word embeddings of MPME achieve the best performance for semantic questions mainly because (1) text representation learning has better generalization ability due to the larger size of training examples than entities (e.g., 1.8b v.s. 0.18b) as well as relatively smaller size of vocabulary (e.g., 1.5m v.s. 5m).
(2) unambiguous mention embeddings capture both textual context information and knowledge, and thus enhance word and entity embeddings.

A Case Study: Entity Linking
Entity linking is a core NLP task of identifying the reference entity for mentions in texts. The main difficulty lies in the ambiguity of various entities sharing the same mention phrase. Previous work addressed this issue by taking advantage of the similarity between words and entities (Francis-Landau et al., 2016;Sun et al., 2015), and/or the relations among entities (Thien Huu Nguyen, 2016;Cao et al., 2015). Therefore, we use entity linking as a case study for a comprehensive measurement of the multi-prototype mention embeddings. Given mentions in a text, entity linking aims to link them to a predefined knowledge base. One of the main challenges in this task is the ambiguity of entity mentions.
We use the public dataset AIDA created by (Hoffart et al., 2011), which includes 1,393 documents and 27,816 mentions referring to Wikipedia entries. The dataset has been divided into 946, 216 and 231 documents for the purpose of training, developing and testing. Following (Pershina et al., 2015;Yamada et al., 2016), we use a publicly available dictionary to generate candidate entities and mention senses. For evaluation, we rank the candidate entities for each mention and report both standard micro (aggregates over all mentions) and macro (aggregates over all documents) precision over top-ranked entities.

Supervised Entity Linking
Yamada et al. (2016) designed a list of features for each mention and candidate entity pair. By incorporating these features into a supervised learning-to-rank algorithm, Gradient Boosting Regression Tree (GBRT), each pair is assigned a relevance score indicating whether they should be linked to each other. Following their recommended parameters, we set the number of trees as 10,000, the learning rate as 0.02 and the maximum depth of the decision tree as 4.
Based on word and entity embeddings learned by ALIGN, the key features in (Yamada et al., 2016) are from two aspects: (1) the cosine similarity between context words and candidate entity, and (2) the coherence among "contextual" entities in the same document.
To evaluate the performance of multi-prototype mention embeddings, we incorporate the following features into GBDT for comparison: (1) the cosine similarity between the current context vector and the sense context cluster center µ * j , which denotes how likely the mention sense refers to the candidate entity, (2) the cosine similarity between the current context vector and the mention sense embeddings. As shown in Table 4, we can see that ALIGN performs better than SPME. This is because SPME learns word embeddings and entity embeddings in separate semantic spaces, and fails to measure the similarity between context words and candidate entities. However, MPME computes the similarity between context words with mention sense instead of entities, thus achieves the best performance, which also demonstrates the high quality of the mention sense embeddings.

Unsupervised Entity Linking
Linking a mention to a specific entity equals to disambiguating mention senses since each candidate entity corresponds to a mention sense. As described in Section 5, we disambiguate senses in two orders: (1) L2R (from left to right), and (2) S2C (from simple to complex).
We evaluate our unsupervised disambiguation methods on the entire AIDA dataset. To be fair, we choose the state-of-the-art unsupervised methods, which are proposed in (Hoffart et al., 2011;Alhelbawy and Gaizauskas, 2014;Cucerzan, 2007; Kulkarni et al., 2009;Masumi Shirakawa and Nishio, 2011) using the same dataset. Table 5 shows the results. We can see that our two methods outperform all other methods. MPME (L2R) is more efficient and easy to apply, while MPME (S2C) slightly outperforms it because the additional step of ranking mentions according to their candidates number guarantees a higher disambiguation performance for those simple mentions, which consequently help disambiguate those complex mentions through global mention information in Equation 8.
We analyze the results and observe a disambiguation bias to popular senses. For example, there are three mentions in the sentence "Japan began the defence of their Asian Cup I title with a lucky 2-1 win against Syria in a Group C championship match on Friday", where the country name Japan and Syria actually denote their national football teams, while the football match name Asian Cup I has little ambiguity. Compared to the team, the sense of country occurs more frequently and has a dominant prior, which greatly affects the disambiguation. By incorporating local context information and global mention information, both the context words (e.g., defence or match) and the neighbor mentions (e.g., Asian Cup I) provide us enough clues to identify a soccer related mention sense instead of the country.

Influence of Smoothing Parameter
As mentioned above, a mention sense may possess a dominant prior and greatly affect the disambiguation. So we introduce a smoothing parameter γ to control its importance to the overall probability. Figure 3 shows the linking accuracy under different values of γ on the dataset of AIDA. γ = 0 indicates we don't use any prior knowledge, and γ = 1 indicates the case without smoothing parameter.
We can see that both micro and macro accuracy decrease a lot if we don't use the parameter (γ = 1). Only using local and global probabilities for disambiguation (γ = 0) achieves a comparable performance when γ = 0.05, both accuracy reach their peaks, which is optimal and default value in our experiments.

Conclusions and Future Work
In this paper, we propose a novel Multi-Prototype Mention Embedding model that jointly learns word, entity and mention sense embeddings. These mention senses capture both textual context information and knowledge from reference entities, and provide an efficient approach to disambiguate mention sense in text. We conduct a series of experiments to demonstrate that multiprototype mention embedding improves the quality of both word and entity representations. Using entity linking as a study case, we apply our disambiguation method as well as the multi-prototype mention embeddings on the benchmark dataset, and achieve the state-of-the-art.
In the future, we will improve the scalability of our model and learn multi-prototype embeddings for the mentions without reference entities in a knowledge base, and introduce compositional approaches to model the internal structures of multiword mentions.