Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation

Named Entity Disambiguation (NED) refers to the task of resolving multiple named entity mentions in a document to their correct references in a knowledge base (KB) (e.g., Wikipedia). In this paper, we propose a novel embedding method specifically designed for NED. The proposed method jointly maps words and entities into the same continuous vector space. We extend the skip-gram model by using two models. The KB graph model learns the relatedness of entities using the link structure of the KB, whereas the anchor context model aims to align vectors such that similar words and entities occur close to one another in the vector space by leveraging KB anchors and their context words. By combining contexts based on the proposed embedding with standard NED features, we achieved state-of-the-art accuracy of 93.1% on the standard CoNLL dataset and 85.2% on the TAC 2010 dataset.


Introduction
Named Entity Disambiguation (NED) is the task of resolving ambiguous mentions of entities to their referent entities in a knowledge base (KB) (e.g., Wikipedia). NED has lately been extensively studied (Cucerzan, 2007;Mihalcea and Csomai, 2007;Milne and Witten, 2008b;Ratinov et al., 2011) and used as a fundamental component in numerous tasks, such as information extraction, knowledge base population (McNamee and Dang, 2009;Ji et al., 2010), and semantic search (Blanco et al., 2015). We use Wikipedia as KB in this paper.
The main difficulty in NED is ambiguity in the meaning of entity mentions. For example, the mention "Washington" in a document can refer to various entities, such as the state, or the capital of the US, the actor Denzel Washington, the first US president George Washington, and so on. In order to resolve these ambiguous mentions into references to the correct entities, early approaches focused on modeling textual context, such as the similarity between contextual words and encyclopedic descriptions of a candidate entity (Bunescu and Pasca, 2006;Mihalcea and Csomai, 2007). Most state-of-the-art methods use more sophisticated global approaches, where all mentions in a document are simultaneously disambiguated based on global coherence among disambiguation decisions. Word embedding methods are also becoming increasingly popular (Mikolov et al., 2013a;Mikolov et al., 2013b;Pennington et al., 2014). These involve learning continuous vector representations of words from large, unstructured text corpora. The vectors are designed to capture the semantic similarity of words when similar words are placed near one another in a relatively lowdimensional vector space.
In this paper, we propose a method to construct a novel embedding that jointly maps words and entities into the same continuous vector space. In this model, similar words and entities are placed close to one another in a vector space. Hence, we can measure the similarity between any pair of items (i.e., words, entities, and a word and an entity) by simply computing their cosine similarity. This enables us to easily measure the contextual information for NED, such as the similarity between a context word and a candidate entity, and the relatedness of entities required to model coherence.
Our model is based on the skip-gram model (Mikolov et al., 2013a;Mikolov et al., 2013b), a recently proposed embedding model that learns to predict each context word given the target word. Our model consists of the following three models based on the skip-gram model: 1) the conventional skip-gram model that learns to predict neighboring words given the target word in text corpora, 2) the KB graph model that learns to estimate neighboring entities given the target entity in the link graph of the KB, and 3) the anchor context model that learns to predict neighboring words given the target entity using anchors and their context words in the KB. By jointly optimizing these models, our method simultaneously learns the embedding of words and entities.
Based on our proposed embedding, we also develop a straightforward NED method that computes two contexts using the proposed embedding: textual context similarity, and coherence. Textual context similarity is measured according to vector similarity between an entity and words in a document. Coherence is measured based on the relatedness between the target entity and other entities in a document. Our NED method combines these contexts with several standard features (e.g., prior probability) using supervised machine learning.
We tested the proposed method using two standard NED datasets: the CoNLL dataset and the TAC 2010 dataset. Experimental results revealed that our method outperforms state-of-the-art methods on both datasets by significant margins. Moreover, we conducted experiments to separately assess the quality of the vector representation of entities using an entity relatedness dataset, and discovered that our method successfully learns the quality representations of entities.

Joint Embedding of Words and Entities
In this section, we first describe the conventional skip-gram model for learning word embedding. We then explain our method to construct an embedding that jointly maps words and entities into the same continuous d-dimensional vector space. We extend the skip-gram model by adding the KB graph model and the anchor context model.

Skip-gram Model for Word Similarity
The training objective of the skip-gram model is to find word representations that are useful to predict context words given the target word. Formally, given a sequence of T words w 1 , w 2 , ..., w T , the model aims to maximize the following objective function: where c is the size of the context window, w t denotes the target word, and w t+j is its context word. The conditional probability P (w t+j |w t ) is computed using the following softmax function: where W is a set containing all words in the vocabulary, and V w ∈ R d and U w ∈ R d denote the vectors of word w in matrices V and U, respectively.
The skip-gram model is trained to optimize the above function L w , and V are used as the resulting vector representations of words.

Extending the Skip-gram Model
We extend the skip-gram model to learn the vector representations of entities. We expand matrices V and U to include the vectors of entities V e ∈ R d and U e ∈ R d in addition to the vectors for words.

KB Graph Model
We use an internal link structure in KB to enable the model to learn the relatedness between pairs of entities. Wikipedia Link-based Measure (WLM) (Milne and Witten, 2008a) is a method to measure entity relatedness based on its link structure. It has been used as a standard method to compute the relatedness of entities for modeling coherence in past NED studies. The relatedness between two entities is computed using the following function: where E is the set of all entities in KB and C e is the set of entities with a link to an entity e. Intuitively, WLM assumes that entities with similar incoming links are related. Despite its simplicity, WLM yields state-of-the-art performance (Hoffart et al., 2012). Inspired by WLM, the KB graph model simply learns to place entities with similar incoming links near one another in the vector space. We formalize this as the following objective function: We compute the conditional probability P (e o |e i ) using the following softmax function: We train the model to predict the incoming links C e given an entity e. Therefore, C e plays a similar role to context words in the skip-gram model.

Anchor Context Model
If we add only the KB graph model to the skipgram model, the vectors of words and entities do not interact, and can be placed in different subspaces of the vector space. To address this issue, we introduce the anchor context model to place similar words and entities near one another in the vector space.
The idea underlying this model is to leverage KB anchors and their context words to train the model. As mentioned in Section 1, we use Wikipedia as a KB. It contains many internal anchors that can be safely treated as unambiguous occurrences of referent KB entities. By using these anchors, we can easily obtain many occurrences of entities and their corresponding context words directly from the KB.
As in the skip-gram model, we simply train the model to predict the context words of an entity pointed to by the target anchor. The objective function is as follows: where A denotes a set of anchors in the KB, each of which contains a pair of a referent entity e i and a set of its context words Q. Here, Q contains the previous c words and the next c words. Note that |A| equals the number of internal anchors in the KB. As in past models, the conditional probability P (w o |e i ) is computed using the softmax function: Using the proposed model, we align the vector representations of words and entities by placing words and entities with similar context words close to one another in the vector space.

Training
Considering the three model components mentioned above, we propose the following objective function by linearly combining the above objective functions: The training of the model is intended to maximize the above function, and the resulting matrix V is used to embed words and entities.
One of the problems in training our model is that the normalizers contained in the softmax functions P (w t+j |w t ), P (e o |e i ), and P (w o |e i ) are computationally very expensive because they involve summation over all words W or entities E. To address this problem, we use negative sampling (NEG) (Mikolov et al., 2013b) to convert original objective functions into computationally feasible ones. NEG is defined by the following objective function: where σ(x) = 1/(1 + exp(−x)) and g is the number of negative samples. We replace the log P (w t+j |w t ) term in Eq. (1) with the above objective function. Consequently, the objective function is transformed from that in Eq. (1) to a simple objective function of the binary classification to distinguish the observed word w t from words drawn from noise distribution P neg (w). We also replace log P (e o |e i ) in Eq. (4) and log P (w o |e i ) in Eq. (6) in the same manner.
Note that NEG takes a negative distribution P neg (w) as a free parameter. Following (Mikolov et al., 2013b), we use the unigram distribution of words (U (w)) raised to the 3/4 th power (i.e., U (w) 3/4 /Z, where Z is a normalization constant) in the skip-gram model and the anchor context model. In the KB graph model, we use a uniform distribution over KB entities E as the negative distribution.
We use Wikipedia to train all the above models. Optimization is carried out simultaneously to maximize the transformed objective function by iterating over Wikipedia pages several times. We use stochastic gradient descent (SGD) for the optimization. The optimization is performed using a multiprocess-based implementation of our model using Python, Cython, and NumPy configured with OpenBLAS with storing matrices V and U in the shared memory. To improve speed, we decide not to introduce locks to the shared matrices.

Named Entity Disambiguation Using Embedding
In this section, we explain our NED method using our proposed embedding. Let us formally define the task. Given a set of entity mentions M = {m 1 , m 2 , ..., m N } in a document d with an entity set E = {e 1 , e 2 , ..., e K } in the KB, the task is defined as resolving mentions (e.g., "Washington") into their referent entities (e.g., Washington D.C.). We introduce two measures that have been frequently observed in past NED studies: entity prior P (e) and prior probability P (e|m). We define entity prior P (e) = |A e, * |/|A * , * | where A * , * denotes all anchors in the KB and A e, * is the set of anchors that point to entity e. Prior probability is defined as P (e|m) = |A e,m |/|A * ,m | where A * ,m represents all anchors with the same surface as mention m in KB and A e,m is a subset of A * ,m that points to entity e. We separate the NED task into two sub-tasks: candidate generation and mention disambiguation. In candidate generation, candidates of referent entities are generated for each mention. Details of candidate generation are provided in Section 4.3.1.

Mention Disambiguation
Given a document d and mention m with its candidate referent entities {e 1 , e 2 , ..., e k } generated in the candidate generation step, the task is to disambiguate mention m by selecting the most relevant entity from the candidate entities.
The key to improving the performance of this task is to effectively model the context. We propose two novel methods to model the context using the proposed embedding. Further, we combine these two models with several standard NED features using supervised machine learning described in 3.1.3.

Modeling Textual Context
Textual context is designed based on the assumption that an entity is more likely to appear if the context of a given mention is similar to that of the entity.
We propose a method to measure the similarity between textual context and entity using the proposed embedding by first deriving the vector representation of the context and then computing the similarity between the context and the entity using cosine similarity. To derive the vector of context, we average the vectors of context words: where W cm is a set of the context words of mention m and v w ∈ V denotes the vector representation of word w. We use all noun words in document d as context words. 1 Moreover, we ignore a context word if the surface of mention m contains it.
We then measure the similarity between candidate entity and the derived textual context by using cosine similarity between v cw and the vector of entity v e .

Modeling Coherence
It has been revealed that effectively modeling coherence in the assignment of entities to mentions is important for NED. However, this is a chickenand-egg problem because the assignment of entities to mentions, which is required to measure coherence, is not possible prior to performing NED.
Similar to past work (Ratinov et al., 2011), we address this problem by employing a simple twostep approach: we first train the machine learning model using the coherence score among unambiguous mentions 2 , in addition to other features, and then retrain the model using the coherence score among the predicted entity assignments instead.
To estimate coherence, we first calculate the vector representation of the context entities and measure the similarity between the vector of the context entities and that of the target entity e. Note that context entities are unambiguous entities in the first step, and predicted entities are used instead in the second step.
To derive the vector representation of context entities, we average their vector representations: where E cm denotes the set of context entities described above.
To estimate the coherence score, we again use cosine similarity between the vector of entity v e and that of context entities v ce .

Learning to Rank
To combine the proposed contextual information described above with standard NED features, we employ a method of supervised machine learning to rank the candidate entities given mention m and document d.
In particular, we use Gradient Boosted Regression Trees (GBRT) (Friedman, 2001), a stateof-the-art point-wise learning-to-rank algorithm widely used for various tasks, which has been recently adopted for the sort of tasks for which we employ it here (Meij et al., 2012). GBRT consists of an ensemble of regression trees, and predicts a relevance score given an instance. We use the GBRT implementation in scikit-learn 3 and the logistic loss is used as the loss function. The main parameters of GBRT are the number of iterations η, the learning rate β, and the maximum depth of the decision trees ξ.
With regard to the features of machine learning, we first use prior probability (P (e|m)) and entity prior (P (e)). Further, we include a feature representing the maximum prior probability of the candidate entity e of all mentions in the document. We also add the number of entity candidates for mention m as a feature. The above set of four features is called base features in the rest of the paper.
We also use several string similarity features used in past work on NED (Meij et al., 2012). These features aim to capture the similarity between the title of entity e and the surface of mention m, and consist of the edit distance, whether the title of entity e exactly equals or contains the surface of mention m, and whether the title of entity e starts or ends with the surface of mention m.
Finally, we include contextual features measured using the proposed embedding. We use cosine similarity between the candidate entity and the textual context (see Section 3.1.1), and similarity between an entity and contextual entities (see Section 3.1.2). Furthermore, we include the rank of entity e among candidate entities of mention m, sorted according to these two similarity scores in descending order.

Experiments
In this section, we describe the setup and results of our experiments. In addition to experiments on the NED task, we separately assessed the quality of pairwise entity relatedness in order to test the 3 http://scikit-learn.org/ effectiveness of our method in capturing pairwise similarity between pairs of entities. We first describe the details of the training of the embedding and then present the experimental results.

Training for the Proposed Embedding
To train the proposed embedding, we used the December 2014 version of the Wikipedia dump 4 . We first removed the pages for navigation, maintenance, and discussion, and used the remaining 4.9 million pages. We parsed the Wikipedia pages and extracted text and anchors from each page. We further tokenized the text using the Apache OpenNLP tokenizer. We also filtered out rare words that appeared fewer than five times in the corpus. We thus obtained approximately 2 billion tokens and 73 million anchors. The total number of words and entities in the embedding were approximately 2.1 million and 5 million, respectively. Consequently, the number of rows of matrices V and U were 7.1 million. The number of dimensions d of the embedding was set to 500. Following (Mikolov et al., 2013b), we also used learning rate α = 0.025 which linearly decreased with the iterations of the Wikipedia dump. Regarding the other parameters, we set the size of the context window c = 10 and the negative samples g = 30. The model was trained online by iterating over pages in the Wikipedia dump 10 times. The training lasted approximately five days using a server with a 40-core CPU on Amazon EC2.

Entity Relatedness
To test the quality of the vector representation of entities, we conducted an experiment using a dataset for entity relatedness created by Ceccarelli et al. (Ceccarelli et al., 2013). The dataset consists of training, test, and validation sets, and we only use the test set for this experiment. The test set contains 3,314 entities, where each entity has 91 candidate entities with gold-standard labels indicating whether the two entities are related.
Following , we obtained the ranked order of the candidate entities using cosine similarity between the target entity and each of the candidate entities, and computed the two standard measures: normalized discounted cumulative gain (NDCG) (Järvelin and Kekäläinen, 2002) and mean average precision (MAP) (Manning et al., 2008). We adopted WLM as baseline. Table 1 shows the results. The score for WLM was obtained from Huang et al. . Our method clearly outperformed WLM. The results show that our method accurately captures pairwise entity relatedness.

Setup
We now explain our experimental setup for the NED task. We tested the performance of our proposed method on two standard NED datasets: the CoNLL dataset and the TAC 2010 dataset. The details of these datasets are provided below. Moreover, as with the corpus used in the embedding, we used the December 2014 version of the Wikipedia dump as the referent KB, and to derive the prior probability as well as the entity prior.
To find the best parameters for our machine learning model, we ran a parameter search on the CoNLL development set. We used η = 10, 000 trees, and tested all combinations of the learning rate β = {0.01, 0.02, 0.03, 0.05} and the maximum depth of the decision trees ξ = {3, 4, 5}. We computed their accuracy on the dataset, and found that the parameters did not significantly affect performance (1.0% at most). We used β = 0.02 and ξ = 4 which yielded the best performance.
CoNLL The CoNLL dataset is a popular NED dataset constructed by Hoffart et al. (Hoffart et al., 2011). The dataset is based on NER data from the CoNLL 2003 shared task, and consists of training, development, and test sets, containing 946, 216, and 231 documents, respectively. We trained our machine learning model using the training set and reported its performance using the test set. We also used the development set for the parameter tuning described above. Following (Hoffart et al., 2011), we only used 27,816 mentions with valid entries in the KB and reported the standard micro-(aggregates over all mentions) and macro-(aggregates over all documents) accuracies of the top-ranked candidate entities to assess disambiguation performance. For candidate generation, we use the fol-lowing two resources: 1) a public dataset recently built by Pershina et al. (Pershina et al., 2015) (denoted by PPRforNED) for the sake of compatibility with their state-of-the-art results, and 2) a dictionary built using a standard YAGO means relation dataset (Hoffart et al., 2011) (denoted by YAGO). Moreover, we used PPRforNED for the parameter tuning of the machine learning model and for error analysis.

TAC 2010
The TAC 2010 dataset is another popular NED dataset constructed for the Text Analysis Conference (TAC) 5 (Ji et al., 2010). The dataset is based on news articles from various agencies and Web log data, and consists of a training and a test set containing 1,043 and 1,013 documents, respectively. Following past work (He et al., 2013;Chisholm and Hachey, 2015), we used mentions only with a valid entry in the KB, and reported the micro-accuracy score of the top-ranked candidate entities. We trained our model using the training set and assessed its performance using the test set. Consequently, we evaluated our model on 1,020 mentions contained in the test set. For candidate generation, we used a dictionary that was directly built from the Wikipedia dump mentioned previously. Similar to past work, we retrieved possible mention surfaces of an entity from (1) the title of the entity, (2) the title of another entity redirecting to the entity, and (3) the names of anchors that point to the entity. We retained the top 50 candidates through their entity priors for computational efficiency.

Comparison with State-of-the-art Methods
We compared our method with the following recently proposed state-of-the-art methods: • Hoffart et al. (Hoffart et al., 2011) is a graphbased approach that finds a dense subgraph of entities in a document to address NED.
• He et al. (He et al., 2013) uses deep neural networks to derive the representations of entities and mention contexts and applies them to NED.
• Chisholm and Hachey (Chisholm and Hachey, 2015) uses a Wikilinks dataset (Singh et al., 2012) to improve the performance of NED.   Table 3: Accuracy scores of the proposed method and the state-of-the-art methods.
• Pershina et al. (Pershina et al., 2015) improved NED by modeling coherence using the personalized page rank algorithm, and achieved the best-known accuracy on the CoNLL dataset. Table 2 shows the experimental results of our proposed method. Our method successfully achieved enhanced performance on both the CoNLL and the TAC 2010 datasets. Moreover, we found that the choice of candidate generation method considerably affected performance on the CoNLL dataset. Further, Table 3 shows the experimental results of our proposed method as well as those of stateof-the-art methods. Our method outperformed all the state-of-the-art methods on both datasets.

Feature Study
We conducted a feature study on our method. We began with base features, added various features to our system incrementally, and reported their impact on performance. We then introduced our twostep approach to achieve the final results. Table 4 shows the results. Surprisingly, we attained results comparable with those of some state-of-the-art methods on the both datasets by only using base features. Adding string similarity features slightly further improved performance.
We observed significant improvement when adding textual context features based on our proposed embedding. Our method outperformed  some state-of-the-art methods without using coherence. Further, coherence based on unambiguous entity mentions and our two-step approach significantly improved performance on the CoNLL dataset. However, it did not contribute to performance on the TAC 2010 dataset. This was because of the significant difference in the density of entity mentions between the datasets. The CoNLL dataset contains approximately 20 entity mentions per document, but the TAC 2010 only contains approximately one mention per document which is unarguably insufficient to model coherence.

Error Analysis
We also conducted an error analysis on the CoNLL test set with candidate generation using PPRforNED dataset. We observed that approximately 48.6% errors were caused by metonymy mentions (Ling et al., 2015) (i.e., mentions with more than one plausible annotation). In particular, our NED method often erred when an incorrect entity was highly popular and exactly matched the mention surface (e.g., "South Africa" referring to the entity South Africa national rugby union team rather than the entity South Africa). This makes sense because our machine learning model uses the popularity statistics of the KB (i.e., prior probability and entity prior), and the string similarity between the title of the entity and the mention surface. This problem is discussed further in (Ling et al., 2015).
Furthermore, because our method depends on the presence of KB anchors in order to learn entity representation, it arguably fails to learn satisfactory representations of tail entities (i.e., entities rarely referred to by anchors), thus resulting in disambiguation errors. We discovered that nearly 9.6% errors were due to referent entities with less than 10 inbound KB anchors, and 4.5% involved entities with no inbound KB anchor. These errors might be addressed using KB data other than KB anchors, such as the description of the entities and the KB categories in order to avoid dependence on the KB anchors. This remains part of our future work.

Related Work
Early NED methods addressed the problem as a well-studied word sense disambiguation problem (Mihalcea and Csomai, 2007). These methods primarily focused on modeling the similarity of textual (local) context. Most recent stateof-the-art methods focus on modeling coherence among disambiguated entities in the same document (Cucerzan, 2007;Milne and Witten, 2008b;Hoffart et al., 2011;Ratinov et al., 2011). These approaches have also been called collective or global approaches in the literature.
Learning the representations of entities for NED has been addressed in past literature. Guo and Barbosa (Guo and Barbosa, 2014) used random walks on KB graphs to construct vector representations of entities and documents to address NED. Blanco et al. (Blanco et al., 2015) proposed a method to map entities into the word embedding (i.e., Word2vec (Mikolov et al., 2013b)) space using entity descriptions in the KB and applied it for NED. He et al. (He et al., 2013) used deep neural networks to compute representations of entities and contexts of mentions directly from the KB. Similarly, Sun et al.  proposed a method based on deep neural networks to model representations of mentions, contexts of mentions, and entities. Huang et al.  also leveraged deep neural networks to learn entity representations such that the consequent pairwise entity relatedness was more suitable than of a standard method (i.e., WLM) for NED. Further, Hu et al. (Hu et al., 2015) used hierarchical information in the KB to build entity embedding and applied it to model coherence. Unlike these methods, our proposed approach involves jointly learning vector representations of entities as well as words, hence enabling the accurate computation of the semantic similarity among its items to model both the textual context and coherence.
Moreover, Yaghoobzadeh and Schutze (Yaghoobzadeh and Schütze, 2015) addressed an entity typing task by building an embedding of words and entities on a corpus with annotated entities (i.e., FACC1 (Gabrilovich et al., 2013)) using the skip-gram model. Compared to our method, in addition to the significant difference between their task and NED, their embedding does not incorporate the link graph data of KB, which is known to be highly important for NED.
Furthermore, in the context of knowledge graph embedding, another tenor of recent works has been published (Bordes et al., 2011;Socher et al., 2013;. These methods focus on learning vector representations of entities to primarily address the link prediction task that aims to predict a new fact based on existing facts in KB. Particularly, Wang et al. (Wang et al., 2014) have recently revealed that the joint modeling of the embedding of words and entities can improve performance in several tasks including the link prediction task, which is somewhat analogous to our experimental results.

Conclusions
In this paper, we proposed an embedding method to jointly map words and entities into the same continuous vector space. Our method enables us to effectively model both textual and global contexts. Further, armed with these context models, our NED method outperforms state-of-the-art NED methods.
In future work, we intend to improve our model by leveraging relevant knowledge, such as relations in a knowledge graph (e.g., Freebase).