Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks

A key challenge in entity linking is making effective use of contextual information to disambiguate mentions that might refer to different entities in different contexts. We present a model that uses convolutional neural networks to capture semantic correspondence between a mention's context and a proposed target entity. These convolutional networks operate at multiple granularities to exploit various kinds of topic information, and their rich parameterization gives them the capacity to learn which n-grams characterize different topics. We combine these networks with a sparse linear model to achieve state-of-the-art performance on multiple entity linking datasets, outperforming the prior systems of Durrett and Klein (2014) and Nguyen et al. (2014).


Introduction
One of the major challenges of entity linking is resolving contextually polysemous mentions.For example, Germany may refer to a nation, to that nation's government, or even to a soccer team.Past approaches to such cases have often focused on collective entity linking: nearby mentions in a document might be expected to link to topically-similar entities, which can give us clues about the identity of the mention currently being resolved (Ratinov et al., 2011;Hoffart et al., 2011;He et al., 2013;Cheng and Roth, 2013;Durrett and Klein, 2014).But an even simpler approach is to use context information from just the words in the source document itself to make sure the entity is being resolved sensibly in context.In past work, these approaches have typically relied on heuristics such as tf-idf (Ratinov et 1 Source available at github.com/matthewfl/nlp-entity-convnetal., 2011), but such heuristics are hard to calibrate and they capture structure in a coarser way than learning-based methods.
In this work, we model semantic similarity between a mention's source document context and its potential entity targets using convolutional neural networks (CNNs).CNNs have been shown to be effective for sentence classification tasks (Kalchbrenner et al., 2014;Kim, 2014;Iyyer et al., 2015) and for capturing similarity in models for entity linking (Sun et al., 2015) and other related tasks (Dong et al., 2015;Shen et al., 2014), so we expect them to be effective at isolating the relevant topic semantics for entity linking.We show that convolutions over multiple granularities of the input document are useful for providing different notions of semantic context.Finally, we show how to integrate these networks with a preexisting entity linking system (Durrett and Klein, 2014).Through a combination of these two distinct methods into a single system that leverages their complementary strengths, we achieve state-ofthe-art performance across several datasets.

Model
Our model focuses on two core ideas: first, that topic semantics at different granularities in a document are helpful in determining the genres of entities for entity linking, and second, that CNNs can distill a block of text into a meaningful topic vector.
Our entity linking model is a log-linear model that places distributions over target entities t given a mention x and its containing source document.For now, we take P (t|x) ∝ exp w f C (x, t; θ), where f C produces a vector of features based on CNNs with parameters θ as discussed in Section 2.1.Section 2.2 describes how we combine this simple model with a full-fledged entity linking system.As shown in the middle of Figure 1 is a cosine similarity between a topic vector associated with the source document and a topic vector associated with the target entity.These vectors are computed by distinct CNNs operating over different subsets of relevant text.
Figure 1 shows an example of why different kinds of context are important for entity linking.In this case, we are considering whether Pink Floyd might link to the article Gavin Floyd on Wikipedia (imagine that Pink Floyd might be a person's nickname).If we look at the source document, we see that the immediate source document context around the mention Pink Floyd is referring to rock groups (Led Zeppelin, Van Halen) and the target entity's Wikipedia page is primarily about sports (baseball starting pitcher).Distilling these texts into succinct topic descriptors and then comparing those helps tell us that this is an improbable entity link pair.In this case, the broader source document context actually does not help very much, since it contains other generic last names like Campbell and Savage that might not necessarily indicate the document to be in the music genre.However, in general, the whole document might provide a more robust topic estimate than a small context window does.

Convolutional Semantic Similarity
Figure 1 shows our method for computing topic vectors and using those to extract features for a potential Wikipedia link.For each of three text granularities in the source document (the mention, that mention's immediate context, and the entire document) and two text granularities on the target entity side (title and Wikipedia article text), we produce vector representations with CNNs as follows.We first embed each word into a d-dimensional vector space using standard embedding techniques (discussed in Section 3.2), yielding a sequence of vectors w 1 , . . ., w n .We then map those words into a fixed-size vector using a convolutional network parameterized with a filter bank M ∈ R k×d .We put the result through a rectified linear unit (ReLU) and combine the results with sum pooling, giving the following formulation: where w j:j+ is a concatenation of the given word vectors and the max is element-wise.2Each convolution granularity (mention, context, etc.) has a distinct set of filter parameters M g .
This process produces multiple representative topic vectors s ment , s context , and s doc for the source document and t title and t doc for the target entity, as shown in Figure 1.All pairs of these vectors between the source and the target are then compared using cosine similarity, as shown in the middle of Figure 1.This yields the vector of features f C (s, t e ) which indicate the different types of similarity; this vector can then be combined with other sparse features and fed into a final logistic regression layer (maintaining end-to-end inference and learning of the filters).When trained with backpropagation, the convolutional networks should learn to map text into vector spaces that are informative about whether the document and entity are related or not.

Integrating with a Sparse Model
The dense model presented in Section 2.1 is effective at capturing semantic topic similarity, but it is most effective when combined with other signals for entity linking.An important cue for resolving a mention is the use of link counts from hyperlinks in Wikipedia (Cucerzan, 2007;Milne and Witten, 2008;Ji and Grishman, 2011), which tell us how often a given mention was linked to each article on Wikipedia.This information can serve as a useful prior, but only if we can leverage it effectively by targeting the most salient part of a mention.For example, we may have never observed President Barack Obama as a linked string on Wikipedia, even though we have seen the substring Barack Obama and it unambiguously indicates the correct answer.
Following Durrett and Klein (2014), we introduce a latent variable q to capture which subset of a mention (known as a query) we resolve.Query generation includes potentially removing stop words, plural suffixes, punctuation, and leading or tailing words.This processes generates on average 9 queries for each mention.Conveniently, this set of queries also defines the set of candidate entities that we consider linking a mention to: each query generates a set of potential entities based on link counts, whose unions are then taken to give on the possible entity targets for each mention (including the null link).In the example shown in Figure 1, the query phrases are Pink Floyd and Floyd, which generate Pink Floyd and Gavin Floyd as potential link targets (among other options that might be derived from the Floyd query).
Our final model has the form P (t|x) = q P (t, q|x).We parameterize P (t, q|x) in a loglinear way with three separate components: f Q and f E are both sparse features vectors and are taken from previous work (Durrett and Klein, 2014).f C is as discussed in Section 2.1.Note that f C has its own internal parameters θ because it relies on CNNs with learned filters; however, we can compute gradients for these parameters with standard backpropagation.The whole model is trained to maximize the log likelihood of a labeled training corpus using Adadelta (Zeiler, 2012).
The indicator features f Q and f E are described in more detail in Durrett and Klein (2014).f Q only impacts which query is selected and not the disambiguation to a title.It is designed to roughly capture the basic shape of a query to measure its desirability, indicating whether suffixes were removed and whether the query captures the capitalized subsequence of a mention, as well as standard lexical, POS, and named entity type features.f E mostly captures how likely the selected query is to correspond to a given entity based on factors like anchor text counts from Wikipedia, string match with proposed Wikipedia titles, and discretized cosine similarities of tf-idf vectors (Ratinov et al., 2011).Adding tf-idf indicators is the only modification we made to the features of Durrett and Klein (2014).

Experimental Results
We performed experiments on 4 different entity linking datasets.
• CoNLL-YAGO (Hoffart et al., 2011): This corpus is based on the CoNLL 2003 dataset; the test set consists of 231 news articles and contains a number of rarer entities.
• WP (Heath and Bizer, 2011): This dataset consists of short snippets from Wikipedia.
• Wikipedia (Ratinov et al., 2011) Our results outperform those of Durrett and Klein (2014) and Nguyen et al. (2014).In general, we also see that the convolutional networks by themselves can outperform the system using only sparse features, and in all cases these stack to give substantial benefit.
We use standard train-test splits for all datasets except for WP, where no standard split is available.
In this case, we randomly sample a test set.For all experiments, we use word vectors computed by running word2vec (Mikolov et al., 2013) on all Wikipedia, as described in Section 3.2.Table 2 shows results for two baselines and three variants of our system.Our main contribution is the combination of indicator features and CNN features (Full).We see that this system outperforms the results of Durrett and Klein (2014) and the AIDA-LIGHT system of Nguyen et al. (2014).We can also compare to two ablations: using just the sparse features (a system which is a direct extension of Durrett and Klein (2014)) or using just the CNNderived features. 5Our CNN features generally outperform the sparse features and improve even further when stacked with them.This reflects that they capture orthogonal sources of information: for example, the sparse features can capture how frequently the target document was linked to, whereas the CNNs can capture document context in a more nuanced way.These CNN features also clearly supersede the sparse features based on tf-idf (taken from (Ratinov et al., 2011)), showing that indeed that CNNs are better at learning semantic topic similarity than heuristics like tf-idf.Table 3: Comparison of using only topic information derived from the document and target article, only information derived from the mention itself and the target entity title, and the full set of information (six features, as shown in Figure 1).Neither the finest nor coarsest convolutional context can give the performance of the complete set.Numbers are reported on a development set.
In the sparse feature system, the highest weighted features are typically those indicating the frequency that a page was linked to and those indicating specific lexical items in the choice of the latent query variable q.This suggests that the system of Durrett and Klein (2014) has the power to pick the right span of a mention to resolve, but then is left to generally pick the most common link target in Wikipedia, which is not always correct.By contrast, the full system has a greater ability to pick less common link targets if the topic indicators distilled from the CNNs indicate that it should do so.

Multiple Granularities of Convolution
One question we might ask is how much we gain by having multiple convolutions on the source and target side.Table 3 compares our full suite of CNN features, i.e. the six features specified in Figure 1, with two specific convolutional features in isolation.Using convolutions over just the source document (s doc ) and target article text (t doc ) gives a system6 that performs, in aggregate, comparably to using convolutions over just the mention (s ment ) and the entity title (t title ).These represent two extremes of the system: consuming the maximum amount of context, which might give the most robust representation of topic semantics, and consuming the minimum amount of context, which gives the most focused representation of topics semantics (and which more generally might allow the system to directly memorize train-test pairs observed in training).However, neither performs as well as the combination of all CNN features, showing that the different granularities capture complementary

Embedding Vectors
We also explored two different sources of embedding vectors for the convolutions.Table 4 shows that word vectors trained on Wikipedia outperformed Google News word vectors trained on a larger corpus.Further investigation revealed that the Google News vectors had much higher out-of-vocabulary rates.For learning the vectors, we use the standard word2vec toolkit (Mikolov et al., 2013) with vector length set to 300, window set to 21 (larger windows produce more semantically-focused vectors (Levy and Goldberg, 2014)), 10 negative samples and 10 iterations through Wikipedia.We do not fine-tune word vectors during training of our model, as that was not found to improve performance.

Analysis of Learned Convolutions
One downside of our system compared to its purely indicator-based variant is that its operation is less interpretable.However, one way we can inspect the learned system is by examining what causes high activations of the various convolutional filters (rows of the matrices M g from Equation 1).Table 1 shows the n-grams in the ACE dataset leading to maximal activations of three of the filters from M doc .Some filters tend to learn to pick up on n-grams characteristic of a particular topic.In other cases, a single filter might be somewhat inscrutable, as with the third column of Table 1.There are a few possible explanations for this.First, the filter may generally have low activations and therefore have little impact in the final feature computation.Second, the extreme points of the filter may not be characteristic of its overall behavior, since the bulk of n-grams will lead to more moderate activations.Finally, such a filter may represent the superposition of a few topics that we are unlikely to ever need to disambiguate between; in a particular context, this filter will then play a clear role, but one which is hard to determine from the overall shape of the parameters.

Conclusion
In this work, we investigated using convolutional networks to capture semantic similarity between source documents and potential entity link targets.
Using multiple granularities of convolutions to evaluate the compatibility of a mention in context and several potential link targets gives strong performance on its own; moreover, such features also improve a pre-existing entity linking system based on sparse indicator features, showing that these sources of information are complementary.

Figure 1 :
Figure1: Extraction of convolutional vector space features fC (x, te).Three types of information from the input document and two types of information from the proposed title are fed through convolutional networks to produce vectors, which are systematically compared with cosine similarity to derive real-valued semantic similarity features.

Table 2 :
Performance of the system in this work (Full) compared to two baselines from prior work and two ablations.

Table 1 :
Five-grams representing the maximal activations from different filters in the convolution over the source document (M doc , producing s doc in Figure1).Some filters tend towards singular topics as shown in the first and second columns, which focus on weapons and trials, respectively.Others may have a mix of seemingly unrelated topics, as shown in the third column, which does not have a coherent theme.However, such filters might represent a superposition of filters for various topics which never cooccur and thus never need to be disambiguated between.

Table 4 :
Results of the full model (sparse and convolutional features) comparing word vectors derived from Google News vs. Wikipedia on development sets for each corpus.