Deep Joint Entity Disambiguation with Local Neural Attention

We propose a novel deep learning model for joint document-level entity disambiguation, which leverages learned neural representations. Key components are entity embeddings, a neural attention mechanism over local context windows, and a differentiable joint inference stage for disambiguation. Our approach thereby combines benefits of deep learning with more traditional approaches such as graphical models and probabilistic mention-entity maps. Extensive experiments show that we are able to obtain competitive or state-of-the-art accuracy at moderate computational costs.


Introduction
Entity disambiguation (ED) is an important stage in text understanding which automatically resolves references to entities in a given knowledge base (KB).This task is challenging due to the inherent ambiguity between surface form mentions such as names and the entities they refer to.This many-to-many ambiguity can often be captured partially by name-entity co-occurrence counts extracted from entity-linked corpora.
ED research has largely focused on two types of contextual information for disambiguation: local information based on words that occur in a context window around an entity mention, and, global information, exploiting document-level coherence of the referenced entities.Many stateof-the-art methods aim to combine the benefits of both, which is also the philosophy we follow in this paper.What is specific to our approach is that we use embeddings of entities as a common representation to assess local as well as global evidence.
In recent years, many text and language understanding tasks have been advanced by neural network architectures.However, despite recent work, competitive ED systems still largely employ manually designed features.Such features often rely on domain knowledge and may fail to capture all relevant statistical dependencies and interactions.The explicit goal of our work is to use deep learning in order to learn basic features and their combinations from scratch.To the best of our knowledge, our approach is the first to carry out this program with full rigor.

Contributions and Related Work
There is a vast prior research on entity disambiguation, highlighted by (Ji, 2016).We will focus here on a discussion of our main contributions in relation to prior work.
Entity Embeddings.We have developed a simple, yet effective method to embed entities and words in a common vector space.This follows the popular line of work on word embeddings, e.g.(Mikolov et al., 2013;Pennington et al., 2014), which was recently extended to entities and ED by (Yamada et al., 2016;Fang et al., 2016;Zwicklbauer et al., 2016;Huang et al., 2015).In contrast to the above methods that require data about entity-entity co-occurrences, we avoid the use of such additional data and rather bootstrap entity embeddings from their canonical entity pages and local context of their hyperlink annotations.This allows for more efficient training and alleviates the need to compile co-linking statistics.Also, we avoid hand-engineered features, multiple disambiguation steps, or the need for additional ad hoc heuristics.
Context Attention.We present a novel attention mechanism for local ED.Inspired by mem-ory networks (Sukhbaatar et al., 2015) and insights of (Lazic et al., 2015), our model deploys attention to select words that are informative for the disambiguation decision.A learned combination of the resulting context-based entity scores and a mention-entity prior yields the final local scores.Our local model achieves better accuracy than the local probabilistic model of (Ganea et al., 2016), as well as the feature-engineered local model of (Globerson et al., 2016).As an added benefit, our model also has a smaller memory footprint.There have been other deep learning approaches to define better context models for ED.For instance (Francis-Landau et al., 2016;He et al., 2013) use convolutional neural networks (CNNs) and stacked denoising auto-encoders, respectively, to learn representations of textual documents and canonical entity pages.Entities for each mention are locally scored based on cosine similarity with the respective document embedding.In a similar local setting (Sun et al., 2015) embed mentions, their immediate contexts and their candidate entities using word embeddings and CNNs.However, their entity representations are built from entity titles and entity categories only.
Collective Disambiguation.Third, a novel deep learning architecture for global ED is proposed.Mentions in a document are resolved jointly, using a conditional random field (Lafferty et al., 2001) with parametrized potentials.We suggest to learn the latter by casting loopy belief propagation (LBP) (Murphy et al., 1999) as a rolled-out deep network.This is inspired by similar approaches in computer vision, e.g.(Domke, 2013), and allows us to backpropagate through the (truncated) message passing, thereby optimizing the CRF potentials to work well in conjunction with the inference scheme.Our model is thus trained end-to-end with the exception of the pre-trained word and entity embeddings.Previous work has investigated different approximation techniques, including: random graph walks (Guo and Barbosa, 2016), personalized PageRank (Pershina et al., 2015), intermention voting (Ferragina and Scaiella, 2010), graph pruning (Hoffart et al., 2011), integer linear programming (Cheng and Roth, 2013), or ranking SVMs (Ratinov et al., 2011).Mostly connected to our approach is (Ganea et al., 2016) where LBP is used for inference (but not learning) in a probabilistic graphical model and (Globerson et al., 2016) where a single round of message passing with attention is performed.To our knowledge, we are one of the first to investigate differentiable message passing for NLP problems.

Learning Entity Embeddings
In a first step, we propose to train entity vectors that can be used for the ED task (and potentially for other tasks).These embeddings compress the semantic meaning of entities and drastically reduce the need for manually designed features or co-occurrence statistics.
Entity embeddings are bootstrapped from word embeddings and are trained independently for each entity.A few arguments motivate this decision: (i) there is no need for entity co-occurrence statistics that suffer from sparsity issues and/or large memory footprints; (ii) vectors of entities in a subset domain of interest can be trained separately, obtaining potentially significant speed-ups and memory savings that would otherwise be prohibitive for large entity KBs;1 (iii) entities can be easily added in an incremental manner, which is important in practice; (iv) the approach extends well into the tail of rare entities with few linked occurrences; (v) empirically, we achieve better quality compared to methods that use entity cooccurrence statistics.
Our model embeds words and entities in the same low-dimensional vector space in order to exploit geometric similarity between them.We start with a pre-trained word embedding map x : W → R d that is known to encode semantic meaning of words w ∈ W; specifically we use word2vec pretrained vectors (Mikolov et al., 2013).We extend this map to entities E, i.e. x : E → R d , as described below.
Suppose that words that co-occur with the mention of an entity e are governed by a conditional distribution p(w|e).Empirically, we assume that words with high probability p(w|e) appear (i) on canonical KB description pages of the entity, and (ii) within windows of fixed size surrounding mentions of an entity.The above distribution is then approximated from word-entity co-occurrence counts from these two sources: p(w|e) ∝ #(w, e).Next, let q(w) be a modified unigram word distribution p(w) which we use for sampling "negative" words, i.e. ones unrelated to a specific entity.As in (Mikolov et al., 2013), we choose a modified unigram distribution q(w) = p(w) α for α ∈ (0, 1).We suggest to use a max-margin objective to infer the optimal entity embeddings.Let w + ∼ p(w|e) and w − ∼ q(w), then: where γ > 0 is a margin parameter and [•] + is the soft-plus function.The above loss is optimized using stochastic gradient descent with projection over sampled pairs (w + , w − ).
We empirically assess the quality of our entity embeddings on entity similarity and ED tasks as detailed in Section 7.2, Table 1 and Appendix A. The technique described in this section can also be applied, in principle, for computing embeddings of general text documents, but a comparison with such methods is left as future work.

Local Model with Neural Attention
We now explain our local ED approach that uses embeddings to steer a neural attention mechanism.We build on the insight that only a few context words are informative for resolving an ambiguous mention, something that has been exploited before in (Lazic et al., 2015).Focusing only on those words helps reducing noise and improves disambiguation.(Yamada et al., 2016) observe the same problem and adopt the restrictive strategy of removing all non-nouns.Here, we assume that a context word may be relevant, if it is related to at least one of the entity candidates of a given mention.
Context Scores.Let us assume that we have a mention-entity prior p(e|m) available and that for each mention m, a pruned candidate set of entities Γ(m), |Γ(m)|≤ S has been identified.Our model, depicted in Figure 1, computes a score for each e ∈ Γ(m) based on the local context c = w 1 , . . ., w K surrounding m as well as the prior.It is a composition of differentiable functions, thus it is smooth from input to output, allowing us to easily compute gradients and backpropagate through it.
Each word w ∈ c and entity e ∈ Γ(m) is mapped to its embedding via the pre-trained map x (cf.Section 3).We then compute an unnormalized support score for each word in the context as follows: where A is a parameterized diagonal matrix.The score is high, if a word is strongly related to at least one candidate entity.We then apply a softmax function, (hard) pruned to the R ≤ K words which receive the highest score2 Denote the reduced context by c.Then we get explicitly (3) We then define a β-weighted context-based entitymention score via where B is another trainable diagonal matrix.We will later use the same architecture for unary scores of our global ED model.
Local Score Combination.We integrate these context scores with the context-independent scores encoded in p(e|m).Our final (unnormalized) local model is a combination of both Ψ(e, c) and log p(e|m): Ψ(e, m, c) = f (Ψ(e, c), log p(e|m)) (5) Here, we found a flexible choice for f to be important and to be superior to any naïve combination model.We therefore used a neural network with two fully connected layers of 100 hidden units and ReLU non-linearities, which we regularized as suggested in (Denton et al., 2015) by constraining the sum of squares of all weights in the linear layer.We use standard projected SGD for training.The same network is also used in Section 5.
Learning the Local Model.Entity and word embeddings are pre-trained as discussed in Section 3. Thus, the only learnable parameters are the diagonal matrices A and B, plus the parameters of f .Having few parameters helps to avoid overfitting and to be able to train with little annotated data.We assume that a set of known mentionentity pairs {(m, e * )} with their respective context windows have been extracted from a corpus.For model fitting, we then utilize a max-margin loss that ranks ground truth entities higher than other candidate entities.This leads us to the objective: g(e, m), where γ > 0 is a margin parameter.So we aim to find a Ψ (i.e.parameterized by θ) such that the score of the correct entity e * referenced by m is at least γ higher than that of any other candidate entity e. Whenever this is not the case, the margin violation becomes the experienced loss.

Document-Level Deep Model
Next, we address global ED assuming document coherence among entities.We therefore introduce the notion of a document as consisting of a sequence of mentions m = m 1 , . . ., m n , along with their context windows c = c 1 , . . .c n .Our goal is to define a joint probability distribution over Each such e selects one candidate entity to each mention in the document.Obviously, the state space of e grows exponentially in the number of mentions n.

CRF Model
Our model is a fully-connected pairwise conditional random field, defined on the log scale as The unary factors are the local scores described in Eq. ( 4).The pairwise factors are bilinear forms of the entity embeddings where C is again a diagonal matrix.Similar to (Ganea et al., 2016), the above normalization helps balancing the unary and pairwise terms across documents with different numbers of mentions.

Differentiable Inference
Training and prediction in binary CRF models as above is NP-hard.
Therefore, in learning one usually maximizes a likelihood approximation and during operations (i.e. in prediction) one may use an approximate inference procedure, often based on messagepassing.Among many challenges of these approaches, it is worth pointing out that weaknesses of the approximate inference procedure are generally not captured during learning.Inspired by (Domke, 2011(Domke, , 2013)), we use truncated fitting of LBP to a fixed number of message passing iterations.Our model directly optimizes the marginal likelihoods, using the same networks for learning and prediction.As noted by (Domke, 2013), this method is robust to model mis-specification, avoids inherent difficulties of partition functions and is faster compared to double-loop likelihood training (where, for each stochastic update, inference is run until convergence is achieved).
Our architecture is shown in Figure 2. A neural network with T layers encodes T message passing iterations of synchronous max-product LBP 3 which is designed to find the most likely (MAP) entity assignments.We also use message damping, which is known to speed-up and stabilize convergence of message passing.Formally, in iteration t, mention m i votes for entity candidate e ∈ Γ(m j ) of mention m j using the normalized log-message m t i→j (e) computed as: Herein the first part just reflects the CRF potential, whereas the second part is defined as where δ ∈ (0, 1] is a damping factor.Note that, without loss of generality, we simplified the LBP procedure by dropping the factor nodes. After T iterations, the beliefs (marginals) are computed as: 3 Sum-product and mean-field performed worse in our experiments.Similar to the local case, we obtain accuracy improvement when combining the mention-entity prior p(e|m) with marginal µ i (e) using the nonlinear combination function f described in Section 4. The learned function f for global ED is non-trivial (see Figure 3), showing that the influence of the prior tends to weaken for larger µ(e), whereas it has a dominating influence, whenever the document-level evidence is weak.
Parameters of our global model are the diagonal matrices A, B, C and the weights of the f network.As before we found a margin based objective to be most effective and we suggest to fit parameters by minimizing a ranking loss4 defined over the marginals ρ i (e) := f (µ i (e), p(e|m i )) (13) as follows:  (Ceccarelli et al., 2013).WLM is a well-known similarity method of (Milne and Witten, 2008).
Computing this objective is trivial by running T times the steps described by Eqs. ( 9), ( 10), followed in the end by step in Eq. ( 12).Each step is differentiable and the gradient over model parameters can be computed on the resulting marginals and back-propagated over messages using the chain rule.
In operations, prediction is done independently for each mention m i by based on maximizing the ρ i (e) score.

Candidate Selection
We make use of a mention-entity prior p(e|m) both as a feature and for entity candidate selection.It is computed by averaging probabilities from two indexes build from mention entity hyperlink statistics from Wikipedia and a large Web corpus (Spitkovsky and Chang, 2012), plus the YAGO index of (Hoffart et al., 2011) (with uniform prior).
Candidate selection, i.e. building of Γ(e), is done for each input mention as follows: first, the top 30 candidates are selected based on p(e|m).Then, in order to optimize for memory and run time (LBP has complexity quadratic in S), we keep only 7 of these entities based on the following heuristic: (i) the top 4 entities based on p(e|m) are selected, (ii) the top 3 entities based on the local context-entity similarity measured as in Eq. 4 are selected. 5.We refrain from annotating mentions without any candidate, implying that precision and recall can be different in our case.
In a few cases, generic mentions of persons (e.g."Peter") are coreferences of more specific mentions (e.g.Peter Such") from the same document.We employ a simple heuristic to address this issue: for each mention m, if there exist mentions 5 We have used a simpler context vector here and simply used the average of all its constituent word vectors

ED Datasets
We validate our ED models on some of the most popular available datasets used by our predecessors6 .We provide statistics in Table 2.

Training Details and (Hyper)Parameters
We explain training details of our approach.All models are implemented in the Torch framework.71.9 (Lazic et al., 2015) 86.4 (Globerson et al., 2016) 87.9 (Yamada et al., 2016) 87.2 our (local, K=100, R=50) 88.8 Global models (Huang et al., 2015) 86.6 (Ganea et al., 2016) 87.6 (Chisholm and Hachey, 2015) 88.7 (Guo and Barbosa, 2016) 89.0 (Globerson et al., 2016) 91.0 (Yamada et al., 2016) 91.5 our (global) 92.22 ± 0.14 first train each entity vector on the entity canonical description page (title words included) for 400 iterations.Subsequently, Wikipedia hyperlinks of the respective entities are used for learning until validation score (described below) stops improving.In each iteration, 20 positive words, each with 5 negative words, are sampled and used for optimization as explained in Section 3. We use Adagrad (Duchi et al., 2011) with a learning rate of 0.3.We choose embedding size d = 300, pretrained (fixed) Word2Vec word vectors8 , α = 0.6, γ = 0.1 and window size of 20 for the hyperlinks.We remove stop words before training.Learning of vectors for all candidate entities in all datasets (270000 entities) takes 20 hours on a single TitanX GPU with 12GB of memory.We test and validate our entity embeddings on the respective parts of the entity relatedness dataset of (Ceccarelli et al., 2013).It contains 3319 and 3673 queries for the test and validation sets.Each query consist of one target entity and up to 100 candidate entities with gold standard binary  for 1250 epochs over AIDA-train.

Baselines & Results
We compare with systems that report state-of-theart results on the datasets.Some baseline scores from Table 4 are taken from (Guo and Barbosa, 2016).The best results for the AIDA datasets are reported by (Yamada et al., 2016) and (Globerson et al., 2016).We do not compare against (Pershina et al., 2015) since, as noted also by (Globerson et al., 2016), their mention index artificially includes the gold entity (guaranteed gold recall).
For a fair comparison with prior work, we use in-KB accuracy and micro F1 (averaged per mention) metrics to evaluate our approach.Results are shown in Tables 3 and 4. We run our system 5 times, each time we pick the best model on the validation set, and report results on the test set for these models.We obtain state of the art accuracy on AIDA which is the largest and hardest (by the accuracy of the p(e|m) baseline) manually created ED dataset .We are also competitive on the other datasets.It should be noted that all the other methods use, at least partially, engineered features.The merit of our proposed method is to show that, with the exception of the p(e|m) feature, a neural network is able to learn the best features for ED without requiring expert input.
To gain further insight, we analyzed the accuracy on the AIDA-B dataset for situations where gold entities have low frequency or mention prior.Table 6 shows that our method performs well in these harder cases.

Hyperparameter Studies
In Figure 5, we analyze the effect of two hyperparameters.First, we see that hard attention (i.e.R < K) helps reducing the noise from uninformative context words.Second, we see that a small number of LBP iterations (hard-coded in our network) is enough to obtain good accuracy.This speeds up training and testing compared to traditional methods that run LBP until convergence.

Qualitative Analysis of Local Model
In Table 7 we show some examples of context words attended by our local model for hard cases (where the mention prior of the correct entity is low).

Conclusion
We have proposed a novel deep learning architecture for entity disambiguation that combines entity embeddings, a contextual attention mechanism, an adaptive local score combination, as well as unrolled differentiable message passing for global inference 9 .Compared to many other methods, we do not rely on hand-engineered features, nor on an extensive corpus for entity co-occurrences or relatedness.Our system is fully differentiable, although we chose to pre-train word and entity embeddings.Extensive experiments show the competitiveness of our approach across a wide range of corpora.In the future, we would like to extend this system to perform nil detection, coreference resolution and mention detection.

Figure 1 :
Figure 1: Local model with neural attention.Inputs: context word and candidate entity embeddings.Outputs: entity scores.All parts are differentiable and trainable with backpropagation.

Figure 2 :
Figure 2: Global model: unrolled LBP deep network that is end-to-end differentiable and trainable.

Figure 3 :
Figure 3: Non-linear scoring function of the belief and mention prior learned with a neural network.Achieves a 1.7% improvement on AIDA-B dataset compared to a weighted average scheme.

Table 1 :
Entity relatedness results on the test set of

Table 2 :
ED datasets.Gold recall is the percentage of mentions for which the entity candidate set contains the ground truth entity.The WNED datasets are automatically extracted, thus less reliable.

Table 3 :
In-KB accuracy for AIDA-B test set.All baselines use KB+YAGO mention-entity index.For our method we show 95% confidence intervals obtained over 5 runs.

Table 4 :
Micro F1 results for other datasets.

Table 5 :
Effects of two of the hyper-parameters.Left: A low T (e.g.5) is already sufficient for accurate approximate marginals.Right: Hard attention improves accuracy of a local model with K=100.

Table 6 :
ED accuracy on AIDA-B for our best system splitted by Wikipedia hyperlink frequency and mention prior of the ground truth entity, in cases where the gold entity appears in the candidate set.

Table 7 :
Examples of context words selected by our local attention mechanism.Distinct words are sorted decreasingly by attention weights and only words with non-zero weights are shown.
Japan player Shizuoka Yokohama played Asian USISL Saitama Okada Nakamura Tokyo Pele matches Japanese Korea players Tanaka soccer Chunnam game Suwon Takuya Kawaguchi Mizuno match Qatar team Eto Eiji football playing Confederations tournament Kagawa Chiba Apple apple fruit berry grape varieties apples crop pear potato blueberry strawberry growers peach orchards pears Prunus grower Rubus citrus spinosa tomato berries Blueberry peaches grapes almond juice melon bean apricot insect vegetable strawberries olive pomegranate Vaccinium cherries potatoes Strawberry plums cultivar Apples harvest figs cultivars sunflower beet Apple Inc.Apple software computer Microsoft Adobe hardware company iPod PC product Dell laptop Mac computers Macintosh Flash video desktop iPhone Digital Windows app PCs Intel technology device iTunes Motorola Sony digital Multimedia iPad HP licensing multimedia Nokia apps smartphone laptops Computer previewed products application Jobs devices startup Queen (band) U2 band singer Avenged Rockers Coldplay concert Lynyrd Kiss Metallica Killers rerecorded song Beatles rock Stones recording Slash Singer touring musician music CD Dirty Moby rockers Sting Blackest songs rocker Germany Germany Berlin German Munich Hamburg Austria Cologne Bavaria Hessen country Europe Wernigerode Saxony western Germans Schwaben Switzerland TuS Heilbronn Realschule Westfalen Deutschland Brandenburg eastern Rudolf Glarus Wolfgang Esslingen Kaserne Swabia Schwerin Andreas Poland Helmut Palatinate history Darmstadt Rhein Harald Ludwigsburg Kiel Barack Obama Obama campaign President presidential endorsed Democrat Clinton nominee Presidential inauguration Senator senator administration speech Barack Democratic appointee Washington Republican vote Tuesday Secretary election Administration elect nomination Bush November president congressman Senate endorsing announcement candidacy Leicestershire curacy town Yeomanry Buckinghamshire Leicestershire Bedfordshire Lichfield Wiltshire Shropshire almshouses Lancashire Stonyhurst Leicestershire County Cricket Club Warwickshire batsman England Hampshire Leicestershire Trott Glamorgan Nottinghamshire Northants Lancashire Middlesex Essex Giles fielding Porterfield Test Surrey cricketer centurion Gough Bevan Sussex Gloucestershire bowled Worcestershire Tests Martyn Croft Derbyshire Clarke overs bowler Lancastrian played Northamptonshire Kent Vaughan Fletcher captaining internationals batting Gilchrist Notts batted cricket

Table 8 :
Closest words to a given entity.Words with at least 500 frequency in the Wikipedia corpus are shown.