Effective Use of Context in Noisy Entity Linking

To disambiguate between closely related concepts, entity linking systems need to effectively distill cues from their context, which may be quite noisy. We investigate several techniques for using these cues in the context of noisy entity linking on short texts. Our starting point is a state-of-the-art attention-based model from prior work; while this model’s attention typically identifies context that is topically relevant, it fails to identify some of the most indicative surface strings, especially those exhibiting lexical overlap with the true title. Augmenting the model with convolutional networks over characters still leaves it largely unable to pick up on these cues compared to sparse features that target them directly, indicating that automatically learning how to identify relevant character-level context features is a hard problem. Our final system outperforms past work on the WikilinksNED test set by 2.8% absolute.


Introduction
Effectively using an entity mention's context to disambiguate it is the crux of the entity linking task: in isolation, the mention Richard Wright could refer to three possible entities in Wikipedia's knowledge base corresponding to an artist, a musician, or an author. Previous work in this area has distilled context information by exploiting tfidf features (Cucerzan, 2007;Milne and Witten, 2008;Ratinov et al., 2011), global link coherence (Hoffart et al.;Sil and Florian, 2016), cues from coreference (Cheng and Roth, 2013;Hajishirzi et al., 2013;Durrett and Klein, 2014), convolutional neural networks (Sun et al.;Francis-Landau et al., 2016), or more sophisticated neural architectures (Gupta et al., 2017;Sil et al., 2018).
These approaches typically focus on aggregating information from a mix of sources, including long-range information from the textual context or other linked entities. While this approach is suitable for entity linking settings such as newswire (Bentivogli, 2010) and Wikipedia (Ratinov et al., 2011), we cannot always rely on this information in other settings like Twitter (Guo et al., 2013;Fang and Chang, 2014;Huang et al., 2014;Dredze et al., 2016), Snapchat (Moon et al., 2018), other web platforms (Eshel et al., 2017), or dialogue systems (Bowden et al., 2018). We need models that can make effective use of limited context windows in noisy settings.
In this work, we investigate this problem of effectively using context in the setting of the Wik-ilinksNED dataset from Eshel et al. (2017). The examples in this dataset, which consists of 3.2 million entity disambiguation examples derived from Wikilinks (Singh et al., 2012), have at most 20 words of context on either side and usually no other mentions of the entity being disambiguated. We build off a state-of-the-art attentive LSTM model from prior work (Eshel et al., 2017) and show that despite its good performance, it fails to resolve some examples that human readers would find trivial. For example, disambiguating the identity of the song Down in Figure 1 is easy if we can recognize the nearby string Jay Sean in the context, but the model sometimes fails to do this.
We explore the performance of a standard attention mechanism as well as two modifications. First, we inject character information into the model through character-level CNNs; these give the model a deeper ability to recognize character correspondences between the context and entity title. However, these convolutional filters struggle to learn useful features in this noisy context and ultimately do not help performance. By contrast, sparse features explicitly targeting these overlaps

Basic Model
The WikilinksNED dataset consists of entity mentions in context scraped from the web, with gold annotation derived from the fact that those mentions originally appeared with hyperlinks to Wikipedia. We denote the mention text (i.e., anchor text of the hyperlink) by m, and denote the left and right context of the mention by c l and c r respectively; these are at most 20 words. For this dataset, we can assume that the possible linked titles for a mention have been seen in training, and the main task is instead to disambiguate between them and identify the gold title t * . We therefore follow prior work (Eshel et al., 2017) and take as candidates all gold entities in the training set whose mention was m rather than relying on a separate candidate generation scheme. Our model places a distribution over titles P (t|m, c l , c r ), where t takes values in the set of candidate Wikipedia titles for that mention. This model, depicted in Figure 1, roughly follows that of Eshel et al. (2017), with some key differences, as we discuss in the rest of this section.
Embedding contexts Given an example of the form (m, c l , c r ), our model first uses a GRU layer (Cho et al., 2014) over each context to convert c l and c r into continuous vector representations l and r, respectively. Our word embeddings are trained over Wikipedia as described in the following paragraph.
Embedding entities We follow the method of Eshel et al. (2017) for generating entity embeddings, using word2vecf (Levy and Goldberg, 2014) to jointly train word and entity embeddings simultaneously using Wikipedia article text. Each title t is associated in turn with each content word w in the article, yielding a set of (w, t) pairs that are consumed by the training procedure. This yields a set of title embeddings e t which we can treat as distributed representations of entities.
Entity-context comparison We systematically compare the representations for l, r, and e t as follows: where · denotes the conventional dot product and the other comparisons are elementwise. These features form the input to a final feedforward layer which produces a real-valued score s t for the given title. Repeating this computation for each title, our model's distribution is P (t|m, c l , c r ) = softmax t (s t ).
Training Because our model involves substantial computation for each possible title, we want to limit the set of titles considered during training. For each example we consider, we construct a set T containing the gold title and 4 negative "distractor" titles from the candidate set. Unlike Eshel et al. (2017), we structure training as a multiclass decision among these titles rather than a binary prediction problem over each title as gold or not. We run our model over the candidates t ∈ T to produce the distribution P (t|m, c l , c r ) and train to maximize the log probablity log P (t * |m, c l , c r ) of the gold title.

Results
The model set forth in this section is the basis for the remaining models in this paper; we call it the GRU model as that is the only context encoding mechanism it uses. As shown in Table 1, this GRU model gets a score of 73.4 on the Wik-ilinksNED development set. In the next section, we explore techniques for using the context in a more sophisticated way to improve further on this result.

Model
Accuracy on Test (%)

Attention
One way to improve over the basic GRU model is to use attention over the context based on the title under consideration. The attention we use is a modified version of the dot product attention (Luong et al., 2015) used by Eshel et al. (2017), allowing the model to weight the importance of the outputs of the GRU at each time step. Each context (left and right) has its own attention weights. For a given side of context and candidate t, the attention first computes a transformation of the entity embedding e t as follows: q t = tanh(W e t ). This allows the model to learn an attention query q t distinct from the candidate embedding e t . The model then computes attention probabilities α i for each GRU output o i , normalized over the entire sequence of GRU outputs (of length n): The resulting probability distribution is used to take a weighted sum of GRU outputs to get a representation a: We compute a l and a r independently and symmetrically for the left and right context. These vectors are then fed forward through the model as the final continuous representation of the left or right context, l or r respectively.

Results
In Table 1, we see that our model with attention (GRU+ATTN) outperforms our basic GRU model by around 1% absolute. It also outperforms the roughly similar model of Eshel et al. (2017) on the test set: this gain is due to a combination of factors including the improved training procedure and some small modeling changes. 2 However, our attention scheme is not without its shortcomings, as we now discuss.

Shortcomings of Attention
One common and frustrating error our model makes is failing to correctly disambiguate mentions whose contexts share similar words or character overlap with the gold entity's actual Wikipedia title. In these instances, the model fails to attend correctly to words that we, as human readers, would most likely see as disambiguating terms. For instance, in this example's left context: ...known also for the B.P. Koirala Institute of Health Sciences, one of the biggest government hospital. The indigenous people of Dharan are Limbu ... the model fails to identify people as a critical term for disambiguation. This failure is partially due to the model's sole reliance on distributed representations: the embedding for people and the title embedding for Limbu People need to somehow contain enough common information for the model to associate these, identify people as an important token, and use it to disambiguate between candidate titles such as Limbu People, Limbu Language, and Limbu Alphabet. Moreover, with such noisy, unstructured context, it is difficult for the model to learn to rely on other grammatical or semantic cues (such as are indicating that the title is probably a plural noun, which alphabet and language are not).

Character CNNs
One way to address these issues in the model is to exploit more fine-grained character-level information. This circumvents the need to separately learn a distributed correspondence between terms with lexical overlap, and is especially useful when these terms may be unknown words; for example, a year mentioned within a context is often unknown and therefore assigned an UNK embedding, even if that year matches exactly with a year in the gold candidate's title. One solution to this is to allow our model to consult character-level information, which past models have done successfully for named entity recognition (Chiu and Nichols, 2015;Lample et al., 2016;Ma and Hovy, 2016), text classification (Zhang et al., 2015), and POS tagging (Santos and Zadrozny, 2014). We use convolutional neural networks (CNNs) to distill character representations of words into vectors that we concatenate with our word representations. We additionally use character CNNs over entity titles and concatenate these representations with the title embeddings e t , to allow the model to learn to characterize similarities between contexts and entity titles. Our CNNs use window sizes of 6 and 100 filters each; these values were selected through hyperparameter tuning on the development set. Table 1 shows the impact of incorporating character CNNs (GRU+ATTN+CNN). Surprisingly, these have a mild negative impact on performance. One possible explanation of this is that it causes the model to split its attention between semantically important and lexically similar context terms. Consider the following example: really think Final Fight could be a lot of a fun as a vigilante justice movie with a high quotient of hand-to-hand fight sequences. Think The Warriors The gold title is The Warrior (film) and the base model correctly places 90% of its attention weight on the word movie when calculating attention for this title. However, the character-level CNN model only places 60% of its attention weight on it, distributing its attention values more evenly across the rest of the words. Such cases are frequent: the average highest weight given by attention in GRU+ATTN+CNN is about 6% lower than the average highest attention weight given by GRU+ATTN. The CNNs seem to have generally decreased the model's confidence in what context clues are key for disambiguation, leading to lower performance. We will return to more analysis of this in Section 4.

Lexical Feature Set
To determine whether character level overlap between the entity title and context is useful, we take a more direct approach to incorporating that information into our model and build a set of sparse features that directly target it. Figure 2 shows an example of how our features are computed. We fire features on each word in the context that is either an exact match or a substring of a word in the candidate title; people is the relevant token here. We conjoin that match information with whether the word is in the left or right context along with the bucketed offset of the word from the mention. This feature set is then appended to the vector comparison features to form the input to the model's feedforward layer (see Figure 1). Table 1 shows the results of stacking these features on top of our model with attention (GRU+ATTN+FEATS). We see our highest development set performance and correspondingly high test performance from this model. This indicates that character-level information is useful for disambiguation, but character CNNs as we incorporated them are not able to distill it as effectively as sparse features can. Our model augmented with these sparse features achieves state-of-the-art results on the test set.

Attention and "Obvious" Terms
Now that we have identified features which seem useful for this entity linking problem, we can ask how the tokens attended by our attention mechanism compare to those singled out by the features.  Figure 3: Examples of our models putting high attention weight into irrelevant context words, not acknowledging the relevance of disambiguating terms that share lexical overlap with the correct title. We display the weight given to the top 4 attended words above each word for two of our models. that contains one of our lexical features, out of all examples where such a feature exists anywhere. The reported probability mass is the total attention mass that the model puts into words that associated with lexical features, averaged over all examples where such features exist. We see that the model frequently fails to exploit this information, and moreover the addition of CNNs does not strongly improve this. Figure 3 shows examples of this behavior. In the first example, rather than identifying cheese as a salient term, both models instead focus more heavily on milk and like. Similarly, in the second example, the model fails to recognize the importance of robot in the context.
One possible reason that CNNs don't help more is that the sparse features only trigger on a subset of examples. Because the CNNs process every example, they may not see enough examples of lexical overlap to pick up on it, and instead try to augment what the word embedding model is already doing with subword information, which ends up being unstable for this task. Naturally, words with these overlap characteristics are not always the most disambiguating term. However, in light of noisy contexts, when the standard representation of context fails to be sufficient for allowing the model to disambiguate, we want the model to be able to leverage this character level information to help it make intuitive decisions, which the CNN fails to do.

Conclusion
In this paper, we observed that in noisy entity linking settings on short texts, neural models relying  on attention do not always pick up on the correct context clues, even when those clues exhibit very obvious surface overlap with the correct entity title. These models can perform better when augmented with sparse features explicitly targeting this kind of lexical overlap: our system using these features achieves state-of-the-art disambiguation accuracy on the WikilinksNED dataset. By contrast, automatically learning learning finegrained character-level features with CNNs in this context is hard. More exploration is needed to better understand what inductive biases are necessary for an entity linking system to make maximally effective use of the information available to it.