Mixing Context Granularities for Improved Entity Linking on Question Answering Data across Entity Categories

The first stage of every knowledge base question answering approach is to link entities in the input question. We investigate entity linking in the context of question answering task and present a jointly optimized neural architecture for entity mention detection and entity disambiguation that models the surrounding context on different levels of granularity. We use the Wikidata knowledge base and available question answering datasets to create benchmarks for entity linking on question answering data. Our approach outperforms the previous state-of-the-art system on this data, resulting in an average 8% improvement of the final score. We further demonstrate that our model delivers a strong performance across different entity categories.


Introduction
Knowledge base question answering (QA) requires a precise modeling of the question semantics through the entities and relations available in the knowledge base (KB) in order to retrieve the correct answer. The first stage for every QA approach is entity linking (EL), that is the identification of entity mentions in the question and linking them to entities in KB. In Figure 1, two entity mentions are detected and linked to the knowledge base referents. This step is crucial for QA since the correct answer must be connected via some path over KB to the entities mentioned in the question.
The state-of-the-art QA systems usually rely on off-the-shelf EL systems to extract entities from the question (Yih et al., 2015 Figure 1: An example question from a QA dataset that shows the correct entity mentions and their relationship with the correct answer to the question, Qxxx stands for a knowledge base identifier for question answering (e.g. DBPedia Spotlight 1 , AIDA 2 ). However, these systems have certain drawbacks in the QA setting: they are targeted at long well-formed documents, such as news texts, and are less suited for typically short and noisy question data. Other EL systems focus on noisy data (e.g. S-MART, Yang and Chang, 2015), but are not openly available and hence limited in their usage and application. Multiple error analyses of QA systems point to entity linking as a major external source of error (Berant and Liang, 2014;Reddy et al., 2014;Yih et al., 2015).
The QA datasets are normally collected from the web and contain very noisy and diverse data (Berant et al., 2013), which poses a number of challenges for EL. First, many common features used in EL systems, such as capitalization, are not meaningful on noisy data. Moreover, a question is a short text snippet that does not contain broader context that is helpful for entity disambiguation. The QA data also features many entities of various categories and differs in this respect from the Twitter datasets that are often used to evaluate EL systems.
In this paper, we present an approach that tackles the challenges listed above: we perform entity mention detection and entity disambiguation jointly in a single neural model that makes the whole process end-to-end differentiable. This ensures that any token n-gram can be considered as a potential entity mention, which is important to be able to link entities of different categories, such as movie titles and organization names.
To overcome the noise in the data, we automatically learn features over a set of contexts of different granularity levels. Each level of granularity is handled by a separate component of the model. A token-level component extracts higher-level features from the whole question context, whereas a character-level component builds lower-level features for the candidate n-gram. Simultaneously, we extract features from the knowledge base context of the candidate entity: character-level features are extracted for the entity label and higher-level features are produced based on the entities surrounding the candidate entity in the knowledge graph. This information is aggregated and used to predict whether the n-gram is an entity mention and to what entity it should be linked.
Contributions The two main contributions of our work are: (i) We construct two datasets to evaluate EL for QA and present a set of strong baselines: the existing EL systems that were used as a building block for QA before and a model that uses manual features from the previous work on noisy data.
(ii) We design and implement an entity linking system that models contexts of variable granularity to detect and disambiguate entity mentions. To the best of our knowledge, we are the first to present a unified end-to-end neural model for entity linking for noisy data that operates on different context levels and does not rely on manual features. Our architecture addresses the challenges of entity linking on question answering data and outperforms state-of-the-art EL systems.
Code and datasets Our system can be applied on any QA dataset. The complete code as well as the scripts that produce the evaluation data can be found here: https://github.com/UKPLab/ starsem2018-entity-linking.

Motivation and Related Work
Several benchmarks exist for EL on Wikipedia texts and news articles, such as ACE (Bentivogli et al., 2010) and CoNLL-YAGO (Hoffart et al., 2011). These datasets contain multi-sentence documents and largely cover three types of entities: Location, Person and Organization. These types are commonly recognized by named entity recognition systems, such as Stanford NER Tool . Therefore in this scenario, an EL system can solely focus on entity disambiguation. In the recent years, EL on Twitter data has emerged as a branch of entity linking research. In particular, EL on tweets was the central task of the NEEL shared task from 2014 to 2016 (Rizzo et al., 2017). Tweets share some of the challenges with QA data: in both cases the input data is short and noisy. On the other hand, it significantly differs with respect to the entity types covered. The data for the NEEL shared task was annotated with 7 broad entity categories, that besides Location, Organization and Person include Fictional Characters, Events, Products (such as electronic devices or works of art) and Things (abstract objects).  2 also shows the entity categories present in two QA datasets. The distribution over the categories is more diverse in this case. The WebQuestions dataset includes the Fictional Character and Thing categories which are almost absent from the NEEL dataset. A more even distribution can be observed in the GraphQuestion dataset that features many Events, Fictional Characters and Professions. This means that a successful system for EL on question data needs to be able to recognize and to link all categories of entities. Thus, we aim to show that comprehensive modeling of different context levels will result in a better generalization and performance across various entity categories.
Existing Solutions The early machine learning approaches to EL focused on long well-formed documents (Bunescu and Pasca, 2006;Cucerzan, 2007;Han and Sun, 2012;Francis-Landau et al., 2016). These systems usually rely on an off-theshelf named entity recognizer to extract entity mentions in the input. As a consequence, such approaches can not handle entities of types other than those that are supplied by the named entity recognizer. Named entity recognizers are normally trained to detect mentions of Locations, Organizations and Person names, whereas in the context of QA, the system also needs to cover movie titles, songs, common nouns such as 'president' etc.
To mitigate this, Cucerzan (2012) has introduced the idea to perform mention detection and entity linking jointly using a linear combination of manually defined features. Luo et al. (2015) have adopted the same idea and suggested a probabilistic graphical model for the joint prediction. This is essential for linking entities in questions. For example in "who does maggie grace play in taken?", it is hard to distinguish between the usage of the word 'taken' and the title of a movie 'Taken' without consulting a knowledge base. Sun et al. (2015) were among the first to use neural networks to embed the mention and the entity for a better prediction quality. Later, Francis-Landau et al. (2016) have employed convolutional neural networks to extract features from the document context and mixed them with manually defined features, though they did not integrate it with mention detection. Sil et al. (2018) continued the work in this direction recently and applied convolutional neural networks to cross-lingual EL.
The approaches that were developed for Twitter data present the most relevant work for EL on QA data. Guo et al. (2013b) have created a new dataset of around 1500 tweets and suggested a Structured SVM approach that handled mention detection and entity disambiguation together.  describe the winning system of the NEEL 2014 competition on EL for short texts: The system adapts a joint approach similar to Guo et al. (2013b), but uses the MART gradient boosting algorithm instead of the SVM and extends the feature set. The current state-of-the-art system for EL on noisy data is S-MART (Yang and Chang, 2015) which extends the approach from  to make structured predictions. The same group has subsequently applied S-MART to extract entities for a QA system (Yih et al., 2015).
Unfortunately, the described EL systems for short texts are not available as stand-alone tools. Consequently, the modern QA approaches mostly rely on off-the-shelf entity linkers that were designed for other domains. Reddy et al. (2016) have employed the Freebase online API that was since deprecated. A number of question answering systems have relied on DBPedia Spotlight to extract entities (Lopez et al., 2016;Chen et al., 2016). DB-Pedia Spotlight (Mendes et al., 2011) uses document similarity vectors, word embeddings and manually defined features such as entity frequency. We are addressing this problem in our work by presenting an architecture specifically targeted at EL for QA data.
The Knowledge Base Throughout the experiments, we use the Wikidata 3 open-domain KB (Vrandečić and Krötzsch, 2014). Among the previous work, the common choices of a KB include Wikipedia, DBPedia and Freebase. The entities in Wikidata directly correspond to the Wikipedia articles, which enables us to work with data that was previously annotated with DBPedia. Freebase was discontinued and is no longer up-todate. However, most entities in Wikidata have been annotated with identifiers from other knowledge sources and databases, including Freebase, which establishes a link between the two KBs.

Entity Linking Architecture
The overall architecture of our entity linking system is depicted in Figure 3. From the input question x we extract all possible token n-grams N up to a x = what are taylor swift's albums?
Step 1. consider all n-grams Step 2. entity candidates for an n-gram C = entity candidates(n) wikidata Full text search Step 3. score the n-gram with the model Step 4. compute the global assignment of entities G = global assignment(p n , p c , n, x|n ∈ N ) Figure 3: Architecture of the entity linking system certain length as entity mention candidates (Step 1). For each n-gram n, we look it up in the knowledge base using a full text search over entity labels (Step 2). That ensures that we find all entities that contain the given n-gram in the label. For example for a unigram 'obama', we retrieve 'Barack Obama', 'Michelle Obama' etc. This step produces a set of entity disambiguation candidates C for the given ngram n. We sort the retrieved candidates by length and cut off after the first 1000. That ensures that the top candidates in the list would be those that exactly match the target n-gram n.
In the next step, the list of n-grams N and the corresponding list of entity disambiguation candidates are sent to the entity linking model (Step 3). The model jointly performs the detection of correct mentions and the disambiguation of entities.

Variable Context Granularity Network
The neural architecture (Variable Context Granularity, VCG) aggregates and mixes contexts of different granularities to perform a joint mention detection and entity disambiguation. Figure 4 shows the layout of the network and its main components. The input to the model is a list of question tokens x, a token n-gram n and a list of candidate entities C. Then the model is a function M(x, n,C) that produces a mention detection score p n for each n-gram and a ranking score p c for each of the candidates c ∈ C: p n , p c = M(x, n,C).
Dilated Convolutions To process sequential input, we use dilated convolutional networks (DCNN). Strubell et al. (2017) have recently shown that DCNNs are faster and as effective as recurrent models on the task of named entity recognition. We define two modules: DCNN w and DCNN c for processing token-level and character-level input respectively. Both modules consist of a series of convolutions applied with an increasing dilation, as described in Strubell et al. (2017). The output of the convolutions is averaged and transformed by a fully-connected layer.
Context components The token component corresponds to sentence-level features normally defined for EL and encodes the list of question tokens x into a fixed size vector. It maps the tokens in x to d w -dimensional pre-trained word embeddings, using a matrix W ∈ R |V w |×d w , where |V w | is the size of the vocabulary. We use 50-dimensional GloVe embeddings pre-trained on a 6 billion tokens corpus (Pennington et al., 2014). The word embeddings are concatenated with d p -dimensional position embeddings P w ∈ R 3×d p that are used to denote the tokens that are part of the target n-gram. The concatenated embeddings are processed by DCNN w to get a vector o s .
The character component processes the target token n-gram n on the basis of individual characters. We add one token on the left and on the right to the target mention and map the string of characters to d z -character embeddings, Z ∈ R |V z |×d z . We concatenate the character embeddings with d p -dimensional position embeddings P z ∈ R |x|×d p and process them with DCNN c to get a feature vector o n .
We use the character component with the same learned parameters to encode the label of a candidate entity from the KB as a vector o l . The parameter sharing between mention encoding and entity label encoding ensures that the representation of a mention is similar to the entity label.
The KB structure is the highest context level included in the model. The knowledge base structure component models the entities and relations that are connected to the candidate entity c. First, we map a list of relations r of the candidate entity to d r -dimensional pre-trained relations embeddings, using a matrix R ∈ R |V r |×d r , where |V r | is the number of relation types in the KB. We transform the relations embeddings with a single fullyconnected layer f r and then apply a max pooling operation to get a single relation vector o r per entity. Similarly, we map a list of entities that are immediately connected to the candidate entity e  to d e -dimensional pre-trained entity embeddings, using a matrix E ∈ R |V e |×d e , where |V e | is the number of entities in the KB. The entity embeddings are transformed by a fully-connected layer f e and then also pooled to produce the output o e . The embedding of the candidate entity itself is also transformed with f e and is stored as o d . To train the knowledge base embeddings, we use the TransE algorithm (Bordes et al., 2013).
Finally, the knowledge base lexical component takes the labels of the relations in r to compute lexical relation embeddings. For each r ∈ r, we tokenize the label and map the tokens x r to word embeddings, using the word embedding matrix W.
To get a single lexical embedding per relation, we apply max pooling and transform the output with a fully-connected layer f rl . The lexical relation embeddings for the candidate entity are pooled into the vector o rl .

Context Aggregation
The different levels of context are aggregated and are transformed by a sequence of fully-connected layers into a final vector o c for the n-gram n and the candidate entity c. The vectors for each candidate are aggregated into a matrix O = [o c |c ∈ C]. We apply element-wise max pooling on O to get a single summary vector s for all entity candidates for n.
To get the ranking score p c for each entity candidate c, we apply a single fully-connected layer g c on the concatenation of o c and the summary vector s: p c = g c (o c s). For the mention detection score for the n-gram, we separately concatenate the vectors for the token context o s and the character context o n and transform them with an array of fully-connected layers into a vector o t . We concatenate o t with the summary vector s and apply another fully-connected layer to get the mention detection score p n = σ (g n (o t s)).

Global entity assignment
The first step in our system is extracting all possible overlapping n-grams from the input texts. We assume that each span in the input text can only refer to a single entity and therefore resolve overlaps by computing a global assignment using the model scores for each n-gram (Step 4 in Figure 3).
If the mention detection score p n is above the 0.5-threshold, the n-gram is predicted to be a correct entity mention and the ranking scores p c are used to disambiguate it to a single entity candidate. N-grams that have p n lower than the threshold are filtered out.
We follow Guo et al. (2013a) in computing the global assignment and hence, arrange all n-grams selected as mentions into non-overlapping combinations and use the individual scores p n to compute the probability of each combination. The combination with the highest probability is selected as the final set of entity mentions. We have observed in practice a similar effect as descirbed by Strubell et al. (2017), namely that DCNNs are able to capture dependencies between different entity mentions in the same context and do not tend to produce overlapping mentions.

Composite Loss Function
Our model jointly computes two scores for each n-gram: the mention detection score p n and the disambiguation score p c . We optimize the parameters of the whole model jointly and use the loss function that combines penalties for the both scores for all n-grams in the input question: where t n is the target for mention detection and is either 0 or 1, t c is the target for disambiguation and ranges from 0 to the number of candidates |C|.
For the mention detection loss M , we include a weighting parameter α for the negative class as the majority of the instances in the data are negative: The disambiguation detection loss D is a maximum margin loss: where m is the margin value. We set m = 0.5, whereas the α weight is optimized with the other hyper-parameters.

Datasets
We compile two new datasets for entity linking on questions that we derive from publicly available question answering data: WebQSP (Yih et al., 2016) and GraphQuestions (Su et al., 2016). WebQSP contains questions that were originally collected for the WebQuestions dataset from web search logs (Berant et al., 2013). They were manually annotated with SPARQL queries that can be executed to retrieve the correct answer to each question. Additionally, the annotators have also selected the main entity in the question that is central to finding the answer. The annotations and the query use identifiers from the Freebase knowledge base.
We extract all entities that are mentioned in the question from the SPARQL query. For the main entity, we also store the correct span in the text, as annotated in the dataset. In order to be able to use Wikidata in our experiments, we translate the Freebase identifiers to Wikidata IDs.
The second dataset, GraphQuestions, was created by collecting manual paraphrases for automatically generated questions (Su et al., 2016). The dataset is meant to test the ability of the system to understand different wordings of the same question. In particular, the paraphrases include various references to the same entity, which creates a challenge for an entity linking system. The following  are three example questions from the dataset that contain a mention of the same entity: (1) a. what is the rank of marvel's iron man? b. iron-man has held what ranks? c. tony stark has held what ranks?
GraphQuestions does not contain main entity annotations, but includes a SPARQL query structurally encoded in JSON format. The queries were constructed manually by identifying the entities in the question and selecting the relevant KB relations. We extract gold entities for each question from the SPARQL query and map them to Wikidata.
We split the WebQSP training set into train and development subsets to optimize the neural model. We use the GraphQuestions only in the evaluation phase to test the generalization power of our model. The sizes of the constructed datasets in terms of the number of questions and the number of entities are reported in Table 1. In both datasets, each question contains at least one correct entity mention.

Evaluation Methodology
We use precision, recall and F1 scores to evaluate and compare the approaches. We follow Carmel et al. (2014) and Yang and Chang (2015) and define the scores on a per-entity basis. Since there are no mention boundaries for the gold entities, an extracted entity is considered correct if it is present in the set of the gold entities for the given question. We compute the metrics in the micro and macro setting. The macro values are computed per entity class and averaged afterwards.
For the WebQSP dataset, we additionally perform a separate evaluation using only the information on the main entity. The main entity has the information on the boundary offsets of the correct mentions and therefore for this type of evaluation, we enforce that the extracted mention has to over-emb. size filter size d w d z d e d r d p DCNN w DCNN c α 50 25 50 50 5 64 64 0.5 Table 3: Best configuration for the VCG model lap with the correct mention. QA systems need at least one entity per question to attempt to find the correct answer. Thus, evaluating using the main entity shows how the entity linking system fulfills this minimum requirement.

Baselines
Existing systems In our experiments, we compare to DBPedia Spotlight that was used in several QA systems and represents a strong baseline for entity linking 4 . In addition, we are able to compare to the state-of-the-art S-MART system, since their output on the WebQSP datasets was publicly released 5 . The S-MART system is not openly available, it was first trained on the NEEL 2014 Twitter dataset and later adapted to the QA data (Yih et al., 2015). We also include a heuristics baseline that ranks candidate entities according to their frequency in Wikipedia. This baseline represents a reasonable lower bound for a Wikidata based approach.
Simplified VCG To test the effect of the end-toend context encoders of the VCG network, we define a model that instead uses a set of features commonly suggested in the literature for EL on noisy data. In particular, we employ features that cover (1) frequency of the entity in Wikipedia, (2) edit distance between the label of the entity and the token n-gram, (3) number of entities and relations immediately connected to the entity in the KB, (4) word overlap between the input question and the labels of the connected entities and relations, (5) length of the n-gram. We also add an average of the word embeddings of the question tokens and, separately, an average of the embeddings of tokens of entities and relations connected to the entity candidate. We train the simplified VCG model by optimizing the same loss function in Section 3.3 on the same data.

Practical considerations
The hyper-parameters of the model, such as the dimensionality of the layers and the size of embed- 4 We use the online end-point: http://www. dbpedia-spotlight.org/api 5 https://github.com/scottyih/STAGG    Table 3 lists the main selected hyperparameters for the VCG model 6 and we also report the results for each model's best configuration on the development set in Table 2. We observe that our model achieves the most gains in precision compared to the baselines and the previous stateof-the-art for QA data. VCG constantly outperforms the simplified VCG baseline that was trained by optimizing the same loss function but uses manually defined features. Thereby, we confirm the advantage of the mixing context granularities strategy that was suggested in this work. Most importantly, the VCG model achieves the best macro result which indicates that the model has a consistent performance on different entity classes. We further evaluate the developed VCG architecture on the GraphQuestions dataset against the DBPedia Spotlight. We use this dataset to evaluate VCG in an out-of-domain setting: neither our system nor DBPedia Spotlight were trained on it.

Results
The results for each model are presented in Table 5. We can see that GraphQuestions provides a much more difficult benchmark for EL. The VCG model shows the overall F-score result that is better than the DBPedia Spotlight baseline by a wide margin. It is notable that again our model achieves higher precision values as compared to other approaches and manages to keep a satisfactory level of recall.
Analysis In order to better understand the performance difference between the approaches and the gains of the VCG model, we analyze the results per entity class (see Figure 5). We see that the S-MART system is slightly better in the disambiguation of Locations, Person names and a similar category of Fictional Character names, while it has  Table 6: Ablation experiments for the VCG model on WEBQSP a considerable advantage in processing of Professions and Common Nouns. Our approach has an edge in such entity classes as Organization, Things and Products. The latter category includes movies, book titles and songs, which are particularly hard to identify and disambiguate since any sequence of words can be a title. VCG is also considerably better in recognizing Events. We conclude that the future development of the VCG architecture should focus on the improved identification and disambiguation of professions and common nouns.
To analyze the effect that mixing various context granularities has on the model performance, we include ablation experiment results for the VCG model (see Table 6). We report the same scores as in the main evaluation but without individual model components that were described in Section 3.
We can see that the removal of the KB structure information encoded in entity and relation embeddings results in the biggest performance drop of almost 10 percentage points. The character-level information also proves to be highly important for the final state-of-the-art performance. These aspects of the model (the comprehensive representation of the KB structure and the character-level information) are two of the main differences of our approach to the previous work. Finally, we see that excluding the token-level input and the lexical information about the related KB relations also decrease the results, albeit less dramatically.

Conclusions
We have described the task of entity linking on QA data and its challenges. The suggested new approach for this task is a unifying network that models contexts of variable granularity to extract features for mention detection and entity disambiguation. This system achieves state-of-the-art results on two datasets and outperforms the previous best system used for EL on QA data. The results further verify that modeling different types of context helps to achieve a better performance across various entity classes (macro f-score).
Most recently, Peng et al. (2017) and Yu et al. (2017) have attempted to incorporate entity linking into a QA model. This offers an exciting future direction for the Variable Context Granularity model.