Collective Entity Disambiguation with Structured Gradient Tree Boosting

We present a gradient-tree-boosting-based structured learning model for jointly disambiguating named entities in a document. Gradient tree boosting is a widely used machine learning algorithm that underlies many top-performing natural language processing systems. Surprisingly, most works limit the use of gradient tree boosting as a tool for regular classification or regression problems, despite the structured nature of language. To the best of our knowledge, our work is the first one that employs the structured gradient tree boosting (SGTB) algorithm for collective entity disambiguation. By defining global features over previous disambiguation decisions and jointly modeling them with local features, our system is able to produce globally optimized entity assignments for mentions in a document. Exact inference is prohibitively expensive for our globally normalized model. To solve this problem, we propose Bidirectional Beam Search with Gold path (BiBSG), an approximate inference algorithm that is a variant of the standard beam search algorithm. BiBSG makes use of global information from both past and future to perform better local search. Experiments on standard benchmark datasets show that SGTB significantly improves upon published results. Specifically, SGTB outperforms the previous state-of-the-art neural system by near 1% absolute accuracy on the popular AIDA-CoNLL dataset.


Introduction
Entity disambiguation (ED) refers to the process of linking an entity mention in a document to its corresponding entity record in a reference knowledge base (e.g., Wikipedia or Freebase). As a core information extraction task, ED plays an important role in the language understanding pipeline, underlying a variety of downstream applications such as relation extraction (Mintz et al., 2009;Riedel et al., 2010), knowledge base population (Ji and Grishman, 2011;Dredze et al., 2010), and question answering (Berant et al., 2013;Yih et al., 2015). This task is challenging because of the inherent ambiguity between mentions and the referred entities. Consider, for example, the mention 'Washington', which can be linked to a city, a state, a person, an university, or a lake (Fig. 1).
Fortunately, simple and effective features have been proposed to capture the ambiguity that are designed to model the similarity between a mention (and its local context) and a candidate entity, as well as the relatedness between entities that co-occur in a single document. These are typically statistical features estimated from entitylinked corpora, and similarity features that are pre-computed using distance metrics such as cosine. For example, a key feature for ED is the prior probability of an entity given a specific mention, which is estimated from mention-entity cooccurrence statistics. This simple feature alone can yield 70% to 80% accuracy on both news and Twitter texts (Lazic et al., 2015;Guo et al., 2013).
To capture the non-linear relationships between the low-dimensional dense features like statistical features, sophisticated machine learning models such as neural networks and gradient tree boosting are preferred over linear models. In particular, gradient tree boosting has been shown to be highly competitive for ED in recent work (Yang and Chang, 2015;Yamada et al., 2016). However, although achieving appealing results, existing gradient-tree-boosting-based ED systems typically operate on each individual mention, without attempting to jointly resolve entity mentions in a document together. Joint entity disambiguation has been shown to significantly boost performance when used in conjunction with other machine learning techniques (Ratinov et al., 2011;Hoffart et al., 2011). However, how to train a global gradient tree boosting model that produces coherent entity assignments for all the mentions in a document is still an open question.
In this work, we present, to the best of our knowledge, the first structured gradient tree boosting (SGTB) model for collective entity disambiguation. Building on the general SGTB framework introduced by Yang and Chang (2015), we develop a globally normalized model for ED that employs a conditional random field (CRF) objective (Lafferty et al., 2001). The model permits the utilization of global features defined between the current entity candidate and the entire decision history for previous entity assignments, which enables the global optimization for all the entity mentions in a document. As discussed in prior work (Smith and Johnson, 2007;Andor et al., 2016), globally normalized models are more expressive than locally normalized models.
As in many other global models, our SGTB model suffers from the difficulty of computing the partition function (normalization term) for training and inference. We adopt beam search to address this problem, in which we keep track of multiple hypotheses and sum over the paths in the beam. In particular, we propose Bidirectional Beam Search with Gold path (BiBSG) technique that is specifically designed for SGTB model training. Compared to standard beam search strategies, BiBSG reduces model variance and also enjoys the advantage in its ability to consider both past and future information when predicting an output.
Our contributions are: • We propose a SGTB model for collectively disambiguating entities in a document. By jointly modeling local decisions and global structure, SGTB is able to produce globally optimal entity assignments for all the mentions.
• We present BiBSG, an efficient algorithm for approximate bidirectional inference. The algorithm is tailored to SGTB models, which can reduce model variance by generating more point-wise functional gradients for estimating the auxiliary regression models.
• SGTB achieves state-of-the-art (SOTA) results on various popular ED datasets, and it outperforms the previous SOTA systems by 1-2% absolute accuracy on the AIDA-CoNLL (Hoffart et al., 2011) dataset.

Model
In this section, we present a SGTB model for collective entity disambiguation. We first formally define the task of ED, and then describe a structured learning formalization for producing globally coherent entity assignments for mentions in a document. Finally, we show how to optimize the model using functional gradient descent.
For an input document, assume that we are given all the mentions of named entities within it. Also assume that we are given a lexicon that maps each mention to a set of entity candidates in a given reference entity database (e.g., Wikipedia or Freebase). The ED system maps each mention in the document to an entry in the entity database. Since a mention is often ambiguous on its own (i.e., the lexicon maps the mention to multiple entity candidates), the ED system needs to leverage two types of contextual information for disambiguation: local information based on the entity mention and its surrounding words, and global information that exploits the document-level coherence of the predicted entities. Note that modeling entity-entity coherence is very challenging, as the long-range dependencies between entities correspond to exponentially large search space.
We formalize this task as a structured learning problem. Let x be a document with T target mentions, and y = {y t } T t=1 be the entity assignments of the mentions in the document. We use S(x, y) to denote the joint scoring function between the input document and the output structure. In traditional NLP tasks, such as part-of-speech tagging and named entity recognition, we often rely on low-order Markov assumptions to decompose the global scoring function into a summation of local functions. ED systems, however, are often required to model nonlocal phenomena, as any pair of entities is potentially interdependent. Therefore, we choose the following decomposition: where F (x, y t , y 1:t−1 ) is a factor scoring function. Specifically, a local prediction y t depends on all the previous decisions, y 1:t−1 in our model, which resembles recurrent neural network (RNN) models (Elman, 1990;Hochreiter and Schmidhuber, 1997). We adopt a CRF loss objective, and define a Figure 1: (a) Example document x with entity candidates for each mention (gold entities are in bold); (b) the m-th SGTB update iteration: (i) conduct beam search to sample candidate entity sequences ( § 3), (ii) compute pointwise functional gradients for each candidate sequence, (iii) fit a regression tree to the negative functional gradient points with input features, φ, (iv) update the factor scoring function, F , by adding the trained regression tree.
distribution over possible output structures as follows: where F (x, y t , y 1:t−1 )} and Gen(x) is the set of all possible sequences of entity assignments depending on the lexicon. Z(x) is then a global normalization term. As shown in previous work, globally normalized models are very expressive, and also avoid the label bias problem (Lafferty et al., 2001;Andor et al., 2016). The inference problem is to find arg max F (x, y t , y 1:t−1 ). (3)

Structured gradient tree boosting
An overview of our SGTB model is shown in Fig. 1. The model minimizes the negative loglikelihood of the data, where y * is the gold output structure.
In a standard CRF, the factor scoring function is typically assumed to have this form: is the feature function and θ are the model parameters. The key idea of SGTB is that, instead of defining a parametric model and optimizing its parameters, we can directly optimize the factor scoring function F (·) iteratively by performing gradient descent in function space. In particular, suppose F (·) = F m−1 (·) in the m-th iteration, we will update F (·) as follows: where is the functional gradient, η m is the learning rate, and 1[·] represents an indicator function, which returns 1 if the predicted sequence matches the gold one, and 0 otherwise. We initialize F (·) to 0 (F 0 (·) = 0). We can approximate the negative functional gradient −g m (·) with a regression tree model h m (·) by fitting the training data 1:t−1 )} to the point-wise negative functional gradients (also known as residuals) 1:t−1 )}. Then the factor scoring function can be obtained by where h m (x, y t , y 1:t−1 ) is called a basis function. We set η m = 1 in this work.

Training
Training the SGTB model requires computing the point-wise functional gradients with respect to training documents and candidate entity sequences. This is challenging, due to the exponential output structure search space. First, we are not able to enumerate all possible candidate entity sequences. Second, computing the conditional probabilities shown in Eq. 6 is intractable, as it is prohibitively expensive to compute the partition function Z(x) in Eq. 2. Beam search can be used to address these problems. We can compute point-wise functional gradients for candidate entity sequences in the beam, and approximately compute the partition function by summing over the elements in the beam.
In this section, we present a bidirectional beam search training algorithm that always keeps the gold sequence in the beam. The algorithm is tailored to SGTB, and improves standard training methods in two aspects: (1) it reduces model variance by collecting more point-wise function gradients to train a regression tree; (2) it leverages information from both past and future to conduct better local search.

Beam search with gold path
The early update (Collins and Roark, 2004) and LaSO (Daumé III and Marcu, 2005;Xu and Fern, 2007) strategies are widely adopted with beam search for updating model parameters in previous work. Both methods keep track of the location of the gold path in the beam while decoding a training sequence. A gradient update step will be taken if the gold path falls out of the beam at a specific time step t or after the last step T . Adapting the strategies to SGTB training is straightforward. We will compute point-wise functional gradients for all candidate entity sequences after time step T or when the gold sequence falls out the beam. Both early update and LaSO are typically applied to online learning scenarios, in which model parameters are updated after passing one or a few training sequences. SGTB training, however, fits the batch learning paradigm. In each training epoch, a SGTB model will be updated only once using the regression tree model fit on the point-wise negative functional gradients. The gradients are calculated with respect to the output sequences obtained from beam search. We propose a simple training strategy that computes and collects point-wise functional gradients at every step of a training sequence. In addition, instead of passively monitoring the gold path, we always keep the gold path in the beam to ensure that we have valid functional gradients at each time step. The new beam search training method, Beam Search with Gold path (BSG), generates much more point-wise functional gradients than early update or LaSO, which can reduce the variance of the auxiliary regression tree model. As a result, SGTB trained with BSG consistently outperforms early update or LaSO in our exploratory experiments, and it also requires fewer training epochs to converge. 2

Bidirectional beam search
During beam search, if we consider a decision made at time step t, the joint probability p(y|x) can be factorized around t as follows: p(y|x) = p(y 1:t−1 |x) · p(y t |y 1:t−1 , x) ·p(y t+1:T |y t , y 1:t−1 , x).
Traditional beam search performs inference in a unidirectional (left-to-right) fashion. Since the beam search at time step t considers only the beam sequences that were committed to so far, {y 1:t−1 }, it effectively approximates the above probability by assuming that all futures are equally likely, i.e. p(y t+1:T |y t , y 1:t−1 , x) is uniform. Therefore, at any given time, there is no information from the future when incorporating the global structure.
In this work, we adopt a Bidirectional Beam Search (BiBS) methodology that incorporates multiple beams to take future information into account (Sun et al., 2017). It makes two simplifying assumptions that better approximate the joint probability above while remaining tractable: (1) future predictions are independent of past predictions given y t ; (2) p(y t ) is uniform. These yield the following approximation: p(y t+1:T |y t , y 1:t−1 , x) = p(y t+1:T |y t , x) ∝ p(y t |y t+1:T ,x) · p(y t+1:T |x).
In (Sun et al., 2017), these are retrieved from forward and backward recurrent networks, whereas in our work we use the joint scores (log probabilities shown in Eq. 1) computed for partial sequences from forward and backward beams.
Algorithm 1: Bidirectional Beam Search with Gold path (BiBSG) The full inference algorithm, Bidirectional Beam Search with Gold path (BiBSG), is presented in Alg. 1. When performing the forward pass to update the forward beam, forward joint scores, S(x, y 1:t ), are computed with respect to current forward beam, and backward joint scores, S(x, y T :t ), are computed with respect to previous backward beam. A similar procedure is used for the backward pass. The search converges very fast, and we use two rounds of bidirectional search as a good approximation. Finally, SGTB-BiBSG compares the conditional probabilities p(y (·) |x) of the best scoring output sequences y (F) and y (B) obtained from the forward and backward beams. The final prediction is the sequence with the higher conditional probability score.

Implementation
We provide implementation details of our SGTB systems, including entity candidate generation, adopted local and global features, and some efforts to make training and inference faster.

Candidate selection
We use a mention priorp(y|x) to select entity candidates for a mention x.
Following Ganea and Hofmann (2017), the prior is computed by averaging mention prior probabilities built from mention-entity hyperlink statistics from Wikipedia 3 and a large Web corpus (Spitkovsky and Chang, 2012). Given a mention, we select the top 30 entity candidates according top(y|x).
We also use a simple heuristic proposed by Ganea and Hofmann (2017) to improve candidate selection for persons: for a mention x, if there are mentions of persons that contain x as a continuous subsequence of words, then we consider the candidate set obtained from the longest mention for the mention x.

Features
The feature function φ(x, y t , y 1:t−1 ) can be decomposed into the summation of a local feature function φ L (x, y t ) and a global feature function φ G (y t , y 1:t−1 ).

Local features
We consider standard local features that have been used in prior work, including mention priors p(y|x) obtained from different resources; entity popularity features based on Wikipedia page view count statistics; 4 named entity recognition (NER) type features given by an in-house NER system trained on the CoNLL 2003 NER data (Tjong Kim Sang and De Meulder, 2003); entity type features based on Freebase type information; and three textual similarity features proposed by Yamada et al. (2016). 5 Global features Three features are utilized to characterize entity-entity relationships: entityentity co-occurrence counts obtained from Wikipedia, and two cosine similarity scores between entity vectors based on entity embeddings from (Ganea and Hofmann, 2017) and Freebase entity embeddings released by Google 6 respectively. We denote the entity-entity features between entities y t and y t as φ E (y t , y t ).
At step t of a training sequence, we quantify the coherence of y t with respect to previous decisions y 1:y−1 by first extracting entity-entity features between y t and y t where 1 ≤ t ≤ t − 1, and then aggregating the information to have a global feature vector φ G (y t , y 1:t−1 ) of a fixed length: where ⊕ denotes concatenation of vectors.

Efficiency
Global models are powerful and effective, but often at a cost of efficiency. We discuss ways to speed up training and inference for SGTB models. Many of the adopted features such as mention priors and entity-entity co-occurrences can be extracted once and retrieved later with just a hash map lookup. The most expensive features are the cosine similarity features based on word and entity embeddings. By normalizing the embeddings to have a unit norm, we can obtain the similarity features using dot products. We find this simple preprocessing makes feature extraction faster by two orders of magnitude.
SGTB training can be easily parallelized, as the computation of functional gradients are independent for different documents. During each training iteration, we randomly split training documents into different partitions, and then calculate the point-wise functional gradients for documents of different partitions in parallel.

Experiments
In this section, we evaluate SGTB on some of the most popular datasets for ED. After describing the experimental setup, we compare SGTB with previous state-of-the-art (SOTA) ED systems and present our main findings in § 5.3.

Data
We use six publicly available datasets to validate the effectiveness of SGTB. AIDA-CoNLL (Hoffart et al., 2011) is a widely adopted dataset for ED based on the CoNLL 2003 NER dataset (Tjong Kim Sang and De Meulder, 2003 further split into training (AIDA-train), development (AIDA-dev), and test (AIDA-test) sets. 7 AQUAINT (Milne and Witten, 2008), MSNBC (Cucerzan, 2007), and ACE (Ratinov et al., 2011) are three datasets for Wikification, which also contain Wikipedia concepts beyond named entities. These datasets were recently cleaned and updated by Guo and Barbosa (2016). WIKI and CWEB are automatically annotated datasets built from the ClueWeb and Wikipedia corpora by Guo and Barbosa (2016). The statistics of these datasets are available in Table 1.

Experimental settings
Following previous work (Guo and Barbosa, 2016;Ganea and Hofmann, 2017), we evaluate our models on both in-domain and cross-domain testing settings. In particular, we train our models on AIDA-train set, tune hyperparameters on AIDAdev set, and test on AIDA-test set (in-domain testing) and all other datasets (cross-domain testing). We follow prior work and report in-KB accuracies for AIDA-test and Bag-of-Title (BoT) F1 scores for the other test sets. Two AIDA-CoNLL specific resources have been widely used in previous work. In order to have fair comparisons with these works, we also adopt them only for the AIDA datasets. First, we use a mention prior obtained from aliases to candidate entities released by Hoffart et al. (2011) along with the two priors described in § 4.1. Second, we also experiment with PPRforNED, an entity candidate selection system released by Pershina et al. (2015). It is unclear how candidates were pruned, but the entity candidates generated by this system have high recall and low ambiguity, and they contribute to some of the best results reported for AIDA-test (Yamada et al., 2016;Sil et al., 2018).
Competitive systems We implement four competitive ED systems, and three of them are based on variants of our proposed SGTB algorithm. 8 Gradient tree boosting is a local model that employs only local features to make independent decisions for every entity mention. Note that our local model is different from that presented by Yamada et al. (2016), where they treat ED as binary classification for each mention-entity pair. SGTB-BS is a Structured Gradient Tree Boosting model trained with Beam Search with early update strategy. SGTB-BSG uses Beam Search with Gold path training strategy presented in § 3.1. Finally, SGTB-BiBSG exploits Bidirectional Beam Search with Gold path to leverage information from both past and future for better local search.
In addition, we compare against best published results on all the datasets. To ensure fair comparisons, we group results according to candidate selection system that different ED systems adopted.
Parameter tuning We tune all the hyperparameters on the AIDA-dev set. We use recommended hyperparameter values from scikit-learn to train regression trees, except for the maximum depth of the tree, which we choose from {3, 5, 8}. After a set of preliminary experiments, we select the beam size from {3, 4, 5, 6}. The best values for the two hyperparameters are 3 and 4 respectively. As mentioned in § 2, the learning rate is set to 1. We train SGTB for at most 500 epochs (i.e., fit at most 500 regression trees). During training, we check the performance on the development set every 25 epochs to perform early stopping. Training takes 3 hours for SGTB-BS and SGTB-BSG, and takes 9 hours for SGTB-BiBSG on 16 threads.

Results
In-domain results In-domain evaluation results are presented in Table 2. As shown, SGTB achieves much better performance than all previously published results. Specifically, SGTB-BiBSG outperforms the previous SOTA system (Ganea and Hofmann, 2017) by 0.8% accuracy, and improves upon the best published results when employing the PPRforNED candidate selection system by 1.9% accuracy. Global information is clearly useful, as it helps to boost the performance by 2-4 points of accuracy, depending on the candidate generation system. In terms of beam 8 Our implementations are based on the scikit-learn package (Pedregosa et al., 2011).

System
PPRforNED In-KB acc.
search training strategies, BiBSG consistently outperforms BSG and beam search with early update. By employing more point-wise functional gradients to train the regression trees and leveraging global information from both past and future to carry on local search, BiBSG is able to find better global solutions than alternative training strategies.
Cross-domain results As presented in Table 3, cross-domain experimental results are a little more mixed. SGTB-BS and SGTB-BSG perform quite competitively compared with SGTB-BiBSG. In a cross-domain evaluation setting, the test data is drawn from a different distribution as the training data. Therefore, less expressive models may be preferred as they may learn more abstract representations that will generalize better to outof-domain data. Nevertheless, our SGTB models achieve better performance than best published results on three of the five popular ED datasets. Specifically, SGTB-BS outperforms the prior SOTA system by absolute 4% F1 on the CWEB dataset, and SGTB-BiBSG performs consistently well across different datasets.  6 Related work Entity disambiguation Most ED systems consist of a local component that models relatedness between a mention and a candidate entity, as well as a global component that produces coherent entity assignments for all mentions within a document. Recent research has largely focused on joint resolution of entities, which is usually performed by maximizing the global topical coherence between entities. As discussed above, directly optimizing the coherence objective is computationally intractable, and several heuristics and approximations have been proposed to address the problem. Hoffart et al. (2011) use an iterative heuristic to remove unpromising mention-entity edges. Yamada et al. (2016) employ a two-stage approach, in which global information is incorporated in the second stage based on local decisions from the first stage. Approximate inference techniques have been widely adopted for ED. Cheng and Roth (2013) use an integer linear program (ILP) solver. Belief propagation (BP) and its variant loopy belief propagation (LBP) have been used by Ganea et al. (2016) and Ganea and Hofmann (2017) respectively. We employ another standard approximate inference algorithm, beam search, in this work. To make beam search a better fit for SGTB training, we propose BiBSG that improves beam search training on stability and effectiveness.
Structured gradient tree boosting Gradient tree boosting has been used in some of the most accurate systems for a variety of classification and regression problems (Babenko et al., 2011;Wu et al., 2010;Yamada et al., 2016). However, gradient tree boosting is seldom studied in the context of structured learning, with only a few exceptions. Dietterich et al. (2004) propose TreeCRF that replaces the linear scoring function of a CRF with a scoring function given by a gradient tree boosting model. TreeCRF achieves comparable or better results than CRF on some linear chain structured prediction problems. Bagnell et al. (2007) extend the Maximum Margin Planning (MMP; Ratliff et al., 2006) algorithm to structured prediction problems by learning new features using gradient boosting machines. Yang and Chang (2015) present a general SGTB framework that is flexible in the choice of loss functions and specific structures. They also apply SGTB to the task of tweet entity linking with a special non-overlapping structure. By decomposing the structures into local substructures, exact inference is tractable in all the aforementioned works. Our work shows that we can train SGTB models efficiently and effectively even with approximate inference. This extends the utility of SGTB models to a wider range of interesting structured prediction problems.

Conclusion and future work
In this paper, we present a structured gradient tree boosting model for entity disambiguation. Entity coherence modeling is challenging, as exact inference is prohibitively expensive due to the pairwise entity relatedness terms in the objective function. We propose an approximate inference algorithm, BiBSG, that is designed specifically for SGTB to solve this problem. Experiments on benchmark ED datasets suggest that the expressive SGTB models are extremely good at dealing with the task of ED. SGTB significantly outperforms all previous systems on the AIDA-CoNLL dataset, and it also achieves SOTA results on many other ED datasets even in the cross-domain evaluation setting. SGTB is a family of structured learning algorithms that can be potentially applied to other core NLP tasks. In the future, we would like to investigate the effectiveness of SGTB on other information extraction tasks, such as relation extraction and coreference resolution.