Learning Dynamic Context Augmentation for Global Entity Linking

Despite of the recent success of collective entity linking (EL) methods, these “global” inference methods may yield sub-optimal results when the “all-mention coherence” assumption breaks, and often suffer from high computational cost at the inference stage, due to the complex search space. In this paper, we propose a simple yet effective solution, called Dynamic Context Augmentation (DCA), for collective EL, which requires only one pass through the mentions in a document. DCA sequentially accumulates context information to make efficient, collective inference, and can cope with different local EL models as a plug-and-enhance module. We explore both supervised and reinforcement learning strategies for learning the DCA model. Extensive experiments show the effectiveness of our model with different learning settings, base models, decision orders and attention mechanisms.


Introduction
Linking mentions of entities in text to knowledge base entries (i.e., entity linking, or EL) is critical to understanding and structuring text corpora. In general, EL is approached by first obtaining candidate entities for each mention, and then identifying the true referent among the candidate entities. Prior distribution and local contexts, either in the form of hand-crafted features (Ratinov et al., 2011;Shen et al., 2015) or dense embeddings (He et al., 2013;Nguyen et al., 2016;Francis-Landau et al., 2016), play key roles in distinguishing different candidates. However, in many cases, local features can be too sparse to provide sufficient information for disambiguation.
To alleviate this problem, various collective EL models have been proposed to globally optimize Figure 1: An Illustration of the Dynamic Context Augmentation process. A traditional global EL model jointly optimizes the linking configuration after iterative calculations over all mentions, which is computationally expensive. In contrast, the DCA process only requires one pass of the document to accumulate knowledge from previously linked mentions to enhance fast future inference. the inter-entity coherence between mentions in the same document (Hoffart et al., 2011;Cheng and Roth, 2013;Nguyen et al., 2014;Alhelbawy and Gaizauskas, 2014;Pershina et al., 2015). Despite of their success, existing global EL models try to optimize the entire linking configuration of all mentions, with extra assumptions of either allmention coherence or pairwise coherence (Phan et al., 2018). Such assumptions are against human intuitions, as they imply that no inference can be made until all mentions in a document have been observed. Also, there usually exists a tradeoff between accuracy and efficiency: state-of-theart collective/global models suffer from high time complexity. From the perspective of computational efficiency, optimal global configuration inference is NP-hard. Approximation methods, such as loopy belief propagation (Ganea and Hofmann, 2017) or iterative substitutions (Shen et al., 2015), are still computationally expensive due to the huge hypothesis space, and thus can hardly be scaled to handle large corpus. Many previous works have discussed the urgent needs of more efficient linking system for production, both in time complexity (Hughes et al., 2014) and memory consumption (Blanco et al., 2015).
In this paper, we propose a simple yet effective Dynamic Context Augmentation (DCA) process to incorporate global signal for EL. As Figure 1 shows, in contrast to traditional global models, DCA only requires one pass through all mentions to achieve comparable linking accuracy. The basic idea is to accumulate knowledge from previously linked entities as dynamic context to enhance later decisions. Such knowledge come from not only the inherent properties (e.g., description, attributes) of previously linked entities, but also from their closely related entities, which empower the model with important associative abilities. In real scenarios, some previously linked entities may be irrelevant to the current mention. Some falsely linked entities may even introduce noise. To alleviate error propagation, we further explore two strategies: (1) soft/hard attention mechanisms that favour the most relevant entities; (2) a reinforcement learning-based ranking model, which proves to be effective as reported in other information extraction tasks.
Contributions. The DCA model forms a new linking strategy from the perspective of data augmentation and thus can serve as a plug-andenhance module of existing linking models. The major contributions of this work are as follows: (1) DCA can introduce topical coherence into local linking models without reshaping their original designs or structures; (2) Comparing to global EL models, DCA only requires one pass through all mentions, yielding better efficiency in both training and inference; (3) Extensive experiments show the effectiveness of our model under different learning settings, base models, decision orders and attention mechanisms.

Problem Definition
Given a set of entity mentions M = {m 1 , ..., m T } in corpus D, Entity Linking aims to link each mention m t to its corresponding gold entity e * t . Such a process is usually divided into two steps: Candidate generation first collects a set of possible (candidate) entities E t = {e 1 t , ..., e |Et| t } for m t ; Candidate ranking is then applied to rank all candidates by likelihood. The linking system selects the top ranked candidate as the predicted entityê t . The key challenge is to capture high-quality features of each entity mention for accurate entity prediction, especially when local contexts are too sparse to disambiguate all candidates.
We build our DCA model based on two existing local EL models. In this section, we first introduce the architecture of the base models, then present the proposed DCA model under the standard supervised learning framework. Since the DCA process can be naturally formed as a sequential decision problem, we also explore its effectiveness under the Reinforcement Learning framework. Detailed performance comparison and ablation studies are reported in Section 6.

Local Base Models for Entity Linking
We apply the DCA process in two popular local models with different styles: the first is a neural attention model named ETHZ-Attn (Ganea and Hofmann, 2017), the other is the Berkeley-CNN (Francis-Landau et al., 2016) model which is made up of multiple convolutional neural networks (CNN).

ETHZ-Attn.
For each mention m t and a candidate e j t ∈ E t , three local features are considered: (1) Mention-entity PriorP (e j t |m t ) is the empirical distribution estimated from massive corpus (e.g.Wikipedia); (2) Context Similarity Ψ C (m t , e j t ) measures the textual similarity between e j t and the local context of m t ; (3) Type Similarity Ψ T (m t , e j t ) considers the similarity between the type of e j t and contexts around m t .P (e j t |m t ) and Ψ C (m t , e j t ) are calculated in the same way as (Ganea and Hofmann, 2017). For Ψ T (m t , e j t ), we first train a typing system proposed by (Xu and Barbosa, 2018) on AIDAtrain dataset, yielding 95% accuracy on AIDA-A dataset. In the testing phase, the typing system predicts the probability distribution over all types (PER, GPE, ORG and UNK) for m t , and outputs Ψ T (m t , e j t ) for each candidate accordingly. All local features are integrated by a two-layer feedforward neural network with 100 hidden units, as described in (Ganea and Hofmann, 2017).
Berkeley-CNN. The only difference between ETHZ-Attn and Berkeley-CNN is that, this model utilizes CNNs at different granularities to capture context similarity Ψ C (m t , e j t ) between a mention's context and its target candidate entities.

Dynamic Context Augmentation
As Figure 1 demonstrates, the basic idea of DCA is to accumulate knowledge from previously linked entities as dynamic context to enhance later decisions. Formally, denote the list of previously linked entities as S t = {ê 1 , ...,ê t }, where eachê i is represented as an embedding vector. The augmented context can be represented by accumulated features of all previous entities and their neighbors (e.g. by averaging their embeddings, in the simplest way). In actual scenarios, some entities in S t are irrelevant, if not harmful, to the linking result of m t+1 . To highlight the importance of relevant entities while filtering noises, we also try to apply a neural attention mechanism on dynamic contexts ( Figure 2). For mention m t+1 , candidates that are more coherent with S t are preferred. More specifically, we calculate the relevance score for eacĥ e i ∈ S t as where A is a parameterized diagonal matrix. Top K entities in S t are left to form dynamic context while the others are pruned. The relevance scores are transformed to attention weights with .
(2) Thus, we can define a weighted coherence score between e j t+1 ∈ E t+1 and S t as where R is a learnable diagonal matrix. Such a coherence score will be later incorporated in the final representation of e j t+1 . To empower the linking model with associative ability, aside from previously linked entities, we also incorporate entities that are closely associated with entities in S t . Specifically, for eachê i ∈ S t , we collect its neighborhood N (ê i ) consisting of Wikipedia entities that have inlinks pointing toê i . Denoting S t as the union of {N (ê i )|ê i ∈ S t }, we define a similar weighted coherence score between e j t+1 ∈ E t+1 and S t as where a is defined similarly to a, and R is a learnable diagonal matrix. The final and logP (e j t+1 |m t+1 ).

Model Learning for DCA
In this section, we explore different learning strategies for the linking model. Specifically, we present a Supervised Learning model, where the model is given all gold entities for training, and a Reinforcement Learning model, where the model explores possible linking results by itself in a longterm planning task.

Supervised Ranking Method
Given a mention-candidate pair (m t , e j t ), the ranking model parameterized by θ accepts the feature vector h 0 (m t , e j t ) as input, and outputs the probability P θ (e j t |m t ). In this work, we use a two-layer feedforward neural network as the ranking model. We apply the max-margin loss as . The learning process is to estimate the optimal parameter such that θ * = arg min θ L θ .
Note that, in the Supervised Ranking model, dynamic contexts are provided by previous gold en- In the testing phase, however, we do not have access to gold entities. Wrongly linked entities can introduce noisy contexts to future linking steps. To consider such long-term influences, we introduce an alternative Reinforcement Learning model in the next section.

Reinforcement Learning Method
Naturally, the usage of dynamic context augmentation forms a sequential decision problem, as each linking step depends on previous linking decisions. Correct linking results provide valuable information for future decisions, while previous mistakes can lead to error accumulation. Reinforcement Learning (RL) algorithms have proven to be able to alleviate such accumulated noises in the decision sequence in many recent works (Narasimhan et al., 2016;Feng et al., 2018). In this work, we propose an RL ranking model for DCA-enhanced entity linking.
Agent: The Agent is a candidate ranking model that has a similar architecture to (Clark and Manning, 2016), aiming to output the action prefer- It is a 2-layer feedforward neural network with following components: Input Layer: For each (m t , e j t ) pair, DCA-RL extracts context-dependent features from S t−1 , S t−1 , and concatenates them with other context-independent features to produce an Idimensional input vector h 0 (m t , e j t ). Hidden Layers: Let Drop( x) be the dropout operation (Srivastava et al., 2014) and ReLU ( x) be the rectifier nonlinearity (Nair and Hinton, 2010). So the output h 1 of the hidden layer is defined as: where W 1 is a H 1 × I weight matrix.
Output Layers: This scoring layer is also fully connected layer of size 1.
where W 2 is a 1×H 1 weight matrix. In the end, all action preference would be normalized together using an exponential softmax distribution, getting their action probabilities π θ (A j t |S t−1 , S t−1 ): According to policy approximating methods, the best approximate policy may be stochastic. So we randomly sample the actions based on the softmax distribution during the training time, whereas deliberately select the actions with the highest ranking score at the test time.
Reward. The reward signals are quite sparse in our framework. For each trajectory, the Agent can only receive a reward signal after it finishes all the linking actions in a given document. Therefore the immediate reward of action t, R t = 0, where 0 ≤ t < T , and R T = −(|M e |/T ), where T is total number of mentions in the source document, and |M e | is the number of incorrectly linked mentions. Then the value G t (expected reward) of each previous state S t can be retraced back with a discount factor ρ according to R T : To maximize the expected reward of all trajectories, the Agent utilizes the REINFORCE algorithm (Sutton and Barto, 1998) to compute Monte Carlo policy gradient over all trajectories, and perform gradient ascent on its parameters: In following sections, to fully investigate the effectiveness of the proposed method, we report and compare the performances of both the Supervisedlearning model and the Reinforcement-learning model.

Analysis of Computational Complexity
For each document D, the train and inference of the global EL models are heavily relied on the inter-entity coherence graph Φ g . Many studies (Ratinov et al., 2011;Globerson et al., 2016;Yamada et al., 2016;Ganea and Hofmann, 2017;Le and Titov, 2018)  pairwise scores between two arbitrary elements e i x and e j y sampled independently from candidate sets E i and E j in the given document. It is obvious that Φ is intractable, and the computational com- where |E| is the average number of candidates per mention and I is the unit cost of pairwise function Φ. In order to reduce O(Φ g ), most previous models (Hoffart et al., 2011;Ganea and Hofmann, 2017;Le and Titov, 2018;Fang et al., 2019) have to hard prune their candidates into an extremely small size (e.g. |E|=5). This will reduce the gold recall of candidate sets and also unsuitable for large scale production (e.g. entity disambiguation for dynamic web data like Twitter).
In contrast, the computational complexity of our model is O(T × |E| × I × K), where K is the key hyper-parameter described in Section 3 and is usually set to a small number. This indicates the response time of our method grow linearly as a function of T × |E|.   tions, and uses a deep learning model combined with a neural attention mechanism and graphical models. Ment-Norm (Le and Titov, 2018) improving the Deep-ED model by modeling latent relations between mentions.
For a fair comparison with prior work, we use the same input as the WNED, Deep-ED and Ment-Norm (models proposed after 2016), and report the performance of our model with both Supervised Learning (DCA-SL) and Reinforcement Learning (DCA-RL). We won't compare our models with the RLEL (Fang et al., 2019) which is a deep reinforcement learning based LSTM model. There are two reasons: 1) RLEL uses optimized candidate sets with smaller candidate size and higher gold recall than ours and the listed baselines. 2) RLEL uses additional training set from Wikipedia data. (Fang et al., 2019) doesn't release either their candidate sets or updated training corpus, so the comparison with their work would be unfair for us.
Hyper-parameter Setting. We coarsely tune the hyper-parameters according to model performance on AIDA-A. We set the dimensions of word embedding and entity embedding to 300, where the word embedding and entity embedding are publicly released by (Pennington et al., 2014) and (Ganea and Hofmann, 2017) respectively. Hyper-parameters of the best validated model are: K = 7, I = 5, H 1 = 100, and the probability of dropout is set to 0.2. Besides, the rank margin γ = 0.01 and the discount factor ρ = 0.9. We also

System
In-KB acc.  regularize the Agent model as adopted in (Ganea and Hofmann, 2017) by constraining the sum of squares of all weights in the linear layer with M axN orm = 4. When training the model, we use Adam (Kingma and Ba, 2014) with learning rate of 2e-4 until validation accuracy exceeds 92.8%, afterwards setting it to 5e-5.

Overall Performance Comparison
Starting with an overview of the end-task performance, we compare DCA (using SL or RL) with several state-of-the-art systems on in-domain and cross-domain datasets. We follow prior work and report in-KB accuracy for AIDA-B and micro F1 scores for the other test sets. Table 2 summarizes results on the AIDA-B dataset, and shows that DCA-based models achieve the highest in-KB accuracy and outperforms the previous state-of-the-art neural system by near 1.6% absolute accuracy. Moreover, compared with the base models, dynamic context augmentation significantly improve absolute in-KB accuracy in models Berkeley-CNN (more than  Table 3 shows the results on the five crossdomain datasets. As shown, none of existing methods can consistently win on all datasets. DCA-based models achieve state-of-the-art performance on the MSBNC and the ACE2004 dataset. On remaining datasets, DCA-RL achieves comparable performance with other complex global models. In addition, RL-based models show on average 1.1% improvement on F1 score over the SL-based models across all the crossdomain datasets. At the same time, DCA-based methods are much more efficient, both in time complexity and in resource requirement. Detailed efficiency analysis will be presented in following sections.
6.3 Performance Analysis 1. Impact of decision order. As the DCA model consecutively links and adds all the mentions in a document, the linking order may play a key role in the final performance. In this work, we try three different linking orders: Offset links all mentions by their natural orders in the original document;  Size first links mentions with smaller candidate sizes, as they tend to be easier to link; The baseline method is to link all mentions in a Random order. Figure 3 shows the performance comparison on the AIDA-B and the CWEB dataset. As shown, in general, Size usually leads to better performance than Offset and Random. However, the DCA-SL model shows poor performance on the CWEB dataset with Size order. This is mainly because the CWEB dataset is automatically generated rather than curated by human, and thus contains many noisy mentions. Some mentions in CWEB with less than three candidates are actually bad cases , where none of the candidates is the actual gold entity. Thus, such mentions will always introduce wrong information to the model, which leads to a worse performance. In contrast, the AIDA-B dataset does not have such situations. The DCA-RL model, however, still has strong performance on the CWEB dataset, which highlights its robustness to potential noises.
2. Effect of neighbor entities. In contrast to traditional global models, we include both previously linked entities and their close neighbors for global signal. Table 4 shows the effectiveness of this strategy. We observe that incorporating these neighbor entities (2-hop) significantly improve the performance (compared to 1-hop) by introducing more related information. And our analysis shows that on average 0.72% and 3.56% relative improvement of 2-hop DCA-(SL/RL) over 1-hop DCA-(SL/RL) or baseline-SL (without DCA) is statistically significant (with P-value < 0.005). This is consistent with our design of DCA.
3. Study of different attention mechanisms. Table 5 shows the performance comparison by replacing the attention module described in Section 3 with different variants. Average Sum treats all previously linked entities equally with a uniform distribution. Soft Attention skips the pruning step for entities with low weight scores. Soft&Hard Attention stands for the strategy used in our model. It is obvious that the attention mechanism does show positive influence on the linking performance compared with Average Sum.
Hard pruning brings slight further improvement.
4. Impact of decision length. As wrongly linked entities can introduce noise to the model, there exists a trade-off in DCA: involving more previous entities (longer historical trajectory) provides more information, and also more noise. Figure  (4.a) shows how the performance of DCA changes with the number of previous entities involved. We observe that longer historical trajectories usually have a positive influence on the performance of DCA. The reason is that our attention mechanism could effectively assess and select relevant contexts for each entity mention on the fly, thus reducing potential noise.

Analysis on Time Complexity
As discussed in Sec. 5, the running time of a DCA enhanced model may rise linearly when the average number of candidates per mention (i.e., |E|) increases, while the global EL model increases exponentially. To validate the theory we empirically investigate the scalability of DCA, and carefully select two global EL models Ment-Norm (Le and Titov, 2018) and Deep-ED (Ganea and Hofmann, 2017) as our baselines. The reason for this choice is that our final model shares the same local model as their models, which excludes other confounding factors like implementation details. As Figure  (4.c) shows, when |E| increases, the running time of these two global EL models increases shapely, while our DCA model grows linearly. On the other hand, we also observed that the resources required by the DCA model are insensitive to |E|. For example, as shown in Figure (4.b), the memory usage of Ment-Norm and Deep-ED significantly rises as more candidates are considered, while the DCA model remains a relatively low memory usage all the time. We also measure the power consumption of Ment-Norm and DCA models, and we find that the DCA model saves up to 80% of the energy consumption over the Ment-Norm, which is another advantage for large scale production.

Related Work
Local EL methods disambiguate each mention independently according to their local contexts (Yamada et al., 2016;Chen et al., 2017;Globerson et al., 2016;Raiman and Raiman, 2018). The per-formance is limited when sparse local contexts fail to provide sufficient disambiguation evidence.
To alleviate this problem, global EL models jointly optimize the entire linking configuration.
The key idea is to maximize a global coherence/similarity score between all linked entities (Hoffart et al., 2011;Ratinov et al., 2011;Cheng and Roth, 2013;Nguyen et al., 2014;Alhelbawy and Gaizauskas, 2014;Pershina et al., 2015;Guo and Barbosa, 2016;Globerson et al., 2016;Ganea and Hofmann, 2017;Le and Titov, 2018;Fang et al., 2019;Xue et al., 2019). Despite of its significant improvement in accuracy, such global methods suffer from high complexity. To this end, some works try to relax the assumption of all-mention coherence, e.g. with pairwise coherence, to improve efficiency (Phan et al., 2018), but exact inference remains an NPhard problem. Approximation methods are hence proposed to achieve reasonably good results with less cost. (Shen et al., 2012) propose the iterative substitution method to greedily substitute linking assignment of one mention at a time that can improve the global objective. Another common practice is to use Loopy Belief Propagation for inference (Ganea and Hofmann, 2017; Le and Titov, 2018). Both approximation methods iteratively improve the global assignment, but are still computationally expensive with unbounded number of iterations. In contrast, the proposed DCA method only requires one pass through the document. Global signals are accumulated as dynamic contexts for local decisions, which significantly reduces computational complexity and memory consumption.

Conclusions
In this paper we propose Dynamic Context Augmentation as a plug-and-enhance module for local Entity Linking models. In contrast to existing global EL models, DCA only requires one pass through the document. To incorporate global disambiguation signals, DCA accumulates knowledge from previously linked entities for fast inference. Extensive experiments on several public benchmarks with different learning settings, base models, decision orders and attention mechanisms demonstrate both the effectiveness and efficiency of DCA-based models. The scalability of DCAbased models make it possible to handle largescale data with long documents. Related code and data has been published and may hopefully benefit the community.