Aggregated Semantic Matching for Short Text Entity Linking

The task of entity linking aims to identify concepts mentioned in a text fragments and link them to a reference knowledge base. Entity linking in long text has been well studied in previous work. However, short text entity linking is more challenging since the text are noisy and less coherent. To better utilize the local information provided in short texts, we propose a novel neural network framework, Aggregated Semantic Matching (ASM), in which two different aspects of semantic information between the local context and the candidate entity are captured via representation-based and interaction-based neural semantic matching models, and then two matching signals work jointly for disambiguation with a rank aggregation mechanism. Our evaluation shows that the proposed model outperforms the state-of-the-arts on public tweet datasets.


Introduction
The task of entity linking aims to link a mention that appears in a piece of text to an entry (i.e. entity) in a knowledge base. For example, as shown in Table 1, given a mention Trump in a tweet, it should be linked to the entity Donald Trump 1 in Wikipedia. Recent research has shown that entity linking can help better understand the text of a document (Schuhmacher and Ponzetto, 2014) and benefits several tasks, including named entity recognition (Luo et al.) and information retrieval (Xiong et al., 2017b). The research of entity linking mainly considers two types of documents: long text (e.g. news articles and web documents) and short text (e.g. tweets). In this paper, we focus on short text, particularly tweet entity linking. * Correspondence author is Rong Pan. This work was done when the first and second author were interns and the third author was an employee at Microsoft Research Asia. 1 https://en.wikipedia.org/wiki/Donald Trump

Tweet
The vile #Trump humanity raises its gentle face in Canada ... chapeau to #Trudeau Candidates Donald Trump, Trump (card games), ... Table 1: An illustration of short text entity linking, with mention Trump underlined.
One of the major challenges in entity linking task is ambiguity, where an entity mention could denote to multiple entities in a knowledge base. As shown in Table 1, the mention Trump can refer to U.S. president Donald Trump and also the card name Trump (card games). Many of recent approaches for long text entity linking take the advantage of global context which captures the coherence among the mapped entities for a set of related mentions in a single document (Cucerzan, 2007;Han et al., 2011;Globerson et al., 2016;Heinzerling et al., 2017). However, short texts like tweets are often concise and less coherent, which lack the necessary information for the global methods. In the NEEL dataset (Weller et al., 2016), there are only 3.4 mentions in each tweet on average. Several studies Huang et al., 2014) investigate collective tweet entity linking by pre-collecting and considering multiple tweets simultaneously. However, multiple texts are not always available for collection and the process is time-consuming. Thus, we argue that an efficient entity disambiguation which requires only a single short text (e.g., a tweet) and can well utilize local contexts is better suited in real word applications.
In this paper, we investigate entity disambiguation in a setting where only local information is available. Recent neural approaches have shown their superiority in capturing rich semantic sim-ilarities from mention contexts and entity contents. Sun et al. (2015); Francis-Landau et al. (2016) proposed using convolutional neural networks (CNN) with Siamese (symmetric) architecture to capture the similarity between texts. These approaches can be viewed as representation-focused semantic matching models. The representation-focused model first builds a representation for a single text (e.g., a context or an entity description) with a neural network, and then conducts matching between the abstract representation of two pieces of text. Even though such models capture distinguishable information from both mention and entity side, some concrete matching signals are lost (e.g., exact match), since the matching between two texts happens after their individual abstract representations have been obtained. To enhance the representationfocused models, inspired by recent advances in information retrieval Guo et al., 2016;Xiong et al., 2017a), we propose using interaction-focused approach to capture the concrete matching signals. The interaction-focused method tries to build local interactions (e.g., cosine similarity) between two pieces of text, and then uses neural networks to learn the final matching score based on the local interactions.
The representation-and interaction-focused approach capture abstract-and concrete-level matching signal respectively, they would be complement each other if designed appropriately. One straightforward way to combine multiple semantic matching signals is to apply a linear regression layer to learn a static weight for each matching signal (Francis-Landau et al., 2016). However, we observe that the importance of different signals can be different case by case. For example, as shown in Table 1, the context word Canada is the most important word for the disambiguation of Trudeau. In this case, the concrete-level matching signal is required. While for the tweet "#Star-Wars #theForceAwakens #StarWarsForceAwakens @StarWars", @StarWars is linked to the entity Star Wars 2 . In this case, the whole tweet describes the same topic "Star Wars", thus the abstract-level semantics matching signal is helpful. To address this issue, we propose using a rank aggregation method to dynamically combine multiple semantic matching signals for disambiguation.
In summary, we focus on entity disambiguation by leveraging only the local information. Specifically, we propose using both representationfocused model and interaction-focused model for semantic matching and view them as complementary to each other. To overcome the issue of the static weights in linear regression, we apply rank aggregation to combine multiple semantic matching signals captured by two neural models on multiple text pairs. We conduct extensive experiments to examine the effectiveness of our proposed approach, ASM, on both NEEL dataset and MSR tweet entity linking (MSR-TEL for short) dataset.

Notations
Given a tweet t, it contains a set of identified queries Q = (q 1 , ..., q n ). Each query q in a tweet t consists of m and ctx, where m denotes an entity mention and ctx denotes the context of the mention, i.e., a piece of text surrounding m in the tweet t. An entity is an unambiguous page (e.g., Donald Trump) in a referent Knowledge Base (KB). Each entity e consists of ttl and desc, where ttl denotes the title of e and desc denotes the description of e (e.g., the article defining e).

An Overview of the Linking System
Typically, an entity linking system consists of three components: mention detection, candidate generation and entity disambiguation. In this section, we will briefly presents the existing solutions for the first two components. In next section, we will introduce our proposed aggregated semantic matching for entity disambiguation.

Mention Detection
Given a tweet t with a sequence of words w 1 , ..., w n , our goal is to identify the possible entity mentions in the tweet t. Specifically, every word w i in tweet t requires a label to indicate that whether it is an entity mention word or not. Therefore, we view it as a traditional named entity recognition (NER) problem and use BIO tagging schema. Given the tweet t, we aim to assign labels y = (y 1 , ..., y n ) for each word in the tweet t.
B w i is a begin word of a mention, I w i is a non-begin word of a mention, O w i is not a mention word.
In our implementation, we apply an LSTM-CRF based NER tagging model which automatically  learns contextual features for sequence tagging via recurrent neural networks (Lample et al., 2016).

Candidate Generation
Given a mention m, we use several heuristic rules to generate candidate entities similar to (Bunescu and Pasca, 2006;Huang et al., 2014;Sun et al., 2015). Specifically, given a mention m, we retrieve an entity as a candidate from KB, if it matches one of the following conditions: (a) the entity title exactly matches the mention, (b) the anchor text of the entity exactly matches the mention, (c) the title of the entity's redirected page exactly matches the mention Additionally, we add a special candidate NIL for each mention, which refers to a new entity out of KB. Given a mention, multiple candidates can be retrieved. Hence, we need to do entity disambiguation.

Aggregated Semantic Matching Model
We investigate entity disambiguation using only local information provided in short texts in this paper. Here, the local information includes a mention and its context in a tweet. Similar to (Francis-Landau et al., 2016), given a query q and an entity e, we consider semantic matching on the four text pairs for disambiguation: (1) the similarity sim(m, ttl) between the mention and entity title, (2) the similarity sim(m, desc) between the mention and entity description, (3) the similarity sim(ctx, desc) between the context and entity description, (4) the similarity sim(ctx, ttl) between the context and entity description. Fig. 1 illustrates an overview of our proposed Aggregated Semantic Matching for entity disambiguation. First, we use a representation-focused model and an interactionfocused neural model for semantic matching on four text pairs. Then, we introduce a pairwise rank aggregation to combine multiple semantic match-ing signals captured by the two neural models on four text pairs.

Semantic Matching
Formally, given two texts T 1 and T 2 , the semantic similarity of the two texts is measured as a score produced by a matching function based on the representation of each text: where Φ is a function to learn the text representation, and F is the matching function based on the interaction between the representations. Existing neural semantic matching models can be categorized into two types: (a) the representation-focused model which takes a complex representation learning function and uses a relatively simple matching function, (b) the interaction-focused model which usually takes a simple representation learning function and uses a complex matching function. In the remaining of this section, we will present the details of a representation-focused model (M-CNN) and an interaction-focused model (K-NRM). We will also discuss the advantages of these two models in the entity linking task.

Convolution Neural Matching with Max Pooling (M-CNN)
Given two pieces of text T 1 = {w 1 1 , ..., w 1 n } and T 2 = {w 2 1 , ..., w 2 m }, M-CNN aims to learn compositional and abstract representations (Φ) for T 1 and T 2 using a convolution neural network with a max pooling layer (Francis-Landau et al., 2016). Figure 2a illustrates the architecture of M-CNN model. Given a sequence of words w 1 , ..., w n , we embed each word into a d dimensional vector, which yields a set of word vectors v 1 , ..., v n . We then map those word vectors into a fixed-size vector using a convolution network with a filter bank M ∈ R u×d , where window size is l and u is the number of filters. The convolution feature matrix H ∈ R k×(n−l+1) is obtained by concatenating the convolution outputs − → h i : where v j:j+l is a concatenation of the given word vectors and the max is element-wise. In this way, we extract word-level n-gram features of T 1 and T 2 respectively. To capture the distinguishable information of T 1 and T 2 , a max-pooling layer is applied and yields a fixed-length vector − → z 1 and − → z 2 for T 1 and T 2 . The semantic similarity between T 1 and T 2 is measured using a cosine similarity In summary, M-CNN extracts distinguishable information representing the overall semantics (i.e. representations) of a string text by using a convolution neural network with max-pooling. However, the concrete matching signals (e.g., exact match) are lost, as the matching happens after their individual representation. We therefore introduce an interaction-focused model to better capture the concrete matching in the next section.

Neural Relevance Model with Kernel
Pooling (K-NRM) As shown in Fig. 2b, K-NRM captures the local interactions between T 1 and T 2 , and then uses a kernel-pooling layer (Xiong et al., 2017a) to softly count the frequencies of the local patterns. The final matching score is conducted based on the patterns. Therefore, the concrete matching information is captured. Different from M-CNN, K-NRM builds the local interactions between T 1 and T 2 based on the word-level n-gram feature matrix calculated in Eq. 2. Formally, we construct a translation matrix M , where each element in M is the cosine similarity between an n-gram feature vector . Then, a scoring feature vector φ(M ) is generated by a kernel-pooling technique.
applies K kernels to the i-th row of the translation matrix, and generates a K−dimensional scoring feature vector for the ith n-gram feature in the query. The sqrt-sum of the scoring feature vectors of all n-gram features in query forms the scoring feature vector φ for the whole query, where the sqrt reduces the range of the value in each kernel vector. Note that the effect of − → K depends on the kernel used. We use the RBF kernel in this paper.
The RBF kernel K k calculates how pairwise similarities between n-gram feature vectors are distributed around its mean µ k : the more similarities closed to its mean µ k , the higher the output value is. The kernel functions act as 'soft-TF' bins, where µ defines the similarity level that 'soft-TF' focuses on and σ defines the range of its 'soft-TF' count. Then the semantic similarity is captured with a linear layer match(T 1 , T 2 ) = w T φ(M )+b, where φ(M ) is the scoring feature vector. In summary, K-NRM captures the concrete matching signals based on word-level n-gram feature interactions between T 1 and T 2 . In contrast, M-CNN captures the compositional and abstract meaning of a whole text. Thus, we produce the semantic matching signals using both models to capture different aspect of semantics that are useful for entity linking.

Normalization Scoring Layer
We compute 4 types of semantic similarities between the query q and the candidate entity e (e.g., sim(m, tit), sim(m, desc), sim(ctx, tit), sim(ctx, desc)) with the above two semantic matching models. We obtain 8 semantic matching signals, denoted as f 1 (q, e), ..., f 8 (q, e) in to-tal. The normalized ranking score for each semantic matching signals f i (q, e) is calculated as where e stands for any of the candidate entities for the given mention m. We then produce 8 semantic matching scores for each candidate entity of m, denoted as S q,e = {s 1 , ..., s 8 }.

Rank Aggregation
Given a query q, we obtain multiple semantic matching signals for each entity candidate after the last step. To take advantage of different semantic matching models on different text pairs, a straightforward approach is using a linear regression layer to combine multiple semantic matching signals (Francis-Landau et al., 2016). The linear combination learns a static weight for each matching signal. However, as we pointed out previously, the importance of different signals varies for different queries. In some cases, the abstract-level signals are important. While the concrete-level signals are more important in other cases. To address this issue, we introduce a pairwise rank aggregation method to aggregate multiple semantic matching signals.
In the area of information retrieval, rank aggregation is combining rankings from multiple retrieval systems and producing a better new ranking (Carterette and Petkova, 2006). In our problem, given a query q, we have one ranking of the entity candidates for each semantic matching signal. We aim to find the final ranking by aggregating multiple rankings. Specifically, given a ranking of entities for one semantic matching signal, e 1 e 2 e 3 . . . , where i j means entity i is ranked above j, we extract all entity pairs (e i , e j ) from the ranking and assume that if e i e j , then e i is preferred to e j . We union all pairwise preferences generated from multiple rankings as a single set, from which the final ranking is learned. In this paper, we apply TrueSkill (Herbrich et al., 2006) which is a Bayesian skill rating model. We present a two-layer version of TrueSkill with no-draw.
TrueSkill assumes that the practical performance of each player in a game follows a normal distribution N (µ, σ 2 ), where µ means the skill level of the player and σ stands for the uncertainty of the estimated skill level. Basically, TrueSkill learns the skill levels of players by lever-aging Bayes' theorem. Given the current estimated skill levels of two players (prior probability) and the outcome of a new game between them (likelihood), TrueSkill model updates its estimation of player skill levels (posterior probability). TrueSkill updates the skill level µ and the uncertainty σ intuitively: (a) if the outcome of a new competition is expected, i.e., the player with higher skill level wins the game, it will cause small updates in skill level µ and uncertainty σ; (b) if the outcome of a new competition is unexpected, i.e., the player with lower skill level wins the game, it will cause large updates in skill level µ and uncertainty σ. According to these intuitions, the equations to update the skill level µ and uncertainty σ are as follows: where t = µ winner − µ loser and c 2 = 2β 2 + σ 2 winner + σ 2 loser . Here, ε is a parameter representing the probability of a draw in one game, and v(t, ε) and w(t, ε) are weighting factors for skill level µ and standard deviation σ respectively. β is a parameter representing the range of skills. In this paper, we set the initial values of the skill level µ and the standard deviation σ of each player the same as the default values used in (Herbrich et al., 2006). We use µ − 3β to rank entities following (Herbrich et al., 2006).

Experiments
In this section, we describe our experimental results on tweet entity linking. Particularly, we investigate the difference between two semantic matching models and the effectiveness of jointly combining these two semantic matching signals.

Datasets & Evaluation Metric
In our experiments, we evaluate our proposed model ASM on the following two datasets. NEEL Weller et al. (2016). We use the dataset of Named Entity Extraction & Linking Challenge 2016. The training dataset consists of 6,025 tweets and includes 6,374 non-NIL queries and 2,291 NIL queries. The validation dataset consists of 100 tweets and includes 253 non-NIL queries and 85 NIL queries. The testing dataset consists of 300 tweets and includes 738 non-NIL queries and 284 NIL queries.
MSR-TEL Guo et al. (2013) 3 . This dataset consists of 428 tweets and 770 non-NIL queries. Since the NEEL test dataset has distribution bias problem, we add MSR-TEL as another dataset for the evaluation. In the NEEL testing dataset, 384 out of 1022 queries refer to three entities: 'Donald Trump', 'Star Wars' and 'Star Wars (The Force Awakens)'.
In this paper, we use accuracy as the major evaluation metric for entity disambiguation. Formally, we denote N as the number of queries and M as the number of correctly linked mentions given the gold mention (the top-ranked entity is the golden entity), accuracy = M N . Besides, we use precision, recall and F1 measure to evaluate the end-toend system. Formally, we denote N as the number of mentions identified by a system and M as the correctly linked mentions. Thus, precision = M N , recall = M N and F 1 = 2 * precision * recall precision+recall .

Data Preprocessing
Tweet data All tweets are normalized in the following way. First, we use the Twitter-aware tokenizer in NLTK 4 to tokenize words in a tweet. We convert each hyperlink in tweets to a special token URL. Since hashtags usually does not contain any space between words, we use a web service 5 to break hastags into tokens (e.g., the service will break '#TheForceAwakens' into 'the force awakens') by following (Guo et al., 2013).
Regarding to usernames (@) in tweets, we replace them with their screen name (e.g., the screen name of the user '@jimmyfallon' is 'jimmy fallon'). Wikipedia data We use the Wikipedia Dump on December 2015 as the reference knowledge base.
Since the most important information of an entity is usually at the beginning of its Wikipedia article, we utilize only the first 200 words in the article as its entity description. We use the default English word tokenizer in NLTK to do the tokenization for each Wikipedia article. Word embedding We use the word2vec toolkit (Mikolov et al., 2013) to pre-train word embeddings on the whole English Wikipedia Dump. The dimensionality of the word embeddings is set to 400. Note that we do not update the word embeddings during training.

Experimental Setup
In our main experiment, we compare our proposed approaches with the following baselines: (a) The officially ranked 1st and 2nd systems in NEEL 2016 challenge. We denote these two systems as Rank1 and Rank2. (b) TagMe. Ferragina and Scaiella (2010) is an end-to-end linking system, which jointly performs mention detection and entity disambiguation. It focuses on short texts, including tweets. To fairly compare with the baselines of Cucerzan and M-CNN, we use the same mention detection and candidate generation for them and our approaches. We train an LSTM-CRF based tagger (Lample et al., 2016) for mention detection by using the NEEL training dataset. The precision, recall, and F1 of mention detection on NEEL testing dataset are 96.1%, 89.2%, 92.6% respectively. The precision, recall, and F1 of mention detection on MSR-TEL dataset are 80.3% 83.8% and 82% respectively. As we described in the previous section, we use the heuristic rules for candidate generation. The recall of candidate generation on NEEL and MSR-TEL is 88.7% and 92.5%. When training our model, we use the stochastic gradient descent algorithm and the AdaDelta optimizer (Zeiler, 2012). The gradients are computed via back-propagation. The dimensionality of the hidden units in convolution neural network is set to 300. All the parameters are initialized with a uniform distribution U (−0.01, 0.01). Since there is NIL entity in the dataset, we tune a NIL threshold for the prediction of NIL entities according to the validation dataset.

Main Results
The end-to-end performance of various approaches on the two datasets is shown in Table 2   The accuracy of entity disambiguation with golden mentions on the two datasets.
Rank1 and Rank2, we give only the F1 scores of these two systems on NEEL dataset according to Weller et al. (2016). Note that the baseline systems Rank1, Rank2 and TagMe use different mention detection. The systems of Rank1, Rank2, TagMe and Cucerzan are feature engineering based approaches. The systems of M-CNN and ASM are neural based approaches. From Table 2, we can observe that neural based approaches are superior to the feature engineering based approaches. Table 2 also shows that ASM outperforms the neural based method M-CNN. Our proposed method ASM also shows improvements over Ensemble, which indicates the necessity of combining representation-and interactionfocused models in entity disambiguation.
Moreover, we pre-train both M-CNN, Ensemble and ASM by using 0.5 million anchors in Wikipedia, and fine-tune the model parameters using non-NIL queries in NEEL training dataset. From Table 2, we can observe that the performance of neural models will be improved by using pre-training. The results in Table 2   that our proposed ASM is still superior to M-CNN and Ensemble in the setting of pre-training. Since entity disambiguation is our focus, we also give the disambiguation accuracy of different approaches by using the golden mentions in Table 3. Similarly, we observe that our proposed ASM outperforms baseline systems.

Model Analysis
In this section, we discuss several key observations based on the experimental results, and we mainly report the entity disambiguation accuracy when given the golden mentions.

Effect of Different Semantic Matching Methods
We empirically analyze the difference between the two semantic matching models (M-CNN and K-NRM) and show the benefits when combing the semantic matching signals from these two models.
M-CNN win M-CNN loss K-NRM win 58.3% 6.3% K-NRM loss 5.8% 29.6%  We first compare the performance of two semantic matching models over the two text pairs: (a) (m, ttl) and (b) (ctx, desc). These two pairs presents two extreme of the information used in the systems: (m, ttl) consumes the minimum amount of information from a query and an entity, while (ctx, desc) consumes the maximum amount of information from a query and an entity. From the first two columns in Table 4, we can observe that M-CNN performs comparably with K-NRM on the two text pairs. ASM that combines the two models obtains performance gains on the two individual text pairs. The third column in Table 4 also shows that ASM gives performance gains when using all text pairs. This indicates that M-CNN and K-NRM capture complementary information for entity disambiguation.
Moreover, we observe that the performance gains are different on the two pairs (m, ttl) and (ctx, desc). The gain on (ctx, desc) is relatively larger. This indicates that M-CNN and K-NRM capture more different information when the text is long. Additionally, we show the win-loss analysis of the two semantic matching model for non-NIL queries on (ctx, desc) in Table 5. The 12.1% (=6.3% + 5.8%) difference between these two models confirms the necessity of combination.  To further investigate the difference between the two semantic matching models on short text, we did case study. Table 6 gives two examples. In the first example, the correct answer is 'Justin Trudeau' which contains the words of 'Canada' and 'trump' in its entity description. However, M-CNN fails to capture this concrete matching information, since the concrete information of text might be lost after the convolution layer and maxpooling layer. In contrast, K-NRM builds the ngram level local interactions between texts, and thus successfully captures the concrete matching information (e.g. exact match) that results in a correct linking result. In the second example, both candidate entities 'Star Wars' and 'Comparison of Star Trek and Star Wars' contains the phrase 'Star Wars' for multiple times in their entity descriptions. In this case, K-NRM fails to distinguish the correct entity 'Star Wars' from the wrong entity 'Comparision of Star Trek and Star Wars', because it relies too much on the soft-TF information for matching. However, the soft-TF information in the descriptions of the two entities is similar. In contrast, M-CNN captures the whole meaning of the text and links the mention to the correct entity. A detailed analysis of n-grams extracted from the M-CNN is provided in the Appendix. Table 4 shows that the combination of multiple semantic matching signals yields the best performance. Table 7 compares two different combination of M-CNN and K-NRM models, the result shows that the rank aggregation method outperforms the linear combination. The rank aggregation method dynamically summarizes win-loss results for each signal and generates the final overall ranking by considering all win-loss results. The improvement of our method over the linear combination confirms that the importance of different semantic signals varies for different queries, and our method is more suitable for combining multiple semantic signals.

Related Work
Existing entity linking methods can roughly fall into two categories. Early work focus on local approaches, which identifies one mention each time, and each mention is disambiguated separately using hand-crafted features (Bunescu and Pasca, 2006;Ji and Grishman, 2008;Milne and Witten, 2008;Zheng et al., 2010). While recent work on entity linking has largely focus on global methods, which takes the mentions in the document as inputs and find their corresponding entities simultaneously by considering the coherency of entity assignments within a document. (Cucerzan, 2007;Hoffart et al., 2011;Globerson et al., 2016;Ganea and Hofmann, 2017).
Global models can tap into highly discriminative semantic signals (e.g. coreference and entity relatedness) that are unavailable to local methods, and have significantly outperformed the local approach on standard datasets (Globerson et al., 2016). However, global approaches are difficult to apply in domains where only short and noisy text is available (e.g. tweets). Many techniques have been proposed to short texts including tweets.  and Huang et al. (2014) investigate the collective tweet entity linking by considering multiple tweets simultaneously. Meij et al. (2012) and Guo et al. (2013) perform joint detection and disambiguation of mentions for tweet entity linking using feature based learning methods.
Recently, some neural network methods have been applied to entity linking to model the local contextual information. He et al. (2013) investigate Stacked Denoising Auto-encoders to learn entity representation. Sun et al. (2015); Francis-Landau et al. (2016) apply convolutional neural networks for entity linking. Eshel et al. (2017) use recurrent neural networks to model the mention contexts. Nie et al. (2018) uses a co-attention mechanism to select informative contexts and entity description for entity disambiguation. However, none of these methods consider combining representation-and interaction-focused semantic matching methods to capture the semantic similarity for entity linking, and use rank aggregation method to combine multiple semantic signals.

Conclusion
We propose an aggregated semantic matching framework, ASM, for short text entity linking. The combination of the representation-focused semantic matching method and the interactionfocused semantic matching method capture both compositional and concrete matching signals (e.g. exact match). Moreover, the pairwise rank aggregation is applied to better combine multiple semantic signals. We have shown the effectiveness of ASM over two datasets through comprehensive experiments. In the future, we will try our model for long text entity linking.