UNIBA: Combining Distributional Semantic Models and Sense Distribution for Multilingual All-Words Sense Disambiguation and Entity Linking

This paper describes the participation of the UNIBA team in the Task 13 of SemEval-2015 about Multilingual All-Words Sense Disam-biguation and Entity Linking. We propose an algorithm able to disambiguate both word senses and named entities by combining the simple Lesk approach with information coming from both a distributional semantic model and usage frequency of meanings. The results for both English and Italian show satisfactory performance.


Introduction
SemEval-2015 Task 13 (Moro and Navigli, 2015) aims to evaluate systems that provide a comprehensive representation of text through linking of both words and entities with concepts in a knowledge base. Besides the traditional difficulties of word sense disambiguation, this task requires specific methods able to tackle the challenges posed by the named entity recognition, disambiguation and linking steps. This paper proposes a unified strategy for word sense and named entity disambiguation which leverages BabelNet, a multilingual resource that encompasses both encyclopedic and lexicographic knowledge (Navigli and Ponzetto, 2012). Our approach relies on the Distributional Lesk (DL-WSD) algorithm (Basile et al., 2014), which is able to disambiguate a word occurrence by computing the similarity between word context and the glosses associated with all possible word meanings. Such a similarity is computed through a Distributional Semantic Model (DSM) (Sahlgren, 2006).
In this work we describe an extension of the DL-WSD algorithm that exploits a specific module for entity discovery given a list of possible surface forms. In particular, we build an index in which each surface form (i.e. candidate entity) is paired to the list of all its possible meanings in a semantic network. This index of surface forms is exploited to look up all candidate entities in a text.
The rest of this paper is structured as follows: Section 2 provides details about the adopted strategy, and describes the two main steps: 1) Entity Recognition and 2) Disambiguation. An experimental evaluation, along with details about results, is presented in Section 3, while conclusions close the paper.

Methodology
Our methodology is a two-step algorithm consisting in an initial identification of all possible entities mentioned in a text followed by the disambiguation of both words and named entities through the DL-WSD algorithm. The semantic network is exploited twice in order to 1) extract all the possible surface forms related to entities, and 2) retrieve glosses used in the disambiguation process.

Entity Recognition
In order to speed up the entity recognition step we build an index in which for each surface form (entity) the set of all its possible meanings in the semantic network is reported. Lucene 1 is exploited to build the index, specifically for each surface form (lexeme) occurring in BabelNet, a document composed of two fields is created. The first field stores the surface form, while the second one contains the list of all possible BabelSynsets that refer to the surface form in the first field. The index is built separately for each language, Italian and English. The entity recognition module exploits this index in order to find entities in a text. Given a text fragment, the module performs the following steps: • Building all n-grams up to five words; • Querying the index and retrieving the list of the top t matching surface forms for each n-gram. It is possible to enable a multi-match strategy; for example the 3-gram "European Union Commission" can match two entities: "European Union" and "European Union Commission". The multi-match strategy provides disambiguation for all the possible entities, otherwise the longest surface form is selected; • Scoring each surface form by exploiting two different approaches: EXACT MATCH computes the linear combination between the score provided by the search engine and a string similarity function based on the Levenshtein Distance between the n-gram and the candidate surface form in the index; PARTIAL MATCH computes the linear combination between the two scores provided by the EXACT MATCH and the Jaccard Index in terms of common words between the n-gram and the candidate surface form; • Filtering the candidate entities recognized in the previous steps; entities are removed if the score computed in the previous step is below a given threshold and/or the sequence of PoStags related to the n-gram does not match a set of defined patterns; • Assigning to each candidate entity two additional scores according to the percentage of: 1) stop words, and 2) words that do not contain at least one upper-case character. A threshold can be fixed for each score to filter out some entities.
Moreover, for each entity we build a set of alternatives. For example, given the candidate entity "European Union" we create the set of alternative surface forms {European, Union, EU, E.U.}. Then, we add all the BabelSynsets of "European Union" to the list of possible meanings of those words that follow the candidate entity and belong to the set of alternative forms.
The output of the entity recognition module is a list of candidate entities in which a set of possible meanings (BabelSynset) is assigned to each surface form in the list. The set of named entities extracted by this module and the list of all the words in the text are the input to the DL-WSD algorithm.

DL-WSD
We exploit the distributional Lesk algorithm proposed by Basile et al. (2014) for disambiguating words and named entities. The algorithm replaces the concept of word overlap initially introduced by (Lesk, 1986) with the broader concept of semantic similarity computed in a distributional semantic space. Let w 1 , w 2 , ...w n be a sequence of words/entities, the algorithm disambiguates each target word/entity w i by computing the semantic similarity between the glosses of senses associated with the target word/entity and its context. This similarity is computed by representing in a DSM both the gloss and the context as the sum of words they are composed of; then this similarity takes into account the co-occurrence evidences previously collected through a corpus of documents. The corpus plays a key role since the richer it is the higher is the probability that each word is fully represented in all its contexts of use. We exploit the word2vec tool 2 (Mikolov et al., 2013) in order to build a DSM, by analyzing all the pages in the last English/Italian Wikipedia Dump. The correct sense for a word is the one whose gloss maximizes the semantic similarity with the word/entity context. The sense description can still be too short for a meaningful comparison with the word/entity context. Following this observation, we adopted an approach inspired by the adapted Lesk (Banerjee and Pedersen, 2002), and we decided to enrich the gloss of the sense with those of related meanings, duly weighted to reflect their distances with respect to the original sense. The algorithm consists of the following steps.
Building the glosses. We retrieve the set S i = {s i1 , s i2 , ..., s ik } of senses associated to the word/entity w i . For named entities such a set is provided by the entity recognition module, while for words the set is obtained by firstly looking up to the WordNet portion of BabelNet, then if no sense is found we seek for senses from Wikipedia. For each sense s ij , the algorithm builds the extended gloss representation g * ij by adding to the original gloss g ij the glosses of related meanings retrieved through the BabelNet function getRelatedMap, with the exception of antonym senses. Each word in g * ij is weighted by a function inversely proportional to the distance d between s ij and the related glosses where the word occurs. Moreover, in order to emphasize discriminative words among the different senses, in the weight we introduce a variation of the inverse document frequency (idf ) for retrieval that we named inverse gloss frequency (igf ). The igf for a word w k occurring gf * k times in the set of extended glosses for all the senses in S i (the sense inventory of w i ) is computed as IGF k = 1 + log 2 |S i | gf * k . The final weight for the word w k appearing h times in the extended gloss g * ij is given by: Building the context. The context C for the word w i is represented by all the words that occur in the text.
Building the vector representations. The context C and each extended gloss g * ij are represented as vectors in the SemanticSpace built through the DSM.
Sense ranking. The algorithm computes the cosine similarity between the vector representation of each extended gloss g * ij and that of the context C. Then, the cosine similarity is linearly combined with a function which takes into account the usage of the meaning in the language. We analyse a function that computes the probability assigned to each synset given a word/named entity as follows: Word. We exploit a synset-tagged corpus and we attempt to map each word occurrence to Word-Net (Miller, 1995). Then, we select the Word-Net sysnet with the maximum probability.
Named Entity. We retrieve from BabelNet the Wikipedia title pages related to the Babel-Synset and count the number of times a Wikipedia page is linked from another page. In this way we use Wikipedia as a synset-tagged corpus.
We define the probability p(s ij |w i ) that takes into account the sense distribution of s ij given the word/entity w i . The sense distribution is computed as the number of times the word/entity w i is tagged with the sense. Zero probabilities are avoided by introducing an additive (Laplace) smoothing. The probability is computed as follows: where t(w i , s ij ) is the number of times the word/entity w i is tagged with the sense s ij .

Evaluation
The evaluation aims at comparing the system result against a gold standard manually annotated using synsets from BabelNet 2.5.1. Test data consists of four documents that belong to three different domains: biomedical, maths and computer science, and general. The idea is to evaluate the algorithm performance both in general and specific domains. We submitted three runs with different parameter settings that mainly affected the entity recognition module. System settings are reported in Table 1   of entities retrieved by the search engine to 25, and the thresholds for stop-word and lower-case filters to 0.3. Table 2 reports the official results released by the task organizers. Our best system ranks 4th among 17 submissions for English, and 4th among 8 for Italian. As reported in Table 2, our system is not scored for adjective. This issue is due to a problem with PoS-tag: in trial data adjectives are tagged with 'A', while in the test data with 'J'. Inadvertently, we did not report this modification in our system during the testing. After the release of the gold standard, we fixed that issue in our system and performed a new experiment whose results are reported in Table  3. Since results for noun, verbs and adverbs are not affected by the fix, they are not reported again in the table. Considering the new results reported in Table  3, our system is able to rank 3rd for English, and 2nd for Italian. Another goal of the task is to evaluate system performance on different domains. In particular three domains were provided: biomedical (bio), maths and computer science (math), and general domain (gnr). Results for each domain and language are reported in Table 4. Our performance on each domain shows a trend very similar to the best system for each language: the math/computer science domain is the hardest to disambiguate, while the biomedical one seems to be the easiest. A deep analysis of domain results shows that our system is the best to disambiguate named entities for Italian biomedical  and math/computer science domains, while it provides the lowest performance in the general domain for both Italian and English. It is important to note that the system settings seem not to affect the overall performance, while a deep analysis focused on the only named entities reveals slight differences between settings. This behaviour is due to the different methods used to recognize named entities. The task description paper reports more details about results (Moro and Navigli, 2015).

Conclusions
We presented a unified approach to entity linking and word sense disambiguation which relies on a distributional extension of the simple Lesk disambiguation algorithm. This algorithm has been extended with an entity recognition module able to recognize candidate named entities. We evaluated three different configurations of such recognition module within the Task 13 of SemEval-2015. Experimental evaluation showed competitive results, with our best run ranked among the top systems.