Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

Word Sense Disambiguation is a long-standing task in Natural Language Processing, lying at the core of human language understanding. However, the evaluation of automatic systems has been problematic, mainly due to the lack of a reliable evaluation framework. In this paper we develop a unified evaluation framework and analyze the performance of various Word Sense Disambiguation systems in a fair setup. The results show that supervised systems clearly outperform knowledge-based models. Among the supervised systems, a linear classifier trained on conventional local features still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting neural networks on unlabeled corpora achieve promising results, surpassing this hard baseline in most test sets.


Introduction
Word Sense Disambiguation (WSD) has been a long-standing task in Natural Language Processing (NLP). It lies at the core of language understanding and has already been studied from many different angles (Navigli, 2009;Navigli, 2012). However, the field seems to be slowing down due to the lack of groundbreaking improvements and the difficulty of integrating current WSD systems into downstream NLP applications (de Lacalle and Agirre, 2015). In general the field does not have a clear path, partially owing to the fact that identifying real improvements over existing approaches becomes a hard task with current evaluation benchmarks. This is mainly due to the lack of a unified framework, which prevents direct and fair comparison among systems. Even though many evaluation datasets have been constructed for the task (Edmonds and Cotton, 2001;Snyder and Palmer, 2004;Navigli et al., 2007;Pradhan et al., 2007;Agirre et al., 2010a;Navigli et al., 2013;Moro and Navigli, 2015, inter alia), they tend to differ in format, construction guidelines and underlying sense inventory. In the case of the datasets annotated using WordNet (Miller, 1995), the de facto sense inventory for WSD, we encounter the additional barrier of having text annotated with different versions. These divergences are in the main solved individually by using or constructing automatic mappings. The quality check of such mapping, however, tends to be impractical and this leads to mapping errors which give rise to additional system inconsistencies in the experimental setting. This issue is directly extensible to the training corpora used by supervised systems. In fact, results obtained by supervised or semi-supervised systems reported in the literature are not completely reliable, because the systems may not necessarily have been trained on the same corpus, or the corpus was preprocessed differently, or annotated with a sense inventory different from the test data. Thus, together, the foregoing issues prevent us from drawing reliable conclusions on different models, as in some cases ostensible improvements may have been obtained as a consequence of the nature of the training corpus, the preprocessing pipeline or the version of the underlying sense inventory, rather than of the model itself. Moreover, because of these divergences, current systems tend to report results on a few datasets only, making it hard to perform a direct quantitative confrontation. This paper offers two main contributions. First, we provide a complete evaluation framework for all-words Word Sense Disambiguation overcoming all the aforementioned limitations by (1) standardizing the WSD datasets and training corpora into a unified format, (2) semi-automatically converting annotations from any dataset to WordNet 3.0, and (3) preprocessing the datasets by consistently using the same pipeline. Second, we use this evaluation framework to perform a fair quantitative and qualitative empirical comparison of the main techniques proposed in the WSD literature, including the latest advances based on neural networks.

State of the Art
The task of Word Sense Disambiguation consists of associating words in context with the most suitable entry in a pre-defined sense inventory. Depending on their nature, WSD systems are divided into two main groups: supervised and knowledgebased. In what follows we summarize the current state of these two types of approach.

Supervised WSD
Supervised models train different features extracted from manually sense-annotated corpora. These features have been mostly based on the information provided by the surroundings words of the target word (Keok and Ng, 2002;Navigli, 2009) and its collocations. Recently, more complex features based on word embeddings trained on unlabeled corpora have also been explored (Taghipour and Ng, 2015b;Rothe and Schütze, 2015;Iacobacci et al., 2016). These features are generally taken as input to train a linear classifier (Zhong and Ng, 2010;Shen et al., 2013). In addition to these conventional approaches, the latest developments in neural language models have motivated some researchers to include them in their WSD architectures (Kågebäck and Salomonsson, 2016;Melamud et al., 2016;Yuan et al., 2016). Supervised models have traditionally been able to outperform knowledge-based systems (Navigli, 2009). However, obtaining sense-annotated corpora is highly expensive, and in many cases such corpora are not available for specific domains. This is the reason why some of these supervised methods have started to rely on unlabeled corpora as well. These approaches, which are often classified as semi-supervised, are targeted at overcoming the knowledge acquisition bottleneck of conventional supervised models (Pilehvar and Navigli, 2014). In fact, there is a line of research specifically aimed at automatically obtaining large amounts of high-quality sense-annotated corpora (Taghipour and Ng, 2015a;Raganato et al., 2016;Camacho-Collados et al., 2016a).
In this work we compare supervised systems and study the role of their underlying senseannotated training corpus. Since semi-supervised models have been shown to outperform fully supervised systems in some settings (Taghipour and Ng, 2015b;Başkaya and Jurgens, 2016;Iacobacci et al., 2016;Yuan et al., 2016), we evaluate and compare models using both manually-curated and automatically-constructed sense-annotated corpora for training.

Knowledge-based WSD
In contrast to supervised systems, knowledgebased WSD techniques do not require any senseannotated corpus. Instead, these approaches rely on the structure or content of manually-curated knowledge resources for disambiguation. One of the first approaches of this kind was Lesk (1986), which in its original version consisted of calculating the overlap between the context of the target word and its definitions as given by the sense inventory. Based on the same principle, various works have adapted the original algorithm by also taking into account definitions from related words (Banerjee and Pedersen, 2003), or by calculating the distributional similarity between definitions and the context of the target word (Basile et al., 2014;Chen et al., 2014). Distributional similarity has also been exploited in different settings in various works (Miller et al., 2012;Camacho-Collados et al., 2015;Camacho-Collados et al., 2016b). In addition to these approaches based on distributional similarity, an important branch of knowledge-based systems found their techniques on the structural properties of semantic graphs from lexical resources (Agirre and Soroa, 2009;Guo and Diab, 2010;Ponzetto and Navigli, 2010;Agirre et al., 2014;Moro et al., 2014;Weissenborn et al., 2015;Tripodi and Pelillo, 2016). Generally, these graph-based WSD systems first create a graph representation of the input text and then exploit different graph-based algorithms over the given representation (e.g., PageRank) to perform WSD.

Standardization of WSD datasets
In this section we explain our pipeline for transforming any given evaluation dataset or senseannotated corpus into a preprocessed unified for- mat. In our pipeline we do not make any distinction between evaluation datasets and senseannotated training corpora, as the pipeline can be applied equally to both types. For simplicity we will refer to both evaluation datasets and training corpora as WSD datasets. Figure 1 summarizes our pipeline to standardize a WSD dataset. The process consists of four steps: 1. Most WSD datasets in the literature use a similar XML format, but they have some divergences on how to encode the information.
For instance, the SemEval-15 dataset (Moro and Navigli, 2015) was developed for both WSD and Entity Linking and its format was especially designed for this latter task. Therefore, we decided to convert all datasets to a unified format. As unified format we use the XML scheme used for the SemEval-13 allwords WSD task (Navigli et al., 2013), where preprocessing information of a given corpus is also encoded.
2. Once the dataset is converted to a unified format, we map the sense annotations from its original WordNet version to 3.0, which is the latest version of WordNet used in evaluation datasets. This mapping is carried out semiautomatically. First, we use automaticallyconstructed WordNet mappings 1 (Daude et al., 2003). These mappings provide confidence values which we use to initially map senses whose mapping confidence is 100%. Then, the annotations of the remaining senses are manually checked, and re-annotated or removed whenever necessary 2 . Additionally, in this step we decided to remove all annotations of auxiliary verbs, following the annotation guidelines of the latest WSD datasets.
CoreNLP toolkit (Manning et al., 2014) for Part-of-Speech (PoS) tagging 3 and lemmatization. This step is performed in order to ensure that all systems use the same preprocessed data.
4. Finally, we developed a script to check that the final dataset conforms to the aforementioned guidelines. In this final verification we also ensured that the sense annotations match the lemma and the PoS tag provided by Stanford CoreNLP by automatically fixing all divergences.

Data
In this section we summarize the WSD datasets used in the evaluation framework. To all these datasets we apply the standardization pipeline described in Section 3. First, we enumerate all the datasets used for the evaluation (Section 4.1). Second, we describe the sense-annotated corpora used for training (Section 4.2). Finally, we show some relevant statistics extracted from these resources (Section 4.3).

WSD evaluation datasets
For our evaluation framework we considered five standard all-words fine-grained WSD datasets from the Senseval and SemEval competitions: • Senseval-2 (Edmonds and Cotton, 2001). This dataset was originally annotated with WordNet 1.7. After standardization, it consists of 2282 sense annotations, including nouns, verbs, adverbs and adjectives.
• Senseval-3 task 1 (Snyder and Palmer, 2004 Table 1: Statistics of the WSD datasets used in the evaluation framework (after standardization).
• SemEval-07 task 17 (Pradhan et al., 2007). This is the smallest among the five datasets, containing 455 sense annotations for nouns and verbs only. It was originally annotated using WordNet 2.1 sense inventory.
• SemEval-13 task 12 (Navigli et al., 2013). This dataset includes thirteen documents from various domains. In this case the original sense inventory was WordNet 3.0, which is the same as the one that we use for all datasets. The number of sense annotations is 1644, although only nouns are considered.
• SemEval-15 task 13 (Moro and Navigli, 2015). This is the most recent WSD dataset available to date, annotated with WordNet 3.0. It consists of 1022 sense annotations in four documents coming from three heterogeneous domains: biomedical, mathematics/computing and social issues.

Sense-annotated training corpora
We now describe the two WordNet senseannotated corpora used for training the supervised systems in our evaluation framework: • SemCor (Miller et al., 1994). SemCor 4 is a manually sense-annotated corpus divided into 352 documents for a total of 226,040 sense annotations. It was originally tagged with senses from the WordNet 1.4 sense inventory. SemCor is, to our knowledge, the largest corpus manually annotated with WordNet senses, and is the main corpus used in the literature to train supervised WSD systems (Agirre et al., 2010b;Zhong and Ng, 2010).
• OMSTI (Taghipour and Ng, 2015a). OM-STI (One Million Sense-Tagged Instances) is a large corpus annotated with senses from the WordNet 3.0 inventory. It was automatically constructed by using an alignmentbased WSD approach (Chan and Ng, 2005) on a large English-Chinese parallel corpus (Eisele and Chen, 2010, MultiUN corpus). OMSTI 5 has already shown its potential as a training corpus by improving the performance of supervised systems which add it to existing training data (Taghipour and Ng, 2015a;Iacobacci et al., 2016). Table 1 shows some statistics 6 of the WSD datasets and training corpora which we use in the evaluation framework. The number of sense annotations varies across datasets, ranging from 455 annotations in the SemEval-07 dataset, to 2,282 annotations in the Senseval-2 dataset. As regards sense-annotated corpora, OMSTI is made up of almost 1M sense annotations, a considerable increase over the number of sense annotations of SemCor. However, SemCor is much more balanced in terms of unique senses covered (3,730 covered by OMSTI in contrast to over 33K covered by SemCor). Additionally, while OMSTI was constructed automatically, SemCor was manually built and, hence, its quality is expected to be higher. Finally, we calculated the ambiguity level of each dataset, computed as the total number of can-didate senses (i.e., senses sharing the surface form of the target word) divided by the number of sense annotations. The highest ambiguity is found on OMSTI, which, despite being constructed automatically, contains a high coverage of ambiguous words. As far as the evaluation competition datasets are concerned, the ambiguity may give a hint as to how difficult a given dataset may be. In this case, SemEval-07 displays the highest ambiguity level among all evaluation datasets.

Evaluation
The evaluation framework consists of the WSD evaluation datasets described in Section 4.1. In this section we use this framework to perform an empirical comparison among a set of heterogeneous WSD systems. The systems used in the evaluation are described in detail in Section 5.1, the results are shown in Section 5.2 and a detailed analysis is presented in Section 5.3.

Comparison systems
We include three supervised (Section 5.1.1) and three knowledge-based (Section 5.1.2) all-words WSD systems in our empirical comparison.

Supervised
To ensure a fair comparison, all supervised systems use the same corpus for training: SemCor and Semcor+OMSTI 7 (see Section 4.2). In the following we describe the three supervised WSD systems used in the evaluation: • IMS (Zhong and Ng, 2010) uses a Support Vector Machine (SVM) classifier over a set of conventional WSD features. IMS 8 is built on a flexible framework which allows an easy integration of different features. The default implementation includes surrounding words, PoS tags of surroundings words, and local collocations as features.
• IMS+embeddings (Taghipour and Ng, 2015b;Rothe and Schütze, 2015;Iacobacci et al., 2016). These approaches have shown the potential of using word embeddings on the WSD task. Iacobacci et al. (2016) carried 7 As already noted by Taghipour and Ng (2015a), supervised systems trained on only OMSTI obtain lower results than when trained along with SemCor, mainly due to OM-STI's lack of coverage in target word types. 8 We used the original implementation available at http: //www.comp.nus.edu.sg/˜nlp/software.html out a comparison of different strategies for integrating word embeddings as a feature in WSD. In this paper we consider the two best configurations in Iacobacci et al. (2016) 9 : using all IMS default features including and excluding surrounding words (IMS+emb and IMS -s +emb, respectively).
In both cases word embeddings are integrated using exponential decay (i.e., word weights drop exponentially as the distance towards the target word increases). Likewise, we use Iacobacci et al.'s suggested learning strategy and hyperparameters to train the word embeddings: Skip-gram model of Word2Vec 10 (Mikolov et al., 2013) with 400 dimensions, ten negative samples and a window size of ten words. As unlabeled corpus to train the word embeddings we use the English ukWaC corpus 11 (Baroni et al., 2009), which is made up of two billion words from paragraphs extracted from the web.
• Context2Vec (Melamud et al., 2016). Neural language models have recently shown their potential for the WSD task (Kågebäck and Salomonsson, 2016;Yuan et al., 2016). In this experiment we replicated the approach of Melamud et al. (2016, Context2Vec), for which the code 12 is publicly available. This approach is divided in three steps. First, a bidirectional LSTM recurrent neural network is trained on an unlabeled corpus (we considered the same ukWaC corpus used by the previous comparison system). Then, a context vector is learned for each sense annotation in the training corpus. Finally, the sense annotation whose context vector is closer to the target word's context vector is selected as the intended sense.

Knowledge-based
In this section we describe the three knowledgebased WSD models used in our empirical comparison: • Lesk (Lesk, 1986) is a simple knowledgebased WSD algorithm that bases its calculations on the overlap between the definitions of a given sense and the context of the target word. For our experiments we replicated the extended version of the original algorithm in which definitions of related senses are also considered and the conventional term frequency-inverse document frequency (Jones, 1972, tf-idf ) is used for word weighting (Banerjee and Pedersen, 2003, Lesk ext  • Babelfy (Moro et al., 2014) is a graph-based disambiguation approach which exploits random walks to determine connections between synsets. Specifically, Babelfy 16 uses random walks with restart (Tong et al., 2006) over BabelNet (Navigli and Ponzetto, 2012), a large semantic network integrating Word-Net among other resources such as Wikipedia 13 We used the same word embeddings described in Section 5.1.1 for IMS+emb. 14 We used the implementation from https://github. com/pippokill/lesk-wsd-dsm. In this implementation additional definitions from BabelNet are considered. 15 We used the last implementation available at http://ixa2.si.ehu.es/ukb/ 16 We used the Java API from http://babelfy.org or Wiktionary. Its algorithm is based on a densest subgraph heuristic for selecting highcoherence semantic interpretations of the input text. The best configuration of Babelfy takes into account not only the target sentence in which the target word occurs, but also the whole document.
As knowledge-based baseline we included the WordNet first sense. This baseline simply selects the candidate which is considered as first sense in WordNet 3.0. Even though the sense order was decided on the basis of semantically-tagged text, we considered it as knowledge-based in this experiment as this information is already available in WordNet. In fact, knowledge-based systems like Babelfy include this information in their pipeline. Despite its simplicity, this baseline has been shown to be hard to beat by automatic WSD systems (Navigli, 2009;Agirre et al., 2014). Table 2 shows the F-Measure performance of all comparison systems on the five all-words WSD datasets. Since not all test word instances are covered by the corresponding training corpora, supervised systems have a maximum F-Score (ceiling in the Table) they can achieve. Nevertheless, supervised systems consistently outperform knowledge-based systems across datasets, confirming the results of Pilehvar and . A simple linear classifier over conventional WSD features (i.e., IMS) proves to be robust across datasets, consistently outperforming the MFS baseline. The recent integration of word embeddings as an additional feature is beneficial, especially as a replacement of the feature based on the surface form of surrounding words (i.e., IMS -s +emb). Moreover, recent advances on neural language models (in the case of Context2Vec a bi-directional LSTM) appear to be highly promising for the WSD task according to the results, as Context2Vec outperforms IMS in most datasets.

Results
On the other hand, it is also interesting to note the performance inconsistencies of systems across datasets, as in all cases there is a large performance gap between the best and the worst performing dataset. As explained in Section 4.3, the ambiguity level may give a hint as to how difficult the corresponding dataset may be. In fact, WSD systems obtain relatively low results in SemEval-07, which is the most ambiguous dataset (see Table 1).   However, this is the dataset in which supervised systems achieve a larger margin with respect to the MFS baseline, which suggests that, in general, the MFS heuristic does not perform accurately on highly ambiguous words.

Analysis
To complement the results from the previous section, we additionally carried out a detailed analysis about the global performance of each system and divided by PoS tag. To this end, we concatenated all five datasets into a single dataset. This resulted in a large evaluation dataset of 7,253 instances to disambiguate (see Table 3). Table 4 shows the F-Measure performance of all comparison systems on the concatenation of all five WSD evaluation datasets, divided by PoS tag. IMS -s +emb trained on SemCor+OMSTI achieves the best overall results, slightly above Context2Vec trained on the same corpus. In what follows we describe some of the main findings extracted from our analysis.
Training corpus. In general, the results of supervised systems trained on SemCor only (manually-annotated) are lower than training simultaneously on both SemCor and OMSTI (automatically-annotated). This is a promising finding, which confirms the results of previous works (Raganato et al., 2016;Iacobacci et al., 2016;Yuan et al., 2016) and encourages further research on developing reliable automatic or semiautomatic methods to obtain large amounts of sense-annotated corpora in order to overcome the knowledge-acquisition bottleneck. For instance, Context2Vec improves 0.4 points overall when adding the automatically sense-annotated OMSTI as part of the training corpus, suggesting that more data, even if not perfectly clean, may be beneficial for neural language models.
Knowledge-based vs. Supervised. One of the main conclusions that can be taken from the evaluation is that supervised systems clearly outperform knowledge-based models. This may be due to the fact that in many cases the main disambiguation clue is given by the immediate local context. This is particularly problematic for knowledge-based systems, as they take equally into account all the words within a sentence (or document in the case of Babelfy). For instance, in the following sentence, both UKB and Babelfy fail to predict the correct sense of state: In sum, at both the federal and state government levels at least part of the seemingly irrational behavior voters display in the voting booth may have an exceedingly rational explanation.  In this sentence, state is annotated with its administrative districts of a nation sense in the gold standard. The main disambiguation clue seems to be given by its previous and immediate subsequent words (federal and government), which tend to co-occur with this particular sense. However, knowledge-based WSD systems like UKB or Babelfy give the same weight to all words in context, underrating the importance of this local disambiguation clue in the example. For instance, UKB disambiguates state with the sense defined as the way something is with respect to its main attributes, probably biased by words which are not immediately next to the target word within the sentence, e.g., irrational, behaviour, rational or explanation.
Low overall performance on verbs. As can be seen from Table 4, the F-Measure performance of all systems on verbs is in all cases below 58%. This can be explained by the high granularity of verbs in WordNet. For instance, the verb keep consists of 22 different meanings in WordNet 3.0, six of them denoting "possession and transfer of possession" 17 . In fact, the average ambiguity level of all verbs in this evaluation framework is 10.4 (see Table 3), considerably greater than the ambiguity on other PoS tags, e.g., 4.8 in nouns. Nonetheless, supervised systems manage to comfortably outperform the MFS baseline, which does not seem to be reliable for verbs given their high ambiguity.
Influence of preprocessing. As mentioned in Section 3, our evaluation framework provides a preprocessing of the corpora with Stanford CoreNLP. This ensures a fair comparison among all systems but may introduce some annotation inaccuracies, such as erroneous PoS tags. However, for English these errors are minimal 18 . For instance, the global error rate of the Stanford PoS tagger in all disambiguation instances is 3.9%, which were fixed as explained in Section 3.
Bias towards the Most Frequent Sense. After carrying out an analysis on the influence of MFS in WSD systems 19 , we found that all supervised systems suffer a strong bias towards the MFS, with all IMS-based systems disambiguating over 75% of instances with their MFS. Context2Vec is slightly less affected by this bias, with 71.5% (SemCor) and 74.7% (SemCor+OMSTI) of answers corre-sponding to the MFS. Interestingly, this MFS bias is also present in graph knowledge-based systems. In fact, Calvo and Gelbukh (2015) had already shown how the MFS correlates strongly with the number of connections in WordNet.
Knowledge-based systems. For knowledgebased systems the WN first sense baseline proves still to be extremely hard to beat. The only knowledge-based system that overall manages to beat this baseline is Babelfy, which, in fact, uses information about the first sense in its pipeline. Babelfy's default pipeline includes a confidence threshold in order to decide whether to disambiguate or back-off to the first sense. In total, Babelfy backs-off to WN first sense in 63% of all instances. Nonetheless, it is interesting to note the high performance of Babelfy and Lesk ext +emb on noun instances (outperforming the first sense baseline by 1.0 and 2.2 points, respectively) in contrast to their relatively lower performance on verbs, adjectives 20 and adverbs. We believe that this is due to the nature of the lexical resource used by these two systems, i.e., BabelNet. BabelNet includes Wikipedia as one of its main sources of information. However, while Wikipedia provides a large amount of semantic connections and definitions for nouns, this it not the case for verbs, adjectives and adverbs, as they are not included in Wikipedia and their source of information mostly comes from WordNet only.

Conclusion and Future Work
In this paper we presented a unified evaluation framework for all-words WSD. This framework is based on evaluation datasets taken from Senseval and SemEval competitions, as well as manually and automatically sense-annotated corpora. In this evaluation framework all datasets share a common format, sense inventory (i.e., WordNet 3.0) and preprocessing pipeline, which eases the task of researchers to evaluate their models and, more importantly, ensures a fair comparison among all systems. The whole evaluation framework 21 , including guidelines for researchers to include their own sense-annotated datasets and a script to validate their conformity to the guidelines, is available at http://lcl.uniroma1.it/wsdeval . We used this framework to perform an empirical comparison among a set of heterogeneous WSD systems, including both knowledge-based and supervised ones. Supervised systems based on neural networks achieve the most promising results. Given our analysis, we foresee two potential research avenues focused on semi-supervised learning: (1) exploiting large amounts of unlabeled corpora for learning word embeddings or training neural language models, and (2) automatically constructing high-quality sense-annotated corpora to be used by supervised WSD systems. As far as knowledge-based systems are concerned, enriching knowledge resources with semantic connections for non-nominal mentions may be an important step towards improving their performance.
For future work we plan to further extend our unified framework to languages other than English, including SemEval multilingual WSD datasets, as well as to other sense inventories such as Open Multilingual WordNet, BabelNet and Wikipedia, which are available in different languages.
Zhi Zhong and Hwee Tou Ng. 2010. It Makes Sense: A wide-coverage Word Sense Disambiguation system for free text. In Proceedings of the ACL System Demonstrations, pages 78-83.