SyntagNet: Challenging Supervised Word Sense Disambiguation with Lexical-Semantic Combinations

Current research in knowledge-based Word Sense Disambiguation (WSD) indicates that performances depend heavily on the Lexical Knowledge Base (LKB) employed. This paper introduces SyntagNet, a novel resource consisting of manually disambiguated lexical-semantic combinations. By capturing sense distinctions evoked by syntagmatic relations, SyntagNet enables knowledge-based WSD systems to establish a new state of the art which challenges the hitherto unrivaled performances attained by supervised approaches. To the best of our knowledge, SyntagNet is the first large-scale manually-curated resource of this kind made available to the community (at http://syntagnet.org).


Introduction
Word Sense Disambiguation (WSD) is one of the most challenging Natural Language Processing (NLP) tasks. It deals with lexical ambiguity, and it is core to achieving the much-sought-for goal of Natural Language Understanding (Navigli, 2018). In a broad sense, two major approaches can be adopted for performing WSD: the supervised and the knowledge-based ones.
Supervised methods, which learn a classifier from training data, have improved from 65% (Snyder and Palmer, 2004) to over 71% accuracy, thus proving to perform best for WSD purposes (Yuan et al., 2016;Melacci et al., 2018;Uslu et al., 2018). However, supervision depends heavily on large quantities of reliable sense-annotated data which -especially for languages other than English -are poorly available.
As a major alternative to supervised WSD, knowledge-based approaches drop the requirement for large amounts of training data by drawing on rich Lexical Knowledge Bases (LKB) such as WordNet (Fellbaum, 1998), and allow scaling to multiple languages thanks to multilingual resources such as BabelNet (Navigli and Ponzetto, 2012). The structure of an LKB plays a key role in increasing the overall disambiguation performance. For the purposes of WSD, a critical consideration concerns the nature of the relations connecting concepts: while LKBs tend to focus on the paradigmatic dimension of language, such resources fall short in respect of syntagmatic relations 1 , which are also crucial for sense disambiguation due to interconnecting co-occurring words (Navigli and Lapata, 2010).
In this paper we address this deficiency and present, for the first time, a manuallycurated large-scale lexical-semantic combination 2 database which associates pairs of concepts with pairs of co-occurring words. Importantly, we prove the effectiveness of our resource by achieving the state of the art in multilingual knowledgebased WSD and by matching supervised WSD performances when integrated into an LKB made up of WordNet and the Princeton WordNet Gloss Corpus 3 .

Related work
Several studies on knowledge-based algorithms have indicated that the LKB structure is of vital importance in determining the accuracy of sense disambiguation. In particular, it has been demonstrated that WSD performance improves 3535 dramatically when employing an LKB with a larger number of high-quality lexical-semantic relations, i.e., more connections between concepts (Boyd-Graber et al., 2006;Lemnitzer et al., 2008;Ponzetto and Navigli, 2010). During the last two decades, a certain amount of work has been carried out aimed at enriching LKBs with new lexical-semantic relations. To this end, knowledge has been (semi-)automatically extracted from large collections of data and integrated into lexical resources such as WordNet.
As far as semi-automatic approaches are concerned, Mihalcea and Moldovan (2001) conceived eXtended WordNet, a resource providing disambiguated glosses by means of a classification ensemble combined with human supervision. A set of manually disambiguated glosses, called the Princeton WordNet Gloss Corpus (PWNG), which inherently included syntagmatic content, was subsequently also made available in 2008.
The rationale behind the creation of such resources was substantiated in a knowledge-based WSD study conducted by Navigli and Lapata (2010), who hypothesized an improvement in performance by several points when enriching a semantic network with tens of lexical-semantic relations for each target word sense. To achieve this demanding goal, endeavors in the literature focused on the fully-automatic production of semantic combinations, such as those obtained by disambiguating topic signatures (Cuadros and Rigau, 2008;Cuadros et al., 2012, KnowNet and deep-KnowNet) or by disentangling the concepts in ConceptNet (Chen and Liu, 2011).
More recently, Espinosa-Anke et al. (2016) aimed at automatically enriching WordNet with collocational information by leveraging the relations between sense-level embedding spaces (Col-WordNet), while Simov et al. (2016) addressed the enhancement of LKBs by exploiting relations over semantically-annotated corpora as contextual information. To the same end, Simov et al. (2018) employed grammatical role embeddings to gather new syntagmatic relations.
The lack of syntagmatic information in semantic networks was also tackled by the extension of a lexical database by means of phrasets, i.e., sets of free combinations of words recurrently used to express a concept (Bentivogli and Pianta, 2004).
Unfortunately, due to their (semi-)automatic nature, the aforementioned resources could not inherently offer wide coverage and high precision at the same time. Compared to other resources geared towards knowledge-based WSD, the novel resource we contribute in this work features: (i) wide coverage with a broad spectrum of possible lexical combinations, and (ii) high precision thanks to being entirely manually curated.
3 SyntagNet: a wide-coverage lexical-semantic combination resource In this Section, we present SyntagNet, a knowledge resource created starting from lexical combinations extracted from the English Wikipedia 4 and the British National Corpus (Leech, 1992, BNC), and manually disambiguated according to the WordNet 3.0 sense inventory.

Methodology
Lexical combination extraction First of all, we employed the Stanford CoreNLP pipeline (Manning et al., 2014) to extract the dependency trees 5 for all the sentences in both Wikipedia and the BNC. Then, in order to identify relevant combinations, we determined the strength of correlation between pairs of POS-tagged, lemmatized content words 6 w 1 , w 2 , co-occurring within a sliding window of 3 words. Each candidate pair (w 1 , w 2 ) was weighted using Dice's coefficient multiplied by a logarithmic factor of the co-occurrence frequency: is the frequency of w i and n w 1 w 2 is the frequency of the two words cooccurring within a window. Three filters were then applied in order to slim down the list of pairs: (i) we filtered out English stopwords according to the Natural Language Toolkit (Loper and Bird, 2002, NLTK 3.4); (ii) we discarded combinations between verbs and verbs; (iii) we discarded combinations not linked by any of the five most frequent dependencies in our list, namely: compound, dobj (direct object), iobj (indirect object), nsubj (nominal subject) and nmod (nominal modifier).
Finally, we ranked the resulting lexical combination list according to the geometric mean word 1 word 2 score sense 1 sense 2 run v program n 18.07 run 19 v (carry out a process or program) program 7 n (a sequence of instructions) run v race n 11.55 run 37 v (compete in a race) race 2 n (a contest of speed) run v farm n 3.50 run 4 v (direct or control) farm 1 n (workplace with farm buildings) between i) the logarithmic Dice scores and ii) the frequency count of a pair in a given POS tag/dependency combination. We show some examples with w 1 = run v , together with their final correlation score in Table 1 (left). We then repeated the whole process described above, with the following changes: i) we set a sliding window of 6 words; ii) we removed the constraint on the dependency selection; iii) we filtered out all pairs already occurring within the first list; iv) we selected only items attested in multiple English monolingual and collocation dictionaries.
Manual disambiguation We asked eight annotators to manually disambiguate the top-ranking 20, 000 lexical combinations from the first list and 58, 000 lexical combinations from the second list, i.e., to associate each word in a pair (w 1 , w 2 ) with its most appropriate senses in WordNet (in Table 1 (right) we show the senses chosen by the annotators for the corresponding lexical combinations).
The eight annotators shared a background in linguistics (Master's Degree with a minimum C1 English proficiency level) and were well acquainted with WordNet. In order to facilitate the annotation process, we provided each annotator with a unique batch of lexical combinations in a simple interface; for each pair, the annotators visualized all the synsets for each word of the combination (along with WordNet definitions and examples), and a context of up to 25 random sentences in which the combination was extracted. The annotators were asked to input the sense numbers associated with their chosen synsets for both the words in a given pair. Since the combinations can carry different meanings depending on the context, the annotators were allowed to assign multiple senses to the same word in a given combination (e.g., judge in the "public official" sense vs. the "evaluator" sense in the (judge n , decide v ) lexical combination).
As a further measure to ensure quality, the annotators were also asked to skip the annotation of lexical combinations (i) carrying mistakes due to the automatic parsing process, (ii) for which none of the available senses in WordNet would fit the context, (iii) reflecting idiomatic expressions, (iv) which were multi-word Named Entities.
We periodically timed the annotators by considering the number of annotations produced on a daily basis, obtaining an average value of 42 disambiguated combinations per hour (1 minute and 26 seconds per word pair). Overall, the annotation process took a period of 9 months.
To determine the reliability of the annotations, we calculated the minimum inter-annotator agreement between pairs of annotators on a random sample of 500 combinations. For each of the 500 lexical combinations used to compute the inter-annotator agreement, the annotators were exceptionally asked to disambiguate the two target words in all of the 25 sentences provided, thus leading to a figure of 25,000 single instances disambiguated per annotator, resulting in a substantial agreement (κ = 0.71). Moreover, we found that most of the disagreement instances arose out of valid alternative tags, rather than factual errors, due to the fine granularity of the WordNet sense inventory.

Experimental setup
We now present the setup of our evaluation, carried out to assess the effectiveness of SyntagNet when employed for knowledge-based WSD.
Disambiguation algorithm We performed our experiments employing UKB 7 (Agirre et al., 2014), a state-of-the-art system for knowledgebased WSD, which applies the Personalized Page Rank (PPR) algorithm (Haveliwala, 2002) to an  Table 2: F1 scores (%) for English all-words fine-grained WSD (left) and for multilingual all-words fine-grained WSD (right). Each row displays results scored by a specific resource combined with the WNG (WordNet+PWNG) baseline. Statistically-significant differences, according to a χ 2 test (p < 0.01), compared to the baseline (first row), are underlined. The second column reports the number of relations of the added resource.
input LKB. We used its PPRw2w single-sentence context disambiguation method, which initializes the PPR vector using the context of the target word in a given sentence, while excluding the contribution of the target word itself.

Evaluation benchmarks and measures
We used five test sets standardized with WordNet 3.0 (Raganato et al., 2017a) including the English allwords tasks from Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007(Pradhan et al., 2007, SemEval-2013 (Navigli et al., 2013) and SemEval-2015 (Moro andNavigli, 2015). To run experiments on multilingual WSD, we used the last two of the foregoing datasets, which also include German, Spanish, French and Italian, employing, as sense inventory, the synset lexicalizations provided in BabelNet 4.0 8 . As customary, we computed precision, recall and F1, which in our case coincided, due to UKB always outputting a sense for each target word.

Experimental results
English WSD As shown in Table 2 (left), Syn-tagNet enabled UKB to achieve the best results in the English all-words disambiguation tasks, attaining 4.4 overall points above the WNG baseline, which is the only statistically-significant improvement across LKBs. Furthermore, results for the individual datasets exhibit statistically-significant improvements over the baseline on two out of five datasets. We attribute this result to the fully manual nature of SyntagNet, in contrast to the noisy character of the other LKBs. A further justification of our results comes from an analysis we performed on relation samples from the various LKBs: we collected 500 random relations for each LKB we experimented with, and manually tagged each of them as syntagmatic or paradigmatic, revealing that their syntagmatic contribution ranges from 39% (deepKnowNet) to 54% (eXtended WordNet). The fully syntagmatic nature of SyntagNet, instead, effectively blends in with the complementary information available in the baseline (63% of the relations in WNG are paradigmatic). Table 3 compares UKB + SyntagNet against the best supervised English WSD systems (Yuan et al., 2016;Melacci et al., 2018;Uslu et al., 2018): none of the differences across datasets between the best performing supervised system and Syn-tagNet is statistically significant according to a χ 2 test (p < 0.01), meaning that SyntagNet enables knowledge-based WSD to rival current supervised approaches.    (2015), ‡ result obtained by aggregating the outputs of the best systems for each dataset. Statistically-significant differences against our results are underlined according to a χ 2 test, p < 0.01.
Multilingual WSD As regards our multilingual evaluation, SyntagNet enabled UKB to attain the best overall result (see Table 2 (right)), which is a statistically-significant improvement of 2.1 points over the baseline. With respect to the comparison against the best systems (Table 4), SyntagNet provides a statistically-relevant boost of 4.6 points in relation to the aggregate score of the compared systems (second to last row), attaining state-ofthe-art results on five out of the six datasets taken into account.

Impact of LKB size
Finally, we graphed the increase in WSD performance obtained when progressively enriching the baseline UKB graph with random samples of 10, 000 SyntagNet relations at each step. As illustrated in Figure 1, the improvements in the English and multilingual settings, respectively, present a growing trend according to a linear regression analysis of the data. This demonstrates that our relations are high-quality and effective for WSD, while leaving room for further improvement as more relations are added in the future.

Conclusions
In this paper we put forward two main contributions: 1) we presented SyntagNet (http: //syntagnet.org), a new wide-coverage, manually-curated resource of lexical-semantic combinations; 2) we showed that SyntagNet enables state-of-the-art knowledge-based WSD, rivaling the best supervised system on English and surpassing the overall performance of the best multilingual systems by 4.6 points. As future work, we plan to: i) enrich SyntagNet with more combinations, so as to surpass supervised English WSD; ii) include information from adjectives; iii) establish a common evaluation framework to compare the contribution of lexical-semantic combination resources; iv) employ and assess SyntagNet in other NLP tasks, such as word and sense similarity (Navigli and Martelli, 2019) or Semantic Role Labeling (where a newly released WordNet-linked resource, VerbAtlas (Di Fabio et al., 2019), would greatly benefit from collocational information).