Personalized PageRank with Syntagmatic Information for Multilingual Word Sense Disambiguation

Exploiting syntagmatic information is an encouraging research focus to be pursued in an effort to close the gap between knowledge-based and supervised Word Sense Disambiguation (WSD) performance. We follow this direction in our next-generation knowledge-based WSD system, SyntagRank, which we make available via a Web interface and a RESTful API. SyntagRank leverages the disambiguated pairs of co-occurring words included in SyntagNet, a lexical-semantic combination resource, to perform state-of-the-art knowledge-based WSD in a multilingual setting. Our service provides both a user-friendly interface, available at http://syntagnet.org/, and a RESTful endpoint to query the system programmatically (accessible at http://api.syntagnet.org/).


Introduction
In Natural Language Processing, Word Sense Disambiguation (WSD) is an open problem concerning lexical ambiguity. It is aimed at determining which sense -among a finite inventory of many -is evoked by a given word in context (Navigli, 2009). This challenge has been tackled by exploiting huge amounts of hand-annotated data in a supervised fashion (Raganato et al., 2017b;Bevilacqua and Navigli, 2019;Vial et al., 2019;Bevilacqua and Navigli, 2020) or, alternatively, by harnessing structured information (Agirre et al., 2014;Moro et al., 2014;, such as that available within existing lexical knowledge bases (LKBs) like WordNet (Fellbaum, 1998). Despite achieving better overall results, supervised systems require tremendous efforts in order to produce data for several languages (Navigli, 2018;Pasini, 2020), whereas knowledge-based approaches can easily be applied in multilingual environments due to the wide array of languages covered by LKBs like Ba-belNet 1 (Navigli and Ponzetto, 2012), or the Open Multilingual WordNet (Bond and Foster, 2013). Moreover, it is widely acknowledged that the performance of a knowledge-based WSD system is strongly correlated with the structure of the LKB employed (Boyd-Graber et al., 2006;Lemnitzer et al., 2008;Navigli and Lapata, 2010;Ponzetto and Navigli, 2010). In fact, the knowledge available within LKBs reflects the fact that words can be linked via two types of semantic relations: paradigmatic relations -i.e. the most frequently encountered relations in LKBs -concern the substitution of lexical units, and determine to which level in a hierarchy a language unit belongs by semantic analogy with units similar to it; conversely, syntagmatic relations concern the positioning of such units, by linking elements belonging to the same hierarchical level (e.g., words), which appear in the same context (e.g., a sentence). As a case in point, a paradigmatic relation exists, independently of a given context, between the words farm n and workplace n (where a farm is a type of workplace), whereas a syntagmatic relation is entertained between the words work v and farm n , e.g., in the sentence 'her husband works in a farm as a labourer.' In our most recent study (Maru et al., 2019, Syn-tagNet), we provided further evidence that the nature of LKBs impacts on system performance: the injection of syntagmatic relations -in the form of disambiguated pairs of co-occurring wordsinto an existing LKB biased towards paradigmatic knowledge enables knowledge-based systems to rival their supervised counterparts.
To make the above results accessible to the research community, in this paper we introduce a Web interface and a RESTful API for SyntagRank, our multilingual WSD system, which applies the Personalized PageRank (PPR) algorithm (Haveliwala, 2002) to an LKB made up of WordNet, the Princeton WordNet Gloss Corpus (PWNG) and the lexical-semantic syntagmatic combinations available in the SyntagNet resource. SyntagRank is the first system to perform multilingual WSD by leveraging an underlying LKB connecting a sizeable amount of syntagmatically-related concepts.

Preliminaries
Our disambiguation algorithm relies on an LKB, i.e. a graph in which each node represents a concept, and each connection between nodes represents a semantic relation. In this Section we describe the LKBs whose resulting union we use as our reference graph, and then go on to provide details of the PPR algorithm.

Lexical Knowledge Bases
WordNet (Fellbaum, 1998) is a lexical-semantic database of English, in which concepts are expressed by means of sets of cognitive synonyms (synsets) that are interlinked to form a semantic network through relation edges.
Relations in WordNet are mainly of a hierarchical, and thus paradigmatic nature, with the most frequently encoded relation being the super-subordinate relation (instantiated in terms of hypernymy and hyponymy; see also Section 1). Other relations linking concepts in WordNet include part-whole relations (meronymy, e.g. between wheel n and car n ), antonymy relations and cross-part-of-speech relations holding among semantically similar words sharing a stem with the same meaning (e.g. between speed n and speedy a ). As of today, WordNet is the most widely used and de facto standard sense inventory for the WSD task (Raganato et al., 2017a).
Princeton WordNet Gloss Corpus (PWNG) is the semantically-annotated gloss corpus made available by WordNet since its 3.0 release. 2 Glosses are short definitions providing proper meanings for synsets, and in PWNG they have been tagged according to the senses in WordNet. Following Agirre et al. (2014), we induce new WordNet relations from PWNG by linking the synset to which the gloss refers to each of the synsets that have been tagged in the gloss itself.
In this way, additional contextual relations are provided, inadvertently covering syntagmatic relations, too.
SyntagNet (Maru et al., 2019) is a database containing almost 90,000 pairs of manuallydisambiguated lexical collocations and free word associations. Pairs in SyntagNet link nouns to other nouns or verbs tagged according to the WordNet 3.0 sense inventory and such pairs can therefore be exploited as new relation paths connecting nodes (synsets) in a WordNet-based semantic network. For our purposes, we are especially interested in the fact that SyntagNet is the only high-quality resource to systematically provide syntagmatic information in the form of lexical-semantic combinations. This kind of information becomes particularly valuable when used to enrich semantic networks otherwise biased towards paradigmatic knowledge, by creating direct routes between those concepts whose lexicalizations tend to appear together in the same context more often than by mere chance.

Personalized PageRank
The original PageRank (Brin and Page, 1998) is an algorithm which uses the connectivity of a graph to assess the probability that each of its nodes has to be reached and visited starting from a random position. As the probability mass (distribution) over the graph nodes is uniform, then, iteratively, the number of ingoing and outgoing connections serves as a means to increase or decrease the relative weight of each node. In order to apply this approach to WSD, following Agirre et al. (2014), SyntagRank uses a variant of the PageRank algorithm, the Personalized PageRank (PPR), in which the initial probability mass is distributed over a restricted set of specific nodes (i.e. the nodes representing the content words to be disambiguated in a given context 3 ). Hence, given an initial set of nodes, the outcome of the PPR algorithm is a vector encoding all the information concerning the probability distributions of all the nodes in the graph. the most appropriate sense of a given word in context. This approach, already discussed by Agirre and Soroa (2009), is here presented in an optimized, rebuilt version, employing the LKBs described in Section 2.1 to achieve state-of-the-art knowledgebased performance across five languages: English, German, French, Spanish, and Italian. Our architecture ( Figure 1) is composed of three main modules: (i) multilingual NLP pipeline, (ii) candidate retrieval, and (iii) disambiguator.

Multilingual NLP Pipeline
In order to allow the user to provide an unprocessed text as input for SyntagRank to disambiguate, our system employs a multilingual NLP pipeline which preliminarily performs the functions of tokenization, sentence splitting, lemmatization and Part of Speech (PoS) tagging. Depending on the input language, SyntagRank utilizes either the Stanford CoreNLP suite 4 (Manning et al., 2014), or the models provided by The Italian NLP Tool (Palmero Aprosio and Moretti, 2016, TINT).

Candidate Retrieval
English Candidate Retrieval With each token in the input text already pre-processed, and considering that each node in our graph corresponds to a unique WordNet synset (see Section 2), in this phase we can retrieve, for each content word (target word) in a single sentence, all those candidate concepts (synsets) for which a coincident lexicalization exists. In doing so, in line with the word-to-word heuristics described in (Agirre et al., 2014), we exclude the target word when retrieving the candidate concepts so as to avoid the probability mass being distributed across the most frequent sense of the target word. The resulting set of collected concepts C, which will now include all the possible senses for the non-target words in the input sentence, thus establishes the starting nodes for the PPR algorithm.
In view of the fact that, according to the Linearity Theorem (Jeh and Widom, 2003), the PPR vector computed starting from a set of nodes C is equivalent to the weighted average of the PPR vectors calculated using each of the nodes in C as single starting points, all the PPR vectors in Synta-gRank have been preliminarily determined for each node in the graph, with the purpose of minimizing execution times 5 . Thus, the PPR vector for a precise context (i.e. an input sentence) is calculated simply by determining the weighted average of the pre-computed PPR vectors for each of its nodes 6 . The weight factor p(w, s), for each candidate s associated with a content word w, is computed as follows: where N is the number of content words in the input sentence and senses w is the set of sense candidates associated with w. Moreover, since the graph connectivity gets denser around most frequent senses (MFS) -according to their distribution in SemCor 7 (Miller et al., 1993) -, and in view 5 All the pre-computed PPR vectors are stored in binary format, and are accessed via a memory-mapped file supported by a Least Recently Used (LRU) cache. 6 With regard to our PPR implementation details, we opted for a damping factor of 0.85. In addition, the algorithm performs a variable number of iterations (random walks) over the graph until reaching convergence, i.e. when the difference between the scores of any node computed at two successive iterations falls below a threshold of 10 −4 . 7 SemCor is the largest, manually sense-annotated corpus of English, and is currently the de facto standard reference dataset for several WSD applications. of the fact that unsupervised systems tend to have a strong bias towards the MFS (Calvo and Gelbukh, 2015;Postma et al., 2016;, we accounted for potential skew towards MFS by including the parameter f req ws , i.e. the normalized value resulting from the number of occurrences for a given word sense in SemCor, divided by the total number of occurrences for all the senses of the same word. Multilingual Candidate Retrieval Concepts represented in a semantic network are language independent by definition. Still, in order to retrieve sense candidates for words in specific languages, we need the nodes in the graph to be mapped with lexicalizations in those languages. As mentioned in Section 2.1, WordNet provides this information for the English language only, therefore, in order to retrieve the lexicalizations in languages other than English we exploited the BabelNet semantic network, which inherently aligns lexicalizations in 284 distinct languages to the original WordNet 3.0 synsets. Nevertheless, two main flaws lie in this approach: (i) the lexicalizations in BabelNet are induced from automatically-linked resources, hence, their quality might be sub-optimal, and (ii) no SemCor equivalent exists for other languages, which means we do not have any accessible MFS information to exploit when computing the weighted average between vectors. In order to address both these flaws concurrently, we devised a strategy to mimic the MFS ranking function by associating a confidence score with each of the lexical resources from which BabelNet derives its lexicalizations (e.g. Wikidata, OmegaWiki or Wikipedia, among others). To this end, after conducting an empirical study to assess the quality of random translation samples provided by each individual resource mapped to BabelNet, we assigned a normalized confidence score to them. Consequently, for each unique lexicalization, we have been able to compute its "MFS" score as the average confidence among all the resources providing that lexicalization for a specific concept.

Disambiguator
After retrieving the PPR vectors for each candidate sense and computing their weighted average (as described in Section 3.2), the last module of Syn-tagRank serves as a means to finally: (i) extract the probability values for the senses of the target word from the averaged PPR vector, and (ii) select the sense with the highest probability value as the result of the disambiguation for the target word.  A. Query The system takes as input the text to be processed 8 . Users can enter either single words, multiword expressions (MWEs), or full sentences as input queries. In the event that the input text is a sentence, this will be processed by the disambiguator and the system will return a disambiguated sentence (see Paragraph C). Otherwise, if the query matches an entry in the SyntagNet database, the interface will switch to the SyntagNet Explorer (see Section 4.1) to display all the lexical-semantic combinations available for all the senses of the word/MWE provided as input query.

B. Language Selection
The drop down menu allows the user to select the language in which the input text is provided. Currently, SyntagRank offers disambiguation in five different languages: English, German, French, Spanish and Italian.

C. Disambiguated Sentence
If an input text has been provided, the interface will display the results of the disambiguation here, with tokens highlighted in different colors for Concepts (blue) and Named Entities (orange).

D. Disambiguated Token
Each disambiguated token is accompanied by a tooltip which shows the image, word sense and definition, as retrieved from the corresponding entry in BabelNet 4.0. 8 The Web interface only allows raw text as input.

E. View Selection
The Web interface allows the user to display the disambiguated sentence in extended or compact form. In the extended view, the focus is placed on the tokens: the disambiguated sentence is shown as a horizontal slider, navigable by means of arrows located on the left and right ends of the container, and the user is thereby given a means to quickly leaf through all the disambiguation results at the same time. Instead, when selecting the compact view, the focus is shifted to the sentence. In this mode, the information associated with the disambiguated tokens will be shown only if the user hovers the mouse cursor over a highlighted token.

SyntagNet Explorer
In addition to the SyntagRank disambiguation system, our Web interface also provides users with full access to the SyntagNet database. By typing into the query bar a word or MWE which is present in SyntagNet 9 (an autocomplete function will provide the user with search suggestions), the interface will switch to the SyntagNet Explorer (Figure 3). The SyntagNet Explorer displays a list of boxes, each containing a sense of the input word/MWE. Senses in the list are ordered according to (i) PoS tag and  Table 1: F1 scores (%) for English all-words fine-grained WSD (left) and for multilingual all-words fine-grained WSD (right). Statistically-significant differences against our results are underlined according to a χ 2 test, p < 0.01. Results under "All" refer to the concatenation of the English (left) and multilingual (right) datasets.
(ii) sense frequency (in line with BabelNet 4.0). On the left side (blue background), the boxes show information for word senses, along with PoS tags, sense definitions and illustrations. By clicking on a sense name, the corresponding BabelNet entry will open in a separate tab. On the right side (white background), all the lexical-semantic items (collocates) linked with the corresponding word senses via SyntagNet are listed. Further information about collocates is provided by hovering the mouse over each item. Finally, clicking on a collocate will start a new query with the selected word.

Usage of the RESTful API
The RESTful API we provide can be used effectively to query the SyntagRank system programmatically. Unlike the Web interface, our API allows the user to input a pre-processed text in addition to performing standard queries with raw text. For the full documentation of the RESTful API, along with the required parameters description, please refer to Appendix A: API Documentation.
In Table 1, we report F1 scores for SyntagRank in the English (left), and multilingual (right) settings, along with comparisons to the best configurations of two distinct graph-based disambiguation systems: Babelfy (Moro et al., 2014) and UKB (Agirre et al., 2014). As can be seen, SyntagRank outperforms its direct competitors by a considerable margin 11 , on both the English and multilingual settings. These results substantiate the idea that applying the PPR algorithm to a graph injected with high-quality syntagmatic knowledge is crucial to enhancing disambiguation performances.

Conclusion
In this paper we presented and described the architecture of SyntagRank, our state-of-the-art knowledge-based system for multilingual Word Sense Disambiguation using syntagmatic information. We also provided details concerning the use of SyntagRank's Web interface and RESTful API, accessible at http://syntagnet.org/ and http://api.syntagnet.org, respectively.

A API Documentation
In what follows we describe the typical usage of our RESTful API and its parameters. The Syn-tagRank API allows the user to perform two distinct requests: (i) Disambiguate Text and (ii) Disambiguate Tokens.
Disambiguate Text With Disambiguate Text, SyntagRank will process a raw text provided as input, given a target language among the five currently supported: EN (English), DE (German), FR (French), ES (Spanish), and IT (Italian). Method type, URL, parameters and response description are specified in detail in Table 2. Figure  4 shows an example of a success response for the Disambiguate Text query. Disambiguate Tokens With Disambiguate Tokens, SyntagRank will accept a pre-processed text as input to be disambiguated. As for Disambiguate Text, language specification is required. Each token must show information concerning index (id), word form (word), lemma form (lemma), POS tag (pos), and a boolean indicating whether the token is a content word to be disambiguated (isTargetWord). In Table 3, we provide exhaustive details concerning method type, URL parameters, token parameters and response description for Disambiguate Tokens. Additionally, Figures 5 and 6 show, respectively, an example of a typical request, and its success response.  The text to be disambiguated (max length: 1,500 characters). E.g.: text=this is a text. lang (String) The language of the input text, among the currently supported: EN, DE, FR, ES and IT.

Response description language
The language of the disambiguated tokens. tokens Contains a list of disambiguated tokens. senseID Identifies the WordNet 3.0 offset for the concept assigned to the token. position Contains information concerning the token positioning. charOffsetBegin Highlights the position where a given term instance starts. Expressed as char offset. charOffsetEnd Highlights the position where a given term instance ends. Expressed as char offset. The language of the input text, among the currently supported: EN, DE, FR, ES and IT. words (List<Token>) Contains a list of words, each representing a single token of the input text.

Token Parameters id (String)
Identifies the position of the token in the input text. word (String) Identifies the token, as it appears in the input text. lemma (String) The lemmatized form of the token. pos (String) The Part of Speech (PoS) of the token. isTargetWord (boolean) If true, identifies a token (for a content word) to be disambiguated.

Response description result
Contains a list of disambiguated tokens. id Identifies the position of the disambiguated token according to the input text. synset Identifies the WordNet 3.0 offset for the concept assigned to the token.