Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data

Annotating large numbers of sentences with senses is the heaviest requirement of current Word Sense Disambiguation. We present Train-O-Matic, a language-independent method for generating millions of sense-annotated training instances for virtually all meanings of words in a language’s vocabulary. The approach is fully automatic: no human intervention is required and the only type of human knowledge used is a WordNet-like resource. Train-O-Matic achieves consistently state-of-the-art performance across gold standard datasets and languages, while at the same time removing the burden of manual annotation. All the training data is available for research purposes at http://trainomatic.org.


Introduction
Word Sense Disambiguation (WSD) is a key task in computational lexical semantics, inasmuch as it addresses the lexical ambiguity of text by making explicit the meaning of words occurring in a given context (Navigli, 2009). Anyone who has struggled with frustratingly unintelligible translations from an automatic system, or with the meaning bias of search engines, can understand the importance for an intelligent system to go beyond the surface appearance of text.
There are two mainstream lines of research in WSD: supervised and knowledge-based WSD. Supervised WSD frames the problem as a classical machine learning task in which, first a training phase occurs aimed at learning a classification model from sentences annotated with word senses and, second the model is applied to previouslyunseen sentences focused on a target word. A key difference from many other problems, however, is that the classes to choose from (i.e., the senses of a target word) vary for each word, therefore requiring a separate training process to be performed on a word by word basis. As a result, hundreds of training instances are needed for each ambiguous word in the vocabulary. This would necessitate a million-item training set to be manually created for each language of interest, an endeavour that is currently beyond reach even in resource-rich languages like English.
The second paradigm, i.e., knowledge-based WSD, takes a radically different approach: the idea is to exploit a general-purpose knowledge resource like WordNet (Fellbaum, 1998) to develop an algorithm which can take advantage of the structural and lexical-semantic information in the resource to choose among the possible senses of a target word occurring in context. For example, a PageRank-based algorithm can be developed to determine the probability of a given sense being reached starting from the senses of its context words. Recent approaches of this kind have been shown to obtain competitive results (Agirre et al., 2014;Moro et al., 2014). However, due to its inherent nature, knowledge-based WSD tends to adopt bag-of-word approaches which do not exploit the local lexical context of a target word, including function and collocation words, which limits this approach in some cases.
In this paper we get the best of both worlds and present Train-O-Matic, a novel method for generating huge high-quality training sets for all the words in a language's vocabulary. The approach is language-independent, thanks to its use of a multilingual knowledge resource, BabelNet (Navigli and Ponzetto, 2012), and it can be applied to any kind of corpus. The training sets produced with Train-O-Matic are shown to provide competitive performance with those of manually and semi-automatically tagged corpora. Moreover, state-ofthe-art performance is also reported for low resourced languages (i.e., Italian and Spanish) and domains, where manual training data is not available.

Building a Training Set from Scratch
In this Section we present Train-O-Matic, a language-independent approach to the automatic construction of a sense-tagged training set. Train-O-Matic takes as input a corpus C (e.g., Wikipedia) and a semantic network G = (V, E). We assume a WordNet-like structure of G, i.e., V is the set of concepts (i.e., synsets) such that, for each word w in the vocabulary, Senses(w) is the set of vertices in V that are expressed by w, e.g., the WordNet synsets that include w as one of their senses.
Train-O-Matic consists of three steps: • Lexical profiling: for each vertex in the semantic network, we compute its Personalized PageRank vector, which provides its lexicalsemantic profile (Section 2.1).
• Sentence scoring: For each sentence containing a word w, we compute a probability distribution over all the senses of w based on its context (Section 2.2).
• Sentence ranking and selection: for each sense s of a word w in the vocabulary, we select those sentences that are most likely to use w in the sense of s (Section 2.3).

Lexical profiling
In terms of semantic networks the probability of reaching a node v starting from v can be interpreted as a measure of relatedness between the synsets v and v . Thus we define the lexical profile of a vertex v in a graph G = (V, E) as the probability distribution over all the vertices v in the graph. Such distribution is computed by applying the Personalized PagaRank algorithm, a variant of the traditional PageRank (Brin and Page, 1998). While the latter is equivalent to performing random walks with uniform restart probability on every vertex at each step, PPR, on the other hand, makes the restart probability non-uniform, thereby concentrating more probability mass in the surroundings of those vertices having higher restart probability. Formally, (P)PR is computed as follows: where M is the row-normalized adjacency matrix of the semantic network, the restart probability distribution is encoded by vector v (0) , and α is the well-known damping factor usually set to 0.85 (Brin and Page, 1998). If we set v (0) to a unit probability vector (0, . . . , 0, 1, 0, . . . , 0), i.e., restart is always on a given vertex, PPR outputs the probability of reaching every vertex starting from the restart vertex after a certain number of steps. This approach has been used in the literature to create semantic signatures (i.e., profiles) of individual concepts, i.e., vertices of the semantic network (Pilehvar et al., 2013), and then to determine the semantic similarity of concepts. As also done by Pilehvar and Collier (2016), we instead use the PPR vector as an estimate of the conditional probability of a word w given the target sense 1 s ∈ V of word w: where Z = w" P (w"|s, w) is a normalization constant, v s is the vector resulting from an adequate number of random walks used to calculate PPR, and v s (s ) is the vector component corresponding to sense s . To fix the number of iterations needed to have a sufficiently accurate vector, we follow Lofgren et al. (2014) and set the error δ = 0.00001 and the number of iterations to 1 δ = 100, 000.
As a result of this lexical profiling step we have a probability distribution over vocabulary words for each given word sense of interest.

Sentence scoring
The objective of the second step is to score the importance of word senses for each of the corpus sentences which contain the word of interest. Given a sentence σ = w 1 , w 2 , . . . , w n , for a given target word w in the sentence (w ∈ σ), and for each of its senses s ∈ Senses(w), we compute the probability P (s|σ, w). Thanks to Bayes' theorem we can determine the probability of sense s of w given the sentence as follows: = P (w 1 , . . . , w n |s, w)P (s|w) P (w 1 , . . . , w n |w) ∝ P (w 1 , . . . , w n |s, w)P (s|w) ≈ P (w 1 |s, w) . . . P (w n |s, w)P (s|w) where Formula 4 is proportional to the original probability (due to removing the constant in the denominator) and is approximated with Formula 5 due to the assumption of independence of the words in the sentence. P (w i |s, w) is calculated as in Formula 2 and P (s|w) is set to 1/|Senses(w)| (recall that s is a sense of w). For example, given the sentence σ = "A match is a tool for starting a fire", the target word w = match and its set of senses S match = {s 1 match , s 2 match }, where s 1 match is the sense of lighter and s 2 match is the sense of game match, we want to calculate the probability of each s i match ∈ S match of being the correct sense of match in the sentence σ. Following Formula 5 we have: P (s 1 match |σ, match) ≈ P (tool|s 1 match , match) · P (start|s 1 match , match) · P (fire|s 1 match , match) · P (s 1 match |match) = 2.1 · 10 −4 · 2 · 10 −3 · 10 −2 · 5 · 10 −1 = 2.1 · 10 −9 P (s 2 match |σ, match) ≈ P (tool|s 2 match , match) · P (start|s 2 match , match) · P (fire|s 2 match , match) · P (s 2 match |match) = 10 −5 · 2.9 · 10 −4 · 10 −6 · 5 · 10 −1 = 1.45 · 10 −15 As can be seen, the first sense of match has a much higher probability due to its stronger relatedness to the other words in the context (i.e. start, fire and tool). Note also that all the probabilities for the second sense are at least one magnitude less than the probability of the first sense.

Sense-based sentence ranking and selection
Finally, for a given word w and a given sense s 1 ∈ Senses(w), we score each sentence σ in which w appears and s 1 is its most likely sense according to a formula that takes into account the difference between the first (i.e., s 1 ) and the second most likely sense of w in σ: where s 1 = arg max s∈Senses(w) P (s|σ, w), and s 2 = arg max s∈Senses(w)\{s 1 } P (s|σ, w). We then sort all sentences based on ∆ s 1 (·) and return a ranked list of sentences where word w is most likely to be sense-annotated with s 1 . Although we recognize that other scoring strategies could have been used, this was experimentally the most effective one when compared to alternative strategies, i.e., the sense probability, the number of words related to the target word w, the sentence length or a combination thereof.

Creating a Denser and Multilingual Semantic Network
In the previous Section we assumed that WordNet was our semantic network, with synsets as vertices and edges represented by its semantic relations. However, while its lexical coverage is high, with a rich set of fine-grained synsets, at the relation level WordNet provides mainly paradigmatic information, i.e., relations like hypernymy (is-a) and meronymy (part-of). It lacks, on the other hand, syntagmatic relations, such as those that connect verb synsets to their arguments (e.g., the appropriate senses of eat v and food n ), or pairs of noun synsets (e.g., the appropriate senses of bus n and driver n ). Intuitively, Train-O-Matic would suffer from such a lack of syntagmatic relations, as the relevance of a sense for a given word in a sentence depends directly on the possibility of visiting senses of the other words in the same sentence (cf. Formula 5) via random walks as calculated with Formula 1. Such reachability depends on the connections available between synsets. Because syntagmatic relations are sparse in Word-Net, if it was used on its own, we would end up with a poor ranking of sentences for any given word sense. Moreover, even though the methodology presented in Section 2 is languageindependent, Train-O-Matic would lack informa-  Table 1: Top-ranking synsets of the PPR vectors computed on WordNet (first and third columns) and WordNet BN (second and fourth columns) for mouse as animal (left) and as device (right).
tion (e.g. senses for a word in an arbitrary vocabulary) for languages other than English.
To cope with these issues, we exploit Babel-Net, 2 a huge multilingual semantic network obtained from the automatic integration of WordNet, Wikipedia, Wiktionary and other resources (Navigli and Ponzetto, 2012), and create the Babel-Net subgraph induced by the WordNet vertices. The result is a graph whose vertices are BabelNet synsets that contain at least one WordNet synset and whose edge set includes all those relations in BabelNet coming either from WordNet itself or from links in other resources mapped to Word-Net (such as hyperlinks in a Wikipedia article connecting it to other articles). The greatest contribution of syntagmatic relations comes, indeed, from Wikipedia, as its articles are linked to related articles (e.g., the English Wikipedia Bus article 3 is linked to Passenger, Tourism, Bus lane, Timetable, School, and many more).
Because not all Wikipedia (and other resources') pages are connected with the same degree of relatedness (e.g., countries are often linked, but they are not necessarily closely related to the source article in which the link occurs), we apply the following weighting strategy to each edge (s, s ) ∈ E of our WordNet-induced sub- overlap measure which calculates the similarity between two synsets: where r 1 i and r 2 i are the rankings of the i-th synsets in the set S of the components in common between the vectors associated with s and s , respectively. Because at this stage we still have to calculate our synset vector representation, we use the precomputed NASARI vectors (Camacho-Collados et al., 2015) to calculate WO. This choice is due to WO's higher performance over cosine similarity for vectors with explicit dimensions (Pilehvar et al., 2013).
As a result, each row of the original adjacency matrix M of G will be replaced with the weights calculated in Formula 7 and then normalized in order to be ready for PPR calculation (see Formula 1). An idea of why a denser semantic network has more useful connections and thus leads to better results is provided by the example in Table 1 4 , where we show the highest-probability synsets in the PPR vectors calculated with Formula 1 for two different senses of mouse (its animal and device senses) when WordNet (left) and our WordNet-induced subgraph of BabelNet (WordNet BN , right) are used as the underlying semantic network for PPR computation. Note that WordNet's top synsets are related to the target synset via paradigmatic (i.e., hypernymy and meronymy) relations, while WordNet BN includes many syntagmatically-related synsets (e.g., exper-iment for the animal, and operating system and Windows for the device sense, among others).

Experimental Setup
Corpora for sense annotation We used two different corpora to extract sentences: Wikipedia and the United Nations Parallel Corpus (Ziemski et al., 2016). The first is the largest and most up-to-date encyclopedic resource, containing definitional information, the second, on the other hand, is a public collection of parliamentary documents of the United Nations. The application of Train-O-Matic to the two corpora produced two senseannotated datasets, which we named T-O-M W iki and T-O-M U N , respectively.

Semantic Network
We created sense-annotated corpora with Train-O-Matic both when using PPR vectors computed from vanilla WordNet and when using WordNet BN , our denser network obtained from the WordNet-induced subgraph of BabelNet (see Section 3).
Gold standard datasets We performed our evaluations using the framework made available by Raganato et al. (2017a) on five different allwords datasets, namely: the Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007(Pradhan et al., 2007, SemEval-2013  and SemEval-2015 (Moro and WSD datasets. We focused on nouns only, given the fact that Wikipedia provides connections between nominal synsets only, and therefore contributes mainly to syntagmatic relations between nouns.
Comparison sense-annotated corpora To show the impact of our T-O-M corpora in WSD, we compared its performance on the above gold standard datasets, against training with: • SemCor (Miller et al., 1993), a corpus containing about 226,000 words annotated manually with WordNet senses.
• One Million Sense-Tagged Instances (Taghipour and Ng, 2015, OMSTI), a sense-annotated dataset obtained via a semi-automatic approach based on the disambiguation of a parallel corpus, i.e., the United Nations Parallel Corpus, performed by exploiting manually translated word senses. Because OMSTI integrates SemCor to increase coverage, to keep a level playing field we excluded the latter from the corpus.
We note that T-O-M, instead, is fully automatic and does not require any WSD-specific human intervention nor any aligned corpus.
Reference system In all our experiments, we used It Makes Sense (Zhong and Ng, 2010, IMS), a state-of-the-art WSD system based on linear Support Vector Machines, as our reference system for comparing its performance when trained on T-O-M, against the same WSD system trained on other sense-annotated corpora (i.e., SemCor and OMSTI). Following the WSD literature, unless stated otherwise, we report performance in terms of F1, i.e., the harmonic mean of precision and recall.
We note that it is not the purpose of this paper to show that T-O-M, when integrated into IMS, beats all other configurations or alternative systems, but rather to fully automatize the WSD pipeline with performances which are competitive with the state of the art.
Baseline As a traditional baseline in WSD, we used the Most Frequent Sense (MFS) baseline given by the first sense in WordNet. The MFS is a very competitive baseline, due to the sense skewness phenomenon in language (Navigli, 2009).
Number of training sentences per sense Given a target word w, we sorted its senses Senses(w) following the WordNet ordering and selected the top k i training sentences for the i-th sense according to Formula 6, where: with K = 500 and z = 2 which were tuned on a separate small in-house development dataset 5 .

Impact of syntagmatic relations
The first result we report regards the impact of vanilla WordNet vs. our WordNet-induced subgraph of BabelNet (WordNet BN ) when calculating PPR vectors. As can be seen from the overall performance of IMS from 0.5 to around 4 points across the five datasets, with an overall loss of 1.6 F1 points. Similar performance losses were observed when using T-O-M U N (see Table  3). This corroborates our hunch discussed in Section 3 that a resource like BabelNet can contribute important syntagmatic relations that are beneficial for identifying (and ranking high) sentences which are semantically relevant for the target word sense. In the following experiments, we report only results using WordNet BN .

Comparison against sense-annotated corpora
We now move to comparing the performance of T-O-M, which is fully automatic, against corpora which are annotated manually (SemCor) and semi-automatically (OMSTI). In Table 3 we show the F1-score of IMS on each gold standard dataset in the evaluation framework and on all datasets merged together (last row), when it is trained with the various corpora described above.
As can be seen, T-O-M W iki and T-O-M U N obtain higher performance than OMSTI (up to 5.5 points above) on 3 out of 5 datasets, and, overall, T-O-M W iki scores 1 point above OMSTI. The MFS is in the same ballpark as T-O-M W iki , performing better on some datasets and worse on others. We note that IMS trained on T-O-M W iki succeeds in surpassing or obtaining the same results as IMS trained on SemCor on SemEval-15 and SemEval-13. We view this as a significant achievement given the total absence of man-

Performance without backoff strategy
IMS uses the MFS as a backoff strategy when no sense can be output for a target word in context (Zhong and Ng, 2010). Consequently, the performance of the MFS is mixed up with that of the SVM classifier. As shown in Table 4, OMSTI is able to provide annotated sentences for roughly half of the tokens in the datasets. Train-O-Matic, on the other hand, is able to cover almost all words in each dataset with at least one training sentence. This means that in around 50% of cases OMSTI gives an answer based on the IMS backoff strategy.
To determine the real impact of the different training data, we therefore decided to perform an additional analysis of the IMS performance when the MFS backoff strategy is disabled. Because we suspected the system would not always return a sense for each target word, in this experiment we measured precision, recall and their harmonic mean, i.e., F1. The results in Table 5 confirm our hunch, showing that OMSTI's recall drops heavily, thereby affecting F1 considerably. T-O-M performances, instead, remain high in terms of precision, recall and F1. This confirms that OMSTI relies heavily on data (those obtained for the MFS and from SemCor) that are produced manually, rather than semi-automatically.

Domain-oriented WSD
To further inspect the ability of T-O-M to enable disambiguation in different domains, we decided to evaluate on specific documents from the various gold standard datasets which could be clearly assigned a domain label. Specifically, we tested on 13 SemEval-13 documents from various domains 6 and 2 SemEval-15 documents (namely, maths & computers, and biomedicine) and carried out two separate tests and evaluations of T-O-M on each domain: once using the MFS backoff strategy, and once not using it. In Tables 6 and 7 we report the results of both T-O-M W iki and T-O-M U N to determine the impact of the corpus type.
As can be seen in the tables, T-O-M W iki systematically attains higher scores than OMSTI (except for the biology domain), and, in most cases, attains higher scores than MFS when the backoff is used, with a drastic, systematic increase over OMSTI with both Train-O-Matic configurations     Results To perform our evaluation we chose the most recent multilingual task (SemEval 2015 task 13) which includes gold data for Italian and Spanish. As can be seen from Table 8 Train-O-Matic enabled IMS to perform better than the best participating system (Manion and Sainudiin, 2014, SUDOKU) in all three settings (All domains, Maths & Computer and Biomedicine). Its performance was in fact, 1 to 3 points higher, with a 6-point peak on Maths & Computer in Spanish and on Biomedicine in Italian. This demonstrates the ability of Train-O-Matic to enable supervised WSD systems to surpass state-of-theart knowledge-based WSD approaches in lowresourced languages without relying on manually curated data for training.

Related Work
There are two mainstream approaches to Word Sense Disambiguation: supervised and knowledge-based approaches. Both suffer in different ways from the so-called knowledge acquisition bottleneck, that is, the difficulty in obtaining an adequate amount of lexical-semantic data: for training in the case of supervised systems, and for enriching semantic networks in the case of knowledge-based ones (Pilehvar and     Navigli, 2009).
State-of-the-art supervised systems include Support Vector Machines such as IMS (Zhong and Ng, 2010) and, more recently, LSTM neural networks with attention and multitask learning (Raganato et al., 2017b) as well as LSTMs paired with nearest neighbours classification (Melamud et al., 2016;Yuan et al., 2016). The latter also integrates a label propagation algorithm in order to enrich the sense annotated dataset. The main difference from our approach is its need for a manually annotated dataset to start the label propagation algorithm, whereas Train-O-Matic is fully automatic. An evaluation against this system would have been interesting, but neither the proprietary training data nor the code are available at the time of writing.
In order to generalize effectively, these supervised systems require large numbers of training in-stances annotated with senses for each target word occurrence. Overall, this amounts to millions of training instances for each language of interest, a number that is not within reach for any language. In fact, no supervised system has been submitted in major multilingual WSD competitions for languages other than English Moro and Navigli, 2015). To overcome this problem, new methodologies have recently been developed which aim to create sense-tagged corpora automatically. Raganato et al. (2016) developed 7 heuristics to grow the number of hyperlinks in Wikipedia pages. Otegi et al. (2016) applied a different disambiguation pipeline for each language to parallel text in Europarl (Koehn, 2005) and QTLeap (Agirre et al., 2015) in order to enrich them with semantic annotations. Taghipour and Ng (2015), the work closest to ours, exploits the alignment from English to Chinese sentences of the United Nation Parallel Corpus (Ziemski et al., 2016) to reduce the ambiguity of English words and sense-tag English sentences. The assumption is that the second language is less ambiguous than the first one and that hand-made translations of senses are available for each WordNet synset. This approach is, therefore, semi-automatic and relies on certain assumptions, in contrast to Train-O-Matic which is, instead, fully automatic and can be applied to any kind of corpus (and language) depending on the specific need. Earlier attempts at the automatic extraction of training samples were made by Agirre and De Lacalle (2004) and Fernández et al. (2004). Both exploited the monosemous relatives method (Leacock et al., 1998) in order to retrieve sentences from the Web which contained a given monosemous noun or a relative monosemous word (e.g., a synonym, a hypernym, etc.). As can be seen in (Fernández et al., 2004) this approach can lead to the retrieval of very accurate examples, but its main drawback lies in the number of senses covered. In fact, for all those synsets that do not have any monosemous relative, the system is unable to retrieve examples, thus heavily affecting the performance in terms of recall and F1. Knowledge-based WSD, instead, bypasses the heavy requirement of sense-annotated corpora by applying algorithms that exploit a general-purpose semantic network, such as WordNet, which encodes the relational information that interconnects synsets via different kinds of relation. Approaches include variants of Personalized PageRank (Agirre et al., 2014) and densest subgraph approximation algorithms (Moro et al., 2014) which, thanks to the availability of multilingual resources such as BabelNet, can easily be extended to perform WSD in arbitrary languages. Other approaches to knowledge-based WSD exploit the definitional knowledge contained in a dictionary. The Lesk algorithm (Lesk, 1986) and its variants (Banerjee and Pedersen, 2002;Kilgarriff and Rosenzweig, 2000;Vasilescu et al., 2004) aim to determine the correct sense of a word by comparing each wordsense definition with the context in which the target word appears. The limit of knowledge-based WSD, however, lies in the absence of mechanisms that can take into account the very local context of a target word occurrence, including non-content words such as prepositions and articles. Furthermore, recent studies seem to suggest that such approaches are barely able to surpass supervised WSD systems when they enrich their networks starting from a comparable amount of annotated data (Pilehvar and Navigli, 2014). With T-O-M, rather than further enriching an existing semantic network, we exploit the information available in the network to annotate raw sentences with sense information and train a state-of-the-art supervised WSD system without task-specific human annotations.

Conclusion
In this paper we presented Train-O-Matic, a novel approach to the automatic construction of large training sets for supervised WSD in an arbitrary language. Train-O-Matic removes the burden of manual intervention by leveraging the structural semantic information available in the WordNet graph enriched with additional relational information from BabelNet, and achieves performance competitive to that of semi-automatic approaches and, in some cases, of manually-curated training data. T-O-M was shown to provide training data for virtually all the target ambiguous nouns, in marked contrast to alternatives like OMSTI, which covers in many cases around half of the tokens, resorting to the MFS otherwise. Moreover Train-O-Matic has proven to scale well to lowresourced languages, for which no manually annotated dataset exists, surpassing the current state of the art of knowledge-based systems.
We believe that the ability of T-O-M to overcome the current paucity of annotated data for WSD, coupled with video games with a purpose for validation purposes Vannella et al., 2014), paves the way for high-quality multilingual supervised WSD. All the training corpora, including approximately one million sentences which cover English, Italian and Spanish, are made available to the community at http://trainomatic.org.
As future work we plan to extend our approach to verbs, adjectives and adverbs. Following Bennett et al. (2016) we will also experiment on more realistic estimates of P (s|w) in Formula 5 as well as other assumptions made in our work.