Just “OneSeC” for Producing Multilingual Sense-Annotated Data

The well-known problem of knowledge acquisition is one of the biggest issues in Word Sense Disambiguation (WSD), where annotated data are still scarce in English and almost absent in other languages. In this paper we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning. Our automatically-generated data consistently lead a supervised WSD model to state-of-the-art performance when compared with other automatic and semi-automatic methods. Moreover, our approach outperforms its competitors on multilingual and domain-specific settings, where it beats the existing state of the art on all languages and most domains. All the training data are available for research purposes at http://trainomatic.org/onesec.


Introduction
The problem of acquiring knowledge (i.e., the knowledge acquisition bottleneck) is an open issue in Natural Language Processing (NLP).This problem has become even more critical with the advent of deep learning, as a bigger amount of data is needed to meet the requirements of more and more difficult tasks and increasingly complex models.Word Sense Disambiguation (WSD), i.e., the task of associating a word with its meaning in a context (Navigli, 2009), is one of the most affected research areas (Navigli, 2018).The interest in this field has grown remarkably due to the variety of applications that can benefit from it, such as Machine Translation (Neale et al., 2016) or Information Extraction (Delli Bovi et al., 2015).Most approaches to WSD are either supervised or knowledge-based.The former frames the problem as a classification (Zhong and Ng, 2010) or sequence learning (Raganato et al., 2017b) task, in which either a target word or all the content words in a sequence have to be tagged with one of their possible meanings.The latter, instead, exploits graph algorithms on knowledge bases, such as the Personalized PageRank method (Haveliwala, 2002;Agirre et al., 2014), or the densest subgraph heuristic (Moro et al., 2014).Hence, knowledgebased approaches rely on semantic networks such as WordNet1 (Miller et al., 1990), a manuallycurated resource where synonyms are grouped into so-called synsets, or BabelNet2 (Navigli and Ponzetto, 2010), a large multilingual encyclopedic dictionary that merges together different resources like WordNet, Wikipedia, Wikidata etc.Therefore, in one form or another both approaches to WSD need lexical-semantic data.This is especially crucial in the case of supervised systems, which have proved capable of attaining higher results on English, for which annotated data are available, whereas they fall behind knowledgebased approaches when tested on other languages.Unfortunately, carrying out semantic annotations for a target language requires time, resources and expertise in the field.Thus, in the last few years new approaches have been developed to mitigate the burden of knowledge acquisition by providing automatically or semi-automatically tagged corpora.The main goal of such techniques is to infer the meaning of words occurring in raw sentences by leveraging information drawn from different sources of knowledge, i.e., parallel corpora (Taghipour and Ng, 2015;Delli Bovi et al., 2017), or semantic networks (Pasini and Navigli, 2017;Pasini et al., 2018).Although supervised models achieve competitive results when trained on automatically and semi-automatically annotated datasets, a major limitation concerning these approaches is that they are strictly dependent on knowledge sources, which are in their turn difficult to harvest.In fact, on the one hand, parallel corpora require human intervention for translating a collection of texts into one or more different languages.On the other hand, semantic networks rely on manually-annotated lexical-semantic data for enriching the network itself.
In this paper we tackle the knowledge acquisition bottleneck by extending the hypotheses introduced in the two seminal papers by Gale et al. (1992b, One Sense Per Discourse) and Yarowsky (1993, One Sense Per Collocation) to Wikipedia categories, thereby making the following four contributions: 1. We formulate the new assumption of One Sense per Wikipedia Category, i.e., all the occurrences of a word across Wikipedia pages in a category share the same word meaning.
2. We propose OneSeC (One Sense per Category), a novel fully-automatic method that produces multilingual sense-annotated datasets on a large scale by mapping Wikipedia categories to word senses.
3. We eliminate the dependency on the structure of a semantic network by relying only on the association between Wikipedia pages and categories and on a sparse vector representation of concepts, i.e., NASARI 3 (Camacho  Collados et al., 2016).
4. We prove that OneSeC achieves state-of-theart results on multilingual WSD and outperforms its automatic and semi-automatic alternatives on English.
2 One Sense Per Category Given the whole Wikipedia together with its associations between pages and categories and given a lexicon of words L, our approach computes a semantically-tagged dataset -where words in L are annotated with their correct meaning -by performing the following three steps: • Category Representation, which represents a lexeme-category pair (l, C) as the Bag Of Words of the sentences of the category C in which the lemma l appears (Section 2.1).
• Sense Assignment, which assigns a sense s of the lemma l to each lexeme-category pair (l, C) (Section 2.2).
• Sentence Sampling, which extracts a certain number of sentences for each sense s of each lemma l in the lexicon L by exploiting the association between lexeme-category pairs and word senses computed in the previous step (Section 2.3).

Category Representation
The first step aims at representing each lexemecategory pair (l, C) with a Bag Of Words (BOW).
To that end, we lemmatise and POS tag the text of each page in C and retain only the content words in each sentence.Then, we consider all the sentences of C in which l appears at least once and we count the frequency of each other lemma occurring in the selected sentences.Finally, we build the BOW of (l, C) in which each dimension corresponds to a lemma that is associated with its frequency, thus giving greater importance to more frequent words.For example, the pair (spring#n, MECHANICS) contains words such as force and gravity, while the pair (match#n, SPORTS LAW) includes team and play.

Sense Assignment
The second step aims at assigning a sense distribution to each lexeme-category pair.We exploit the BOW we computed and the NASARI lexical vectors (Camacho Collados et al., 2016) to represent categories and synsets, respectively.NASARI leverages Wikipedia pages to provide a sparse representation of BabelNet synsets, having words as their dimensions weighted by their lexical specificity (Lafon, 1980).NASARI has been used to compute the semantic similarity between two concepts (Pilevar et al., 2013) in combination with the Weighted Overlap (WO), which has proven to work better than cosine similarity for comparing sparse vectors.It takes as input two vectors v 1 and v 2 and computes their similarity by considering the ranks of the components shared by both vectors5 .However, as it takes into account only the common dimensions, it also gives a high similarity value when the two vectors share just a few dimensions with similar rankings.In light of this, we modified the original formula and added a weight factor Ψ as follows: where O is the intersection set between the dimensions of v 1 and v 2 , r v i w is the rank of the dimension corresponding to the word w in the vector v i and Ψ is a logarithmic function that depends on the size of O and is defined as Ψ = ln(|O| + 1).
For example, given the BOW for a category related to the animal mouse and the two NASARI vectors for the animal and device senses of mouse as in Table 1, the standard weighted overlap scores the animal sense 0.93 and the device sense 1.00, even though the latter has only the first dimension in common.When we add the logarithmic factor Ψ, instead, the first sense is scored 1.80 while the second is scored 0.69.
Therefore, for each lexeme-category pair (l, C) we compute the WO between B C , i.e., the BOW representation of the category C (see Section 2.1), and each NASARI vector associated with a given sense of l.Thus, given a set of weighted overlap scores {W O(B C , s 1 ), . . ., W O(B C , s n )}, where s 1 . . .s n are the senses of l, we assign to (l, C) the sense that maximises the similarity with the category BOW as follows: In Table 2 we show the distribution of senses for one category of spring#n and match#n, respectively.As one can see, given the pair (spring#n, SEASONS) we select the season sense of spring#n as it is the highest ranked one in terms of WO, while the formal contest meaning of match#n is selected for (match#n, SPORTS LAW).

Sentence Sampling
Once each lexeme-category pair (l, C) is associated with one sense, we can reverse the relation having -for each sense of l -a list of categories C 1 , . . ., C m sorted by weighted overlap.For example, in Table 3 we show an excerpt of the most related categories for the animal and the device meanings of the lemma mouse#n.As one can see, the animal sense is mostly related to categories that concern the animal world, e.g.MICE, RO-DENTS, etc., while the device sense to the electronic device world, e.g.COMPUTING INPUT DE-VICES, POINTING DEVICES, etc.Therefore, for each sense s i of l we sample a set of K s i sentences from C 1 , . . ., C m i that depends on the BabelNet ordering of senses.Following Pasini and Navigli (2017) we compute K s i applying a Zipfian distribution: where K and z are two system parameters that define, respectively, the number of examples to assign to the first sense of a lemma and how fast the function decreases, while i is the sense position in BabelNet.In the case that we find only β sentences for the first sense of l, with β < K s 1 , we scale down all K s i by setting K = β, i.e., we consider the maximum number of examples as those that are actually available for the first sense.For example, if we have z = 2.0 and K = 500 but we can retrieve only 100 sentences for the sense s 1 , we set K = 100 when computing K s i for i > 1.
Hence, the number of sentences to be associated with s 2 is 25, rather than 125, thus maintaining the distribution across senses balanced.
In order to provide different contexts of use for a given sense s i , we sample K C j s i sentences from each category C j .K C j s i is computed as follows: where the second term is a smoothed version of the category rank reciprocal6 , i.e., it is normalised by the sum of the reciprocal of each category rank (from 1 to m i ).
Once we have determined the number of examples to draw from each category, we sample the sentences according to their perplexity, which we compute with a Neural Language Model trained on WikiText103 (Howard and Ruder, 2018) 7 .
The result of the above three steps is a semantically-annotated corpus where each meaning s of each lemma l ∈ L is associated with a set of sentences in which l is tagged with s.

Experimental Setup
We exploited the Word Sense Disambiguation task to assess the quality of our automaticallygenerated corpus.Therefore, we trained a reference WSD model on the data generated by OneSeC and compared the results against those achieved by the same model trained on other resources.
In what follows we introduce the reference Word Sense Disambiguation system, the test bed, the comparison systems and how we tuned the two parameters K and z.
Reference system We carried out the evaluation with two different WSD models: the SVM-based system It Makes Sense (Zhong and Ng, 2010, IMS) and the Bi-LSTM-based model introduced by Raganato et al. (2017b).For the latter we used MUSE embeddings (Lample et al., 2018) in the input layer, a learning rate of 0.5 and followed Raganato et al. (2017b) for all the other hyperparameters.Depending on the setting, English or multilingual, we chose the best-performing system on a development set: Senseval-2 for English and an inhouse development set for all the other languages8 .For both models, unless differently stated, we used the Most Frequent Sense (MFS) of a lemma, i.e., its first-ranked meaning in BabelNet, as backoff strategy when the system was not able to provide an answer.
Following the literature, we report the F1 measure on all the test sets unless stated differently.English parameter tuning We tune the parameters K and z introduced in Section 2.3 so as to maximise the performance of the reference system on the development set.We used Senseval-2 as tuning corpus and varied K between 100 and 900 with a 200 step and z between 2.0 and 3.0 with a 0.1 step.We ran both models, IMS and Bi-LSTM, for each parameter value and chose the one that performed best.In Figure 1 (left) we show the results of the two systems when trained on OneSeC where z is set to 2.0 and K is increased from 100 to 900.As can be seen, the Bi-LSTM trend increases more rapidly than the IMS one.However, its results are always lower than those attained by its alternative.IMS, in fact, scores almost 5 points higher starting from K = 100 and maintains its lead through all the values of K.It reaches a plateau when K = 700, which we interpret as the plateau of knowledge.Indeed, increasing the number of examples degrades IMS performance as no more informative sentences are found for a given sense.Once K was set to 700 both for IMS and Bi-LSTM, we ran the same experiment varying z.As one can see in Figure 1 (right), IMS achieves the highest score when z = 2.1 while Bi-LSTM when z = 2.9.While IMS seems sensitive to this parameter, attaining better performance when the distribution of classes in training is more balanced, the neural model trend is almost constant, indicating it is less dependent on the sense distribution.
Therefore, we chose IMS as our WSD reference system as it consistently outperformed its neuralnetwork alternative.In the following we report the results of IMS trained on OneSeC with K = 700 and z = 2.1.

Multilingual parameter tuning
We varied K and z as for English and computed the performance separately on each language-specific development dataset.We then chose the parameters leading the reference model to the highest results averaged across all languages.Contrary to what was the case for English, the Bi-LSTM model outperformed IMS on most of the settings and achieved the highest score with K = 200 and z = 2.0.Hence, we report multilingual results attained by the Bi-LSTM model when trained on OneSeC with K = 200 and z = 2.0.

Comparison systems
We compared OneSeC with a manual, a semi-automatic and a fullyautomatic alternative: • SemCor (Miller et al., 1993): the most used training corpus in WSD, which provides more than 200K manual annotations.
• OMSTI (Taghipour and Ng, 2015): a semi-automatic approach that extracts semantically-annotated data by exploiting parallel data to reduce the ambiguity of the target language.Since the resource contains SemCor by default, we considered only the semi-automatically generated examples in order to guarantee a fair comparison with OneSeC.
• For the multilingual setting, instead, due to the lack of manually sense-annotated data for non-English languages, we compared directly OneSeC against the best participating system in each task and Train-O-Matic.To set a level playing field, we also report the results attained by the Bi-LSTM model when trained on Train-O-Matic corpora for the tested languages.

English All-Words WSD
We proceed by testing the reference WSD system on the data provided by OneSeC, Train-O-Matic, OMSTI and SemCor on the English allwords tasks.
In Table 5 we compare the results of IMS when trained on different corpora.As one can see, OneSeC achieves the best results on ALL when compared to automatic and semi-automatic approaches, and ranks second only with respect to SemCor.Interestingly enough, OneSeC beats its manual competitor on SemEval-2013 by 1 point and on SemEval-2015 by 4.7 points, an impressive result considering that OneSeC does not involve any human intervention during the generation of the corpus.In Table 5 we also report the statistical significance between OneSeC and its competitors on the ALL dataset by juxtaposing a † symbol next to the score.In order to do  mar, 1947) with significance level α = 0.01 between OneSeC and SemCor.It resulted in no statistical significance, meaning that IMS trained on OneSeC is in the same ballpark as when trained on SemCor.We note that the goal of this work was not to achieve state-of-the-art results on English WSD compared to manually-annotated corpora.However, performing competitively on standard benchmarks represents one step further towards getting rid of the limitation imposed by resources like SemCor.Moreover, our approach outperforms Train-O-Matic, our direct competitor, on all the datasets, with the highest increment of 3.7 points on SemEval-2007, while scoring almost 2 points higher than TOM overall.OneSeC also attains higher results when compared with a semi-automatic approach like OM-STI.In fact, OMSTI is surpassed on all the datasets but Senseval-2 and scores 2.6 F1 points less on the ALL dataset.This is per se a remarkable result as OneSeC is automatic, while OM-STI relies on parallel corpora and manual effort to align senses across languages.Furthermore, we show that OneSeC results are statistically significant in comparison to those attained by TOM and OMSTI.We also note that, similarly to TOM, OneSeC covers almost all the lemmas in each test set (see annotated examples for only half of the instances.Therefore, IMS -when trained on OMSTI -resorts heavily to the MFS backoff strategy. In light of this, we computed precision (P), recall (R) and their harmonic mean (F1) when no backoff strategy was used, as shown in Table 4.As one can see, OMSTI's performance drops heavily by roughly 30 points, confirming the figures in Table 6.Train-O-Matic's results, in contrast, remain consistent, scoring 1.5 F1 points less than Sem-Cor overall and managing to beat it on 2 datasets.OneSeC, instead, leads IMS to the highest results overall, managing to surpass those achieved, not only by its direct competitors, but also by SemCor.
The results attest the high quality of our corpus, hence crowning OneSeC as the best choice over its competitors and even over manually-curated corpora when no back-off strategy is available.

Augmenting SemCor
To further investigate the quality of the examples provided by OneSeC, we augmented SemCor with our automatically-tagged sentences (Sem-Cor+OneSeC).We added examples to SemCor in two cases: 1.When a word in OneSeC lexicon never appears tagged in SemCor.

2.
When not all senses of a word are covered by at least one example in SemCor.
In the first case we provided annotated sentences for all the senses of the target word with K = 700 and z = 2.1.In the second case, instead, we generated examples only for those senses s i of a word w that are missing in SemCor.We determined the number of examples for s i by following the Zipfian distribution in Formula 2 with z = 2.1 and K = |examples(s 1 , w)|, i.e., the number of examples in SemCor where w occurs tagged with its most frequent sense s 1 .SemCor+OneSeC achieves 70.7 F1 points on ALL, beating SemCor alone (70.4) and SemCor+OMSTI (70.5)10 .

Domain-Specific Evaluation
In Table 7 we show the results achieved by IMS on each specific domain of SemEval-2013 and SemEval-2015.As shown in the two tables, when compared with TOM and OMSTI, OneSeC leads IMS to consistently outperform all the other approaches on SemEval-2015 and most of the domains of SemEval-2013.In fact, OneSeC scores lower only in 2 out of the 7 SemEval-2013 domains, whereas Train-O-Matic, instead, scores 0.1 and 2 points higher.However, when the MFS is disabled (second row of each domain), OneSeC is the best system across the board, demonstrating it can also provide valuable examples for those words that are specific to a domain.

Multilingual All-Words WSD
Finally, we move our focus to testing the ability of OneSeC to scale to different languages.In Tables 8 and 9 we show the results obtained by Bi-LSTM trained on OneSeC and Train-O-Matic (TOM -Bi-LSTM) when the MFS backoff strategy is disabled.We compare the aforementioned approaches with the best participating system in SemEval-2013and SemEval-2015, i.e., UMCC-DLSI's (Gutiérrez Vázquez et al., 2010) best run for the Spanish test set of SemEval-2013 and IMS trained on Train-O-Matic for all other datasets (Pasini et al., 2018).OneSeC proved, once again, to be the best system across the board, achieving state-of-the-art results on all languages.Our approach outperforms its competitors on all datasets, with the highest increment of 7.4 points on the French test set for SemEval-2013, while scoring on average 3.2 F1 points higher compared to the existing state of the art.
Results show that OneSeC is a robust approach that is able to scale across languages and domains.It goes beyond the findings of Train-O-Matic and raises the state-of-the-art bar in multilingual WSD.

Related Work
Word Sense Disambiguation is a well-established task in the field of Natural Language Processing and it has been tackled from many different angles over the past years.One of the major problems concerning WSD has been the so-called knowledge acquisition bottleneck (Gale et al., 1992a), i.e., the paucity of lexical-semantic data.In fact, semantic resources are mainly exploited by WSD models in one of two different ways: as structured knowledge to identify the meaning of a word in a context in knowledge-based models (Moro et al., 2014;Agirre et al., 2014;Chaplot and Salakhutdinov, 2018), and as training data to fit the parameters of a classifier in supervised models (Zhong and Ng, 2010;Yuan et al., 2016;Raganato et al., 2017b;Luo et al., 2018).
On the one hand, knowledge-based models have proved to be more versatile when it comes to disambiguating less frequent words and texts in low-resourced languages, even though they suffer from the lack of statistical evidence of lexical context.On the other hand, supervised models have consistently attained higher results in English WSD (Raganato et al., 2017a), however at the cost of less flexibility and lower results when scal- ing to other languages (Raganato et al., 2017b).Thus, research has recently been focused on new techniques that aim at mitigating the effects of the knowledge-acquisition bottleneck by automatically creating high-quality, sense-annotated training corpora.Some earlier attempts consisted of annotating examples from the Web by exploiting the target words' monosemous relatives (Agirre and Martínez, 2004).But a major drawback of this kind of approach is its limited coverage.In fact, a training example can be provided only for those senses with at least one monosemous related concept.Raganato et al. (2016) presented in their paper a method for the automatic construction of a Semantically Enriched Wikipedia (SEW), where the number of hyperlink annotations was enlarged by means of a set of heuristics.As an outcome they released a corpus containing more than 200 million annotations for approximately 4 million concepts and named entities.Another approach was developed by Otegi et al. (2016) to enrich the multilingual text of Europarl (Koehn, 2005) and QTLeap (Agirre et al., 2014) with several features, including semantic annotations in 6 different languages.Parallel corpora were exploited also in the more recent work of Taghipour and Ng (2015, OMSTI) 11 , who presented a semi-automatic approach that creates a novel semantically-annotated dataset by leveraging the manual effort made to align senses across different languages.
In contrast, recent methods have been able to fully automatise the whole process while simulta-neously producing high-quality resources.For example, Delli Bovi et al. ( 2017) exploited an external WSD system, i.e., Babelfy (Moro et al., 2014), and the richer context provided by aligned sentences, to carry out semantic annotations for Europarl.Instead, Pasini and Navigli completely removed the need for parallel corpora (Pasini and Navigli, 2017;Pasini et al., 2018) and for the WordNet backoff strategy (Pasini and Navigli, 2018) by introducing Train-O-Matic and two automatic methods for inducing the sense distribution.
Our work follows this latter line of research and, similarly to the aforementioned approaches, automatically provides multilingual sense-annotated data on a large scale.OneSeC stands out from its alternatives as it does not depend either on the structure of a semantic network (like Train-O-Matic), or on external WSD models (like Eu-roSense).In our approach, in fact, we only rely on Wikipedia categories and NASARI vectors to inject semantic information at sentence level.

Conclusions
In this paper we presented OneSeC, a novel method for the automatic creation of multilingual sense-annotated corpora on a large scale.Our approach relieves the burden of human intervention, hence mitigating the knowledge acquisition bottleneck besetting WSD training data.Moreover, we take a further step towards removing any dependency on a semantic-network structure by exploiting only Wikipedia categories and a sparse vector representation of concepts for creating our datasets.OneSeC outperforms its automatic and semi-automatic alternatives on the English WSD task, and achieves results in the same ballpark as those attained when manually-curated corpora are used for training.Furthermore, OneSeC scales to multiple languages without any additional human effort.Indeed, our approach also proved to be capable of producing high-quality training data for low-resourced languages, leading a WSD supervised model to achieve state-of-the-art results on all the datasets of the multilingual WSD tasks.We release more than one million tagged sentences for English, Spanish, Italian, French and German at http://trainomatic.org/onesec.
As future work we plan to exploit a subset of the Wikipedia categories as coarse-grained sense inventory and enrich our dataset with coarser labels, hence enabling WSD at different granularities.

Figure 1 :
Figure 1: Performance on the development set of IMS and the Bi-LSTM model trained on OneSeC when z = 2.0 and K ranges between 100 and 900 (left) and when K = 700 and z ranges between 2.0 and 3.0 (right).
UNITED KINGDOM category groups together all the past and present monarchs of the country, e.g.Elisabeth II, Queen Victoria, etc.Based on this, in what follows we refer to the sentences of a category C as those sentences contained in all the pages of C, and we refer to the occurrences of a lemma in a category C as the occurrences of its inflected forms in the sentences of C.
and Computer keyboard pages are grouped under the same category, namely, COMPUTING IN-PUT DEVICES.Similarly, the MONARCHS OF THE 3 http://lcl.uniroma1.it/nasari/

Table 1 :
Excerpt of the sorted components of an example category's BOW (first line) and two NASARI vectors (second and third line).

Table 2 :
Excerpt of the sense distribution of spring#n and match#n for one of their categories.

Table 3 :
Excerpt of the most related categories for the device and animal senses of mouse.

Table 4 :
Performance of IMS trained on different corpora on the English all-words WSD tasks when the MFS is disabled.

Table 5 :
Results of IMS trained on different corpora on the English all-words WSD tasks.† marks statistical significance between OneSeC and its competitors.

Table 6 :
Number of nominal lemmas covered by each corpus.

Table 6 )
, while OMSTI is able to provide