Corpus-based Check-up for Thesaurus

In this paper we discuss the usefulness of applying a checking procedure to existing thesauri. The procedure is based on the analysis of discrepancies of corpus-based and thesaurus-based word similarities. We applied the procedure to more than 30 thousand words of the Russian wordnet and found some serious errors in word sense description, including inaccurate relationships and missing senses of ambiguous words.


Introduction
Large thesauri such as Princeton WordNet (Fellbaum, 1998) and wordnets created for other languages (Bond and Foster, 2013) are important instruments for natural language processing. Developing and maintaining such resources is a very expensive and time-consuming procedure. At the same time, contemporary computational systems, which can translate texts with almost human quality (Castilho et al., 2017), cannot automatically create such thesauri from scratch providing a structure somehow similar to resources created by professionals (Camacho-Collados, 2017;Camacho-Collados et al., 2018).
But if such a thesaurus exists, the developers should have approaches to maintain and improve it. In previous works, various methods on lexical enrichment of thesauri have been studied (Snow et al., 2006;Navigli and Ponzetto, 2012). But another issue was not practically discussed: how to find mistakes in existing thesaurus descriptions: incorrect relations or missed significant senses of ambiguous words, which were not included accidentally or appeared recently.
In fact, it is much more difficult to reveal missed and novel senses or wrong relations, if compared to detect novel words (Frermann and Lapata, 2016;Lau et al., 2014). So it is known that such missed senses are often found during semantic annotation of a corpus and this is an additional problem for such annotation (Snyder and Palmer, 2004;Bond and Wang, 2014).
In this paper, we consider an approach that uses embedding models to reveal problems in a thesaurus.
Previously, distributional and embedding methods were evaluated in comparison with manual data (Baroni and Lenci, 2011;Panchenko et al., 2015). But we can use them in the opposite way: to utilize embeddingbased similarities and try to detect some problems in a thesaurus.
We study such similarities for more than 30 thousand words presented in Russian wordnet RuWordNet (Loukachevitch et al., 2018) 1 . RuWordNet was created on the basis of another Russian thesaurus RuThes in 2016, which was developed as a tool for natural language processing during more than 20 years (Loukachevitch and Dobrov, 2002). Currently, the published version of RuWordNet includes 110 thousand Russian words and expressions.
improve the sense representation in a specific semantic resource. Lau et al. (2014) study the task of finding unattested senses in a dictionary is studied. At first, they apply the method of word sense induction based on LDA topic modeling. Each extracted sense is represented as top-N words in the constructed topics. To compute the similarity between a sense and a topic, the words in the definition are converted into the probability distribution. Then two probability distributions (gloss-based and topic-based) are compared using the Jensen-Shannon divergence. It was found that the proposed novelty measure could identify target lemmas with high-and medium-frequency novel senses. But the authors evaluated their method using word sense definitions in the Macmillan dictionary 2 and did not check the quality of relations presented in a thesaurus.
A series of works was devoted to studies of semantic changes in word senses (Gulordava and Baroni, 2011;Mitra et al., 2015;Frermann and Lapata, 2016), Gulordava and Baroni, 2011) study semantic change of words using Google n-gram corpus. They compared frequencies and distributional models based on word bigrams in 60s and 90s. They found that significant growth in frequency often reveals the appearance of a novel sense. Also it was found that sometimes the senses of words do not change but the context of their use changed significantly.
In (Mitra et al., 2015), the authors study the detection of word sense changes by analyzing digitized books archives. They constructed networks based on a distributional thesaurus over eight different time windows, clustered these networks and compared these clusters to identify the emergence of novel senses. The performance of the method has been evaluated manually as well as by comparison with WordNet and a list of slang words. But Mitra et al. (2015) did not check if WordNet misses some senses.

Comparison of Distributional and Thesaurus Similarities
To compare distributional and thesaurus similarities for Russian according to RuWordNet, we used a collection of 1 million news articles as a reference collection. The collection was lemmatized. For our study, we took thesaurus 2 https://www.macmillandictionary.com/ words with frequency more than 100 in the corpus. We obtained 32,596 words (nouns, adjectives, and verbs). For each of these words, all words located in the three-step relation paths (including synonyms, hyponyms, hypernyms, cohyponyms, indirect hyponyms and hypernyms, cross-categorial synonyms, and some others) were considered as related words according to the thesaurus. For ambiguous words, all sense-related paths were considered and collected together. In such a way, for each word, we collected the thesaurus-based "bag" of similar words (TBag). Then we calculated embeddings according to word2vec model with the context window of 3 words, planning to study paradigmatic relations (synonyms, hypernyms, hyponyms, cohyponyms). Using this model, we extracted the twenty most similar words w i to the initial word w 0 . Each w i should also be from the thesaurus. In such a way, we obtained the distributional (word2vec) "bag" of similar words for w 0 (DBag) with their calculated word2vec similarities to w 0 . Now we can calculate the intersection between TBag and DBag and sum up the word2vec similarities in the intersection. Figure 1 shows the distribution of words according to the similarity score of the TBag-DBag intersection. The axis X denotes the total similarity in the TBag-DBag intersection: it can achieve more than 17 for some words, denoting high correspondence between corpus-based and thesaurus-based similarities.
Relative adjectives corresponding to geographical names have the highest similarity values in the TBag-DBag intersection, for example, samarskii (related to Samara city), vologodskii (related to Vologda city), etc. Also nouns denoting cities, citizens, nationalities, nations have very high similarity values in the TBag-DBag intersection. Among verbs, verbs of thinking, movement (drive  fly), informing (say  inform  warn), value changing (decrease  increase), belonging to large semantic fields, have the highest similarity values (more than 13).
At the same time, the rise of the curve in the low similarity values reveals the segment of problematic words.

Analyzing Discrepancies between Distributional and Thesaurus Similarities
We are interested in cases when the TBag-DBag intersection is absent or contains only 1 word with small word2vec similarity (less than the threshold (0.5)). We consider such a difference in the similarity bags as a problem, which should be explained. We obtained 2343 such problematic "words". Table 1 shows the distribution of these words according to the part of speech.
It can be seen that verbs have a very low share in this group of words. It can be explained that in Russian, most verbs have two aspect forms (Perfective and Imperfective) and also frequently have sense-related reflexive verbs. All these verb variants (perfective, imperfective, reflexive) are presented as different entries in RuWordNet. Therefore, in most cases altogether they should easily overcome the established threshold of discrepancies. In the same time, if some verbs are found in the list of problematic words, they have real problems of their description in the thesaurus. To classify the causes of discrepancies, we ordered the list of problematic words in decreasing similarity of their first most similar word from the thesaurus, that is in the beginning words with the most discrepancies are gathered (further, ProblemList). Table 2 shows the share of found problems in the first 100 words of this list.

Part of speech
In the subsections, we consider specific reasons, which can explain discrepancies between thesaurus and corpus-based similarities.

Morphological Ambiguity and Misprints
The most evident source of the discrepancies is morphological ambiguity when two different words w 1 and w 2 have the same wordform and words from DBag of w 1 in fact are semantically related to w 2 (usually w 2 has larger frequency). For example, in Russian there are two words bank (financial organization) and banka (a kind of container). All similar words from Dbag to banka are from the financial domain: gosbank (state bank), sberbank (saving bank), bankir (banker), etc. The analyzed list of problematic words includes about 90 such words. 32 of such words are located in the top of ProblemList.
The technical reasons of some discrepancies are frequent misprints. For example, frequent Russian word zayavit (to proclaim) is often erroneously written as zavit (to curl). Therefore the DBag of word zavit includes many words similar to zayavit such as soobshchit' (to inform), or otmetit (to remark). Another example is a pair words statistka (showgirl) and statistika (statistics). In the top-100 of ProblemList, two such words were found. Such cases can be easily excluded from further analysis.

Named Entities and Multiword Expressions
The natural reason of discrepancies are named entities, whose names coincide with ordinary words, they are not described in the thesaurus, and are frequent in the corpus under analysis. For example, mistral is described in RuWordNet as a specific wind, but in the current corpus French helicopter carrier Mistral is actively discussed.
Frequent examples of such named entities are names of football, hockey and other teams popular in Russia coinciding with ordinary Russian words or geographical names (Zenith, Dynamo, etc.). Some teams can have nicknames, which are written with lowercase letters in Russian and cannot be revealed as named entities. For example, Russian word iriska means a kind of candy. In the same time, it is nickname of Everton Football Club (The Toffees).
Some discrepancies can be based on frequent multiword expressions, which can be present or absent in the thesaurus. A component w 1 of multiword expression w 2 can be distributionally similar to other words frequently met with w 2 or it . can be similar to words related to the whole phrase w 1 w 2 .
For example, word toplenyi (rendered) occurs in the phrase toplenoe maslo (rendered butter) 78 times of 112 of its total frequency. Because of this, this word is the most similar to word mindalnyi (adjective to almond), which is met in the phrase mindalnoe maslo (almond oil) 57 of 180 times. But two words toplenyi and mindalnyi cannot be considered as sense-related words.

Correcting Thesaurus Relations
In some cases, the idea of distributional similarity is clear, but the revision cannot be made in the thesaurus. We found two types of such cases. First, such epithet as gigant (giant) in the current corpus is applied mainly to large companies (IT-giant, cosmetics giant, etc.). But it can be strange to provide the relations between words giant and company in a thesaurus. The second case can be seen on the similarity row to word massazhistka (female masseur), comprising such words as hairdresser, housekeeper, etc. This is a kind of specialists in specific personal services but it seems that an appropriate word or expression does not exist in Russian. So, we do not have any language means to create a more detailed classification of such specialists.
Another interesting example of a similarity grouping is the group of "flaws in the appearance": word tsellyulit (cellulite) 3 is most similar to words: morshchina (crease of the skin), perkhot' (dandruff), kariyes (dental caries), oblyseniye (balding), vesnushki (freckles). It can be noted that a bald head or freckles are not necessary flaws of a specific person, but on 3 https://en.wikipedia.org/wiki/Cellulite average they are considered as flaws. On the other hand, such a phrase as nedostatki vneshnosti (flaws in the appearance) is quite frequent in Internet pages according to global search engines. Therefore maybe it could be useful to introduce the corresponding synset for correct describing the conceptual system of the modern personality.
But also real problems of thesaurus descriptions were found. They included word relations, which could be presented more accurately (6 cases in Top-100). For example, word tamada (toastmaster) was linked to a more general word, not to veduschii (master of ceremonies), and it was revealed from the ProblemList analysis.

Senses Unattested in Thesaurus
Also significant missed senses including serious errors for verbs were found. As it was mentioned before, in Russian there are groups of related verbs: perfective, imperfective, and reflexive. These verbs usually have a set of related senses, and also can have their own separate senses. In the comparison of discrepancies between TBag and Dbag of verbs, it was found that at least for 25 verbs some of senses were unattested in the current version of the thesaurus, which can be considered as evident mistakes. For example, the imperfective sense of verb otpravlyatsya (depart) was not presented in the thesaurus.
Several dozens of novel senses, which are the most frequent senses in the current collection, were identified. Most such senses are jargon (sports or journalism) senses, i.e. derbi (derby as a game between main regional teams) or naves as a type of a pass in football (high-cross pass). Also several novel senses that belong to information technologies were detected: proshivka (firmware), socset' (abbreviation from sotsial'naya set'social network).
Several colloquial (but well-known) word senses absent in RuWordNet were found. For example, verb obzech'sya in the literary sense means 'burn oneself'. In Dbag the colloquial sense 'make a mistake' is clearly seen.
For word korrektor (corrector), two most frequent unattested senses were revealed. The Dbag of this word looks as a mixture of cosmetics and stationary terms: guash' (gouache), kistochka (tassel), tonal'nyy (tonal), chernila (ink), tipografskiy (typographic), etc. Currently, about 90 evident missed senses (different from named entities), which are most frequent senses of the word in the collection, are identified. Among them, 10 words are in the Top-100 of the ProblemList. Table 3 presents the examples of found ambiguous words with missed senses that should be added to RuWordNet.

Other Cases
In some cases, paths longer than 3 should be used to provide better correspondence between thesaurus-based and corpus-based similar words (10 words in the top 100 words of ProblemList), for example, such 4-step paths as two hypernyms, then two hyponyms.
Four words in the top-100 have strange corpusbased similarities. We suppose that it is because of the presence of some news articles in Ukrainian.

Conclusion
In this paper we discuss the usefulness of applying a checking procedure to existing thesauri. The procedure is based on the analysis of discrepancies between corpus-based and thesaurus-based word similarities. We applied the procedure to more than 30 thousand words of Russian wordnet RuWordNet, classified sources of differences between word similarities and found some serious errors in word sense description including inaccurate relationships and missing senses for ambiguous words. We highly recommend using this procedure for checking wordnets. It is possible to find a lot of unexpected knowledge about the language and the thesaurus.
In future, we plan to develop an automatic procedure of finding thesaurus regularities in DBag of problematic words, which can make more evident what kind of relations or senses are missed in the thesaurus.