Studying Taxonomy Enrichment on Diachronic WordNet Versions

Ontologies, taxonomies, and thesauri are used in many NLP tasks. However, most studies are focused on the creation of these lexical resources rather than the maintenance of the existing ones. Thus, we address the problem of taxonomy enrichment. We explore the possibilities of taxonomy extension in a resource-poor setting and present methods which are applicable to a large number of languages. We create novel English and Russian datasets for training and evaluating taxonomy enrichment models and describe a technique of creating such datasets for other languages.


Introduction
Nowadays, construction and maintenance of lexical resources (ontologies, knowledge bases, thesauri) have become essential for the NLP community. In particular, enriching the most acknowledged lexical databases like WordNet (Miller, 1998) and its variants for almost 50 languages 1 or collaboratively created lexical resources such as Wiktionary is crucial. Resources of this kind are widely used in multiple NLP tasks: Word Sense Disambiguation, Entity Linking (Moro and Navigli, 2015), Named Entity Recognition, Coreference Resolution (Ponzetto and Strube, 2006).
There already exist several initiatives on WordNet extension, for example, the Open English WordNet with thousands of new manually added entries or plWordNet (Maziarz et al., 2014) which includes a mapping to an enlarged Princeton WordNet. However, the manual annotation process is too costly: it is time-consuming and requires language or domain experts. On the other hand, automatically created datasets and resources usually lag in quality compared to manually labelled ones. Therefore, it would be beneficial to assist manual work by introducing automatic annotation systems to keep valuable lexical resources up-to-date. In this paper, we analyse the approaches to automatic enrichment of wordnets.
Formally, the goal of the Taxonomy Enrichment task is as follows: given words that are not included in a taxonomy (hereinafter orphan words), we need to associate each word with the appropriate hypernyms from it. For example, given an input word "duck" we need to provide a list of the most probable hypernyms the word could be attached to, e.g. "waterfowl", "bird". A word may have multiple hypernyms.
SemEval-2016 task 14 (Jurgens and Pilehvar, 2016) was the first effort to evaluate this task in a controlled environment, but it provided an unrealistic task scenario. Namely, the participants were given definitions of words to be added to taxonomy, which are often unavailable in real world. The majority of presented methods heavily depended on these definition. In contrast to that, we present a resource-poor scenario and solutions which conform to it. Our contributions are as follows:

Related Work
The existing studies on the taxonomies can be divided into three groups. The first one addresses the Hypernym Discovery problem (Camacho-Collados et al., 2018): given a word and a text corpus, the task is to identify hypernyms in the text. However, in this task, the participants are not given any predefined taxonomy to rely on. The second group of works deals with the Taxonomy Induction problem (Bordea et al., 2015;Bordea et al., 2016;Velardi et al., 2013), in other words, creation of a taxonomy from scratch. Finally, the third direction of research is the Taxonomy Enrichment task: the participants extend a given taxonomy with new words. Our methods tackle this task.
Unlike the former two groups, the latter garners less attention. Until recently, the only dataset for this task was created under the scope of SemEval-2016. It contained definitions for new words, so the majority of models solving this task used the definitions. For instance, Tanev and Rotondi (2016) computed definition vector for the input word, comparing it with the vector of the candidate definitions from WordNet using cosine similarity. Another example is TALN team (Espinosa-Anke et al., 2016) which also makes use of the definition by extracting noun and verb phrases for candidates generation.
This scenario may be unrealistic for manual annotation because annotators are writing a definition for a new word and adding new words to the taxonomy simultaneously. Having a list of candidates would not only speed up the annotation process but also identify the range of possible senses. Moreover, it is possible that not yet included words may have no definition in any other sources: they could be very rare ("apparatchik", "falanga"), relatively new ("selfie", "hashtag") or come from a narrow domain ("vermiculite").
Thus, following RUSSE-2020 shared task (Nikishina et al., 2020) we stick to a more realistic scenario when we have no definitions of new words, but only examples of their usage. The organisers of the shared task provided a baseline and training and evaluation datasets based on RuWordNet (Loukachevitch et al., 2016). The task exploited words which were recently added to the latest release of RuWordNet and for which the hypernym synsets for the words were already identified by qualified annotators. The participants of the competition were asked to find synsets which could be used as hypernyms.
The participants of this task mainly relied on vector representations of words and the intuition that words used in similar contexts have close meanings. They cast the task as a classification problem where words need to be assigned one or more hypernyms (Kunilovskaya et al., 2020) or ranked all hypernyms by suitability for a particular word (Dale, 2020). They also used a range of additional resources, such as Wiktionary (Arefyev et al., 2020), dictionaries, additional corpora. Interestingly, only one of the well-performing models (Tikhomirov et al., 2020) used context-informed embeddings (BERT).
However, the best-performing model (denoted as Yuriy in the workshop description paper) extensively used external tools such as online Machine Translation (MT) and search engines. This approach is difficult to replicate because their performance for different languages can vary significantly. Thus, we exclude features which relate to search engines and compute the model performance without them.
The same applies to pre-trained word embeddings to some extent, but in this case we know which data was used for their training and can make more informed assumptions about the downstream performance. Moreover, the embeddings are trained on unlabelled text corpora which exist in abundance for many languages, so training high-quality vector representations is an easier task than developing a well-performing MT or search engine. Therefore, we would like to work out methods which do not depend on resources which exist only for a small number of well-resourced languages (e.g. semantic parsers or knowledge bases) or data which should be prepared specifically for the task (descriptions of new words). At the same time, we want our methods to benefit from the existing data (e.g. corpora, pre-trained embeddings, Wiktionary).

Diachronic WordNet Datasets
We build two diachronic datasets: one for English, another one for Russian based respectively on Princeton WordNet (Miller, 1998) and RuWordNet taxonomies. Each dataset consists of a taxonomy and a set of novel words not included in the taxonomy.

English Dataset
We choose two versions of WordNet and then select words which appear only in a newer version. For each word, we get its hypernyms from the newer WordNet version and consider them as gold standard hypernyms. We add words to the dataset if only their hypernyms appear in both snippets. We do not consider adjectives and adverbs, because they often introduce abstract concepts and are difficult to interpret by context. Besides, the taxonomies for adjectives and adverbs are worse connected than those for nouns and verbs making the task more difficult.
In order to find the most suitable pairs of releases, we compute WordNet statistics (see Table 1). New words demonstrate the difference between the current and the previous WordNet version. For example, it shows that the dataset generated by "subtraction" of WordNet 2.1 from WordNet 3.0 would be too small, they differ by 678 nouns and 33 verbs. Therefore, we create several datasets by skipping one or more WordNet versions. The statistics for each dataset are provided in Table 2.
As gold standard hypernyms, we use not only the immediate hypernyms of each lemma but also the second-order hypernyms: hypernyms of the hypernyms. We include them in order to make the evaluation less restricted. According to our empirical observations, the task of automatically identifying the exact hypernym might be too challenging, and finding the region where a word belongs ("parents" and "grandparents") can already be considered a success.

Russian Dataset
Our method of dataset construction does not use any language-specific or database-specific features. We show how it can be transferred to other wordnets or taxonomies with timestamped releases for Russian. We create an analogous version to English extending the dataset by Nikishina et al. (2020) based on RuWordNet. The original dataset does not include short words (< 4 symbols), diminutives, named entities and other constraints described in the shared task paper. We remove those constraints and present a non-restricted Russian dataset and a symmetrical English dataset from WordNet database (cf. Table 2).

Taxonomy Enrichment Methods
Our method is based on the baseline distributional model from RUSSE-2020 shared task extended with ranking of synset candidates and use the information from Wiktionary and various types of embeddings.

Baseline
According to Cai et al. (2018) and Aly et al. (2019), co-hyponyms (words or phrases that share a hypernym) usually have similar contexts. On the other hand, the distributional hypothesis states that words that occur in similar context tend to have similar meanings (Harris, 1954). Our approach is built on the top of the baseline method by Nikishina et al. (2020). In this method, top k = 10 nearest neighbours of the input word are taken from the pre-trained embedding model (according to the above considerations they should be co-hyponyms). Subsequently, hypernyms of those co-hyponyms are extracted from the taxonomy. These hypernyms can also be considered hypernyms of the input word.
There is no one-to-one mapping between a word and a synset. On one hand, several hypernyms can belong to one synset, on the other hand, one word can occur in multiple synsets. Thus, all synsets associated with the list of extracted hypernyms are extracted. Then, vector representation of a synset is computed by averaging embeddings of all lemmas belonging to the synset. The model returns top k closest synsets instead of top k words. Despite its simplicity, this method turned out to be a strong baseline as it outperformed over half of the participating models.

Ranking Extended Hypernyms List by Weighted Similarity
This baseline has a shortcoming: it lacks sorting operation on the extracted candidates. The rank of synsets is defined only by the rank of a corresponding nearest neighbour.
We improve the described model by ranking the generated synset candidates. In addition to that, we extend the list of candidates with second-order hypernyms (hypernyms of each hypernym). The direct hypernyms of the word's nearest neighbours can be too specific, whereas second-order hypernyms are likely to be more abstract concepts, which the input word and its neighbours have in common. After forming a list of candidates, we score each of them using the following equation: where v x is a vector representation of a word or a synset x, h i is a hypernym, n is the number of occurrences of this hypernym in the merged list, sim(v o , v h i ) is the cosine similarity of the vector of the orphan word o and hypernym vector h i . By computing this score, we assume that the most frequent and the most similar candidates are the true hypernyms of the word. We sort the hypernyms by this score and return top k.

Features Extracted from Wiktionary
One of the promising multilingual resources which the taxonomy enrichment models could benefit from is Wiktionary. We choose it as Wiktionary is the only large web-based free content dictionary existing for 175 languages, including English (6, 334, 384 entries) and Russian (1, 076, 156 entries). More importantly, each Wiktionary page usually comprises a definition and lists of hypernyms, hyponyms and synonyms, which could be useful for our task. We implement the following Wiktionary features: • the candidate is present in the Wiktionary hypernyms list for the input word (binary feature), • the candidate is present in the Wiktionary synonyms list (binary feature), • the candidate is present in the Wiktionary definition (binary feature), • average cosine similarity between the candidate and the Wiktionary hypernyms of the input word.
We do not use the definitions directly, as their texts are too noisy. They often include example usages of words which cannot be separated from the definitions and can distort their vector representations.
We extract lists of hypernym synset candidates using the baseline procedure and compute the 4 Wiktionary features for them. In addition to that, we use the score from the previous approach as a feature. To define the feature weights, we train a Logistic Regression model with L2 regularisation on a training dataset which we construct from the older (known) versions of WordNet. This dataset is constructed analogously to the datasets for evaluation which we described in Section 3 using all leaf synsets from the older WordNet. For each lemma of such synsets, we extract their gold standard hypernym synsets. As a result, our dataset comprised 79, 000 positive and 79, 000 negative examples of word-candidate pairs for both nouns and verbs for English dataset and 59, 914 and 59, 914 for Russian.
In order to understand the contribution of the Wiktionary features to the final score we compute the number of orphans encountered in Wiktionary (97% to 100%) and the number of orphans containing at least one hypernym in Wiktionary fields (2-18% for "hypernyms" field, 1-2% for "synonyms" field and 26-35% in the definition).

Pre-trained Embeddings
We test our methods on non-contextualised fastText (Bojanowski et al., 2017) and contextualised BERT (Devlin et al., 2019) embeddings. We choose fastText embeddings because pre-trained fastText models are easy to deploy, and do not require additional data or training for the out-of-vocabulary words. In this paper we use the fastText embeddings from the official website 3 for both English and Russian, trained on Common Crawl from 2019 and Wikipedia CC including lexicon from the previous periods as well.
While fastText embeddings can be generated for individual words, BERT requires a context for a word (i.e. a sentence containing it) to generate its embedding. For experiments with English datasets, we extract contexts from Wikipedia. For the experiments with Russian, we use a news corpus provided by the organisers of RUSSE'2020, 4 which contains at least 50 occurrences for each word in the dataset.
We use the pre-trained BERT-base model for English from (Devlin et al., 2019). For Russian, we utilize RuBERT model from (Kuratov and Arkhipov, 2019), which proved to outperform the Multilingual BERT from the original paper. To compute BERT embeddings for orphans and synsets, we extract sentences containing them from the corresponding corpora. If the words are absent in the corpora, we computed the average of lemmas without context for synsets and the embedding of the input word without context. We also averaged word-pieces for the OOV words. We lemmatise corpora with UDPipe (Straka and Straková, 2017) to be able to find not only exact word matches but also their grammatical forms. We rely on UDPipe as it supports many languages and shows reasonable performance on our data. In case of multiple occurrences of the same orphan, we average the retrieved contextualised embeddings.

Evaluation Metric
We consider the Taxonomy Enrichment task as a soft ranking problem and use Mean Average Precision (MAP) score for the quality measurement: where N and M are the number of predicted and ground truth values, respectively, prec i is the fraction of ground truth values in the predictions from 1 to i, y i is the label of the i-th answer in the ranked list of predictions, and I is the indicator function.
This metric is widely acknowledged in the Hypernym Discovery shared tasks, where systems are also evaluated over the top candidate hypernyms (Camacho-Collados et al., 2018). The MAP score takes into account the whole range of possible hypernyms and their rank in the candidate list.
However, the design of our dataset disagrees with MAP metric. As we described in Section 3, the goldstandard hypernym list is extended with second-order hypernyms (parents of parents). This extension can distort MAP. If we consider all gold standard answers as compulsory for the maximum score, it means that we demand models to find both direct and second-order hypernyms. This disagrees with the original motivation of including second-order hypernyms to the gold standard -it was intended to make the task easier by allowing a model to guess a direct or a second-order hypernym.
On the other hand, if we decide that guessing any synset from the gold standard yields the maximum MAP score, we will not be able to provide an adequate evaluation for words with multiple direct hypernyms. There exist two cases thereof: 1. the target word has two or more hypernyms which are co-hyponyms or one is a hypernym of the other -this word has a single sense, but the annotator decided that multiple related hypernyms are needed to reflect all shades of the meaning,  2. the target word has two or more hypernyms which are not directly connected in the taxonomy and neither are their hypernyms. This happens if: (a) the word's sense is a composition of senses of its hypernyms, e.g. "impeccability" possesses two components of meaning: ("correctness", "propriety") and ("morality", "righteousness"); (b) the word is polysemous and different hypernyms reflect different senses, e.g. "pop-up" is a book with three-dimensional pages ("book, publication") and a baseball term ("fly, hit").
While the case 2a corresponds to a monosemous word and the case 2b indicates polysemy, this difference does not affect the evaluation process. We suggest that in both these cases in order to get the maximum MAP score a model should capture all the unrelated hypernyms which correspond to different components of sense. At the same time, we should bear in mind that guessing a direct hypernym or a second-order hypernym are equally good options. Therefore, following Nikishina et al. (2020), we evaluate our models with modified MAP. It transforms a list of gold standard hypernyms into a list of connected components. Each of these components includes hypernyms (both direct and second-order) which form a connected component in a taxonomy graph. (According to graph theory, connected component is a subgraph, in which there is a path between any two nodes.) Thus, in the case 1 we will have a single connected component, and a model should guess any hypernym from it to get the maximum MAP score. In the cases 2a and 2b we will have multiple components, and a model should guess any hypernym from each of the components.

Results
We test the models suggested in Section 4 on our new created English and Russian datasets as well as on the RUSSE'2020 dataset. Table 3 compares the performance of our models with fastText and BERT embeddings on RUSSE'2020 and the symmetrical subset of the English dataset where named entities and short words are excluded. Table 5 demonstrates the results for the non-restricted datasets with the bestperformed embeddings. Table 4 illustrates top-10 candidates generated by the best performing system. Due to size limitation, we report the results only for WordNet 2.0-3.0 dataset, but the performance of all models on the other English datasets is consistent with that on this corpus.
From Tables 3 and 5 we see that our methods consistently improve the hypernym detection for both nouns and verbs across different datasets. Extending a list of hypernym candidates with second-order hypernyms and ranking them increases MAP by a large margin, especially for nouns. Adding Wiktionary features further boosts the performance of models. However, the use of contextualised word embeddings     Table 3 does not guarantee high results in this task. The models which used BERT vector representations perform worse than the same approaches using fastText. This also holds for all datasets and parts of speech. Apparently, fastText is good at modelling common and the most popular word senses, whereas BERT embeddings aggregate sense from different contexts, which results in mixing different senses and confusing word representations. For example, for the word "смайлик" (emoji) the predicted candidates are completely unsuitable (person, device, flatterer, hypocrite, visual materials) in comparison with the fastText prediction and correct hypernyms (graphical sign, image, symbol). Therefore, the contexts (in a broad sense) extracted from the pre-trained fastText embeddings are sufficient to attach new words to the taxonomy. Our methods were not able to outperform the best-performing systems from RUSSE'2020 shared task for Russian. However, as it has been discussed in Section 2, the best approach for nouns relies on external sources which are difficult to reproduce. In contrast to that, our approach is based on pure fastText vectors, word similarities, and Wiktionary which is available for multiple languages. At the same time, the approach by Dale (2020) was ranked first in the verbs track. It does not use additional sources, but it only suits for Russian verbs, because it performed below the baseline at Russian nouns and at the whole English dataset. Our method suits for both verbs and nouns and is stable across languages.

Error Analysis
To better understand the difference in systems performance and their main difficulties, we made a quantitative and qualitative analysis of the results.

Performance on Different Classes of Words
We noticed that for certain words hypernym discovery is an easier task. In particular, named entities and some other categories of words seem less challenging for our models. To test that, we divide our datasets into several parts: named entities, short words (less than 4 letters) and the rest. We compute MAP separately for each of these groups (see Table 5).
The MAP scores vary significantly across groups. Since MAP for a dataset is an average of MAPs for individual words, we can directly compare scores for different subsets. Thus, we see that for both languages named entities are easier to find hypernyms for. This happens because their hypernyms often contain the word from the same named entity. For instance, the named entity "Massif Central" has "massif.n.01" as one of the true hypernyms. The performance on short nouns and short verbs also differs. Whereas short nouns are often polysemous (hence more challenging), short verbs have one sense, are uncommon, and their sense is sometimes deduced from their form (e.g. "to aah" -"to produce an 'aah' sound").
Finally, the performance on all other nouns and verbs which have no such lexical cues is lower than on the whole list of words. This trend is particularly marked for nouns where less challenging groups (NE) constitute less than two fifths in both datasets. Thus, in order to evaluate taxonomy enrichment models, we should check their quality on different groups of words.

Distribution of Scores
The differences in word semantics make the dataset uneven. In addition to that, we would also like to understand whether the performance of models depends on the number of connected components (possible meanings) for each word. Thus, we examine how many words with more than one meaning can be predicted by the system. Figure 1 depicts the distribution of synsets over the number of senses they convey. As we can see, the vast majority of words are monosemous. For Russian nouns, the system correctly identifies almost half of them, whereas for other datasets the share of correctly predicted monosemous words is below 30%. This stems from the fact that for distributional models it is difficult to capture multiple senses in one vector. They usually capture the most widespread sense of a word. Therefore, the number of predicted synsets with two or more senses is extremely low. A similar power law distribution would be obtained using BERT embeddings, as we are still averaging embeddings from all contexts. This may be one of the reasons why contextualised models did not perform better than the fastest models which capture the main meaning only but do it well.

Error Types
In order to understand why a large number of word hypernyms (at least 60%) are too difficult for models to predict, we turn to manual analysis of the system outputs. We find out that errors can be divided into two groups: system errors caused by distributional models limitations and taxonomy inaccuracies. Therefore, we come across five main error types: Type 1. Extracted nearest neighbours can be semantically related words but not necessary cohyponyms: • delist (WordNet); expected senses: get rid of; predicted senses: remove, delete; • хэштег (hashtag, RuWordNet); expected senses: отличительный знак, пометка (tag, label); predicted senses: символ, короткий текст (symbol, short text).
The above mentioned mistakes and inaccuracies may dramatically decrease the scores of automatic metrics. In order to check how useful the predicted synsets are for a human annotator (i.e. if a short list of possible hypernyms can speed up the manual extension of a taxonomy), we conduct the manual evaluation of 10 random nouns and 10 random verbs for both languages (the words are listed in Table  6). We focus on worse-quality cases and thus select words whose MAP score is below 1. Annotators with the expertise in the field and the knowledge of English and Russian were provided with guidelines and asked to evaluate the outputs from our best-performing system. Each word was labelled by 4 expert annotators, Fleiss's kappa is 0.63 (substantial agreement) for both datasets.
We compute Precision@k score (the share of correct answers in the generated lists from position 1 to k) for k from 1 to 10, shown in Figure 2. We can see that even for words with MAP below 1 our model manages to extract useful hypernyms.

Conclusion
In this work, we deal with the taxonomy enrichment task: the automatic extension of an existing taxonomy with new terms. The novelty of our work is the use of diachronic versions of wordnets in two languages: this setting reflects the real process of development of the resources in time and allows us to study if machines could perform similar taxonomy completion task automatically. Toward this end, we present a simple method for solving the task that can be applied to multiple languages. Its results are close to the state-of-the-art for Russian and stable across languages. An interesting finding is that the task does not benefit from context-informed embeddings,whereas context-free vector representations like fastText often successfully identify hypernyms. On the other hand, the availability of Wiktionary substantially boosts the performance, yet models based solely on embeddings still yield competitive results. Error analysis reveals that some groups of words (e.g. named entities) are easier to find a hypernym for, and polysemous words are significantly more challenging. At the same time, our models were able to identify some cases of polysemy which were not reflected in wordnets. Promising directions of future work are (i) application of the developed methods to automate work of lexicographers constructing wordnets and (ii) further improvement of results for verbs.