Leveraging WordNet Paths for Neural Hypernym Prediction

We formulate the problem of hypernym prediction as a sequence generation task, where the sequences are taxonomy paths in WordNet. Our experiments with encoder-decoder models show that training to generate taxonomy paths can improve the performance of direct hypernym prediction. As a simple but powerful model, the hypo2path model achieves state-of-the-art performance, outperforming the best benchmark by 4.11 points in hit-at-one (H@1).


Introduction
Hypernymy, or the IS-A relation, is one of the most important lexical relations. It is used to create taxonomies of terms and it is the main organizational criterion of nouns and verbs in WordNet (Fellbaum, 1998). Learning hypernymy is also important in practice, as knowing a word's hypernyms gives an approximation of its meaning, and enables inferences in downstream tasks such as question answering and reading comprehension. Predicting hypernymy is still a challenging task for word embeddings (Pinter and Eisenstein, 2018;Bernier-Colborne and Barriere, 2018; and previous studies have shown that it is more difficult to predict hypernymy than other lexical relations (Balažević et al., 2019;. Hypernymy prediction is often evaluated against a given taxonomy, typically WordNet (Fellbaum, 1998). The main hypothesis that we pursue in this paper is that knowledge of this taxonomy, in particular of taxonomy paths, will be helpful for hypernymy prediction. So we introduce two simple encoderdecoder based models for hypernym prediction that make use of information in the full taxonomy paths.
There has been much recent work on modeling lexical relations based on distributed representations (Pinter and Eisenstein, 2018;Bernier-Colborne and Barriere, 2018;). However, the task formulations and evaluation datasets have differed widely, making it hard to compare different approaches. We focus on evaluating on hypernymy, rather than jointly on many relations, which can mask strong performance differences across relations. We evaluate our encoder-decoder models against several previous models that have not been evaluated in the same setting before. Like many other approaches, we use WordNet as the basis for our experiments. We formulate the task as the task of finding the correct point to attach a new node (synset) to the WordNet taxonomy. We build on the existing WN18RR dataset (Dettmers et al., 2018), but filter its hypernymy pairs to produce WN18RR-hp, a subset that is leak-free with respect to approaches that use taxonomy paths during training, as we do.
We find that one of our new models, hyper2path, achieves state-of-the-art performance on hypernym prediction on WN18RR-hp, exceeding the best performance of benchmark models by 4.11 points in accuracy of the highest-ranked prediction (hit-at-one, H@1). In particular, we observe the greatest performance gain in noun hypernym prediction, where it improves over the best benchmark by 5.17 H@1 points.  Table 1: We frame hypernym prediction as a sequence generation problem. Given a query hyponym (e.g., pizza.n.01), the hypo2path rev model generates its taxonomy path, from its direct hypernym (dish.n.02) to the root node (entity.n.01). and indicate a correct and an incorrect prediction, respectively. In each example, an underlined synset corresponds to what the model predicted as a direct hypernym.

Hypernym Prediction
Several tasks related to hypernymy have been proposed under different names: extracting is-a relations from text (hypernym discovery) (Hearst, 1992;Snow et al., 2005;Camacho-Collados et al., 2018), binary classification of whether two given words are in a hypernym relation (hypernym detection) (Weeds et al., 2014;Shwartz et al., 2016;Roller et al., 2018), and constructing or extending a taxonomy (taxonomy induction) (Snow et al., 2006;Jurgens and Pilehvar, 2016). Another recently introduced task is hierarchical path completion (Alsuhaibani et al., 2019), where, given a hypernym path of length 4 from WordNet, the task is to predict the correct hyponym(s).
While hypernymy has long been studied in computational lexical semantics, another thread of recent research on hypernymy comes from the literature on knowledge base completion (Bordes et al., 2013;Nickel and Kiela, 2017;Pinter and Eisenstein, 2018;Dettmers et al., 2018). Here, hypernymy is considered as one of multiple different semantic relations between two nodes in a graph. Extending from this line of research, we also consider the relation prediction task in a semantic graph, but only focus on one relation of interest, hypernymy.
Like previous work in knowledge base completion (Bordes et al., 2013;Nickel and Kiela, 2017;Pinter and Eisenstein, 2018;Balažević et al., 2019), we take WordNet as our experimental space, so we learn hypernymy between synsets rather than raw lemmas. A synset is a basic lexical unit in WordNet, defined as a set of lemmas that are synonymous to each other. A synset thus also functions as one of the senses for each of the lemmas in the set. For example, the synset mark.n.01 (a number or letter indicating quality) consists of the three lemmas mark, grade, and score. Given a new synset, which we call the source node, our task is to predict its direct hypernym or target node from among the synsets in WordNet. For example, for the source node woolly daisy.n.01, the model should identify wildflower.n.01 as the target node in the graph.
While some previous approaches have predicted indirect hypernyms using the transitive closure of WordNet (Vendrov et al., 2016;Li et al., 2019), we focus on predicting direct hypernyms. Datasets for indirect hypernymy often include hypernyms that are too generic and not informative enough, where some semantically distant concepts are trivially mapped to the root node (entity.n.01) or a high-level hypernym near the root. We also restrict ourselves to modeling hypernym relations specifically, unlike much work on WN18RR which learns hypernymy as one of 11 lexical relations (Bordes et al., 2013;Pinter and Eisenstein, 2018;Balažević et al., 2019). As we discuss in Section 5, we observe that hypernymy is more effectively learned when trained on its own.

Models
In this section we introduce our two new path-based models, hypo2path and Path Encoder 1 , along with the four benchmark models.

Path Generators: hypo2path and hypo2hyper
In our first model, we treat hypernym prediction as a sequence generation task: given a hyponym, the goal is to generate the entire path in the WordNet taxonomy starting from the root node (entity.n.01 for nouns) and ending with the direct hypernym. For example, flock.n.02 should be mapped to its hypernym path by generating entity.n.01 abstraction.n.06 group.n.01 biological group.n.01 animal group.n.01. So the model is tasked to translate source synsets to target synset sequences. We denote this model as hypo2path. Our intuition behind this model is that training with a more difficult objective (i.e., entire hypernym path prediction rather than direct hypernym prediction) may result in a stronger model.
We use a standard LSTM-based sequence-to-sequence model (Sutskever et al., 2014) with Luong-style attention (Luong et al., 2015). This encodes a synset embedding of a hyponym into a hidden state, which is taken as the initial state of the LSTM decoder. The decoder generates synsets sequentially, conditioned on previously generated synsets. While the attention mechanism assigns weights to the source tokens, here we only have a single source token (i.e., a query hyponym). In our task, the attention mechanism serves as a way to avoid "forgetting" the source hyponym while decoding long paths 2 .
Reversing the order of the source or target sequences can the improve performance of encoder-decoder models, since the encoded hidden state is closer to the first target token (Sutskever et al., 2014;Gillick et al., 2016). Motivated by this, we experiment with a model variant called hypo2path rev in which we reverse the target path to generate a sequence of hypernyms starting from the direct hypernym of the source. This frames every generation step as direct hypernym prediction, which the decoder may more easily learn. Examples of generated reversed paths are shown in Table 1.
In order to determine whether generating an entire hypernym path as an auxiliary task helps to accurately predict a synset's direct hypernym, or whether generating only the direct hypernym is enough, we also perform experiments with hypo2hyper, a variant of hypo2path. Here the encoder-decoder model is trained to only generate a direct hypernym (i.e., both the source and target sequences are of length 1). 3

Path Encoder
We also examine the reverse approach: training an LSTM encoder that learns vector representations of hypernym paths. Given a query hyponym, we construct an embedding of a hypernym path (from the root node down), and the model is tasked to distinguish the gold path (which ends at its direct hypernym) from distractor paths. We construct path embeddings using a bidirectional LSTM followed by a fullyconnected layer. The output corresponding to the last hidden state is the encoded path vector. We denote this model as Path Encoder.
Given the training set S consisting of pairs (x, p) of a hyponym and a path, 4 we train the model by minimizing the Euclidean distance of encoded path vectors V p and the embedding vectors of the hyponym V x . In addition, the model is trained to maximize the distance of V p and V x from the negative examples {(p, x )|( * , x ) ∈ S} which are generated by randomly pairing a hyponym with a random path. The model is optimized with the following ranking loss function: where γ is a positive margin hyperparameter.
During prediction, the model first encodes all hypernym paths {p|(p, * ) ∈ S}. Then for each query hyponym x the path that minimizes V p − V x 2 is returned as the predicted hypernym path. From this path, we take the predicted direct hypernym of x for evaluation, just as we do for hypo2path. 5

Benchmark Models
TransE In the TransE model (Bordes et al., 2013), a semantic relationship is interpreted as a vector translation in embedding space. Given a triplet (s, r, t) of a source node s, a relation r, and a target node t, the model learns embeddings for the nodes and relation such that the target vector t is near s + r.
M3GM Max-Margin Markov Graph Models (M3GM) (Pinter and Eisenstein, 2018) exploit graph motif properties in WordNet (e.g., number of cycles of length 2 or 3) to predict different semantic relations. As a global feature model, M3GM reranks the top N candidates predicted by a local distributional feature model such as TransE (Bordes et al., 2013).
CRIM Bernier-Colborne and Barriere (2018) proposed a hybrid system which exploits both unsupervised pattern-based hypernym discovery and supervised projection learning (Ustalov et al., 2017). The core idea of the supervised algorithm is to learn multiple projection matrices which map a query embedding to a target hypernym. Their system ranked first on the three subtasks in SemEval-2018 Task 9 (Camacho-Collados et al., 2018).

Text2edges
The approach most similar to ours is (Prokhorov et al., 2019), which represents each hyponym using its textual definition from WordNet and maps it to its taxonomy path from the root to its parent node. Given the definition of a query hyponym, a bidirectional LSTM encoder-decoder with attention is used to generate the taxonomic path starting from the root node. For example, the definition of swift ("a small bird that resembles a swallow and is noted for its rapid flight") is mapped to the sequence 'animal, chordate, vertebrate, bird, apodiform bird'. Their best system, text2edges with pre-trained Con-ceptNet numberbatch embeddings (Speer et al., 2017), uses a reduced set of artificial edge label symbols rather than the original node labels.
Similar to our approach, text2edges uses a sequence-to-sequence model with an attention mechanism. However, there are several important differences. First, it encodes the definition of an input hyponym, while our prediction conditions only on the vector representation of the synsets themselves without looking at their definitions. Obtaining a definition of an unknown word is not always feasible, especially for domain-specific jargon and neologisms that are frequently used but seldom defined in dictionaries. On the other hand, computing their embeddings is a less challenging task when using approaches such as fastText (Bojanowski et al., 2017), which also works for words seen only once because it interpolates from embeddings of each word piece. 6 Another key difference is that it can only apply to rooted tree graphs with a single root. For this reason, it cannot be trained on the verb taxonomy in WordNet, which has more than one root.

Other Path Encoding Approaches
We next discuss some related path encoding approaches which we do not compare in our experiments. First, Das et al. (2017) proposed a model for link prediction that is similar to Path Encoder. Given a multirelational knowledge base, their task is to assign the correct relation to those entity pairs linked by an entity-relation path but without a direct relation between them. Here, sequences of entities and relations are encoded with an LSTM and the relation whose vector is closest to the encoded path is returned. In our task, however, concepts are linked by only one relation (hypernymy or instance hypernymy). This prevents us from drawing additional information from other relation paths between synsets. Alsuhaibani et al. (2019) also use a path-based model, but predict hyponyms rather than hypernyms in WordNet. Their model learns hierarchical embeddings over the taxonomy, where each leaf node is represented as the sum of the embeddings of its first four hypernyms (i.e., one direct and three indirect hypernyms in the taxonomy path). The model predicts the hyponym of a given path by maximizing the distance between the sum of the hypernym vectors and candidate hyponym vectors. However, this approach cannot be used for hyponyms that never appear in the taxonomy. removed seven inverse relations (such as hyponymy) from WN18 that caused a test leakage problem.
For this work, we only use two relations in WN18RR: hypernymy and instance hypernymy. Instance hypernymy can be thought of as a special type of hypernymy; it only holds between an instance that is a terminal node (e.g., specific persons, countries, and geographic entities) and its hypernym (common noun). Also, we evaluate verb and noun hypernymy separately because verb hypernymy (troponymy) in WordNet is conceptually distinct from noun hypernymy, as it expresses a manner relation rather than an IS-A relation (Fellbaum, 2002).
To avoid giving an unfair advantage to the path-based models, we filtered both validation and test sets to only include hyponym queries that are unseen anywhere in the full taxonomy paths of the training data. By eliminating the queries observed during path training, we made sure that all evaluated queries are equally new to both path-based models (e.g., hypo2path) and non-path models (e.g., hypo2hyper). We also exclude hyponyms from the test and validation sets which appear as hyponyms in the training set 7 to prevent the models from merely copying. We denote this subset as WN18RR-hp.
In sum, we use three different types of hypernym relation sets (noun, instance noun, verb) in our experiments.  The WordNet taxonomy is a directed acyclic graph where many hypernyms have multiple paths to the root. When training path models (i.e., hypo2path, Path Encoder, and text2edges), we included every existing path to a query's parent as individual target instance(s).

Evaluation Metrics
We use two different measures that represent the accuracy of model predictions: a hard accuracy measure (hit at one, H@1) and a "ballpark match" (soft accuracy) measure (WuP).
H@1 score As one of the most commonly used evaluation metrics for the relation prediction task, the hits-at-k (H@k) is the proportion of correct predictions (hits) within the top k ranked predictions. As the most intuitive and practical measure of each model's performance, we only consider the top first prediction accuracy (H@1), and not others with larger k (e.g., H@10).
Wu & Palmer similarity (WuP) The Wu and Palmer score (Wu and Palmer, 1994) is a similarity measure between two nodes in a taxonomy that ranges between 0 and 1. To quantify how close they are, it considers how deep the two nodes and their closest common ancestor are in a taxonomy. For instance, the WuP score for orange.n.01 and lemon.n.01 is 0.75, while it is only 0.35 for orange.n.01 and car.n.01. To assess how close a prediction is to the gold hypernym, we compute the WuP score between them using NLTK's implementation. 8 We report an averaged WuP score of each system.

Synset Embeddings
Averaged Lemma Vectors Following Pinter and Eisenstein (2018) 9 , we computed an embedding for each synset by averaging the pretrained fastText embeddings (Bojanowski et al., 2017) 10 of its synonyms 7 These cases exist because some queries have multiple hypernyms. WN18RR allows such queries with multiple gold targets to appear in train and evaluation sets. 8 When two nodes are identical, NLTK's implementation of WuP score does not necessarily return 1.0. We added an ad-hoc check to make sure we get 1.0 WuP score in such cases. 9 We used embed from words.py released by the first author at https://github.com/yuvalpinter/m3gm 10 https://fasttext.cc/docs/en/pretrained-vectors.html (lemmas). If a synset contained any multi-word lemma, the words within the lemma were averaged. We used these synset embeddings for all our reported experiments. 11

Model Details
Baselines We include two simple baselines, closest vector and closest co-hyponym, in order to gauge reasonable lower bounds for our metrics. The closest vector baseline is obtained by predicting the hypernym whose vector is closest (using the Euclidean norm) to the given synset, choosing from among all the hypernyms in the training set. On the other hand, closest co-hyponym is obtained by predicting the hypernym of the closest hyponym in the training set, i.e., under the assumption that nearby vectors are co-hyponyms.
TransE, M3GM, CRIM, and text2edges Our replications of the benchmark models are based on the original source code 12 keeping the default hyperparameters except for a few things: For TransE and CRIM, we tuned for the best number of training epochs. We increased the early stopping threshold to five epochs for TransE, as the model stopped too early with the default setting. Also, we only trained the supervised part of CRIM, since the unsupervised part of CRIM requires an external corpus for training. All models except text2edges used fastText embeddings to compute synset embeddings. For text2edges, we replicated their best system which takes the pretrained ConceptNet numberbatch embeddings to represent words in node definitions.
For all models except M3GM, we trained a separate model for each relation of WN18RR-hp (noun, instance noun, verb). On the other hand, we trained all three relations in a single M3GM model, as it employs graph motif features of different relations. We trained this multi-relational M3GM model as a re-ranker of TransE that was also trained on all three relations. 13 We did not run a post-hoc tuning for graph score weights in M3GM.
Path Generators: hypo2path and hypo2hyper We implemented a sequence-to-sequence model with Luong attention in Keras, which we used for the hypo2path and hypo2hyper experiments. We used a single-layer unidirectional LSTM with 256 hidden units and a dropout rate of 0.3 for both the encoder and the decoder. We trained the network with teacher forcing and used the Adam optimizer with a learning rate of 0.001 and batch size of 256. The embedding layer was frozen during training. Synsets without pretrained embeddings were assigned random vectors with elements sampled uniformly from [−.25, .25]. Greedy decoding was used to generate sequences. 14 We did not perform any hyperparameter tuning for these models.
Path Encoder The Path Encoder model was implemented in PyTorch, using a single-layer bidirectional LSTM to encode the path and a fully-connected layer to map the output to the target embedding space. The dimension of the LSTM cell was 1024 for nouns and instance nouns and 512 for verbs.
The learning rates in {0.01, 0.001} and margins in {0.1, 0.3, 0.5, 0.7} were tuned differently for noun, instance noun, and verb experiments. We used the same dropout rate, batch size and choice of optimizer as in the hypo2path experiments.

Results and Discussion
Results Overall, hypo2path rev shows the highest aggregate (micro-averaged) H@1 (dev: 24.43, test: 25.59) across the three hypernymy relations (nouns, instance nouns, and verbs), while CRIM has the best aggregate ballpark correctness (WuP) scores that are closely followed by hypo2path rev (Table 6). 11 We also trained hypo2path with randomly initialized embeddings (trained with the rest of the network), and observed a large drop in H@1. This suggests there is information in the embeddings which cannot be learned solely from the hypernym generation task, in line with observations of Pinter and Eisenstein (2018). 12 TransE and M3GM: https://github.com/yuvalpinter/m3gm/ CRIM: https://github.com/gbcolborne/hypernym_discovery/ text2edges: https://github.com/VictorProkhorov/Text2Path/ 13 This multi-relational TransE model is different from the single-relational TransE models reported in the results section. 14 Preliminary experiments on nouns showed no improvement when using beam search (with beam widths up to 6).     Table 6: Aggregated scores across all three groups.
For both nouns (Table 3) and instance nouns (Table 4), hypo2path rev 15 is a clear winner in terms of H@1. Despite being a simple model, it achieves the best H@1 with notable improvements over more complex benchmarks. We observe large gains (5.17 points) on nouns over CRIM and some gains (1.27 points) on instance nouns over text2edges. With respect to the ballpark accuracy (WuP), hypo2path shows similar performance to CRIM and text2edges on nouns, while text2edges does slightly better on instance nouns.
For nouns, the reversed hypo2path model ('hypo2path rev') achieved the best performance on H@1, while the non-reversed and reversed versions performed similarly in terms of WuP scores. Without reversing the path, the model's H@1 slightly degraded, and is closer to hypo2hyper's results. The performances of Path Encoder followed closely after the proposed three encoder-decoder models.
On instance nouns, performances of different models are overall much higher than on nouns. The best results are observed from the hypo2hyper, hypo2path rev, and text2edges models. That hypo2hyper and hypo2path rev perform similarly is not surprising: instance hypernymy is less likely to be learned from path generation, as it is a special type of hypernymy that only holds between a leaf and its parent node.
On the other hand, none of the models 16 does well on verb hypernymy (Table 5). CRIM achieved the highest scores overall for verbs. Consistent with the experiments on nouns and instance nouns, hypo2path rev had relatively strong performance, with 9.22 H@1 on the test set, which is comparable to the best H@1 (9.71). However, the best H@1 is still below 10% and the best WuP score is not much higher than the closest co-hyponym baseline.
Discussion Despite being trained with about ten times less data, scores for instance hypernymy are generally much higher than for noun hypernymy. Our finding that it is easier to predict hypernyms for individual entities than for common nouns is consistent with previous work (Boleda et al., 2017;Camacho-Collados et al., 2018;Balažević et al., 2019;Nguyen et al., 2019).
On the other hand, scores for hypernymy amongst verbs are very low, despite having about twice as much training data as instance hypernyms. This could be due both to the fact that verbs have more  complex semantics than nouns and to the structure of the WordNet verb hierarchy; while the noun subgraph is a single tree rooted at entity.n.01, the verb subgraph consists of 599 shallow trees. The WordNet verb hierarchy also has a number of annotation errors described in Richens (2008). In addition, verbs are more polysemous than nouns (Fellbaum, 1990), and the hypernym relation for verbs (troponymy) encompasses a diverse set of heterogeneous subsumption relations (Fellbaum, 2002;Richens, 2008).
Our results also suggest that it may be more effective to learn hypernymy separately from other lexical relations, rather than in a multi-relational setup. For example, M3GM trained with all 11 relations of WN18RR achieved a high aggregate validation H@1 of 39.88 in our replication , but a much lower H@1 for hypernymy (1.19) and instance hypernymy (3.74). These were evaluated on the original WN18RR, following Pinter and Eisenstein (2018), so these scores are not comparable to Tables 3, 4, 5, 6. Unlike Pinter and Eisenstein (2018), we evaluated M3GM in one direction (i.e., only hypernym prediction, rather than both hyponym and hypernym prediction). 17 These results are in line with the H@10 scores reported by Nguyen et al. (2019) and Balažević et al. (2019), which are substantially lower for hypernymy than for the other lexical relations in WN18RR.

Error Analysis
Here we examine the predictions of the best performing model, hypo2path rev, on the validation set for nouns in WN18RR-hp (647 synsets).
Path Validity Regardless of whether the predicted direct hypernym was correct or not, we observe that every generated path, from each predicted hypernym to the root, is actually a valid path in WordNet. This is not surprising, since all such paths appear in the training data. While the model always correctly generated valid paths in the graph, they did not necessarily start at the correct node (i.e., failing to predict the gold direct hypernym).
Nearby Nodes For noun hypernymy, 14.4% of the errors are due to predicting an indirect hypernym ( Table 7). The remaining incorrect predictions are not on the path from the hyponym to the root: these include co-hyponyms ("siblings", or nodes that share the same parent), and "cousins" (nodes that share the same non-parent ancestor). 3.6% of incorrect predictions are co-hyponyms (also in Table 7). About half of all predicted cousins had a common ancestor with the query hyponym that was within four steps.
Similar Synset Embeddings Some synsets were similar enough to mislead the model: polysemous lemmas were shared across different synsets (17.4% of the total errors) and some multi-word lemmas had shared words (15.4%) ( Table 7). In these cases, the model incorrectly returned a hypernym of a "confounding hyponym" that has a similar representation to the query. This type of error is attributable to the way synset embeddings are computed (i.e., averaging lemma vectors) which we adopted from Pinter & Eisenstein (2018).

Lemma Overlap and Polysemy
Although none of the queries (synsets) are shared between the training and evaluation (validation/test) sets, some lemma-level overlaps exist (5.4% of training set overlaps with the validation set). We checked whether our model is taking advantage of any lemma overlap by computing the correlation between prediction correctness and lemma overlap rate (i.e., how many lemmas of a query hyponym synset are already seen anywhere in the training set) of each predicted pair in the validation set. If the correlation is positive, this indicates that the the model did get some hints from lemmas seen from training set.
However, we observed a significant negative correlation (-0.228; p-value<0.001). This is related to the point on polysemy discussed above. In nearly half of the validation instances where a query had a lemma overlap with the training set, the model incorrectly predicted the hypernym of a confounding hypernym.
Rare Synsets Is hypernym prediction more difficult for infrequent synsets (i.e., synsets whose lemmas rarely appear in the corpus from which the word vectors were derived)? We define a synset's frequency as the average of the frequencies of its lemmas. 18 We find that the 164 synsets with frequency under 2,000 have an H@1 score of 15.2 (an 8.4 point drop). To further quantify the effect of synset frequency on performance, we ranked and binned every prediction by synset frequency (Figure 1). There is a clear upward trend, suggesting that methods designed to learn better embeddings with sparse data (Herbelot and Baroni, 2017) could improve performance substantially.

Conclusion and Future Work
In this paper we have considered the hypernym prediction task, the task of identifying the correct direct hypernym in a taxonomy of a synset given its embedding. In terms of evaluation, we have focused on both "exact match" (H@1) and "ballpark match" (WuP) metrics, and we have for the first time provided a comparison of existing models that had not previously been evaluated with the same metrics or on the same datasets.
We have introduced two simple encoder-decoder based models for hypernym prediction that make use of information in the full taxonomy paths, finding that in particular hypo2path rev shows state-of-the-art performance on the WN18RR-hp dataset. For nouns, it achieves improvements of 5.17 test H@1 over the best benchmark model. For verbs, we find that no model achieves a high performance. Instance nouns, on the other hand, are the easiest to predict. Encouragingly, we find that "ballpark match" (WuP) for instance nouns is at over 87 points.
There are several directions for future work. Encoding lemmas separately and with attention, rather than with a single embedding, could allow the attention mechanism to assign lower weights to less informative, misleading, or polysemous lemma names (which were related to over a third of all errors). One potential way to handle polysemy could involve encoding both synsets and glosses, using either classical word embeddings or contextualized word embeddings. Another extension of this work is to use multi-task learning with multiple decoders and different tasks, similar to Luong et al. (2016). For example, in addition to learning to decode a path of hypernyms leading to a given synset, a model could also generate the synset's co-hyponyms or its hypernym's lexname 19 .
Our results suggest that hypernym prediction is challenging for words which occur less frequently in the corpus used to compute embeddings. Methods are needed which can learn more effectively from low-frequency data, where the "unknown word" does not appear very often (Herbelot and Baroni, 2017;Kabbach et al., 2019).