Generationary or: “How We Went beyond Word Sense Inventories and Learned to Gloss”

Mainstream computational lexical semantics embraces the assumption that word senses can be represented as discrete items of a prede-ﬁned inventory. In this paper we show this needs not be the case, and propose a uniﬁed model that is able to produce contextually appropriate deﬁnitions. In our model, Genera-tionary, we employ a novel span-based encoding scheme which we use to ﬁne-tune an English pre-trained Encoder-Decoder system to generate glosses. We show that, even though we drop the need of choosing from a prede-ﬁned sense inventory, our model can be employed effectively: not only does Generation-ary outperform previous approaches in the generative task of Deﬁnition Modeling in many settings, but it also matches or surpasses the state of the art in discriminative tasks such as Word Sense Disambiguation and Word-in-Context. Finally, we show that Generationary beneﬁts from training on data from multiple inventories, with strong gains on various zero-shot benchmarks, including a novel dataset of deﬁnitions for free adjective-noun phrases. The software and reproduction materials are available at http://generationary.org .


Introduction
Virtually all modern approaches to Word Sense Disambiguation (WSD), i.e. the task of automatically mapping a word in context to its meaning (Navigli, 2009), use predetermined word senses from a machine lexicon, both in supervised (Huang et al., 2019;Bevilacqua and Navigli, 2020;Scarlini et al., 2020b) and in knowledge-based settings (Tripodi and Navigli, 2019;Scarlini et al., 2020a;Scozzafava et al., 2020). Nevertheless, researchers in Natural Language Processing (NLP), lexical semantics, and lexicography, have long been warning the community about the cognitively inaccurate * These authors contributed equally. nature of discrete sense boundaries (Rosch and Mervis, 1975;Kilgarriff, 1997;Tyler and Evans, 2001). As Kilgarriff (2007) argued, different language users have different understandings of words. This fact explains why inter-annotator agreement (ITA) estimates on WSD annotation tasks have never exceeded the figure of 80% (Edmonds and Kilgarriff, 2002;Navigli et al., 2007;. Moreover, this casts doubt upon the reliability of human-made inventories and "gold standard" evaluation datasets (Ramsey, 2017). Having no indisputable way of determining where one sense of a word ends and another begins, together with the fact that little consensus about how to represent word meaning has hitherto existed (Pustejovsky, 1991;Hanks, 2000;Nosofsky, 2011), are issues lying at the core of what makes WSD hard (Jackson, 2019). Moreover, while English inventories of senses and corpora are widely available the same cannot be said for other languages (Scarlini et al., 2019;Barba et al., 2020;Pasini, 2020), and this limits the scalability of Natural Language Understanding tasks to multiple languages (Navigli, 2018).
In this paper we overcome these limitations by proposing a unified approach to computational lexical semantics that has as its central focus Definition Modeling (DM), i.e. the task of generating a gloss 1 from static or contextual embeddings (Noraset et al., 2017). Generating a meaning description (definiens) to define a given term in context (definiendum) avoids many of the concerns highlighted above, in that we are not limited to a preexisting list of meanings. We show that we can use a single generation model, i.e. Generationary, not just to compete on the DM benchmarks, but also to achieve strong results on fully-discriminative tasks such as WSD and the recently-proposed Word-in-Context (Pilehvar and Camacho-Collados, 2019, WiC). This, in turn, results in a more solid assessment of the generation quality, a notorious problem in Natural Language Generation (NLG) evaluation (Gatt and Krahmer, 2018).
In contrast to previous approaches in DM (Gadetsky et al., 2018), we dispense with the requirement of having the definiendum represented by a single vector, and we condition gloss generation on a context of which the definiendum is an arbitrary span. This allows for the generation of contextual definitions for items that are rarely covered by sense inventories, such as free word combinations (e.g. clumsy apology or nutty complexion). Finally, the generative formulation makes it possible to train on several lexicographic resources at once, resulting in a versatile model that performs well across inventories, datasets, and tasks.
The main contributions of our approach are as follows: 1. We propose the use of a single conditional generation architecture to perform English DM, WSD and WiC; 2. Our model achieves competitive to state-ofthe-art results despite dropping the need of choosing from a predefined sense inventory; 3. Thanks to our encoding scheme, we can represent the definiendum as a span in the context, thus enabling definition generation for arbitrary-sized phrases, and seamless usage of BART (Lewis et al., 2019), a pre-trained Encoder-Decoder model; 4. Additionally, we release a new evaluation dataset to rate glosses for adjective-noun phrases.
We envision many possible applications for Generationary, such as aiding text comprehension, especially for second-language learners, or extending the coverage of existing dictionaries.

Related Work
Recent years have witnessed the blossoming of research in Definition Modeling (DM), whose original aim was to make static word embeddings intepretable by producing a natural language definition (Noraset et al., 2017). 2 While subsequently released datasets have included usage examples to account for polysemy (Gadetsky et al., 2018;Chang et al., 2018), many of the approaches to "contextual" DM have nevertheless exploited the context merely in order to select a static sense embedding from which to generate the definition (Gadetsky et al., 2018;Chang et al., 2018;Zhu et al., 2019). Such embeddings, however, are non-contextual.
Other works have made a fuller use of the sentence surrounding the target, with the goal of explaining the meaning of a word or phrase as embedded in its local context (Ni and Wang, 2017;Mickus et al., 2019;Ishiwatari et al., 2019). However, these approaches have never explicitly dealt with WSD, and have shown limits regarding the marking of the target in the context encoder, preventing an effective exploitation of the context and making DM overly reliant on static embeddings or surface form information. For example, in the model of Ni and Wang (2017), the encoder is unaware of the contextual target, whereas Mickus et al. (2019) use a marker embedding to represent targets limited to single tokens. Finally, Ishiwatari et al. (2019) replace the target with a placeholder, and the burden of representing it is left to a character-level encoder and to static embeddings. This latter approach is interesting, in that it is the only one that can handle multi-word targets; however, it combines token embeddings via order-invariant sum, and thus it is suboptimal for differentiating instances such as pet house and house pet.
Recent approaches have explored the use of large-scale pre-trained models to score definitions with respect to a usage context. For example, Chang and Chen (2019) proposed to recast DM as a definition ranking problem. A similar idea was applied in WSD by Huang et al. (2019), leading to state-of-the-art results. However, both of these approaches fall back on the assumption of discrete sense boundaries, and are therefore unable to define targets outside a predefined inventory.
With Generationary, by contrast, we are the first to use a single Encoder-Decoder model to perform diverse lexical-semantic tasks such as DM, WSD and WiC. Moreover, we address the issue of encoding the target in context by using a simple, yet effective, encoding scheme which makes use of special tokens to mark the target span, producing a complete and joint encoding of the context without the need for other components. This allows the effective usage of a pre-trained model, which we fine-tune to generate a gloss given the context.

Generationary
With this work we present a new approach to computational lexical semantics, by means of which we generate glosses for arbitrary-sized phrases in context. Our work has a wider scope than its predecessors, in that we put forward a unified method that overcomes the limits of both DM and WSD. With respect to DM, our full sequence-to-sequence framing of the task enables us to deal with units having different compositional complexity, from single words to compounds and phrases. Thus, Generationary can gloss a definiendum that is not found in dictionaries, such as starry sky, with the appropriate definiens, e.g.: 'The sky as it appears at night, especially when lit by stars'.
As regards WSD, instead, we are no longer bound by the long-standing limits of predefined sense inventories. Thus, it is possible to give (i) a meaningful answer for words that are not in the inventory, and (ii) one that fits the meaning and the granularity required by a given context better than any sense in the inventory. Consider the following: (1) (a) Why cannot we teach our children to read, write and reckon? (b) Mark or trace on a surface. (c) To be able to mark coherent letters.
The target word in (1 a) is associated 3 with the gold gloss (1 b) from WordNet (Fellbaum, 1998), the most used sense inventory in WSD. However, Generationary arguably provides a better gloss (1 c). In what follows, we detail our approach.

Gloss Generation
In this work we address the task of mapping an occurrence of a target word or phrase t (in a context c) to its meaning, by reducing it to that of generating a textual gloss g which defines c, t . The target t is a span in c, i.e. a pair of indices i, j corresponding to the first and the last token which make up the target in c. Formally, we propose to apply the standard sequence-to-sequence conditional generation formulation, in which the probability of a gloss, given a context-target pair, is computed by factorising it auto-regressively: 3 According to the human annotators of the Senseval-2 WSD evaluation dataset (Edmonds and Cotton, 2001). where g k is the kth token of g and g 0 is a special start token. By means of this procedure we can readily perform contextual DM (t = 1, |c| ), as well as "static" DM, i.e. when the target encompasses the whole context (t = 1, |c| ).
To learn the function in Eq. (1) we employ a recent Encoder-Decoder model, i.e. BART (Lewis et al., 2019), which is pre-trained to reconstruct text spans on massive amounts of data. The use of a pre-trained model is particularly important in our case, as successfully generating a gloss for a wide range of different context-target pairs requires a model which can wield vast amounts of semantic and encyclopedic knowledge. BART can be fine-tuned to perform specific kinds of conditional generation by minimizing the cross-entropy loss on new training input-output pairs. In our approach we give as input to BART a c, t pair, and train to produce the corresponding gold gloss g, with c, t and g being gathered from various sources (see Section 4.1). We devise a simple encoding scheme that allows us to make the model aware of the target boundaries, without architectural modifications to BART. Particularly, we encode c, t pairs as sequences of subword tokens in which the boundaries of the t span in c are marked by two special tokens, i.e. <define> and </define>. For example, the sentence I felt like the fifth wheel, with the phrase fifth wheel as the target, will be encoded as I felt like the <define> fifth wheel </define>.
We fine-tune BART to generate the corresponding gloss g: (idiomatic, informal) Anything superfluous or unnecessary.

Discriminative Sense Scoring
In this section we introduce three distinct techniques by means of which Generationary tackles discriminative tasks without additional training.

Gloss Probability Scoring
With Eq. (1) we are able to compute the probability of a certain gloss g given a pair c, t . Thus, we can perform classification by picking the sense which is associated with the gloss with the highest probability. Formally, we select: where S t ⊂ S is the set of applicable senses for target t from the full inventory S, and G : S → G is a function mapping senses to glosses (G, G, S and S t are determined by the reference dictionary).

Gloss Similarity Scoring
The usage of model gloss probability does not take into account the definitions that are actually generated. Thus, we adopt a simple best match approach where we compute similarity scores between the system-generated gloss and the glosses associated with the candidates, and we predict the candidate with the highest similarity. We employ a cosine similarity between the gloss vectors produced via the recently introduced Sentence-BERT model (Reimers and Gurevych, 2019, SBERT), and select a predicted senseŝ as follows: whereĝ is the most probable output found by beamsearch decoding, and sim is the SBERT similarity.

Gloss Similarity Scoring with MBRR
Using just the most probable sequence in the decoding process for the best match search is suboptimal, as more probability mass might be cumulatively assigned to a cluster of very similar outputs. To take this into account we propose the use of a simple approach inspired by Minimum Bayes Risk Re-Ranking (Kumar and Byrne, 2004, MBRR), which considers the mutual (dis)similarity within the setĜ of k generated outputs decoded with beam search. This is done by rescoring each output as the sum of the dissimilarities over all k outputs, weighted by their conditional probability: The new predictionĝ is then plugged into Eq. (3) as in simple similarity-based scoring.

Dictionary Gloss Datasets
We now move on to describe the datasets which we use to train Generationary models by fine-tuning BART. Each dataset includes c, t, g triples, which are used as our input and output for training. CHA (Chang and Chen, 2019) is an online dataset 4 of examples and definitions from oxforddictionaries.com. It comes with two settings, each with its own train/dev/test splits: in the Seen setting (CHA S ), definitions in the training set are also present in the test set, while the Unseen 4 github.com/MiuLab/GenDef  setting (CHA U ) has a zero-shot test of lemmas not featured in the training set. SEM is a dataset built by exploiting the Sem-Cor corpus (Miller et al., 1993) -which is manually tagged with WordNet senses -to associate sentence-level contexts with definitions. We filter out NER-like sense annotations (e.g. those mapping proper names such as Frank Lloyd Wright to the general sense of person). Moreover, to improve coverage, since not all WordNet senses appear in SemCor, we use synonymy information to build additional contexts, e.g. <define> separate, part, split </define> → go one's own way; move apart.
UNI is the concatenation of the train splits of SEM and CHA, plus the following: (i) a cleanedup January 2020 dump of Wiktionary, from which circular definitions (e.g. starting with synonym of ) have been filtered out, and (ii) the training split containing data from the GNU Collaborative International Dictionary of English (GCIDE), included in the dataset of Noraset et al. (2017).
We use CHA and SEM since they were employed by state-of-the-art approaches to DM (Chang and Chen, 2019) and WSD (Huang et al., 2019). With UNI, instead, we bring together diverse sense inventories to create a dataset that is less dependent on the idiosyncrasies of each of its sources. We report statistics in Table 1.

The Hei++ Evaluation Dataset
As of now, there is no publicly available dataset enabling the assessment of definition generation quality on free phrases (e.g. exotic cuisine), which are not commonly found in traditional dictionaries and benchmarks. Thus, we present Hei++, a dataset which associates human-made definitions with adjective-noun phrases. With Hei++ we can test the ability of Generationary to generate glosses, in a zero-shot setting, for items which are not featured in the training set. We encourage the community to use it for future evaluations.
As a first step in building Hei++ we retrieve the test split of the HeiPLAS dataset (Hartung, 2016), 5 which we choose as our starting point since it contains commonly used adjective-noun phrases. After removing duplicates and discarding ill-formed phrases, we ask an expert lexicographer to write a single definition for each adjective-noun pair. At the end of the annotation process we obtain a dataset made up of 713 adjective-noun phrases with their definitions to be used as a gold standard.

Quantitative Experiments
We first perform a threefold automatic evaluation to test the strengths of Generationary in different settings. On the one hand, we assess its ability to produce suitable definitions by testing the generation quality on the DM setting (Section 5.1). On the other, we aim to further appraise how well the generated outputs describe the contextual meaning, by evaluating the performance they bring about on the discriminative benchmarks of WSD (Section 5.2) and WiC (Section 5.3). 6

Definition Modeling
In this experiment we use different NLG measures to automatically assess how well generated definitions match gold glosses. We evaluate on the Seen (CHA S ) and Unseen (CHA U ) test splits of CHA, which is the largest contextual DM benchmark released so far. Moreover, we report results on our Hei++ (HEI) dataset of adjective-noun phrases. We do not include results on the datasets of Noraset et al. (2017) and Gadetsky et al. (2018), as the former only includes targets with no surrounding context, and the latter is largely included in CHA. 7

Systems
For each evaluation dataset D we test two Generationary models: one trained on the corresponding train split (Gen-D), and one trained on UNI (Gen-UNI). 8 We compare against (i) a random baseline which predicts, for each test item, a random definition taken from the same test set; (ii) the model of Ishiwatari et al. (2019), which we have re-trained on the same data as Generationary (Ishiwatari-D), and (iii) the state-of-the-art approach of Chang and Chen (2019, Chang). On HEI, which has no training split, we only evaluate Gen-UNI and the random baseline, since Ishiwatari-UNI generates strings consisting of mostly unknown word placeholders (<unk>), and Chang and Chen (2019) cannot handle multi-word targets.

Measures
Previous approaches have employed both perplexity (PPL) and string-matching measures (e.g. BLEU) for scoring DM systems. PPL is very appropriate when, as in DM, there are many possible "good" answers. 9 PPL, however, produces a score just on the basis of a pre-existing gold definition, by collecting teacher forcing probabilities without taking into account any actual output generated through beam-search decoding, and thus not assessing the quality of the generation. To evaluate this quality, BLEU and ROUGE-L (Lin, 2004) are also reported. Note, however, that these two measures are based on simple string matches which, in many cases, are not good indicators of output quality. To counteract this problem, we also report results with METEOR (Banerjee and Lavie, 2005) -which uses stemming and WordNet synonymsand BERTScore (Zhang et al., 2019), which uses vector-based contextual similarities. 10 Finally, to present a complete comparison against the rankingbased approach of Chang and Chen (2019), we report results (precision@k) on their retrieval task of recovering the correct definition, for the target in context, from the whole inventory of 79,030 unique glosses in their dataset. We rank definitions by applying the MBRR plus cosine similarity strategy described in Section 3.2.3.

Results
As shown in Table 2, Generationary models outperform competitors in every setting. On CHA S , our specialized model (Gen-CHA S ) shows much better results than Gen-UNI, because NLG measures give high scores to glosses which are lexically similar to the gold ones, while multi-inventory training will, instead, expose the model to many other, differently formulated, but equally valid definitions. Note, moreover, that our Gen-CHA S model outperforms both Ishiwatari et al. (2019) and Chang and Chen (2019), even though the latter, being a ranking model, is obviously at an advantage, since it gets a perfect score when it ranks the gold definition first. In CHA U we observe that the Gen-UNI model  attains higher performances than Gen-CHA U , indicating that, when 'overfitting' on the inventory is factored out, multi-inventory training enables the model to generalize better in a zero-shot setting. Furthermore, figures for HEI are in the same ballpark as those on CHA U , demonstrating that Generationary can easily deal, not only with unseen lemmas, but also with entirely different kinds of target. Additionally, in Table 3 we report the results of the precision@k evaluation when macro-averaging on lemmas (left) and senses (right). Figures on the two different splits of CHA show very different trends. On the CHA S setting, the base model from Chang and Chen (2019) achieves, in most cases, the highest recovery rate. However, with k = 1, which is the most realistic case, Gen-CHA S outperforms the competitor by 4.6 points when macro-averaging on senses, i.e. items with the same gold definition. On the more challenging zero-shot CHA U setting, both Generationary models strongly outperform Chang (large), more than doubling the performance on k = 1 and showing an improvement of more than 75% on k = 10. Gen-UNI, which was underperforming Gen-CHA S in the Seen setting, now achieves better results across the board, since it can exploit the supervision of a wide array of different glosses from multiple inventories.

WSD Evaluation
We now move to the assessment of Generationary in a traditional WSD setting. Even though our approach goes beyond fixed sense inventories, here  we want to show that this degree of freedom does not come at the expense of performance when presented with the task of choosing a sense from a finite predefined list. We test on the five datasets collected in the evaluation framework of Raganato et al. (2017), namely: Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007(Pradhan et al., 2007, SemEval-2013 (Navigli et al., 2013), SemEval-2015 (Moro and Navigli, 2015), which are all annotated with WordNet 3.0 senses (or converted to its inventory). We denote with ALL and ALL − the concatenation of all evaluation datasets, including or excluding, respectively, SemEval-2007, which is our development set for this experiment. Moreover, we test on the subset of ALL − containing instances whose lemmas are not covered in SemCor (0-shot).

Systems
To choose a possible sense from WordNet and perform WSD, we evaluate the techniques presented in Section 3.2, i.e. probability scoring (Prob.), simple similarity scoring (Sim.), and similarity scoring with MBRR. We evaluate our Gen-SEM, which is trained on examples specifically tagged according to the WordNet inventory, and Gen-UNI, which includes definitions from many different inventories. We compare against recent WSD approaches which make use of gloss knowledge, i.e. LMMS (Loureiro and Jorge, 2019) and the state-of-the-art approach of GlossBERT (Huang et al., 2019).

Results
We report the results of the WSD evaluation in Table 4. The MBRR scoring strategy proves to be the most versatile, with Gen-SEM (MBRR) achieving a higher F1 than Gen-SEM (Prob.) on almost every dataset, and outperforming Gen-SEM (Sim.) on model S2 S3 S7 S13 S15 ALL ALL − 0-shot N V A R  (2) Generationary. Columns: datasets in the evaluation framework (S2 to S15), ALL w/ and w/o the dev set (ALL/ALL − ), zero-shot set (0-shot), and results by PoS on ALL (N/V/A/R). F1 is reported. Bold: best. *: re-computed with the original code.
the 0-shot set. As both Sim. and MBRR outscore Prob., it is clear that generating a gloss and ranking candidates with similarity is a better strategy than directly ranking with model probability, which leaves room for further improvement as better similarity measures are developed. On another note, Gen-SEM (MBRR) achieves performances which are overall comparable with those of the state of the art (GlossBERT) without having been explicitly trained to perform WSD. Compared to Gen-SEM (MBRR), Gen-UNI (MBRR) sacrifices 0.4 and 0.2 points on, respectively, ALL and ALL − , but obtains 8 points more on the zero-shot set, also improving over Gloss-BERT by 4.3 points. This demonstrates that, when using Generationary with data from multiple inventories, (i) performances remain in the same ballpark as those of a state-of-the-art system, and (ii) much improved generalizability is achieved, as shown by the state-of-the-art results on the zero-shot setting.

Word-in-Context
In the task of Word-in-Context (WiC) (Pilehvar and Camacho-Collados, 2019), predefined sense inventories are not required and meaning identification is reduced to a binary problem in which, given two contexts, both featuring an occurrence of the same lemma, a model has to predict whether the two targets have the same meaning. We compare against Chang and Chen (2019), which is the only DM approach reporting results for WiC, following their setting in which no task-specific training is performed and the training set for the task is used for testing. Results are reported for both Gen-CHA S , which is trained on the same data as Chang and Chen (2019), and Gen-UNI. 11 To perform the task, for each pair in the WiC dataset we generate two sets, γ and γ , each of 11 In this experiment we have excluded Wiktionary, which was used to build the WiC dataset, from the UNI training set. 10 glosses, for the two respective sentences in the pair. Then, for each generated glossĝ ∈ γ, we compute the score zĝ as the mean SBERT similarity betweenĝ and the 10 generated glosses in γ . Analogously, we compute zĝ as the mean similarity betweenĝ ∈ γ and the glosses in γ. For each gloss g we normalize z g by subtracting an approximate mean similarity of g with random glosses, computed as the mean similarity between g and all other unrelated glosses in the batch. If the mean score ( ĝ∈γ zĝ + ĝ ∈γ zĝ )/20 exceeds a threshold t (tuned on the dev set), we predict that a WiC pair shares the same sense.
Gen-CHA S , with an accuracy of 69.2, outperforms Chang and Chen (2019), which achieves 68.6, demonstrating the strength of our approach in this setting. Moreover, Gen-UNI, which attains a result of 71.1, outscores both Gen-CHA S and the competitor, once again bearing witness to the versatility of training on multiple inventories.

Qualitative Experiment
Given that the ability of Generationary to produce fluent and meaningful definitions is its key asset, in addition to the automatic evaluation reported in Section 5 we devised a qualitative experiment on two distinct datasets we constructed. While our previous experiments shed light upon the quality of Generationary in comparison with other automatic systems, here we employ human annotators to compare definitions produced with our approach against glosses written by human lexicographers.
The datasets that we use in this experiment are (i) our Hei++ dataset of definitions for adjectivenouns phrases (Section 4.2) and (ii) SamplEval, i.e. a sample of 1,000 random instances made up of 200 items 12 for each of the five WSD datasets included in ALL (see Section 5.2), with at most one dataset gold Gen. ≥ Hei++ 4.43 3.58 29.9 SamplEval 3.75 3.62 51.3 Table 5: Qualitative evaluation results. Columns: dataset, average Likert for gold and Generationary, % of Generationary scores equal or better than gold (≥). total instance per sense. With Hei++ we assess the ability of Generationary to accurately gloss complex expressions, such as free phrases (e.g. wrong medicine or hot forehead), that are rarely covered by traditional dictionaries. With SamplEval, instead, we test whether generated glosses can improve over gold definitions associated with gold senses in WordNet.

Annotators and Annotation Scheme
For each context-target pair in Hei++ and SamplEval we have two definitions: a gold one, written by a lexicographer, and one generated by Gen-UNI, which is not tied to any specific inventory and has proven the most versatile model across tasks. We hired three annotators with Master's Degrees in Linguistics and effective operational proficiency in English and, in a similar fashion to Erk and Mc-Carthy (2009), we asked them to assign a graded value to the definitions based on their pertinence to describing the target t in c, according to a fivelevel Likert scale (see Appendix F). 13 The annotators received a wage in line with the standards of their country of residence, and worked an overall amount of 90 person-hours (30 per annotator). The ITA was substantial, with an average pairwise Cohen's κ of 0.69 (SamplEval) and 0.67 (Hei++).

Results
As can be seen in Table 5, the quality of Generationary glosses in the SamplEval dataset is comparable to those drawn from WordNet. Note that, although it would be expected for gold annotations to come close to the top of the scale, this is not the case, as they received an average score of 3.75 out of 5, demonstrating the suboptimal nature of "readymade" meaning distinctions. We report comparable scores on the Hei++ dataset. The gap with respect to gold definitions here is wider, probably because (i) Generationary is not specifically trained on complex expressions, and (ii) the gold score is higher since phrases are less ambiguous than single words. Interestingly, the annotators rated Generationary 13 We presented glosses for each sentence in random order. c 1 [. . . ] I scooted them into the dog run. g 1 Cause to move along by pushing. g 1 Run or move very quickly or hastily. c 2 Exotic cuisine. g 2 A style of cooking that is out of the ordinary and unusual (as if from another country). g 2 Cuisine involving unfamiliar foods. c 3 He was never the same after the accident. g 3 Indicates that a person has lost the good qualities that were present before the accident.
c 4 Sam is in a better place now. g 4 A phrase used to express that one has learned about another's death.
c 5 Yesterday I had to undergo a beardectomy. g 5 The surgical removal of the beard.
c 6 You've got a hard coconut to smash here, Dr. Yang! g 6 Something difficult to deal with. c 7 The mind is haunted by the ghosts of the past. g 7 People's memories of the past are still present in their mind, even after they have ceased to exist.
c 8 The fault, dear Brutus, is not in our stars, but in ourselves. g 8 The responsibility for a problem lies with the people who cannot see it themselves. glosses at least as high as their gold counterparts on 51.3% and on 29.9% of the total cases on Sam-plEval and Hei++, respectively: this result provides evidence for the reliability of Generationary definitions as valid alternatives to glosses taken from established inventories of discrete word senses.

Generation Examples
In Table 6 we show a sample of definitions generated via our Gen-UNI model for various spans in context. 14 As can be seen, the glossesĝ 1 and g 2 (extracted from SamplEval and Hei++, respectively) demonstrate that Generationary can indeed provide better, more specific definitions than gold standard ones. The following reported examples show the strength of our model on contexts which do not resemble those it is trained on: Generationary is proficient at (i) handling fixed or semi-fixed idioms of different lengths (ĝ 3 ,ĝ 4 ) and (ii) defining non-conventional words and phrases (ĝ 5 ,ĝ 6 ); interestingly, Generationary is also able to (iii) provide high-level explanations for whole figurative contexts (ĝ 7 ,ĝ 8 ), which goes well beyond what is commonly referred to as glossing. This might result in interesting applications beyond the scope of this work, e.g. for paraphrase generation and metaphor interpretation (Rai and Chakraverty, 2020).

Error Analysis
To have a broader picture of the quality of the outputs produced by means of Generationary, we perform behavioural testing for our Gen-UNI model, in the spirit of Ribeiro et al. (2020). As a result, we can identify two main trends behind failures to generate an appropriate contextual definition, which we refer to as disambiguation errors and hallucinations, respectively.
Disambiguation errors When the model predicts a perfectly good definition for the target, but one that fits another common context of occurrence, a disambiguation error arises. For instance, given the c, t pair in (2 a), with the word pupil as the target, the model fails to identify the "aperture in the iris of the eye" sense, and instead produces an output gloss which is compatible with the meaning of the homograph (2 b): (2) (a) The teacher stared into the pupils of her pupil. (b) A person receiving instruction, especially in a school.
Hallucinations Other errors stem from the fact that the model can only rely on the knowledge about possible definienda that it is able to store in the parameters during the pre-training and training stages. Thus, if the contextual knowledge is not sufficient to extrapolate a definition, the modelwhich is required to always generate an outputwill hallucinate an answer on the basis of contextual clues, incurring the risk of introducing nonfactualities. This particularly concerns named entities and domain-specific concepts, but the clearest examples can be seen with targets that do not correspond to any existing, fictional or non-fictional entity. For example, given the input sentence (3): (3) Squeaky McDuck wasn't happy about it, the model outputs the following: (4) The title character in the "Squeaky Squeakety-Squeakiness" cartoon series.
In this case, the model picked the cue of the cartoonish Squeaky McDuck character, and hallucinated the name of a plausible cartoon series for it. Note that neither Squeaky McDuck nor the cartoon series actually exist.

Conclusion
With this work, we showed that generating a definition can be a viable, suitable alternative to the traditional use of sense inventories in computational lexical semantics, and one that better reflects the non-discrete nature of word meaning. We introduced Generationary, an approach to automatic definition generation which, thanks to a flexible encoding scheme, can (i) encode targets of arbitrary length (including unseen multi-word expressions), and (ii) exploit the vast amount of knowledge encoded in the BART pre-trained Encoder-Decoder, through fine-tuning.
From two points of view, Generationary represents a unified approach: first, it exploits multiple inventories simultaneously, hence going beyond the quirks of each one; second, it is able to tackle both generative (Definition Modeling) and discriminative tasks (Word Sense Disambiguation and Wordin-Context), obtaining competitive to state-of-theart results, with particularly strong performances on zero-shot settings. Finally, human evaluation showed that Generationary is often able to provide a definition that is on a par with or better than one written by a lexicographer.
We make the software and reproduction materials, along with a new evaluation dataset of definitions for adjective-noun phrases (Hei++), available at http://generationary.org. c 1 : Good news. g 1 : (New Testament) The gospel as revealed by Jesus to the apostles. g 1 : Any news that arouses feelings of joy or eases anxiety.

References
c 2 : Uneven margin. g 2 : A margin that is not uniform. g 2 : A margin that is not perfectly leveled.
c 3 : Early diagnosis. g 3 : The diagnosis of a condition before symptoms appear. g 3 : A diagnosis that is made at an initial stage of a disease.
c 4 : Sincere friendship. g 4 : A friendship that is not based on deceit or hypocrisy. g 4 : Friendship marked by genuine feelings of benevolence.
c 5 : Painful performance. g 5 : A performance of a piece of music that is difficult to play. g 5 : A performance that is exceptionally bad.
c 6 : Courageous heart. g 6 : A heart that is strong enough to endure adversity. g 6 : The feelings of a person that is not afraid of getting hurt.
c 7 : Inaccurate thermometer. g 7 : A thermometer that is inaccurate in measuring temperature. g 7 : A thermometer that indicates the wrong temperature.
c 8 : New friend. g 8 : A friend who has recently come into one's life. g 8 : A recently made friend.
c 9 : Familiar guest. g 9 : A person who is a regular customer or client of a hotel, restaurant, etc. g 9 : A well known guest.
c 10 : Vivacious hostess. g 10 : A woman who entertains guests at their home and makes them feel welcome. g 10 : A woman host who shows liveliness.

A Generation Examples
In the evaluation of NLG systems, human qualitative assessment is very important. Therefore, we choose to report a fair number of non-cherrypicked, zero-shot generation examples, produced by means of our GEN-UNI model.  In Table 7 we show Generationary outputs and gold definitions for 10 randomly sampled phrases in the Hei++ dataset. In addition, in Table 8 we report gloss generation examples for random words and noun phrases taken from the webtext corpus included in the NLTK suite (Loper and Bird, 2002). We exclude swear words, slurs, numbers, and noun phrases consisting entirely of named entities. Moreover, every sampled item whose target was featured in our training set was filtered out.

B Reproducibility Details
To train our models we employ the fairseq library. Generationary has the same number of parameters as BART (Lewis et al., 2019), i.e. ca. 458M. For fine-tuning, we use the same hyperparameters used in Lewis et al. (2019) for summarization, 15 except that: • the learning rate is set to 5 × 10 −5 on the basis of preliminary experiments; • due to memory concerns, we feed the input in batches of 1,024 tokens, updating every 16 iterations; • we use inverse square root learning rate scheduling, which does not require to set a maximum number of iterations a priori; • we double the number of warmup steps to 1,000.
Training is performed for at most 50 epochs. We employ a single NVIDIA GeForce RTX 2080 Ti GPU to perform all the reported experiments, with average runtimes per epoch of BART fine-tuning ranging from ca. 50 minutes (Gen-SEM) to >120 minutes (Gen-UNI).
On the DM task, we evaluate on the best epoch, i.e. the one with the lowest cross-entropy loss on the dev set, with no hyperparameter tuning.
On the WSD task, instead, we perform minimal hyperparameter tuning, with search trials just on beam size (testing with values of 1, 10, 25, and 50), choosing as the best configuration the one with the highest F1 on our dev set, SemEval-2007; with simple similarity scoring, the best Gen-SEM has a beam size of 10, while, with MBRR similarity scoring, the best Gen-SEM has a beam size of 25. We use only MBRR with Gen-UNI, with a beam size of 10, resulting in the best performance on the development set.
On the WiC task we only perform tuning of the threshold on the dev set, by trying every value in range between the lowest and the highest z score, with a minimum step of 0.025. We compute similarities in batches of 125 pairs.
For training and prediction of the models of Ishiwatari et al. (2019), we use the code provided by the authors. 16 We use the same hyperparameters, except that we increase the vocabulary size to 39,000, which results in much improved performances on our benchmarks.

C Additional Results on DM
In Table 9 we report results, for the DM evaluation described in Section 5.1, on two additional datasets.
NOR (Noraset et al., 2017)   pairs, in which the context coincides with the word to be defined. Nonetheless, each lemma can be associated with multiple definitions. GAD (Gadetsky et al., 2018) collects context-target pairs and definitions from oxforddictionaries.com. The target lemma is not present in all contexts, so in these cases we prepend the lemma according to the following template: 'lemma: context'. 17

D Perplexity
Perplexity captures the confidence of the model in outputting a certain sequence. In approaches with word-level tokenization, evaluated at word-level, perplexity can be computed by exponentiating the negative log-likelihood that is used for training: P P L w w = exp(− w∈V P (w|c, t,h) lnP (w|c, t,h)) = exp(− lnP (w|c, t,h)) where c is the context, t is the target, V is the vocabulary,w is the gold word, andh is the gold history of previous tokens. Generationary employs subword-level tokenization, but we can still obtain the word-level probabilities by applying the chain rule of conditional probability: (w i |c, t,h,w 1:i−1 )) wherew * is the n-ple that is the subword split ofw, e.g. token, ##ization for tokenization. Do we maintain full comparability? There are two issues here. The first stems from the fact that, thanks to the application of the chain rule, the vocabulary is open, i.e. the support of the subword model is the set of possible words, so that every item receives non-zero probability.
In contrast, a word-level model without some kind of backoff strategy has a closed vocabulary. If the evaluation set includes a word outside V , the closed vocabulary model has a special <unk> token, on which it is trained to concentrate all the probability mass that the open vocabulary model, instead, would spread over all the possible words which are not in V . This entails an unfavorable advantage of the closed vocabulary model over the open vocabulary. Moreover, there is an additional complication arising from the fact that, while the subword tokenizers are usually deterministic, i.e. any word is always split in the same way, there might be multiple legal subword splits depending on the vocabulary, and to obtain the probability of the word we would need to marginalize over all splits. In other words, we would need to marginalize by summing the probability of token, ##ization , token, ##iz, ##ation , to, ##ken, ##ization and so on. This is very burdensome, and in practice we only consider the deterministic split produced by the tokenizer. In doing this, we underestimate the probability of the word and, thus, overestimate the perplexity of the subword-level model.

E NLG Measures Details
In order to ensure comparability, here we report the BLEU, ROUGE, METEOR, and BERTScore configurations that we used. A scorer is available as part of the provided software.

F Likert Scale
We employ a five-level Likert scale to rank glosses in both the annotation experiments on SamplEval and Hei++ (see Section 6.1). In Table 10 we show one of the annotation examples that were provided to the annotators to be used as guidelines.
Was he going to be saddled with a creep for a bar-buddy? 1 Wrong gloss. May refer to a homonym of the target. A heating element in an electric fire.