ESC: Redesigning WSD with Extractive Sense Comprehension

Word Sense Disambiguation (WSD) is a historical NLP task aimed at linking words in contexts to discrete sense inventories and it is usually cast as a multi-label classification task. Recently, several neural approaches have employed sense definitions to better represent word meanings. Yet, these approaches do not observe the input sentence and the sense definition candidates all at once, thus potentially reducing the model performance and generalization power. We cope with this issue by reframing WSD as a span extraction problem — which we called Extractive Sense Comprehension (ESC) — and propose ESCHER, a transformer-based neural architecture for this new formulation. By means of an extensive array of experiments, we show that ESC unleashes the full potential of our model, leading it to outdo all of its competitors and to set a new state of the art on the English WSD task. In the few-shot scenario, ESCHER proves to exploit training data efficiently, attaining the same performance as its closest competitor while relying on almost three times fewer annotations. Furthermore, ESCHER can nimbly combine data annotated with senses from different lexical resources, achieving performances that were previously out of everyone’s reach. The model along with data is available at https://github.com/SapienzaNLP/esc.


Introduction
Being able to link a piece of raw text to a knowledge base is fundamental in NLP (Navigli, 2009;McCoy et al., 2019;Bender and Koller, 2020), as it can aid neural models to ground their representations on structured resources and enable Natural Language Understanding (Navigli, 2018). A task that is key to achieving this goal is Word Sense Disambiguation (WSD), where, given a sentence with a target word, a model has to predict its most suitable meaning from a predefined set of labels, i.e., its senses. WSD has not only considerably improved its performance with the advent of deep learning (by around 15 F1 points in 15 years), but it has also shown its benefits in downstream applications such as Neural Machine Translation (Liu et al., 2018;Pu et al., 2018) and Information Extraction (Moro and Navigli, 2013;Delli Bovi et al., 2015), while also being leveraged to enrich the contextual representations of neural models (Peters et al., 2019;Zhang et al., 2019). However, WSD has mostly been framed as a multi-label classification task (Raganato et al., 2017b;Hadiwinoto et al., 2019) over a very large vocabulary of discrete senses. This formulation may limit a model's capabilities to properly represent word meanings, as each sense is only defined by means of its occurrences in a training set, while its inherent meaning remains linguistically unexpressed. Furthermore, rare or unseen senses are either poorly modeled or cannot be modeled at all. These problems have recently been mitigated by integrating sense definitions (glosses) within neural architectures (Kumar et al., 2019;Huang et al., 2019;Blevins and Zettlemoyer, 2020). Yet, despite their large improvements, none of these models attends all the possible definitions of a target word at once, and therefore each lacks the ability to represent both the input context and the candidate definitions together.
Inspired by the Extractive Reading Comprehension framework (Rajpurkar et al., 2016) in the field of Question Answering (QA), we cope with these issues and reframe the WSD problem as a novel text extraction task, which we have called Extractive Sense Comprehension (ESC). In this setting, a model receives as input a sentence with a target word and all its possible sense definitions. Then, we request the model to extract the text span associated with the gloss expressing the target word's most suitable meaning. Within this frame-work, we also propose a transformer-based architecture (ESCHER) that implements the ESC task by attending to the input context and target word definitions jointly. Through an extensive experimental setting, we show that ESCHER surpasses former state-of-the-art approaches by a large margin while, at the same time, requiring almost 3 times less training data points to attain performances comparable to its strongest competitor in a few-shot setting. Furthermore, thanks to our new formulation, the proposed model can effectively carry out predictions across different sense repositories and combine distinct inventories with unmatched nimbleness, attaining even higher results than when limited to a single resource only.
To summarize, this paper brings the following novel contributions: 1. The Extractive Sense Comprehension task (ESC), i.e., a reframing of the Word Sense Disambiguation problem.
2. ESCHER: a transformer-based architecture for ESC, outperforming all the other modern architectures on the WSD task.
3. An extensive study of the proposed model in different training regimes, i.e., in 0-shot, fewshot and fully-supervised settings.

4.
A study on combining data annotated with distinct lexicographic resources.
Besides its performance advantages, ESC also comes with other benefits: it does not require a large output vocabulary, and it eases the joint use of corpora annotated with different inventories.

Related Work
Word Sense Disambiguation (WSD) is one of the long-standing problems in lexical semantics, introduced for the first time in the context of Machine Translation by Weaver (1949). WSD aims at linking a word in context to its most suitable meaning in a predefined sense inventory, which is usually a dictionary where each entry defines a concept via a definition (gloss) and a set of examples. Most approaches to WSD rely on WordNet (Miller et al., 1990) as the underlying inventory of senses for the English language, and SemCor (Miller et al., 1993) as training corpus. WordNet organizes lexical-semantic information by means of a graph where sets of synonyms are grouped into synsets (concepts) and edges are typed semantic relations.
While early neural models used WordNet as a mere repository of senses (Raganato et al., 2017b;Hadiwinoto et al., 2019), more recent approaches have started to exploit sense definitions (Kumar et al., 2019;Blevins and Zettlemoyer, 2020) and relational information (Bevilacqua and Navigli, 2020;Conia and Navigli, 2021). Sense definitions, in particular, have been shown to be effective for modeling word senses (Luo et al., 2018;Kumar et al., 2019), as they provide information orthogonal to that available in the training data. This has been further investigated under different perspectives by Huang et al. (2019, GlossBERT), Blevins and Zettlemoyer (2020, BEM) and Bevilacqua et al. (2020, Generationary). GlossBERT casts the WSD problem as a binary classification task where, given a word in context and one of its dictionary definitions, it determines whether this definition matches the word meaning expressed in the context. BEM employs a bi-encoder to represent the target word and its sense definitions within the same space. Generationary, instead, has predefined sense inventories at its disposal and directly generates a definition given a word in its context. The strength of these approaches lies in the fact that glosses allow senses that are under-represented within the training corpus to be modeled, hence mitigating the long-standing paucity of sense-annotated data (Pasini, 2020). Nevertheless, none of the above approaches can exploit all definitions at once: indeed, glosses are either provided one at a time (GlossBERT), modeled with one vector only and independently from each other (BEM), or used individually as target text to be generated (Generationary).
Our new formulation (ESC) for the WSD problem stands out from previous approaches inasmuch as it is the first to access the input context and all the target word's definitions together, while, at the same time, dropping the requirement of a predefined sense inventory. Indeed, differently from its competitors, our proposed approach (ESCHER) can scale effectively across different lexical resources even when they were not available at the time of training.

Methodology
In what follows, we first formalize the Extractive Sense Comprehension task (Section 3.1), then in-   troduce ESCHER, a transformer-based architecture for ESC (Section 3.2), and finally put forward a novel approach for mitigating the bias towards the most frequent meanings (Kilgarriff, 2004) within training data (Section 3.3).

Extractive Sense Comprehension
To unleash the full potential of attention-based models on the Word Sense Disambiguation task, we reframe WSD as a span-extraction problem. Formally, given a sense inventory S, we first define the definitional context Dŵ for the target wordŵ as the concatenation of all the possible definitions d 1 , . . . , d k in S forŵ, i.e., where w dz i is the i-th word of the gloss d z (1 ≤ z ≤ k). Then, we reformulate the task as follows: given a target wordŵ, a context c in whichŵ occurs and the definitional context Dŵ, a model has to find the interval [i * , j * ] in Dŵ which identifies the correct definition d * ∈ Dŵ ofŵ in c. This formulation, on the one hand, aids to better characterize word meanings, thanks to the inclusion of all the target word definitions as additional input. On the other hand, it also relieves the burden of a large output vocabulary -typically in the order of tens of thousands of meanings -which makes the classification cumbersome.

ESCHER
We now introduce a transformer-based model for the ESC task ( Figure 1). It takes as input a context c with a target wordŵ 1 concatenated with Dŵ. The target wordŵ is surrounded by the tags <t> and </t> and each definition in Dŵ has the first letter capitalized and a period at the end. We separate the context c and the definitional context Dŵ with the special symbol </s> and surround the whole text with the tags <s> and </s>. 2 Formally, given the input: m =<s> w 1 . . . <t>ŵ </t> . . . w n </s> of length l, the model computes the span (i, j) containing the predicted gloss for the target wordŵ as follows: where transformer can be any transformerbased architecture, H ∈ R f ×l is the matrix of hidden states, 3 and W ∈ R f ×2 and b ∈ R 2 are trainable parameters. Z s and Z e are two variables containing the logits for each word w u indicating, respectively, whether it is the start or the end of the correct definition for target wordŵ.
Finally, we train the model by averaging two distinct cross-entropy losses that we compute for the start and end indices: where Z s i * and Z e j * are the scores associated with the correct start and end indices.
At prediction time, rather than allowing the system to output a span that does not correspond precisely to any definition in Dŵ, the model outputs a pair (i, j) such that a definition d k ∈ Dŵ starts in i and ends in j and its probability is the maximum across all the other gloss spans in Dŵ. Formally, the model selects its output as follows: where P (w u = start | Z s ) and P (w u = end | Z e ) indicate the probability that w u is the start or the end of any of the k definitions, respectively.

Rebalancing the Most Frequent Sense Bias
While our approach already allows all the possible definitions of a word to be contextualized by jointly encoding them together with the context sentence, it may still suffer from the high unbalance in sense distribution (Kilgarriff, 2004) and be biased towards the most frequent definition regardless of its contextualization. Our framework allows this issue to be dealt with in an elegant way, which we have called Gloss Noise (GN). GN counterbalances this bias by lowering the prior probability of the most frequent glosses. That is, inspired by the negative sampling technique (Mikolov et al., 2013), GN adds, to each training example, k frequent definitions that are not related to the target word. We sample the k glosses from the following multinomial distribution: where D is the set of all possible definitions in the training set and f d i is the frequency of the i-th definition in a sense-tagged corpus. The value of k, instead, is sampled from a Poisson distribution with λ = 1, so that the expected number of added definitions is equal to 1. This allows the discrepancy between the training and prediction phases to be kept as small as possible, while also introducing negative signals for frequent senses. Indeed, Gloss Noise ensures that the expected number of times a definition is added as a negative example is equal to the number of times it is seen as a correct one, thereby counterbalancing the high rate at which frequent definitions are seen only as positive examples without overly affecting rare senses.

Standard WSD Evaluation
In this Section we introduce the experimental setting we use to evaluate the proposed framework and neural architecture.

Setup
Data We use the evaluation suite made available by Raganato et al. (2017a) for the English Word Sense Disambiguation task. It includes SemCor (Miller et al., 1993) for training, i.e., a corpus containing 33,362 sentences and 226,036 instances annotated manually with senses from WordNet 3.0. As common practice, we use SemEval-2007 (SE07; Pradhan et al., 2007) Moro and Navigli, 2015) and their concatenation (ALL). 4 In order to measure the extent to which systems generalize to rare and unseen words and definitions (zero-shot settings), we also consider five other test sets that we created from the ALL dataset: i) MFS, which contains test instances tagged with the most frequent sense for the target word in the training set; ii) LFS, which contains test instances that are tagged with a sense that is not the most frequent for the target word and that was seen at least once during training; iii) 0-lex, which contains test instances whose lexeme 5 was never seen as a target word during training; iv) 0-lex-def, 6 which contains test instances with a definition that was never seen associated with the target lexeme during training; v) 0-def, which contains test instances whose definition has never been seen during training. We note that 0-def differs from 0-lex-def as a definition is tied in WordNet to a synset, i.e., a set of synonymous senses, rather than to a sense; therefore the same definition may be seen associated with different lexemes.
Comparison Systems As baselines, we consider the Most Frequent Sense computed on the training set (MFS SemCor) and two neural models featuring BERT large and BART large as text encoders, with a linear classifier over the whole sense vocabulary on top. As for the BERT large baseline, we follow Blevins and Zettlemoyer (2020) and keep BERT large weights fixed, while for BART large we finetune the whole model. ; BEM 8 (Blevins and Zettlemoyer, 2020) and EWISER (Bevilacqua and Navigli, 2020), which take advantage of external knowledge such as glosses and semantic relations. We note that EWISER uses a different development set, hence its results are not fully comparable with the others. Finally, we also consider two nearest-neighbour approaches based on synset embedding and vector similarity, i.e., LMMS (Loureiro and Jorge, 2019) and ARES (Scarlini et al., 2020).

ESCHER Setting
We use BART large (Lewis et al., 2020;Wolf et al., 2020) as transformer architecture 9 owing to the fact that it is among the strongest models on reading comprehension tasks 5 A (lemma, part of speech) pair. 6 We identify a sense as a pair (lexeme, definition). 7 Similarly to Blevins and Zettlemoyer (2020), we report the best results of the SVC single model trained on SemCor only. 8 BEM is the state-of-the-art model in this setting at the time of writing. 9 Please see Appendix A for experiments with different transformer pretrained models. such as SQuAD (Rajpurkar et al., 2016) and it allows us to feed sequences up to 1024 subtokens long. 10 We use the output of its last decoder layer to represent the input tokens and compute the start and end token distributions. We note that ESCHER is directly comparable to the BART large baseline in terms of model complexity as both use the same transformer model with one linear layer on top.
We finetune the whole ESCHER architecture with the Rectified Adam (Liu et al., 2020) optimizer with learning rate set to 1 · e −5 for up to 300,000 steps, 20 steps of gradient accumulation and batches made of 700 tokens. 11 In what follows, we report the results for our model with and without Gloss Noise (Section 3.3), denoting them as ESCHER and ESCHER No-GN , respectively.

Results
Framework Benchmark In Table 1 we report the F1 scores of ESCHER, ESCHER No-GN and all the other systems. By comparing BART large and ESCHER, we can measure the effectiveness of our proposed framework, ESC, on the performance of a transformer-based architecture. Indeed, the two architectures are nearly identical except for the last layer, where, for each token, BART large makes a prediction across the whole sense vocabulary, while ESCHER performs a binary classification. Thus, the large difference between the two models (8.5 F1 points) suggests that the Extractive Sense Comprehension formulation of WSD allows the potential of transformer-based architectures to be fully exploited, and, therefore, attain better performance.
When Gloss Noise is enabled (ESCHER row), our model gains 1 F1 point in comparison to when it is disabled (ESCHER No-GN ). This highlights that directly mitigating the bias towards the Most Frequent Senses during training is fundamental to making our approach as effective as possible.
Finally, thanks to our new formulation of the WSD problem, a simple model such as ESCHER outperforms all the other approaches by a large margin on the ALL dataset, beating the previous state of the art by 1.7 points (BEM). This corroborates our hunch that the Extractive Sense Comprehension task is an extremely effective formulation of WSD for transformer-based architectures.  Table 2 we report the results of the three best-performing models, i.e., ESCHER, ESCHER No-GN and BEM, on five datasets, measuring how well models perform when dealing with rare words and meanings in different situations (cf. Section 4.1). ESCHER No-GN manages to outperform BEM on most datasets, hence already demonstrating that our new framing allows transformers to better generalize on rare words and senses. When enabling Gloss Noise, ESCHER achieves even higher performance on all datasets, falling behind BEM only on the MFS dataset. Interestingly enough, the comparison with BEM on the 0-lex-def and 0-def datasets shows that ESCHER can easily predict definitions that were either seen associated only with lexemes different from the input ones or not seen at all, while, in direct contrast, BEM performs poorly in both scenarios. A similar pattern is observed for the Least Frequent Senses (LFS) dataset, where ESCHER outperforms BEM by 3.6 F1 points at the cost of only 1 point less in predicting the most frequent meanings.

Merging Multiple Knowledge Bases
Being able to combine datasets tagged with different inventories is a desirable ability for a model.  one system for each inventory. However, merging distinct lexicographic resources is not a straightforward task and requires its own complex pipeline. An easier approach could be to concatenate datasets tagged with different vocabularies, which, nonetheless, would expose models to possibly different definitions for nearly identical meanings and to different levels of sense granularity. In this Section we therefore investigate the ability of ESCHER to manage data annotated with distinct sense inventories when simply joining them. To this end, we train ESCHER on the concatenation of SemCor and the Oxford Dictionary dataset (Chang et al., 2018) and compare its performance with the state-of-the-art system at the moment of writing, i.e. BEM, when trained on the same corpus.

The Oxford Dictionary Dataset
Chang et al. (2018)   into train (Oxford train ), dev (Oxford dev ) and test 12 (Oxford test ). In Table 3 we report its statistics together with those of the training (SemCor), development (SE07) and test (ALL) sets of the standard evaluation suite. Specifically, we show the average polysemy of each dataset (Polysemy), the expressed polysemy (Exp. Polysemy), i.e., for each lexeme we compute the number of senses that appear in the dataset over the number of possible senses it can assume in the reference vocabulary and we average across all lexemes, the number of distinct senses (#Senses) and the number of instances (#Instances). As one can see, Oxford train contains more than two times the instances and senses of SemCor, while having roughly half of SemCor's polysemy but a higher expressed polysemy. As for Oxford test , it contains a larger number of instances than ALL, and also a higher polysemy and expressed polysemy.

Setup
We analyze three different scenarios: i) Standard, where the system is trained on the same inventory with which it is tested, e.g., trained on Oxford train and tested on Oxford test ; ii) Zero-shot, in which the system is trained on one sense inventory and tested on the other, e.g,. trained on SemCor and tested on Oxford test ; and iii) Joint, in which the system is jointly trained with the two sense inventories. In order to combine the two different inventories, we train the model by alternating the batches made up of either SemCor or Oxford train instances. Since the number of instances in SemCor is lower than that in Oxford train , we oversample SemCor by repeating its instances. Finally, we select the model with the best macro F1 averaged on the two validation datasets (SE07 and Oxford dev ). We add the subscript S, OT and S + OT to models trained on SemCor, Oxford train and their concatenation, 12 We refer to the one named test_easy in the original paper.  respectively.

Results
As one can see from Table 4, ESCHER outperforms BEM in all settings. That is, when trained with one inventory and tested on a dataset tagged with the other inventory (BEM S and ESCHER S on the Oxford test and BEM OT and ESCHER OT on ALL), ESCHER attains 6 and 3 points higher performance, respectively, than its competitor. This result is not important per se, but it also suggests that ESC does not bind the model to a single lexical knowledge base. Indeed, by extensively leveraging sense definitions, it allows a transformer-based model to scale on multiple inventories as long as they provide at least one definition for each meaning. BEM, instead, by encoding each gloss independently, falls short in representing definitions that were previously unseen, as also shown in Section 4.2.
When trained on SemCor and Oxford train together, not only can ESCHER handle the two inventories that coexist in the training set effectively, but it also leverages them at its own convenience, achieving 81.5 F1 points on ALL, in contrast to BEM which performs slightly worse than when trained in the Standard scenario.

Few-Shot Evaluation
We now move to analyzing the performances of ESCHER in a few-shot scenario, i.e., when the number of samples available for each sense is limited.
Setting We compare ESCHER against BEM, and report the F1 scores on the ALL dataset when varying the number k of training instances per sense in {1, 3, 5, 10, unlimited}. We show in Table 5     number of instances drawn from SemCor that are seen at training time for each k.
We also report the F1 scores of ESCHER on the MFS, LFS and 0-lex-def datasets in the same scenario in order to investigate the extent to which the difference in the number of occurrences for each sense impacts the ability of the model to generalize on rare senses.

Results
As one can see from Figure 2a, 13 ESCHER makes much more efficient use of training data than BEM, needing roughly one third of the instances to attain the same results. In fact, BEM needs more than 5 instances per sense (83,068 instances) to reach the same performance (73.9 F1 points) as that of ESCHER trained with k = 1 (33,206 instances). Furthermore, with roughly half of the instances (k = 10) ESCHER attains results that are in the same ballpark as the current state of the art. Interestingly enough, by looking at Fig-13 BEM chart from the original paper. ure 2b, we see that ESCHER's accuracy on the MFS instances rises when adding more examples. This is due to the fact that frequent senses get increasingly represented within the training set, therefore better matching the sense distribution in the test set. Similarly, the performance on the Least Frequent Senses also rises from k = 1 to k = 10, but slightly drops when considering the whole dataset. By manually inspecting the data we notice that this happens because most of the instances added to the dataset with k = 10 are tagged with the most frequent sense, therefore drastically skewing the sense distribution. Finally, the performance on 0-lex-def remains stable for all k, hence showing that, despite increasingly skewing the distribution towards the most frequent definitions, our approach can still provide meaningful representations for unseen senses. 14

Error Analysis
In order to get a clear picture of the model's pitfalls and gain insights into possible directions for future work, we perform an analysis of ESCHER misclassifications on the ALL dataset. We find that the mistaken predictions belong to three main categories: most frequent sense bias, insufficient context and WordNet sense granularity. Since we already discussed the first of these in the previous sections, we focus here on the latter two.
Insufficient Context Annotators often compiled the WSD evaluation datasets by considering each instance in the context of the documents they appear in. In contrast, WSD models typically take into account only the sentence surrounding the target word, discarding a large portion of the available context. This behavior causes a discrepancy where sentences do not provide enough information to disambiguate the target words therein. Indeed, ESCHER mistakes most often appear in sentences with an average length of 27 tokens, i.e., roughly 5 tokens less than the average length in ALL (32).
This suggests that moving the disambiguation context from sentences to documents may improve the performances of models as long as they are capable of handling longer sequences.
WordNet Sense Granularity The granularity of WordNet senses has been considered one of the main reasons behind the complexity of the WSD task . To measure the extent to which this affects ESCHER's performance, we utilize the 45 domain-based labels introduced by Lacerra et al. (2020, CSI), which define macro categories for each WordNet sense. For instance, in the CSI inventory, the sense argument%1:10:03:: belongs to the following domains: Culture Anthropology and Society, Language and Linguistics and Communication and Telecomunication.
To better understand the relation between ESCHER predictions and the gold annotations, for each misclassified instance in ALL, we compute the average Jaccard similarity between the CSI labels assigned to the gold annotation of that instance and those assigned to the sense predicted by ESCHER. As an example, ESCHER misclassified an instance annotated with the sense argument%1:10:03::, assigning to it the sense argument%1:10:00::. Examining the domains to which the predicted sense belongs, we can see a considerable overlap (and consequently a high Jaccard similarity) with the domains of the gold sense (i.e. argument%1:10:03::): Culture Anthropology and Society, Politics Government and Nobility, Language and Linguistics and Communication and Telecomunication.
As a term of comparison, we repeat the same procedure when considering a random baseline as WSD model, i.e., one that predicts for each instance a random sense among those of the target word. We find that ESCHER predictions have an average Jaccard similarity with the gold predictions of 0.49, whereas the random baseline achieves 0.27. This suggests that, even when providing a formally mistaken output, ESCHER still predicts a sense that is correlated, according to CSI labels, to the gold sense. Our analysis calls for further work to improve evaluation in WSD as the F1 score cannot discriminate between predictions that are clearly wrong and predictions that are just slightly different from the gold sense.

Conclusion
In this paper, we introduced a novel framing for the Word Sense Disambiguation problem inspired by the Extractive Reading Comprehension task in QA: given a word in a sentence and a text containing all its possible definitions, a model has to identify the span containing the correct definition for the target word. For this new formulation -which we called Extractive Sense Comprehension (ESC) -we devised a transformer-based architecture (ESCHER), which, differently from previous approaches, can look at all the target word definitions at once, alongside the input sentence. ESCHER surpasses the current state of the art by 1.7 points on the standard English all-words WSD task, thanks to its more efficient use of the training data. Also, when provided with only a few examples for each sense, ESCHER attains remarkable levels of performance, requiring roughly three times less annotated instances than its direct competitor to reach the same performances. Furthermore, our new formulation allows ESCHER to scale across different inventories and to combine them effectively. Indeed, when provided with data annotated with multiple vocabularies, it achieves even better results than when limited to one inventory only, with results in the 86-88% range.
As future work we plan to expand this framework so as to condition the prediction not only on the target word context and definitions, but also on the possible senses of its surrounding words.
The pretrained model, along with code and data, is available at https://github.com/ SapienzaNLP/esc.

A Transformer Architectures
In this Section we show the results attained when using RoBERTa large  and XLNet large (Yang et al., 2019) as pretrained model for ESCHER. We note that 73 training examples out of 226,036 could not fit in 512 bpes, i.e., the maximum input length for both models, and we therefore discard them. In Table 6, we report the results on the English WSD evaluation framework of Raganato et al. (2017a) for ESCHER initialized with the aforementioned models along with its performance when using BART large .

B Training Details
We use BART large as transformer architecture which consists of 12 encoder layers and 12 decoder layers with 1024 hidden size. We train the model with a constant learning rate of 0.00001, Rectified Adam as optimizer and a batch size of 700 tokens. We accumulate the gradient for 20 steps and clip it at 10. The model is trained for a maximum of 300, 000 steps. We compute the F1 score on the validation dataset every 2000 steps and stop the training if the model does not improve for 15 consecutive tests (30, 000 steps). The whole training is done with half precision and an amp-level of O1. It is worth noting that the training of ESCHER on SemCor (Miller et al., 1990) took less than 5 hours on a GeForce RTX 2080ti.