With More Contexts Comes Better Performance: Contextualized Sense Embeddings for All-Round Word Sense Disambiguation

Contextualized word embeddings have been employed effectively across several tasks in Natural Language Processing, as they have proved to carry useful semantic information. However, it is still hard to link them to structured sources of knowledge. In this paper we present ARES (context-AwaRe Embeddings of Senses), a semi-supervised approach to producing sense embeddings for the lexical meanings within a lexical knowledge base that lie in a space that is comparable to that of contextualized word vectors. ARES representations enable a simple 1-Nearest-Neighbour algorithm to outperform state-of-the-art models, not only in the English Word Sense Disambiguation task, but also in the multilingual one, whilst training on sense-annotated data in English only. We further assess the quality of our embeddings in the Word-in-Context task, where, when used as an external source of knowledge, they consistently improve the performance of a neural model, leading it to compete with other more complex architectures. ARES embeddings for all WordNet concepts and the automatically-extracted contexts used for creating the sense representations are freely available at http://sensembert.org/ares .


Introduction
Contextualized word embeddings have proved to be highly beneficial to the majority of Natural Language Processing tasks (Wang et al., 2019). Indeed, language models like BERT (Devlin et al., 2019), RoBERTa , XLNet (Yang et al., 2019), etc., enable architectures built on top of them to attain performances that were previously out of reach (Wang et al., 2019). The main reason behind this great success is the fact that contextualized embeddings of words encode the semantics defined by their input context (Reif et al., 2019). Indeed, when tested in the Word-in-Context (WiC) task (Pilehvar and Camacho-Collados, 2019), i.e., a binary classification problem where a model has to classify whether a target word is used with the same meaning in two different sentences, contextualized word embeddings placed themselves as the best approaches across the board. Nevertheless, these latent representations do not provide any explicit information regarding the meaning expressed by the word in context, hence making it difficult to link texts to structured sources of knowledge such as lexical knowledge bases (LKB).
The task of associating a word in context with the most suitable meaning from a predefined sense inventory is better known as Word Sense Disambiguation (Navigli, 2009, WSD), and is usually tackled by two kinds of approach: knowledge-based and supervised ones. On the one hand, knowledgebased approaches  are able to scale across languages since they do not need sense-annotated corpora and rely only on the information within their underlying LKB. On the other hand, supervised models (Huang et al., 2019; have proved to achieve state-of-the-art results on the English benchmarks by taking advantage of manually-annotated data for the task and machine learning algorithms. However, supervised approaches are mostly focused on English (Navigli, 2018;Pasini, 2020) and have only recently been applied to lower-resourced languages thanks to automatically-produced datasets (Scarlini et al., 2019;Barba et al., 2020;. Another effective approach in this direction has been presented by , who introduced SensEmBERT, a knowledge-based approach to building sense embeddings without relying on sense-annotated data. Since it is not tied to semantic annotations, SensEmBERT scales over different languages. However, it is limited to nominal concepts only and provides different representations for the same concepts across different languages, which hinders its applicability to cross-lingual tasks. In this paper we present ARES (context-AwaRe Embeddings of Senses), a semi-supervised approach to producing sense embeddings for all the word senses in a language vocabulary. ARES makes up for the paucity of manually-annotated examples for a large portion of words' meanings by coupling the information within a knowledge base with the representational power of a pre-trained language model. This enables reliable representations to be built for those senses not appearing in manually-curated resources, while at the same time enriching the vectors for all the other concepts.
We tested our embeddings on the two tasks that measure a model's capabilities to encode word meanings, i.e., WSD and WiC. In both tasks, ARES representations prove to be of great benefit. In WSD, while employing a simple 1-Nearest-Neighbour (1-NN) algorithm, they attain state-ofthe-art results on English, even beating dedicated architectures with long and expensive fine-tuning procedures. In WiC they lead a simple BERTbased model to perform in the same ballpark as other state-of-the-art alternatives which rely on more complex architectures. Furthermore, by taking advantage of pre-trained multilingual models we provide unified representations of meanings across languages, which, while using English data only, outperform their competitors and achieve the state of the art on all the languages available in the all-words multilingual WSD tasks, i.e., French, German, Italian and Spanish.

Related Work
Word Sense Disambiguation (WSD) is a core task in lexical semantics and has mainly been tackled by two kinds of approach: knowledge-based and supervised ones. Knowledge-based methods build upon lexical knowledge bases, such as WordNet (Miller et al., 1990) and BabelNet (Navigli and Ponzetto, 2012), and employ algorithms on graphs to address the word ambiguity in texts (Moro et al., 2014;Agirre et al., 2014;Tripodi and Navigli, 2019;. These approaches do not rely on semantically-tagged training data and are hence able to scale over all the languages supported by their underlying knowledge base. Nevertheless, they lag behind their supervised counterparts on English in terms of performance. Supervised ap-proaches, by framing WSD as a classification task, have acquitted themselves as the state of the art in English (Hadiwinoto et al., 2019;Huang et al., 2019;Blevins and Zettlemoyer, 2020;, outperforming their knowledge-based competitors by several points. Recently, Pilehvar and Camacho-Collados (2019) provided a new declination of WSD, formulating it as a binary classification problem where, given a target word and two contexts, a model has to predict whether the target word is used with the same meaning. This setting has the advantage of not drawing on sense inventories and provides an effective testbed for context-based word embeddings (Peters et al., 2019;Levine et al., 2020).
Contextualized sense representations have recently been employed to compute sense representations that can be applied directly to disambiguation. Some of the first approaches of this kind were proposed by Melamud et al. (2016) and Peters et al. (2018), who exploited the semantically-tagged sentences of SemCor (Miller et al., 1993) and neural language models to create embeddings for the senses in WordNet. Similarly, Loureiro and Jorge (2019, LMMS) computed sense embeddings using BERT (Devlin et al., 2019) and the relations in a lexical knowledge base in order to also provide vectors for those meanings that do not appear in SemCor. The most recent effort in this direction is SensEmBERT , which drops the need for sense-annotated corpora by exploiting the BabelNet mapping between WordNet senses and Wikipedia pages so as to collect contextual information for the senses in WordNet. Since it does not rely on manually-annotated data SensEmBERT can scale over different languages, being limited, however, to nominal senses only.
In this work we continue along this latter line of research and propose a novel method for producing sense embeddings which, by relying on English data only, also proves to be able to model meanings across languages. Rather than leveraging Word-Net relations as LMMS does, ARES creates vector representations for all senses by automatically providing usage examples for the synsets within a knowledge base. In contrast to SensEmBERT, instead, ARES covers all the four WordNet POS tags, and, at the same time, disposes of the resources required by SensEmBERT, such as NASARI and the Wikipedia category graph.

Preliminaries
We now describe the resources that we use to build ARES embeddings.
WordNet (Miller et al., 1990) is the most used lexical knowledge base for English. It can be viewed as a graph where nodes are concepts, i.e., synsets, and edges are semantic relations between them. Each synset contains a set of synonyms, e.g., the synset defined as A natural flow of ground water comprises the lemmas spring, fountain and natural spring. We use the notation {l 1 , . . . , l n } (g) to refer to the concept with gloss g and expressed by the lemmas l 1 , . . . , l n . We define a sense as a lemma-gloss pair, i.e., a meaning that is specific to a given lemma, e.g., fountain-(A natural flow of ground water) is a sense of {spring, fountain, natural spring} (A natural flow of ground water).
SyntagNet (Maru et al., 2019) is a repository containing approximately 88K lexical-semantic collocations, i.e., pairs of WordNet synsets that co-occur more frequently than would be expected. 1 For example, the concepts {coach, bus, autobus} (A vehicle carrying many passengers) and {driver, motorist} (The operator of a motor vehicle) appear in SyntagNet as they form a collocation.
UKB (Agirre et al., 2014) is a knowledge-based approach to WSD based on the Personalized PageRank algorithm (Haveliwala et al., 2002). We set WordNet as underlying knowledge base, disable the Most Frequent Sense backoff strategy and set the parameters according to Agirre et al. (2018).
SemCor (Miller et al., 1993) is the standard manually-curated corpus for WSD including more than 220K words tagged with 25K distinct Word-Net meanings, hence providing annotated contexts for roughly 15% of the synsets in WordNet.
BERT (Devlin et al., 2019) is a deep neural architecture trained with the masked language model objective. Given a text, it provides contextual embeddings for the subtokens therein. We choose BERT because it has proven to capture the semantics of a word in context (Reif et al., 2019), while also being able to effectively generalize cross-lingually thanks to its multilingual representations (Pires et al., 2019). 2

ARES
We now introduce ARES, a semi-supervised approach for creating sense embeddings that cover all the senses in a language vocabulary. Given as input a corpus C of raw sentences and a synset s ∈ WordNet together with its lexicalizations L s , ARES operates the following three steps: 1. Context extraction, which exploits the representation capabilities of BERT and the collocational information comprised in SyntagNet to extract a meaningful set of contexts where s is likely to appear (Section 4.1); 2. Synset embedding, which creates the embedding of the synset s by encoding the contextual information of the sentences gathered in the previous step (Section 4.2); 3. Sense embedding, which combines the senseannotated contexts in SemCor, the definitional information of the glosses and our synset embeddings to create the final sense representation (Section 4.3).

Context Extraction
In this Section we describe our approach for automatically retrieving contexts for WordNet's synsets. First, as in , we utilize BERT and UKB to find contexts that are similar to each other and link them to a meaning in WordNet. Then, we enrich the set of contexts retrieved for a given synset s by exploiting the semantic collocations available in SyntagNet.
Similarity-Based Extraction Given a synset s and one of its lexicalizations l, we collect the occurrences of l in the input corpus C and compute their contextualized representations by means of BERT. 3 We then cluster the contextualized vectors of l's occurrences by using the k-means algorithm. We note that the sentences comprised within the same cluster define similar contexts for the target word, hence implying that l is very likely to be used with the same meaning across sentences. Therefore, we associate each cluster with one of l's meanings and a disambiguation score. To this end, we apply UKB (see Section 3) to the set of words that most characterize the given cluster, i.e., the top n most Sentences for {spring, fountain, natural spring} Springs that contain significant amounts of minerals are called 'mineral springs'. The forcing of the spring to the surface can be the result of a confined aquifer.
Other fountains are the result of pressure from an underground source in the earth Natural springs that contain significant amounts of minerals are called 'mineral springs'.
Other natural springs are the result of pressure from an underground source in the earth.  frequent words 4 among its sentences. 5 Once each cluster has been disambiguated with one meaning of l, we retain only those clusters that are associated with s. Then, we associate each sentence with the disambiguation score provided by UKB for its cluster and sample t sentences according to their score, creating a set of contexts Φ l,s for the lemma l in the synset s. We note that it might happen that none of the clusters of l is associated with s. This limits both the number and the diversity of contexts available for the target synset. To overcome this issue and increase coverage, we sample a set of ξ sentences from ∪ l ∈Ls Φ l ,s and replace the lexicalizations l of s that appear therein with the lemma l. For example, let {spring, fountain, natural spring} (A natural flow of ground water) be the input synset, and the sentences in Table 1 (top) be the contexts retrieved thanks to the clustering and disambiguation steps, we replace some occurrences of spring and fountain with natural spring, as shown in the bottom part of the Table. Collocation-Aware Extraction We now enrich the set Φ l,s by leveraging the semantic collocations available in SyntagNet (see Section 3) for the synset s. To this end, we first retrieve from Syn-tagNet all the synsets s that collocate with s, and then extract all the sentences in C where any of the lemmas l and l of s and s , respectively, appear within a small window w. Finally, we disambiguate each occurrence of l with its synset s. For example, given the concepts {play} (Play on an instrument) and {guitar} (A stringed instrument) which are in collocation in SyntagNet, we search for all the occurrences of play and guitar in the sentences of the input corpus and retain only those where the two words appear within a window of size 3. In Table 2 we show an excerpt of the sentences extracted for the two aforementioned synsets. Each occurrence of play in those sentences is hence disambiguated with {play} (Play on an instrument).
At the end of this step, the synset s is associated with the set of sentences Φ s = ∪ l∈Ls Φ l,s where any of the lemmas of s is disambiguated with s.

Synset Embedding
In this step we exploit the contexts retrieved for a target synset s in order to compute its latent representation.
First, we create the setL s containing the lexicalizations of the synsets that are collocated with s in SyntagNet. For example, given the synset s = {spring, fountain, natural spring} (A natural flow of ground water), we consider the lexicalizations of its related concepts in SyntagNet, i.e., flow and flowing from the synset {flow, flow-ing} (The motion characteristic of fluids) and creatê L s = {f low, f lowing}.
Then, we leverage the contexts in Φ l,s and the lemmas in both L s andL s to compute the vector representation v s for the synset s as follows: whereΦ l ,s is a subset of Φ l,s containing all the sentences where the lemma l ∈L s appears in collocation with l ∈ L s , Z is l∈L |Φ l,s | + l ∈L |Φ l ,s |, and BERT(λ, σ) is the contextualized embedding for the lemma λ in the context σ.
At the end of this step, the synset s is associated with a vector v s C created as shown above.

Sense Embedding
In this final step, we first create sense-level representations by leveraging the contexts in SemCor

Synsets
Sentences Annotations Avg sentences per synset  Table 3: Statistics of the contexts extracted by the similarity-based (Cluster) and collocation-aware (SyntagNet) extraction step in Section 4.1. and the WordNet glosses, and then enrich them with our synset embeddings.
For each sense θ of s we create its embedding from its contextual occurrences within SemCor and its definition in WordNet. As for the SemCor part, we apply Peters et al. (2018)'s method to compute its representation v θ SC , i.e., we average the BERT embeddings of all the words in SemCor tagged with θ. As regards the sense gloss part, instead, we follow Loureiro and Jorge (2019) and prepend to the gloss of s both the lemma of θ and all the lexicalizations of s, and compute the sense gloss embedding v θ G by averaging the BERT representations of the words therein. For example, given the spring sense of the synset {spring, fountain, natural spring} (A natural flow of ground water), its sense gloss embedding is the average of the BERT representations of the following enriched gloss: "spring -spring, fountain, natural spring -A natural flow of ground water".
We compute the representation ARES θ for the sense θ of the synset s as follows: represents the mean between two vectors, and their concatenation. If a sense does not occur in SemCor, we replace v θ SC with v θ G and apply the above formula. We recall from Section 3 that SemCor covers only 15% of WordNet's senses, nevertheless ARES is able to generalize over all the senses in WordNet thanks to the glosses and its automatically-retrieved contexts.

Statistics
In Table 3 we report the statistics of the sentences extracted as in Section 4.1. As one can see, our automatically-extracted annotations cover 65% of WordNet synsets (77,195 out of 117,659), providing at least one annotated example for 56,022 synsets that are not covered by SemCor. The total number of distinct tagged sentences is more than 10M for a total of 13M annotations. On average, most synsets have around 150 annotated examples, as shown in Figure 1.

WSD Experimental Setup
We now report the setup of the evaluation we conducted on the English and multilingual WSD tasks.

ARES Configuration
We used Wikipedia as input corpus since it is the largest general-domain resource currently available. Regarding the context extraction step (see Section 4.1), we set the number of clusters k for a lexeme l as the number of its senses in WordNet. We varied the number of words n to give as input to UKB between 5 and 25 with a 5 step and selected the value n = 5 by manually assessing the quality of a sample of the clusters' disambiguation output. As for the number of sentences t and ξ, we ranged them between 50 and 300 with a 50 step 9 and selected the values that maximized the performance in terms of F1 of ARES on SemEval-07, 10 i.e., t = 150 and ξ = 50. As regards the window size w, we followed Maru et al. (2019) and set w = 3.
Concerning BERT representations, we used the BERT large-cased model for English. To scale across languages, instead, we made use of BERT base-multilingual-cased (mBERT) so as to build unified representations that are shared across languages, i.e., ARES m . For our multilingual representations, we focused on synset embeddings rather than sense ones. In fact, senses are languagespecific as they are tied to one of the lemmas of the synset. Hence, we built ARES m synset embeddings by averaging the representations of their English senses. We note that, while the pre-trained model differs between the two representations, the sentences used to create the embeddings are the same as the ones used for English. Following Loureiro and Jorge (2019), we took as BERT representation the sum of the last four hidden layers.

WSD Setup
To test ARES on the WSD task, we employed the 1-NN algorithm. To this end, we computed the BERT representation of each word w in the test sentences and compared it with the embeddings corresponding to the senses of w in WordNet. Since ARES vectors are made of the con-catenation of two BERT representations (Section 4.3), we repeated the embedding of w in order to match the shape of ARES vectors. Thus, we took as prediction the sense that maximizes the similarity with w's representation. For languages other than English, we considered as candidate synsets for a lemma those associated with it in BabelNet 4.0, i.e., a multilingual knowledge base providing lexicalizations of concepts in different languages.
Comparison systems We compared ARES with both knowledge-based and supervised approaches on English. As knowledge-based systems, we considered UKB with SyntagNet's relations (Scozzafava et al., 2020, UKB +Syn ), and SensEmBERT , along with its supervised version, i.e., SensEmBERT sup . SensEmBERT and SensEmBERT sup cover only nominal senses, so we used the Most Frequent Sense (MFS) backoff strategy, i.e., predicting the most frequent sense of a lemma in WordNet, for tagging instances with other POS tags.
Among supervised systems, we tested against EWISE ConvE (Kumar et al., 2019), KnowBERT (Peters et al., 2019), the vocabulary compression model by Vial et al. (2019, BERT hyp  We also report the performance of these two latter approaches by using mBERT instead of BERT large, i.e., LMMS mBERT and mBERT k-NN. All supervised systems under comparison use SemCor only as training corpus.
We performed additional comparisons by using Peters et al. (2018)'s method with BERT on SemCor+OMSTI (Taghipour and Ng, 2015, SemCor+OMSTI BERT ), a semi-automatically generated extension of SemCor, and OneSeC (Scarlini et al., 2019, OneSeC BERT ), an automatically-tagged corpus. 12 On the multilingual WSD tasks we compared against SensEmBERT and UKB augmented with SyntagNet's relations (Scozzafava et al., 2020, UKB +Syn    other languages. To this end, we used mBERT with frozen weights followed by a linear layer with swish activation and an unbiased softmax classifier on top. 13 In addition, we report the performance of LMMS mBERT and mBERT k-NN on the multilingual datasets.

WSD Results
We now report the results of the evaluation we carried out on the English and multilingual WSD tasks, along with an ablation study of ARES components.

English all-words WSD
In Table 4 we report the results attained by the systems under comparison on the all-words English WSD datasets. Our direct competitors, i.e., SensEmBERT sup and LMMS, score, respectively, 5.1 and 2.5 F1 points lower than ARES. This comparison shows the effectiveness of different approaches in coping with the paucity of senseannotated data for WSD. On the one hand, the SensEmBERT approach is effective in modeling nominal meanings, however, it cannot scale over 13 See Appendix A.2 for training details.
other POS tags due to the limitations of its underlying resources. On the other hand, LMMS shows that the WordNet topology can be exploited to propagate the latent representations of frequent meanings towards those not appearing in senseannotated corpora. Nevertheless, these less frequent senses do not have a specific characterization and thus their representations are less refined, as we also show in Section 7.2. Our approach overcomes both these limitations, being able to create bettercharacterizing representations across senses with different POS tags. This leads ARES to outperform the state of the art at the time of writing, i.e., GlossBERT, by almost 1 point on ALL by simply employing a 1-NN algorithm, and hence requiring no expensive fine-tuning procedure.

WSD on Infrequent Words and Senses
To test the ability of ARES and its competitors to scale over rare words and senses, we extracted two new test sets from ALL: i) ALL LFS , which includes the 1139 instances in ALL associated with a sense not in SemCor; ii) ALL LFW , which includes the 222 instances in ALL associated with a non-monosemous word not tagged in SemCor. As shown in Table 5, ARES proves to be the best system across the board, achieving the highest result on both datasets. This shows that the contexts extracted by ARES help balance the quality of meanings' representations across senses with different frequencies, without disadvantaging rare senses in favor of the more frequent ones. In contrast, both LMMS and GlossBERT are more  Table 6: Ablation in terms of F1 of the different components of ARES on the ALL dataset. indicates the concatenation while the average. biased towards those representations in SemCor, hence losing ground on both datasets with a gap of 3.6 and 3.2 points, respectively, on ALL LFS when compared to ARES. This latter, instead, by taking advantage of its automatically-retrieved contexts, scales better over rare words and senses, and outperforms its competitors on both datasets with the highest result of 81.1 on ALL LFW .

Ablation Study
We now measure the impact that each part of our vectors has on the final results by means of an ablation study on the ALL dataset. The upper side of Table 6 compares the two kinds of contexts that we automatically retrieve (Section 4.1). As one can see, the Cluster cont alone, i.e., the sentences retrieved by means of the similarity-based step, already attains good results. When combined with the contexts extracted thanks to SyntagNet, i.e., Syn cont , it gains 0.6 extra points. In the lower part of the Table, we show different combinations of the vectors built from SemCor, our contexts and the WordNet glosses. We indicate with A CS and Gloss the vectors built from our extracted contexts (see Equation 4.2) and the sense gloss (see Section 4.3), respectively. SemCor alone attains 69.2 points, 1.3 points less than Cluster cont . This is because Sem-Cor does not provide examples for all WordNet meanings, therefore having a lower recall. When combining SemCor with WordNet's glosses (Sem-Cor Gloss) and A CS (SemCor A CS ), we have a 6.5 and 7.9 improvement, respectively. Finally, when combining the three components, we obtain our best score of 77.9 F1 points on ALL.

Multilingual all-words WSD
Finally, we investigate the ability of ARES m to scale across languages by testing it on the multilingual WSD datasets of SemEval-13 and SemEval-  Table 7: F1 on the WSD tasks's languages (SemEval-13 and SemEval-15) and the macro F1 score computed across all languages. Statistically-significant difference between ARES m and mBERT's recalls is underlined (χ 2 with p < 0.05). *: Recomputed on the latest version of the datasets.
15. 14 As shown in Table 7, ARES m is the best system across the board, achieving state-of-the-art results on all languages of both datasets but the Italian nominal instances of SemEval-15. On average, ARES scores almost 2.0 F1 points higher compared to the second best performing system, i.e., mBERT. When compared to LMMS mBERT , ARES achieves 7.0 F1 points higher on average. This may be due to the fact that our automatically-retrieved sentences provide a better contextualization of meanings than the propagation technique employed by LMMS, hence allowing our embeddings to scale effectively across languages. Finally, we surpass SensEm-BERT and attain state-of-the-art performance on all languages of the multilingual all-words WSD tasks while at the same time keeping the quality on nouns high. The evaluation carried out shows how beneficial our embeddings are to the English and the multilingual WSD tasks. ARES, in fact, proves to carry high-quality semantic information within its representations, which enables it to generalize over both words and languages, and achieve state-of-the-art results in all the tested settings.

WiC Experimental Setup
In this Section we further inspect the properties of our embeddings by measuring the improvements they bring to the Word-in-Context (WiC) task. 15 Evaluation Dataset We tested on the Word-in-Context task (Pilehvar and Camacho-Collados, 2019, WiC), 16 i.e., a binary classification problem where, given a target word w and two contexts c 1 and c 2 , the task is to determine if w occurs with the  embedding and the two representations of the target word w in c 1 and c 2 . As additional features, we considered the senses s 1 and s 2 of w in c 1 and c 2 , respectively, that we predicted by means of ARES as in Section 6. Then, we applied a dense layer -which we trained during finetuning -to the ARES embeddings of s 1 and s 2 and reduced their dimensionality to 1024. Finally, we concatenated the input of the classifier with these two new representations. 17 Comparison Systems We compared ARES against the best performing models on the WiC task. We considered three pre-trained language models fine-tuned on WiC, i.e., BERT LARGE (Devlin et al., 2019), RoBERTa  and T5 (Raffel et al., 2019), and two language models which leverage external knowledge while pretraining, i.e., KnowBert (Peters et al., 2019) and SenseBERT LARGE (Levine et al., 2020).

WiC Results
In in a straightforward manner at finetuning. Moreover, BERT ARES performs better than or on a par with its closest competitors, i.e., KnowBert, SenseBERT LARGE and T5 (Large, and 3B), which, instead, rely on more complex architectures, specific pre-training phases and between 3000 M and 40 M more parameters. T5-11B is the only model achieving better results than BERT ARES , mainly due to the large difference in terms of trainable weights (with T5-11B being 30 times bigger.)

Conclusion
In this paper we presented ARES, a semisupervised approach for producing embeddings of senses in English and across different languages. ARES can couple the information within senseannotated corpora with that automatically created by means of a cluster-based algorithm so as to produce high-quality latent representations for the concepts within a lexical knowledge base. Our experiments showed that despite relying on English data only ARES outperforms all its alternatives. It achieves state-of-the-art results on both English and multilingual WSD benchmarks, leveraging BERT large and mBERT, respectively, as underlying pre-trained language models. We further tested our embeddings in the WiC task where they lead a baseline neural model to outperform its closest competitors that rely on larger architectures or dedicated pre-training routines. Our embeddings computed with BERT large and mBERT and the automatically-extracted contexts are available at http://sensembert.org/ares.
As future work, we plan to exploit the information brought by our embeddings to other downstream tasks, such as multilingual Semantic Role Labeling (Di Fabio et al., 2019; and cross-lingual Semantic Parsing (Blloshmi et al., 2020).