Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations

To date, the most successful word, word sense, and concept modelling techniques have used large corpora and knowledge resources to produce dense vector representations that capture semantic similarities in a relatively low-dimensional space. Most current approaches, however, suffer from a monolingual bias, with their strength depending on the amount of data available across languages. In this paper we address this issue and propose Conception, a novel technique for building language-independent vector representations of concepts which places multilinguality at its core while retaining explicit relationships between concepts. Our approach results in high-coverage representations that outperform the state of the art in multilingual and cross-lingual Semantic Word Similarity and Word Sense Disambiguation, proving particularly robust on low-resource languages. Conception – its software and the complete set of representations – is available at https://github.com/SapienzaNLP/conception.


Introduction
Word vector representations, in particular dense representations or word embeddings (Mikolov et al., 2013a;Pennington et al., 2014;Bojanowski et al., 2017), play a key role in a wide range of tasks, including Text Similarity (Kenter and de Rijke, 2015;Nguyen et al., 2019), Word Sense Disambiguation (Iacobacci et al., 2016;Raganato et al., 2017a), Semantic Role Labeling (He et al., 2017;Marcheggiani et al., 2017;, Question Answering (Zhou et al., 2015) and Machine Translation (Mikolov et al., 2013b;Bahdanau et al., 2015). This is especially the case when they are used as the underlying input representation. Word embedding techniques map each word to a relatively low n-dimensional space where two semantically or syntactically similar words lie close together. Due to their latent nature, however, most embeddings are commonly considered to be uninterpretable (Levy and Goldberg, 2014) as the properties captured by each dimension are often unclear. More recent studies have shed some light on their interpretability (Rothe and Schütze, 2016;Wallace et al., 2019) or included interpretability directly in the learning process (Park et al., 2017;Koç et al., 2018), but the opaqueness of dense vectors is still a key reason why research has not completely given up on sparse representations (Faruqui et al., 2015;Derby et al., 2018).
Moreover, most current embedding techniques rely on large corpora which are often available in few languages, such as English or Chinese, strongly limiting their robustness on low-resource languages (Speer and Lowry-Duda, 2017). In an attempt to solve this issue, researchers turned to multilingual word representations by making use of parallel vocabularies (Mikolov et al., 2013b;Ammar et al., 2016;Smith et al., 2017), exploiting multilingual knowledge graphs , and exploring unsupervised methods to align monolingual embeddings in a single shared distributional space (Conneau et al., 2017) or to directly learn multilingual embeddings (Chen and Cardie, 2018).
Nevertheless, a well-known pitfall of both monolingual and multilingual word representations is the so-called meaning conflation deficiency problem (Camacho-Collados and Pilehvar, 2018): a word may be ambiguous, that is, it may have multiple meanings, but those possibly unrelated meanings cannot This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. be told apart since they are conflated into a single representation. As a result, contextualized word representations have garnered attention (Melamud et al., 2016), enjoying great success in the form of pretrained language models like BERT (Devlin et al., 2019) or XLM (Conneau and Lample, 2019). At the same time, modelling techniques for individual word senses, concepts and named entities have also gained traction (Camacho-Collados et al., 2016;Scarlini et al., 2020a), though their integration into downstream NLP applications is still subject of ongoing investigations (Li and Jurafsky, 2015;.
The requirement of massive amounts of training data and the lack of interpretability hinder most of the above-mentioned approaches. To address these limits, we introduce Conception, a novel knowledgebased technique for modelling concepts and named entities through concepts and named entities. Our approach places multilinguality at its core by leveraging the mutually-reinforcing information coming from different languages, enabling seamless and robust cross-lingual scaling, while also providing explicit and easily interpretable semantic dimensions. In contrast to most word-based embeddings, in Conception: i. each component in a vector represents a (weighted) concept or named entity, therefore making our representations fully interpretable; ii. vector representations are explicitly linked to BabelNet (Navigli and Ponzetto, 2012a), a multilingual semantic network which provides coverage for words and multiword expressions in 284 languages; iii. each concept and named entity is defined as a language-independent unit, so the same representation can be used across languages.
We evaluate Conception on multilingual and cross-lingual Semantic Word Similarity, finding that our approach outperforms supervised, unsupervised and knowledge-based state-of-the-art techniques for both sparse and dense vector representations. Furthermore, we show that these improvements translate into the downstream task of Word Sense Disambiguation, where Conception surpasses the state of the art among supervised and knowledge-based techniques, showing that our semantics-first representations contain meaningful information even when compared against BERT-based techniques.

Related Work
Multilingual word embeddings. The advantages of multilinguality in representation learning were first noticed by Mikolov et al. (2013b), who exploited similarities in the structures of the distributional spaces of different languages to learn cross-lingual word embeddings by taking advantage of purposely-built parallel vocabularies. Since then, multilinguality has become increasingly important in learning robust representations: Faruqui and Dyer (2014) used canonical correlation analysis to project independently-constructed distributional spaces for two languages onto a common space; Ammar et al. (2016) extended previous work to over fifty languages; Smith et al. (2017) reduced the need for bilingual supervision by compiling a pseudo-dictionary from the identical strings that appear in two languages; Jawanpuria et al. (2019) proposed a geometric approach to embedding alignment that leverages language-specific transformations; Singhal et al. (2019) learned multilingual word embeddings from image-text data. While these methods still require annotated cross-lingual data or parallel vocabularies, Conneau et al. (2017) and Artetxe et al. (2018) found success by employing unsupervised methods and adversarial training.
Contextualized word embeddings. The above-mentioned approaches produce word-level representations that are independent of the specific context a word appears in, and such "static" representations often show a strong bias towards the most frequent sense of a word. Instead, context-aware word representation techniques, such as context2vec (Melamud et al., 2016) or ELMo (Peters et al., 2018), dynamically create a representation for a word in a sentential or documental context. Contextualized embeddings witnessed a dramatic rise in popularity thanks to the advent and wide availability of language models pretrained on massive amounts of text, such as BERT (Devlin et al., 2019), immediately followed by multilingual language models, such as m-BERT and XLM (Conneau and Lample, 2019). Contextualized word embeddings are able to capture the many facets of a polysemous word in a context (Pilehvar and Camacho-Collados, 2019), but their implicitly encoded meanings are still disconnected from human-curated knowledge bases, even if more recent efforts showed promising results in imparting structured semantic knowledge to contextualized representations (Peters et al., 2019;Levine et al., 2019).
A further step towards semantic representations involves modelling individual word senses as vectors which are explicitly linked to a knowledge resource. Early approaches to sense embeddings adapted existing work to project words and word senses onto a shared distributional space (Iacobacci et al., 2015;Iacobacci and Navigli, 2019), while more recent studies exploited the inner states of pretrained language models (Loureiro and Jorge, 2019;Scarlini et al., 2020a). Instead of modelling language-specific units like words or senses, NASARI (Camacho-Collados et al., 2016) represents language-independent concepts using sparse lexical vectors. Its most notable shortcoming, however, is that each lexical vector is built from a single source language, and therefore each concept has a separate representation depending on the source language of choice. While the NASARI lexical vectors provide language-specific representations for language-independent concepts, Camacho-Collados et al. (2016) also proposed a "unified" variant where the vector dimensions are concepts obtained by semantically clustering the words of the corresponding lexical vector based on the hypernymy relation. However, such representations start from a single language, and therefore do not exploit the multilingual content available in resources such as BabelNet, and are not interrelated to each other. With Conception, we tackle all these issues and propose an integrated, multilingually-enhanced representation of concepts and entities.

Preliminaries
Conception relies on the concept inventory of BabelNet and the lexical vectors of NASARI to build its representations, which we introduce hereafter. 1 BabelNet (Navigli and Ponzetto, 2012a) is a multilingual semantic network that brings together heterogeneous resources, such as Wikipedia, WordNet, and Open Multilingual WordNet, with 284 languages supported in the current version 4.0. Each node in the BabelNet graph represents a concept or named entity and is defined as a multilingual synset, i.e., the set of synonymous lexicalizations used in different languages to express the same concept or named entity. 2 For example, the concept MO-TOR VEHICLE is defined as the multilingual synset containing the terms { car EN , motorcar EN , coche ES , voiture FR , macchina IT , . . . , 自動車 JP }. In BabelNet, synsets are connected to other synsets through a variety of relations, from hypernymy (generalization or is-a) to hyponymy (specialization or has-kind), from meronymy (part-whole) to antonymy (opposite-of) and general relatedness relations extracted from Wikipedia page links, among others. While some different relation types may arguably be considered more important than others, for the sake of simplicity, we do not distinguish between synset relation types.
NASARI (Camacho-Collados et al., 2016), as previously mentioned, represents a concept c in the form of a sparse vector v l c whose components are the weights of lexical items (words and multiword expressions) expressed in a given language l. The weight of a lexical item w is computed as its lexical specificity (Lafon, 1980) in the subcorpus of Wikipedia articles which define c and its related concepts using language l. Lexical specificity is based on the hypergeometric distribution over word frequencies in such corpora and is computed as where T and t are the sizes of Wikipedia and the subcorpus, respectively, and F and f are the frequencies of w in the two respective corpora. For each language l in Wikipedia, NASARI can produce a distinct representation v l c of a concept c. However, since vocabularies of different languages are mostly non-overlapping, the components or lexical items of v l c and v l c cannot be directly compared across any two languages l and l . Notably, v l c may include knowledge that is missing from v l c and, at the same time, the lexical items of v l c can help disambiguate the lexical items of v l c , as observed by Navigli and Ponzetto (2012b).

Conception
The key innovation we put forward is that, with Conception, multilinguality is an integral part of the learning process: instead of deriving concept representations from within a single language, we leverage the mutually-reinforcing information available across languages to create human-readable and languageindependent concept-level representations. Our approach results in representations where each concept c is described by a vector where each dimension corresponds to a concept: a larger magnitude for the i-th component denotes a stronger relation between c and the i-th concept c i , that is, each concept is described by the concepts it is most related to. By modelling individual concepts rather than words, Conception does not suffer from the conflation of senses that affects word-level representations, such as word2vec and GloVe. At the same time, since concepts are language-independent units, using them as the dimensions of our representations addresses the language-specificity issue of other sparse representations such as the NASARI lexical vectors (see Section 3).
In the remainder of this Section, we describe the four steps of Conception (a running example, discussed in what follows, is shown in Figure 1).

Retrieving concepts from any language
Starting from the NASARI lexical vectors (available in 5 languages, namely English, French, German, Italian, and Spanish), the first step of Conception obtains sparse vectors whose dimensions are concepts instead of words or multiword expressions. Given a concept c and a language l, let v l c be the lexical vector representing c in l. For each such vector v l c and for each non-null lexical item w in v l c , we consider each concept c that has a lexicalization w in l according to BabelNet, and score the relevance of c with respect to c as: is the ranking of the lexical item w among the components of v l c sorted by decreasing magnitude. The language-independent aggregated score of c with respect to c is computed across the set of source languages L as follows: As a result, for a given concept c, we can create an initial semantic vector v c whose components are v c [c ] = SCORE(c |c), for each c in BabelNet. This step does not disambiguate lexical items, so a score is assigned to all the concepts collected from the items of any lexical vector v l c , leading to noisy representations that include undesired concept relations. For example, if the English lexical representation for the THEATER PLAY concept includes the lexical item play EN , then the initial Conception representation of THEATER PLAY will have a positive score for the dimension corresponding to THEATRICAL WORK, but also for MATCH and PERFORM MUSIC, which are clearly unrelated to THEATER PLAY, as shown in Figure 1 (top left, dimensions related/unrelated to THEATER PLAY shown in green/red for illustrative purposes).

Cross-lingual concept disambiguation
The objective of the second step of Conception is, therefore, to refine the previously created vectors by excluding all the unsuitable components that were included due to lexical ambiguity. To do this, we exploit multilinguality so as to retain only those concept components that span across languages.
Consider again a concept c and its lexical vectors, one for each language. Then, for each lexical item w whose score is non-null in any lexical vector of c, we assume that the most relevant meaning c of w with respect to c will appear the largest number of times across the different lexical vectors of c. More formally, we define the language span SPAN(c |c) of c with respect to c as the number of languages where c appears as the meaning of a word w with non-zero scores in the lexical vectors of c: where V l is the vocabulary of the words in language l, and SENSES(w|l) are the possible meanings of w in l. We can then build a new filtered vectorv c where each component c is zeroed if it is deemed unrelated to c: The resulting vectorv c is a more accurate version of v c in that some of its components have been zeroed based on the disambiguation of each word across languages. Following the previous example, the ambiguous word tragedy EN from the English lexical vector can be disambiguated thanks to drame FR from French. As shown in Figure 1 (top right), these two words share only the DRAMA meaning across languages, therefore Conception zeroes out all the other falsely positive components in the representation of THEATER PLAY.

Exploiting concept relations
After selecting the most important dimensions for each semantic vector, Conception takes advantage of the semantic relations defined in the BabelNet graph in order to directly inject explicit semantic knowledge into the representations. More formally, given a concept c and its vectorv c , for each concept c corresponding to a component ofv c , Conception takes into account the value inv c of the neighboring concepts c n ∈ N(c ) of c in the BabelNet graph: Then, for each concept c, we create a new vectorṽ c fromv c by adding the neighbors scored as above: for each concept c , independently of whetherv c [c ] is 0 or not. As a result of this step, semantic information that was previously left unexpressed is made explicit, resulting in a new vectorṽ c where the number of non-zero dimensions is larger than inv c . For instance, the representation of THEATER PLAY is now richer by including WORK OF ART as hypernym of THEATRICAL WORK, or ACT as meronym of DRAMA (see Figure 1, bottom left).

Symmetrizing concept representations
The previous step is "local" in that the injection of semantic knowledge into a concept representation does not depend on the representation of any other concept. In this final step, Conception enhances each concept representation with information contained in the representations of other concepts.
Let G = (V, E) be a directed weighted graph where V is a set of concepts and E is a set of weighted relations between pairs of concepts. The vectors we have created so far can be seen as the weighted adjacency lists of each concept in the graph G. Given a concept c ∈ V and its representationṽ c , if v c [c ] > 0, then there exists a relationship edge e = (c, c ) ∈ E. We assume that, if e exists, then there should also exist an edgeē = (c , c) ∈ E, that is,ṽ c [c] should be non-zero. Ifē already exists inṽ c , we increase the weight ofē, otherwise we connect c to c by creatingē in the semantic vector of c . In both cases, the updated weight ofē depends on its previous weight (possibly null) and the importance of is the ratio between the weights of the two concepts c and c , f (c, c ) = σ(c) σ(c ) , and σ(c) computes the importance of a concept c over the whole BabelNet graph based on its weights in the vectors built as explained in Section 4.3: Getting back to our example, let the value of the GLOBE THEATRE dimension in the representation of THEATER PLAY be non-null. As shown in Figure 1 (bottom right), this step "connects" GLOBE THEATRE and THEATER PLAY by increasing the (possibly null) score of the THEATER PLAY dimension in the representation of the GLOBE THEATRE concept.

Semantic Word Similarity
We evaluate Conception on the Semantic Word Similarity task across 6 languages for a total of 6 multilingual and 10 cross-lingual datasets. Semantic Word Similarity is one of the most popular intrinsic benchmarks for the evaluation of representation techniques. Given two lexical items (words, multiword expressions or named entities), the task involves measuring their semantic closeness.

Experimental Setup
The evaluation of word-level representations in Semantic Word Similarity is often straightforward, since the semantic closeness of two lexical items can be measured directly by comparing the two corresponding representations. Instead, the application of concept-level representations like Conception's to word similarity requires consideration of all the possible senses of the two lexical items to be compared.
Word comparison. In the context of word sense and concept representations, the semantic distance between two words is traditionally computed as the similarity between their closest senses (Resnik, 1995;Budanitsky and Hirst, 2006). We use a variant of this comparison strategy so as to give more importance to the more frequent senses of a word: which computes the maximum similarity when considering all pairs of concepts c w 1 and c w 2 for words w 1 and w 2 , where R s (c w , w) is the ranking of c w among the senses of w sorted by decreasing value of SEMEVAL-2017 Best 4 All C   σ(c w ) (see Section 4.4). We measure the semantic closeness of two senses (SIM in the above formula) using the square-rooted Absolute Weighted Overlap (Camacho-Collados et al., 2016) on their sparse vector representations.
Comparison systems. Vector representations for words, word senses and concepts can be split into two categories: sparse and dense representations. Conception, NASARI lexical and NASARI unified (see Sections 2 and 3) belong to the former category, so they are the most natural competitors in a comparison. However, over the last few years, dense vector representations have emerged as the most empirically effective type of representations in capturing syntactic and semantic relations between words. For this reason, we also compare Conception with the current state of the art in multilingual dense vector representation techniques. We include in the comparison multilingual word embeddings from the works of Conneau et al. (2017), created by aligning fastText embeddings in a unified space, Jawanpuria et al. (2019), obtained with language-specific transformations, and Speer et al. (2017, Conceptnet Numberbatch), built by retrofitting pre-trained word embeddings to the multilingual ConceptNet graph. We include both Conceptnet Numberbatch 19.08 , which is latest version of the embeddings, and Conceptnet Numberbatch SE17 , which uses a complex strategy for out-of-vocabulary (OOV) words. To set a level playing field for all the systems, we assign the same score (0.5) for any OOV word pair.

Multilingual Word Similarity
Datasets. We evaluate Conception on SemEval-2017 Task 2.a , a tough multilingual word similarity benchmark that provides hundreds of word pairs in 5 languages, namely English, Farsi, German, Italian and Spanish. SemEval-2017 is the ideal test bed for Conception since it features a low-resource language (Farsi) and, unlike other popular datasets such as SimLex-999 (Hill et al., 2015) and its translations, it also includes multiwords and named entities, which are difficult to model with word representations and are therefore often treated separately or ignored.
Results. Table 1a reports the average Pearson and Spearman correlation performance of Conception and all comparison systems on 5 languages (detailed per-language results are reported in the Appendix).
Conception outperforms NASARI (both lexical and unified) -the current state of the art in sparse representations of concepts -by a remarkable margin (+5% across all languages). Our sparse vectors also outperform the state-of-the-art dense vectors of Conceptnet Numberbatch (+5% across all languages), while also providing considerably wider lexical coverage (+11% in Table 1a, last column). More generally, as long as BabelNet can provide a non-empty word-to-sense mapping, Conception can gracefully  scale across 284 languages.

Cross-lingual Word Similarity
Datasets. We also compare Conception on the SemEval-2017 Task 2.b (Camacho-Collados et al., 2017, Cross-lingual Semantic Word Similarity). This task is similar to the multilingual Word Similarity task described in Section 5.2, with the key difference that the two lexical items to compare belong to different languages. SemEval-2017 includes 10 cross-lingual datasets from 5 languages, namely English, German, Spanish, Italian and Farsi. Each dataset contains around 1,000 entries that compare words, multiword expressions and named entities across the aforementioned languages.
Results. Conception provides a notable increase in correlation performance over the state-of-the-art sparse vector representations of NASARI (both lexical and unified) across all the cross-lingual datasets of the task, averaging a 5% and a 6% absolute improvement in Pearson and Spearman correlations respectively, as shown in Table 1b (see the Appendix for consistently state-of-the-art pairwise figures). Furthermore, our sparse vector representations outperform the state-of-the-art dense vector representations of Conceptnet Numberbatch (CNNB), Conneau et al. (2017) and Jawanpuria et al. (2019) in every language pair in the task except for the English-German test, where the results are comparable with CNNB. The difference in performance is remarkable in the evaluations that involve Farsi, and this once again highlights the robustness of Conception on low-resource languages. In Table 1b, we also report the score of XLM (Conneau and Lample, 2019), a language model trained with an explicit cross-lingual objective.

Analysis
Ablation study. In order to better appreciate the contribution of each step of the Conception algorithm to the final results, we analyzed the difference in performance between NASARI and the representations created after a) selecting the concepts through cross-lingual disambiguation (Section 4.2), b) exploiting the semantic relations in the BabelNet graph to inject knowledge (Section 4.3), and c) symmetrizing concept relations (Section 4.4), i.e., the vectors produced at the end of the Conception algorithm. Table  1a (bottom) reports a comparison of the results of the above representations in the multilingual word similarity task of SemEval-2017. Each step of the Conception algorithm incrementally improves the performance of the representations of the previous step, which is particularly evident for the last step. We observed comparable improvements in the cross-lingual setting.
Case study. We conducted an in-depth analysis of the correlation performance obtained by Conception, CNNB, and the NASARI lexical vectors in the SemEval-2017 Italian benchmark (subtask 2.a), with the aim of understanding where Conception makes the difference. Conception shines in capturing semantic similarity between word pairs with a high gold similarity score (≥ 0.75). In contrast to Conception, for example, CNNB struggles in capturing semantic similarity in synonym pairs (sim gold = 1.0) which involve highly-ambiguous words such as schermo IT -monitor IT (sim pred < 0.5), or multiword expressions such as sclerosi multipla IT -sclerosi a placche IT (sim pred = 0.5), or named entities such as DeepMind IT -Google DeepMind IT (sim pred = 0.5). Another representative example is the synonym pair borsa IT -mercato azionario IT , where the first term means either HANDBAG or STOCK MARKET, and the second term assumes the latter meaning only. In Table 2 (right), we show how the two concepts of HANDBAG and STOCK MARKET are modelled separately by Conception based on their closest meanings. The STOCK MARKET meaning of borsa IT enables Conception to capture the synonymity borsa IT -mercato azionario IT (sim pred = sim gold = 1.0). In contrast, CNNB fails to do so (sim pred = 0.42): the reason is that, in a similar vein to other word-level representations, CNNB conflates the various meanings of an ambiguous word into a single vector where the predominant meaning may overshadow the other meanings. In our example, the STOCK MARKET concept is overshadowed by the HANDBAG concept in the word representation of borsa IT of CNNB (Table 2, left). Conversely, Conception models the two concepts independently (Table 2, right), and it is consequently able to correctly capture similarity in more difficult settings.

Word Sense Disambiguation
Word Sense Disambiguation (WSD) -the task of assigning the correct meaning to a target word in a context -is considered to be a fundamental step towards natural language understanding (Navigli, 2018). As with many other tasks, WSD has benefited greatly from the recent advances in other fields, such as language modelling (Scarlini et al., 2020b), game theory (Tripodi and Navigli, 2019), structured knowledge integration , definition modelling  and label propagation (Barba et al., 2020;, inter alia. Our experiments show that Conception can be used to create state-of-the-art sense embeddings, demonstrating empirically that our approach provides high-quality knowledge that is still not captured by recent language models.
Experimental setup. We start from state-of-the-art, precomputed sense embeddings and adopt a simple strategy to enrich such representations with Conception in order to evaluate its effectiveness in WSD. First, we create an embedding e c for a concept c by averaging the precomputed embeddings e s of each word sense s that can be used to express c: e c = s∈c es |{s∈c}| . Then, given a word sense s and its corresponding concept c, we build a new word sense embedding e s by adding to e s each concept embedding e c weighted by the ranking of concept c in the Conception representation of c: where α = 0.5, and R(c , v c ) is the ranking of concept c in the Conception representation of c.
Comparison systems. We consider the following state-of-the-art sense embeddings: LMMS (Loureiro and Jorge, 2019), a supervised technique that combines BERT contextualized embeddings with knowledge from WordNet; SensEmBERT (Scarlini et al., 2020a), a knowledge-based BERT-based approach that enriches contextualized embeddings with knowledge from Wikipedia, BabelNet and NASARI vectors; BERT Large as reported by Loureiro and Jorge (2019). We compare the above embeddings against those obtained when enriching the LMMS and SensEmBERT embeddings with Conception based on Eq. 1 (Conception LMMS and Conception SensEmBERT hereafter). For all embeddings, WSD is performed as customary in the literature: given a word w in context, we choose the sense s of w whose vector is closest to the contextual BERT representation of w according to cosine similarity. We also include KnowBERT (Peters et al., 2019), a language model which exploits multiple knowledge bases.
Datasets. We empirically assess our sense embeddings on the unified evaluation framework for English WSD proposed by Raganato et al. (2017b), which comprises Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007(Pradhan et al., 2007, SemEval-2013 (Navigli et al., 2013), and SemEval-2015 (Moro and Navigli, 2015). Table 3 shows that, on the concatenation of all the English WSD datasets, Conception LMMS surpasses the state-of-the-art supervised representations of LMMS (+1.0% F 1 ), while Conception SensEmBERT outperforms the state-of-the-art knowledge-based representations of SensEm-BERT (+1.1% F 1 ). In particular, we would like to highlight three main findings: i) Conception encodes a non-trivial amount of knowledge that is not included in LMMS, even though this latter already exploits BERT and the WordNet semantic graph; ii) Conception builds its representations from NASARI, which only covers nominal concepts. Nevertheless, it also produces robust representations for verbal concepts (+1.6% in F 1 score over LMMS); iii) while Conception and SensEmBERT are both knowledge-based techniques relying on BabelNet, the former is still able to inject meaningful knowledge into the latter. In general, the knowledge included by Conception is orthogonal to existing sense representations. While we have here opted to enrich existing sense embeddings with a simple yet effective technique, we envisage that more sophisticated uses of Conception can lead to further improvements in WSD.

Conclusion
In this paper we presented Conception, a novel knowledge-based technique for modelling concepts and named entities. Its key innovation lies in setting multilinguality as the cornerstone of the learning process to build language-agnostic and human-readable concept vector representations.
Evaluated across multiple multilingual and cross-lingual Semantic Word Similarity datasets, Conception shows state-of-the-art results not only compared to concept representations such as NASARI, but also to multilingual word embeddings such as Conceptnet Numberbatch and cross-lingual language models such as XLM. Additionally, our concept representations are particularly robust on resource-poor languages, like Farsi, along the lines of recent work in Semantic Parsing and Semantic Role Labeling aimed at bridging the gap between languages (Blloshmi et al., 2020;. Finally, Conception can be seamlessly applied to a downstream task: in Word Sense Disambiguation, it improves over state-of-the-art supervised and knowledge-based sense embeddings, showing that Conception encodes information that is still not captured by BERT-based contextualized representations. Furthermore, our approach produces much more than concept representations: since each concept is described by the relationships it has with other concepts, Conception can be seen as a weighted directed graph where each node is a concept whose vector representation is also its weighted adjacency list. This paves the way to a whole set of possible applications for Conception: from graph-based concept embeddings to semantics-first sentence and document representations. The complete set of concept vectors is available at https://github.com/SapienzaNLP/conception. Table 4 includes the complete results over all the 5 languages in the Multilingual Word Similarity subtask of SemEval-2017. As can be seen, Conception shows a much less abrupt decrease in performance when evaluated on Farsi compared to other knowledge-based, purely distributional and knowledge-enhanced approaches such as NASARI (Camacho-Collados et al., 2016), GeoMM (Jawanpuria et al., 2019), and Conceptnet Numberbatch (Speer and Lowry-Duda, 2017), respectively.  B Cross-lingual Word Similarity Table 5 includes the complete results over all the 10 language pairs in the Cross-lingual Word Similarity subtask of SemEval-2017. Conception achieves a new state of the art in 9 of the 10 language pairs, and, once again, it is particularly robust in language pairs where Farsi, a low-resource language, is involved.  Table 5: Pearson (r) and Spearman (ρ) correlation performance of Conception compared with the current state of the art in the cross-lingual Semantic Word Similarity task of SemEval-2017 (subtask 2.b). The scores of the SemEval-2017 baseline and Conceptnet Numberbatch SE17 are taken directly from  and Speer and Lowry-Duda (2017), respectively.