A Unified Multilingual Semantic Representation of Concepts

Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called M UFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. M UFFIN represents a given concept in a uniﬁed semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disam-biguation, reporting state-of-the-art performance on several standard datasets.


Introduction
Semantic representation, i.e., the task of representing a linguistic item (such as a word or a word sense) in a mathematical or machine-interpretable form, is a fundamental problem in Natural Language Processing (NLP). The Vector Space Model (VSM) is a prominent approach for semantic representation, with widespread popularity in numerous NLP applications. The prevailing methods for the computation of a vector space representation are based on distributional semantics (Harris, 1954). However, these approaches, whether in their conventional co-occurrence based form (Salton et al., 1975;Turney and Pantel, 2010;Landauer and Dooley, 2002), or in their newer predictive branch (Collobert and Weston, 2008;Mikolov et al., 2013;Baroni et al., 2014), suffer from a major drawback: they are unable to model individual word senses or concepts, as they conflate different meanings of a word into a single vectorial representation. This hinders the functionality of this group of vector space models in tasks such as Word Sense Disambiguation (WSD) that require the representation of individual word senses. There have been several efforts to adapt and apply distributional approaches to the representation of word senses (Pantel and Lin, 2002;Brody and Lapata, 2009;Reisinger and Mooney, 2010;Huang et al., 2012). However, none of these techniques provides representations that are already linked to a standard sense inventory, and consequently such mapping has to be carried out either manually, or with the help of sense-annotated data. Chen et al. (2014) addressed this issue and obtained vectors for individual word senses by leveraging WordNet glosses. NASARI (Camacho-Collados et al., 2015) is another approach that obtains accurate sense-specific representations by combining the complementary knowledge from Word-Net and Wikipedia. Graph-based approaches have also been successfully utilized to model individual words (Hughes and Ramage, 2007;Yeh et al., 2009), or concepts (Pilehvar et al., 2013Pilehvar and Navigli, 2014), drawing on the structural properties of semantic networks. The applicability of all these techniques, however, is usually either constrained to a single language (usually English), or to a specific task.
We put forward MUFFIN (Multilingual, Uni-Fied and Flexible INterpretation), a novel method that exploits both structural knowledge derived from semantic networks and distributional statistics from text corpora, to produce effective representations of individual word senses or concepts. Our approach provides multiple advantages in comparison to the previous VSM techniques: 1. Multilingual: it enables sense representation in dozens of languages; 2. Unified: it represents a linguistic item, irrespective of its language, in a unified seman- tic space having concepts as its dimensions, permitting direct comparison of different representations across languages, and hence enabling cross-lingual applications; 3. Flexible: it can be readily applied to different NLP tasks with minimal adaptation.
We evaluate our semantic representation on two different tasks in lexical semantics: semantic similarity and Word Sense Disambiguation. To assess the multilingual capability of our approach, we also perform experiments on languages other than English on both tasks, and across languages for semantic similarity. We report state-of-the-art performance on multiple datasets and settings in both frameworks, which confirms the reliability and flexibility of our representations.

Methodology
Figure 1 illustrates our procedure for constructing the vector representation of a given concept. We use BabelNet 1 (version 2.5) as our main sense repository. BabelNet (Navigli and Ponzetto, 2012a) is a multilingual encyclopedic dictionary which merges WordNet with other lexical resources, such as Wikipedia and Wiktionary, thanks to its use of an automatic mapping algorithm. BabelNet extends the WordNet synset model to take into account multilinguality: a Ba-belNet synset contains the words that, in the various languages, express the given concept.
Our approach for modeling a BabelNet synset consists of two main steps. First, for the given synset we gather contextual information from Wikipedia by exploiting knowledge from the Ba-belNet semantic network (Section 2.1). Then, by analyzing the corresponding contextual information and comparing and contrasting it with the whole Wikipedia corpus, we obtain a vectorial representation of the given synset (Section 2.2).

A Wikipedia sub-corpus for each concept
Let c be a concept, which in our setting is a Ba-belNet synset, and let W c be the set containing the Wikipedia page p corresponding to the concept c and all the Wikipedia pages having an outgoing link to p. We further enrich W c with the corresponding Wikipedia pages of the hypernyms and hyponyms of c in the BabelNet network. W c is the set of Wikipedia pages whose contents are exploited to build a representation for the concept c. We refer to the bag of content words in all the Wikipedia pages in W c as the sub-corpus SC c for the concept c.

Vector construction: lexical specificity
Lexical specificity (Lafon, 1980) is a statistical measure based on the hypergeometric distribution. Due to its efficiency in extracting a set of highly relevant words from a sub-corpus, the measure has recently gained popularity in different NLP applications, such as textual data analysis (Lebart et al., 1998), term extraction (Drouin, 2003), and domain-based term disambiguation (Camacho-Collados et al., 2014;Billami et al., 2014). We leverage lexical specificity to compute the weights in our vectors. In our earlier work (Camacho-Collados et al., 2015), we conducted different experiments which demonstrated the improvement that lexical specificity can provide over the popular term frequency-inverse document frequency weighting scheme (Jones, 1972, tf-idf ). Lexical specificity computes the vector weights for an item, i.e., a word or a set of words, by comparing and contrasting its contextual information with a reference corpus. In our setting, we take the whole Wikipedia as our reference corpus RC (we use the October 2012 Wikipedia dump).
Let T and t be the respective total number of tokens in RC and SC c , while F and f denote the frequency of a given item in RC and SC c , respectively. Our goal is to compute a weight denoting the association of an item to the concept c. For notational brevity, we use the following expression to refer to positive lexical specificity: where X represents a random variable following a hypergeometric distribution of parameters F , t and T . As we are only interested in a set of items that are representative of the concept being modeled, we follow Billami et al. (2014) and only consider in our final vector the items which are relevant to SC c with a confidence higher than 99% according to the hypergeometric distribution (P (X ≥ f ) ≤ 0.01).
On the basis of lexical specificity we put forward two types of representations: lexical and unified. The lexical vector representation lex c of a concept c has lemmas as its individual dimensions. To this end, we apply lexical specificity to every lemma in SC c in order to estimate the relevance of each lemma to our concept c. We use the lexical representation for the task of WSD (see Section 3.2). We describe the unified representation in the next subsection.

Unified representation
Unlike the lexical version, our unified representation has concepts as individual dimensions. Algorithm 1 shows the construction process of a concept's unified vector. The algorithm first clusters together those words that have a sense sharing the same hypernym (h in the algorithm) according to the BabelNet taxonomy (lines 2-4). Next, the specificity is computed for the set of all the hyponyms of h, even those that do not appear in the sub-corpus SC c (lines 6-14). Here, F and f denote the aggregated frequencies of all the hyponyms of h in the whole Wikipedia (i.e., reference corpus RC) and the sub-corpus SC c , respectively.
Our binding of a set of sibling words into a single cluster represented by their common hypernym provides two advantages. Firstly, it transforms the representations to a unified semantic space. This space has concepts as its dimensions, enabling their comparability across languages. Secondly, the clustering can be viewed as an implicit disambiguation process, whereby a set of potentially if ∃ l1, l2 ∈ SCc: l1, l2 hyponyms of h and l1 = l2 then 8: for each hyponym hypo of h 11: for each lexicalization lex of hypo 12: uc(h) ← specificity(T, t, F, f )) 15: return vector uc ambiguous words are disambiguated into their intended sense on the basis of the contextual clues of the neighbouring content words, resulting in more accurate representations of meaning.
Example. Table 1 lists the top-weighted concepts, represented by their relevant lexicalizations, in the unified vectors generated for the bird and machine senses of the noun crane and for three different languages. 2 A comparison of concepts across the two senses indicates the effectiveness of our representation in identifying relevant concepts in different languages, while guaranteeing a clear distinction between the two meanings.

Applications
Thanks to their VSM nature and the senselevel functionality, our concept representations are highly flexible, allowing us to adapt and apply them to different NLP tasks with minimal adaptation. In this section we explain how we use our representations in the tasks of semantic similarity (Section 3.1) and WSD (Section 3.2).
Associating concepts with words. Given that our representations are for individual word senses, a preliminary step for both tasks would be to associate the set of concepts, i.e., BabelNet synsets, C w = {c 1 , ..., c n } with a given word w. In the case when w exists in the BabelNet dictionary, we obtain the set of associated senses of the word as defined in the BabelNet sense inventory.
In order to enhance the coverage in the case of  Table 1: Top-weighted concepts, i.e., BabelNet synsets, for the bird and machine senses of the noun crane. We represent each synset by one of its word senses. Word senses marked with the same symbol across languages correspond to the same BabelNet synset.
words that are not defined in the BabelNet dictionary, we also exploit the so-called Wikipedia piped links. A piped link is a hyperlink appearing in the body of a Wikipedia article, providing a link to another Wikipedia article. For example, the piped link [[dockside crane|Crane (machine)]] is a hyperlink that appears as dockside crane in the text, but takes the user to the Wikipedia page titled Crane (machine). These links provide Wikipedia editors with the ability to represent a Wikipedia article through a suitable lexicalization that preserves the grammatical structure, contextual coherency, and flow of the sentence. This property provides an effective means of obtaining a set of concepts for the words not covered by BabelNet. For the case of our example, the BabelNet out-ofvocabulary word w = dockside crane will have in its set of associated concepts C w the BabelNet synset corresponding to the Wikipedia page titled Crane (machine).

Semantic Similarity
Once we have the set C w of concepts associated with each word w, we first retrieve the set of corresponding unified vector representations. We then follow Camacho- Collados et al. (2015) and use square-rooted Weighted Overlap (Pilehvar et al., 2013, WO) as our vector comparison method, a metric that has been shown to suit specificitybased vectors more than the conventional cosine. WO compares two vectors on the basis of their overlapping dimensions, which are harmonically weighted by their relative ranking: where O is the set of overlapping dimensions (i.e. concepts) between the two vectors and rank(q, v i ) is the rank of dimension q in the vector v i . Finally, the similarity between two words w 1 and w 2 is calculated as the similarity of their closest senses, a prevailing approach in the literature (Resnik, 1995;Budanitsky and Hirst, 2006): where w 1 and w 2 can belong to different languages. This cross-lingual similarity measurement is possible thanks to the unified languageindependent space of concepts of our semantic representations.

Multilingual Word Sense Disambiguation
In order to be able to apply our approach to WSD, we use the lexical vector lex c for each concept c.
The reason for our choice of lexical vectors in this setting is that they enable a direct comparison of a candidate sense's representation with the context, which is also in the same lexical form. Algorithm 2 summarizes the general framework of our approach. Given a target word w to disambiguate, our approach proceeds by the following steps: 1. Retrieve C w , the set of associated concepts with the target word w (line 1); 2. Obtain the lexical vector lex c for each concept c ∈ C w (cf. Section 2); 3. Calculate, for each candidate concept c, a confidence score (score c ) based on the harmonic sum of the ranks of the overlapping words between its lexical vector lex c and the context of the target word (line 5 in Algorithm 2).

Algorithm 2 MUFFIN for WSD
Input: a target word w and a document d (context of w) Output:ĉ, the intended sense of w 1: for each concept c ∈ Cw 2: scorec ← 0 3: for each lemma l ∈ d 4: if l ∈ lexc then 5: scorec ← scorec + rank(l, lexc) Thanks to the use of BabelNet, our approach is applicable to arbitrary languages. For the task of WSD, we focus on two major sense inventories integrated in BabelNet: Wikipedia and WordNet.
Wikipedia sense inventory. In this case, we obtain the set of candidate senses for a target word by following the procedure described in the beginning of this Section (i.e., associating concepts with words). However, we do not consider those Babel-Net synsets that are not associated with Wikipedia pages.
WordNet sense inventory. Similarly, when restricted to the WordNet inventory, we discard those BabelNet synsets that do not contain a Word-Net synset. In this setting, we also leverage relations from WordNet's semantic network and its disambiguated glosses 3 in order to obtain a richer set of Wikipedia articles in the sub-corpus construction. The enrichment of the semantic network with the disambiguated glosses has been shown to be beneficial in various graph-based disambiguation tasks (Navigli and Velardi, 2005;Pilehvar et al., 2013).

Experiments
We assess the reliability of MUFFIN in two standard evaluation benchmarks: semantic similarity (Section 4.1) and Word Sense Disambiguation (Section 4.2).

Semantic Similarity
As our semantic similarity experiment we opted for word similarity, which is one of the most popular evaluation frameworks in lexical semantics. Given a pair of words, the task in word similarity is to automatically judge their semantic similarity and, ideally, this judgement should be close to that given by humans.

Datasets
Monolingual. We picked the RG-65 dataset (Rubenstein and Goodenough, 1965) as our monolingual word similarity dataset. The dataset comprises 65 English word pairs which have been manually annotated by several annotators according to their similarity on a scale of 0 to 4. We also perform evaluations on the French (Joubarne and Inkpen, 2011) and German (Gurevych, 2005) adaptations of this dataset. (2009) developed two sets of cross-lingual datasets based on the English MC-30 (Miller and Charles, 1991) and WordSim-353 (Finkelstein et al., 2002) datasets, for four different languages: English, German, Romanian, and Arabic. However, the construction procedure they adopted, consisting of translating the pairs to other languages while preserving the original similarity scores, has led to inconsistencies in the datasets. For instance, the Spanish dataset contains the identical pair mediodiamediodia with a similarity score of 3.42 (in the scale [0,4]). Additionally, the datasets contain several orthographic errors, such as despliege and grua (instead of despliegue and grúa) and incorrect translations (e.g., the English noun implement translated into the Spanish verb implementar). Kennedy and Hirst (2012) proposed a more reliable procedure that leverages two existing aligned monolingual word similarity datasets for the construction of a new cross-lingual dataset. To this end, for each two word pairs a-b and a'-b' in the two datasets, if the difference in the corresponding scores is greater than one, the pairs are discarded. Otherwise, two new pairs a-b' and a'-b are created with a score equal to the average of the two original pairs' scores. In the case of repeated pairs, we merge them into a single pair with a similarity equal to their average scores. Using this procedure as a basis, Kennedy and Hirst (2012) created an English-French dataset consisting of 100 pairs. We followed the same procedure and built two datasets for English-German (consisting of 125 pairs) and German-French (comprising 96 pairs) language pairs. 4

Comparison systems
Monolingual. We benchmark our system against four other approaches that exploit   (Granada et al., 2014). We also provide results for systems that use distributional semantics for modeling words, both the conventional co-occurrence based approach, i.e., PMI-SVD (Baroni et al., 2014), PMI and SOC-PMI (Joubarne and Inkpen, 2011), and Retrofitting (Faruqui et al., 2015), and the newer word embeddings, i.e., Word2Vec (Mikolov et al., 2013). For Word2Vec and PMI-SVD, we use the pre-trained models obtained by Baroni et al. (2014). 6 As for WordNet-based approaches, we report results for Resnik (Resnik, 1995) and ADW (Pilehvar et al., 2013), which take advantage of its structural information, and Lesk hyper (Gurevych, 2005), which leverages definitional information in WordNet for similarity computation. Finally, we also report the performance of our earlier work NASARI (Camacho-Collados et al., 2015), which combines knowledge from WordNet and Wikipedia for the English language in its setting without the Wiktionary synonyms module.
Cross-lingual. We compare the performance of our approach against the best configuration of the CL-MSR-2.0 system (Kennedy and Hirst, 2012), which exploits Pointwise Mutual Information (PMI) on a parallel corpus obtained from 5 SSA involves several parameters tuned on datasets that are constructed on the basis of MC-30 and RG-65. 6 We report the best configuration of the systems on the RG-65 dataset out of their 48 configurations. The corpus used to train the models contained 2.8 billion tokens, including Wikipedia (Baroni et al., 2014). the English and French versions of WordNet. Since two of our cross-lingual datasets are newlycreated, we developed three baseline systems to enable a more meaningful comparison. To this end, we first use Google Translate to translate the non-English side of the dataset to the English language. Accordingly, three state-of-the-art graphbased and corpus-based approaches were used to measure the similarity of the resulting English pairs. As English similarity measurement systems, we opted for ADW (Pilehvar et al., 2013), and the best predictive (Mikolov et al., 2013, Word2Vec) and co-occurrence (i.e., PMI-SVD) models obtained by Baroni et al. (2014). 7 In our experiments we refer to these systems as pivot, since they use English as a pivot for computing semantic similarity. As a comparison, we also show results for MUFFIN pivot , which is the variant of our system applied to the same automatically translated monolingual datasets.

Results
Monolingual. We show in Table 2 the performance of different systems in terms of Spearman and Pearson correlations on the English, German, and French RG-65 datasets. On the German and French datasets, our system outperforms the comparison systems according to both evaluation measures. It achieves considerable Spearman and Pearson correlation leads of 0.1 and 0.2, respectively, on the French dataset in comparison to the best system. Also on the English RG-65 dataset, our system attains competitive performance according to both Spearman and Pearson correla-  tions. We note that most state-of-the-art systems on the dataset (e.g., ADW) are restricted to the English language only.
Cross-lingual. Pearson correlation results on the three cross-lingual RG-65 datasets are presented in Table 3. Similarly to the monolingual experiments, our system proves highly reliable in the cross-lingual setting, improving the performance of the comparison systems on all three language pairs. Moreover, MUFFIN pivot attains the best results among the pivot systems on all datasets, confirming the reliability of our system in the monolingual setting. We note that since the cross-lingual datasets were built by translating the word pairs in the original English RG-65 dataset, the pivot-based comparison systems proved to be highly competitive, outperforming the CL-MSR-2.0 system by a considerable margin.

Wikipedia
In this setting, we selected the SemEval 2013 allwords WSD task  as our evaluation benchmark. The task provides datasets for five different languages: Italian, English, French, Spanish and German. There are on average 1123 words to disambiguate in each language's dataset.
As comparison system, we provide results for the best-performing participating system on each language. We also show results for the state-of-theart WSD system of Moro et al. (2014, Babelfy), which relies on random walks on the BabelNet semantic network and a set of graph heuristic algorithms. Finally, we also report results for the Most Frequent Sense (MFS) baseline provided by the task organizers. We follow Moro et al. (2014) and back off to the MFS baseline in the case when our system's judgement does not meet a threshold θ. Similarly to Babelfy, we tuned the value of the threshold θ on the trial dataset provided by the organizers of the task. We tuned θ with step size 0.05 (hence, 21 possible values in [0,1]), obtaining an optimal value of 0.85 in the trial set, a value which we use across all languages. Table 4 lists the F1 percentage performance of different systems on the five datasets of the SemEval-2013 all-words WSD task. Despite not being tuned to the task, our representations provide competitive results on all datasets, outperforming the sophisticated Babelfy system on the Spanish and German languages. The variant of our system not utilizing the MFS information in the disambiguation process (θ = 0), i.e., MUF-FIN , also shows competitive results, outperforming the best system in the SemEval-2013 dataset on all languages. Interestingly, MUFFIN proves highly effective on the French language, surpassing not only the performance of our system using the MFS information, but also attaining the best overall performance.

WordNet
As regards the WordNet disambiguation task, we take as our benchmark the two recent SemEval English all-words WSD tasks: the SemEval-2013 task on Multilingual WSD  and the SemEval-2007 English Lexical Sample, SRL and All-Words task (Pradhan et al., 2007). The all-words datasets of the two tasks contain 1644 instances (SemEval-2013) and 162 noun instances (SemEval-2007), respectively.
As comparison system, we report the performance of the best configuration of the topperforming system in the SemEval-2013 task, i.e., UMCC-DLSI (Gutiérrez et al., 2013). We also show results for the state-of-the-art supervised system (Zhong and Ng, 2010, IMS), as well as for two graph-based approaches that are based on random walks on the WordNet graph (Agirre and Soroa, 2009, UKB w2w) and the BabelNet semantic network (Moro et al., 2014, Babelfy). We follow Babelfy and also exploit the WordNet's sense frequency information from the SemCor senseannotated corpus (Miller et al., 1993). However, instead of simply backing off to the most frequent sense, we propose a more meaningful exploitation of this information. To this end, we compute the relevance of a specific sense as the average of its normalized sense frequency and its corresponding  score (score c in Algorithm 2) given by our system. The sense with the highest overall relevance value is then picked as the intended sense. Additionally, we put forward a hybrid system that combines our system with IMS, hence benefiting from the judgements made by two systems that utilize complementary information. Our system makes judgements based on global contexts, whereas IMS exploits the local context of the target word. To this end, we compute the relevance of a specific sense as the average of the normalized scores given by IMS and our system (score c in Algorithm 2). We refer to this hybrid system as MUFFIN+IMS. Table 5 reports the F1 percentage performance of different systems on the datasets of SemEval-2013 and SemEval-2007 English all-words WSD tasks. We also report the results for the MFS baseline, which always picks the most frequent sense of a word. Similarly to the disambiguation task on the Wikipedia sense inventory, MUFFIN proves to be quite competitive on the WordNet disambiguation task, while surpassing the performance of all the comparison systems on the SemEval-2013 dataset. On the SemEval-2007 dataset, IMS achieves the best performance, thanks to its usage of large amounts of manually and semiautomatically tagged data. Finally, our hybrid system, MUFFIN+IMS, provides the best overall performance on the two datasets, showing that our combination of the two WSD systems that utilize different types of knowledge was beneficial.

Related work
We briefly review the recent literature on the two NLP tasks to which we applied our representations, i.e., Word Sense Disambiguation and semantic similarity.
WSD. There are two main categories of WSD techniques: knowledge-based and supervised   (Navigli, 2009). Supervised systems such as IMS (Zhong and Ng, 2010) analyze sense-annotated data and model the context in which the various senses of a word usually appear. Despite their accuracy for the words that are provided with suitable amounts of sense-annotated data, their applicability is limited to those words and languages for which such data is available, practically limiting them to a small subset of words mainly in the English language. Knowledge-based approaches (Sinha and Mihalcea, 2007;Navigli and Lapata, 2007;) significantly improve the coverage of supervised systems. However, similarly to their supervised counterparts, knowledge-based techniques are usually limited to the English language. Recent years have seen a growing interest in cross-lingual and multilingual WSD (Lefever and Hoste, 2010;Lefever and Hoste, 2013;. Multilinguality is usually offered by methods that exploit the structural information of large-scale multilingual lexical resources such as Wikipedia (Gutiérrez et al., 2013;Manion and Sainudiin, 2013;Hovy et al., 2013). Babelfy (Moro et al., 2014) is an approach with state-ofthe-art performance that relies on random walks on BabelNet multilingual semantic network (Navigli and Ponzetto, 2012a) and densest subgraph heuristics. However, the approach is limited to the WSD and Entity Linking tasks. In contrast, our approach is global as it can be used in different NLP tasks, including WSD.
Semantic similarity. Semantic similarity of word pairs is usually computed either on the basis of the structural properties of lexical databases and thesauri, or by comparing vectorial representations of words learned from massive text corpora. Structural approaches usually measure the similarity on the basis of the distance information on semantic networks, such as WordNet (Budanitsky and Hirst, 2006), or thesauri, such as Roget's (Morris and Hirst, 1991;Jarmasz and Szpakowicz, 2003). The semantic network of Word-Net has also been used in more sophisticated techniques such as those based on random graph walks Pilehvar et al., 2013), or coupled with the complementary knowledge from Wikipedia (Camacho- Collados et al., 2015). However, these techniques are either limited in the languages to which they can be applied, or in their applicability to tasks other than semantic similarity (Navigli and Ponzetto, 2012b).
Corpus-based techniques are more flexible, enabling the training of models on corpora other than English. However, these approaches, either in their conventional co-occurrence based form (Gabrilovich and Markovitch, 2007;Landauer and Dumais, 1997;Turney and Pantel, 2010;Bullinaria and Levy, 2012), or the more recent predictive models (Mikolov et al., 2013;Collobert and Weston, 2008;Pennington et al., 2014), are restricted in two ways: (1) they cannot be used to compare word senses; and (2) they cannot be directly applied to cross-lingual semantic similarity. Though the first problem has been solved by multi-prototype models (Huang et al., 2012), or by the sense-specific representations obtained as a result of exploiting WordNet glosses (Chen et al., 2014), the second problem remains unaddressed. In contrast, our approach models word senses and concepts effectively, while providing a unified representation for different languages that enables cross-lingual semantic similarity.

Conclusions
This paper presented MUFFIN, a new multilingual, unified and flexible representation of individual word senses. Thanks to its effective combination of distributional statistics and structured knowledge, the approach can compute efficient representations of arbitrary word senses, with high coverage and irrespective of their language. We evaluated our representations on two different NLP tasks, i.e., semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several datasets. Experimental results demonstrated the reliability of our unified representation approach, while at the same time also highlighting its main advantages: multilinguality, owing to its effective application within and across multiple languages; and flexibility, owing to its robust performance on two different tasks.