Sew-Embed at SemEval-2017 Task 2: Language-Independent Concept Representations from a Semantically Enriched Wikipedia

This paper describes Sew-Embed, our language-independent approach to multilingual and cross-lingual semantic word similarity as part of the SemEval-2017 Task 2. We leverage the Wikipedia-based concept representations developed by Raganato et al. (2016), and propose an embedded augmentation of their explicit high-dimensional vectors, which we obtain by plugging in an arbitrary word (or sense) embedding representation, and computing a weighted average in the continuous vector space. We evaluate Sew-Embed with two different off-the-shelf embedding representations, and report their performances across all monolingual and cross-lingual benchmarks available for the task. Despite its simplicity, especially compared with supervised or overly tuned approaches, Sew-Embed achieves competitive results in the cross-lingual setting (3rd best result in the global ranking of subtask 2, score 0.56).


Introduction
Semantic similarity is a well established research area of Natural Language Processing, concerned with measuring the extent to which two linguistic items are similar (Budanitsky and Hirst, 2006). In particular, word similarity is nowadays a widely used evaluation benchmark for word and sense representations (Turney and Pantel, 2010).
In this respect Wikipedia, as one of the most popular semi-structured resources in the field (Hovy et al., 2013), provides a convenient bridge to multilinguality, with several million inter-language links among articles refferring to the same concept or entity. In fact, a number of successful approaches to semantic similarity make explicit use of Wikipedia, from ESA (Gabrilovich and Markovitch, 2007) to NASARI (Camacho Collados et al., 2016). Others, like SENSEMBED (Iacobacci et al., 2015), report state-of-the-art results when trained on an automatically disambiguated version of a Wikipedia dump. Regardless of whether Wikipedia is seen as a multilingual semantic network of concepts and entities or as a sense-annotated corpus, hyperlinks (inter-page links) constitute its key structural property: in light of this, Raganato et al. (2016) addressed the sparsity problem of original hyperlinks and developed SEW 1 , a semantically enriched Wikipedia where the overall number of linked mentions has been more than tripled by solely exploiting the structure of Wikipedia itself and the wide-coverage sense inventory of Babel-Net (Navigli and Ponzetto, 2012) In addition to building the corpus, the authors used SEW's sense annotations to construct vector representations of concepts and entities from the BabelNet sense inventory, and tested them on multiple semantic similarity tasks. Being defined at the concept level, SEW's representations are inherently multilingual: however, they consist of high-  dimensional sparse vectors, not immediately comparable with existing approaches, especially those based on word embeddings, and less flexible to use within downstream applications.
In this paper we propose SEW-EMBED, an embedded augmentation of SEW's original representations in which sparse vectors, defined in the high-dimensional space of Wikipedia pages, are mapped to continuous vector representations via a weighted average of embedded vectors from an arbitrary, pre-specified word (or sense) representation. Regardless of the particular representation used, the resulting vectors are still defined at the concept level, and hence immediately expendable in a multilingual and cross-lingual setting.
We describe and evaluate SEW-EMBED with two off-the-shelf embedded representations: the popular word embeddings of Word2Vec (Mikolov et al., 2013a) and the embedded concept representations of NASARI (Camacho Collados et al., 2016) 3 . We report and discuss the results obtained by both versions on all monolingual and crosslingual benchmarks available for the task (Camacho Collados et al., 2017), and include a comparison with the original explicit representations of Raganato et al. (2016).

Background: Developing a Semantically Enriched Wikipedia
The approach used by Raganato et al. (2016) to develop SEW relies on a cascade of hyperlink propagation heuristics, applied to an English Wikipedia 3 http://lcl.uniroma1.it/nasari dump after some standard pre-processing. In general terms, each propagation heuristic identifies a list of BabelNet synsets to be propagated across a given Wikipedia page p; then, for each synset, occurrences of any of its potential lexicalizations are detected and added as new sense annotations for p. Raganato et al. (2016) distinguishes between intra-page and inter-page heuristics (depending on whether the synsets propagated across p are collected from the same page), but all of them share a common assumption: every occurrence of an ambiguous mention within p refers to the same underlying sense (one sense per page) and hence it is annotated with the same synset. 4 After all heuristics have been applied, overlapping mentions and duplicates are removed by enforcing a conservative policy which favors intra-page annotations over inter-page ones, and selects the longest match in case of overlapping annotations of the same type. The result of this process is SEW, a Wikipediabased corpus with over 200 million sense annotations of BabelNet synsets for all open-class parts of speech (nouns, verbs, adjectives, and adverbs).

SEW-EMBED: Building Vectors from Sense Annotations
In this section we provide the details of SEW-EMBED. We start by briefly describing the original explicit representations based on SEW (Section 3.1) and then our embedded augmentation (Section 3.2). The workflow of our procedure is depicted in Figure 1 with an illustrative example.

Explicit Representation
As a starting point, we consider the Wikipediabased representation (WB-SEW) by Raganato et al. (2016), in which each concept or entity s in the BabelNet sense inventory is represented as a vector v s where dimensions are Wikipedia pages. For each Wikipedia page p in SEW, the corresponding component of v s is computed as the estimated frequency of s appearing as sense annotation in p. Frequency is estimated using lexical specificity (Lafon, 1980), a statistical measure based on the hypergeometric distribution, particularly suitable for extracting an accurate set of representative terms for a given subcorpus SC of a reference corpus RC. We applied the procedure described by Camacho Collados et al. (2016), with the single page p as SC and the whole SEW as RC. As a result we obtain v s , a rather sparse vector in which non-zero components correspond to the Wikipedia pages where s appears as a hyperlink; the weight ω p associated with each component reflects the representativeness of s in the context described by p (Figure 1a).

Embedded Representation
In order to compute the embedded augmentation of an explicit vector v s , obtained as in Section 3.1 for a given concept or entity s, we follow Camacho Collados et al. (2016) and exploit the compositionality of word embeddings (Mikolov et al., 2013b). According to this property, the representation of an arbitrary compositional phrase can be expressed as the combination (typically the average) of its constituents' representations. We build on this property and plug a pre-trained embedding representation into the explicit representation of Raganato et al. (2016). In particular, we consider each dimension p (i.e. Wikipedia page, cf. Section 3.1) of v s and map it to the embedding space E provided by the pre-trained representation to obtain an embedded vector e p . Such mapping depends on the specific embedding representation: • In case of a word embedding representation we consider the Wikipedia page title as lexicalization of p and then retrieve the associated pre-trained embedding. If the title is a multi-word expression and no embedding is available for the whole expression, we exploit compositionality again and average the embedding vectors of its individual tokens; • In case of a sense or concept embedding representation we instead exploit BabelNet's inter-resource links, and map p to the target sense inventory for which the corresponding embedding vector can be retrieved.
The embedded representation e s of s (Figure 1b) is then computed as the weighted average over all the embedded vectors e p associated with the dimensions of v s : where ω p is the lexical specificity weight of dimension p. In contrast to a simple average, here we exploit the ranking of each dimension p (represented by ω p ) and hence give more importance to the higher weighted dimensions of v s .

Word Similarity
In order to calculate similarity at the word level, we follow other sense-based approaches (Pilehvar et al., 2013;Camacho Collados et al., 2016) and adopt a strategy that selects, for a given word pair w 1 and w 2 , the closest pair of candidate senses: where S w is the set of candidate senses of w in the BabelNet sense inventory, and s is the vector representation associated with s ∈ S w . As similarity measure σ we use standard cosine similarity for SEW-EMBED (Section 3.2), and weighted overlap (Pilehvar et al., 2013) for the explicit representations based on SEW (Section 3.1). Finally, we rely on a back-off strategy that set Sim(w 1 , w 2 ) = 0.5 (i.e. the middle point in our similarity scale) when no candidate sense is found for either w 1 or w 2 .

Experiments
In this section we report and discuss the performance of SEW-EMBED on the monolingual and cross-lingual benchmark of the Semeval 2017 Task 2 (Camacho Collados et al., 2017).
We consider two versions of SEW-EMBED: one based on the pre-trained word embeddings of Word2Vec (Mikolov et al., 2013a, SEW-EMBED w2v ) 5 , and another one based on the  Table 2: Results on the cross-lingual word similarity benchmarks (subtask 2) of Semeval 2017 task 2, in terms of Pearson correlation (r), Spearman correlation (ρ), and the harmonic mean of r and ρ.
embedded concept vectors of NASARI (Camacho Collados et al., 2016, SEW-EMBED N asari ). In all test sets, the figures of SEW-EMBED w2v correspond to the results of SEW-EMBED reported in the task description paper (Camacho Collados et al., 2017). We additionally include the results obtained by the original explicit representations based on SEW (cf. Section 3.1) and by the NASARI baseline, and use them as comparison systems across Sections 4.1 and 4.2. 6 4.1 Subtask 1: Multilingual Word Similarity Table 1 shows the overall performance on multilingual word similarity for each monolingual dataset.
Both SEW-EMBED w2v and SEW-EMBED N asari achieve comparable results: their correlation figures are in the same ballpark as the NASARI baseline for Italian, Farsi, and Spanish; instead, they lag behind in English and German. Most surprisingly, however, the explicit representations based on SEW show an impressive performance, and reach the best result overall in 4 out of 5 benchmarks: this might suggest that many word pairs across the test sets are actually being associated with concepts or entities that are well connected in the semantically enriched Wikipedia, and hence the corresponding sparse vectors are representative enough to provide meaningful comparisons. In general, the performance decrease on German and Farsi for all comparison systems is connected to the lack of coverage: both SEW and SEW-EMBED use the back-off strategy (cf. Section 3.3) 70 times for Farsi (14%) and 54 times (10.8%) for German. Table 2 reports the overall performance on crosslingual word similarity for each language pair. Consistently with the multilingual evaluation (Section 4.1), both SEW-EMBED w2v and SEW-EMBED N asari achieve comparable results in the majority of benchmarks. All approaches based on SEW seem to perform globally better in a crosslingual setting: on average, the harmonic mean of r and ρ is 2.2 points below the NASARI baseline (compared to 3.2 points in the evaluation of Section 4.1). This suggests the potential of Wikipedia as a bridge to multilinguality: in fact, even though SEW was constructed automatically on the English Wikipedia, knowledge transfers rather well via inter-language links and has a considerable impact on the cross-lingual performance.

Subtask 2: Cross-lingual Word Similarity
Again, the best figures are consistently achieved by the explicit representations based on SEW: the improvement in terms of harmonic mean of r and ρ is especially notable in benchmarks that include a less-resourced language such as Farsi (+11.75% on average compared to the NASARI baseline). This improvement does not occur with SEW-EMBED, since in that case sparse vectors are eventually mapped to an embedding space trained specifically on an English corpus.

General Discussion
Overall, SEW-EMBED reached the 4th and 3rd positions in the global rankings of subtask 1 and 2 respectively (with scores 0.552 and 0.558, not including the NASARI baseline). Thus, perhaps surprisingly, the embedded augmentation yielded a considerable decrease in terms of global performance in both subtasks, where the original explicit representations of SEW achieved a global score of 0.615 in subtask 1, and a global score of 0.63 in subtask 2 (cf. Sections 4.1-4.2). 7 Intuitively, multiple factors might have influenced this negative result: • Dimensionality Reduction. Converting an explicit vector (with around 4 million dimensions) into a latent vector of a few hundred dimensions leads inevitably to losing some valuable information, and hence to a decrease in the representational power of the model. Such a phenomenon was also shown by Camacho Collados et al. (2016), where the lexical and unified representations of NASARI tend to outperform the embedded representation on several word similarity and sense clustering benchmarks; • Lexical Ambiguity. While the original concept vectors of SEW are defined in the unambiguous semantic space of Wikipedia pages, we constructed their embedded counterparts via the word-level representations of their lexicalized dimensions (Section 3.2); hence, when moving to the word level, we ended up conflating the different meanings of an ambiguous word or expression; 8 7 The global score is computed as the average harmonic mean of Pearson and Spearman correlation on the best four (subtask 1) and six (subtask 2) individual benchmarks (Camacho Collados et al., 2017). 8 E.g., in SEW-EMBEDw2v, the distinct explicit dimensions represented in SEW by the Wikipedia pages BANK and • Non-Compositionality. The compositional properties of word embeddings that we assumed in Section 3.2 falls short in many cases, such as idiomatic expressions or named entity mentions (e.g. Wall Street, or New York). The explicit vectors of SEW, instead, do not require the compositional assumption and always consider a multi-word expression as a whole.
Even though the embedded representations of SEW do not match up to the accuracy of explicit ones on experimental benchmarks, they are on the other hand more convenient in terms of compactness and flexibility (due to the reduced dimensionality), and also in terms of comparability, as they are defined in the same vector space of Word2Vecbased representations such as the embedded vectors of NASARI (Camacho Collados et al., 2016) or DECONF (Pilehvar and Collier, 2016).

Conclusion
In this paper we presented SEW-EMBED, a language-independent concept representation approach which we put forward as a competitor system in the Semeval-2017 Task 2 (Camacho Collados et al., 2017). SEW-EMBED is tied to a Wikipedia-based sense-annotated corpus, SEW (Raganato et al., 2016), obtained automatically by exploiting the hyperlink structure of Wikipedia and the wide-coverage sense inventory of BabelNet. SEW is used to construct sparse vector representations in the space of Wikipedia pages, which are then mapped to an embedded representation by plugging in an arbitrary word (or sense) embedding model and computing a weighted average. We described and evaluated SEW-EMBED on all benchmarks available for the task, together with the explicit sparse vectors originally proposed by Raganato et al. (2016). In spite of the methodological simplicity of the approach (which was designed as an extrinsic test bed for the quality of SEW's annotations), global figures put SEW-EMBED close to, or on par with, state-ofthe-art approaches such as NASARI. In particular, we showed that a cross-lingual setting yields the best overall improvement for concept representations based entirely on SEW, suggesting its potential for multilingual and cross-lingual applications.
BANK (GEOGRAPHY) were both mapped to the Word2Vec embedding of bank.