RUFINO at SemEval-2017 Task 2: Cross-lingual lexical similarity by extending PMI and word embeddings systems with a Swadesh’s-like list

The RUFINO team proposed a non-supervised, conceptually-simple and low-cost approach for addressing the Multilingual and Cross-lingual Semantic Word Similarity challenge at SemEval 2017. The proposed systems were cross-lingual extensions of popular monolingual lexical similarity approaches such as PMI and word2vec. The extensions were possible by means of a small parallel list of concepts similar to the Swadesh’s list, which we obtained in a semi-automatic way. In spite of its simplicity, our approach showed to be effective obtaining statistically-significant and consistent results in all datasets proposed for the task. Besides, we provide some research directions for improving this novel and affordable approach.


Introduction
Pairwise semantic lexical similarity is a core component in NLP systems that tackle fundamental NLP tasks such as word sense disambiguation (Camacho-Collados et al., 2015), semantic textual similarity (Agirre et al., 2017) and many others. Since more than two decades, the problem has been addressed mainly for the English language, but only recently, other languages have been considered. The task 2 in SemEval 2017 (Camacho-Collados et al., 2017) proposes a public challenge for this task in 5 languages (English, Spanish, Italian, German and Farsi) and an additional crosslingual challenge in their 10 possible combinations. This paper describes the participating systems of the RUFINO team in these challenges.
Lexical-similarity systems receive two words as input and return a numerical score that reflects the . .
w ES w EN Figure 1: Architecture of the cross-lingual system similarity or relatedness between them. Crosslingual systems extends the idea to words in different languages. The evaluation of such systems consist in measuring the correlation of the scores obtained by several word pairs against the consensus of human judgments (gold standard).
The main fundamental resources used by lexical similarity systems are monolingual corpora, parallel corpora and knowledge-based resources such as WordNet (Miller, 1995) and Babelnet (Navigli and Ponzetto, 2012). Among them, monolingual corpora are the cheapest and most available resource in the majority of the languages. Aiming to propose a lexical similarity system with easy replicability, beyond the 5 languages of the challenge, the RUFINO team proposed a system based mainly on monolingual corpora.
Like monoligual lexical similarity, the crosslingual variant of this task aims to establish quantitatively the degree of similarity between two words, but with the added complexity of being in different languages. This task contributes to solve other higher level task such as cross-lingual text similarity and entailment (Jimenez et al., 2012(Jimenez et al., , 2013. However, to the best of our knowledge, it is not possible to build a cross-lingual system between two non-similar languages based solely on monolingual corpora. For that, we proposed a resource inspired by the well-known Swadesh's list. The Swadesh's list (Swadesh, 1950(Swadesh, , 1952 is comprised of approximately 200 concepts aimed to be universal, culturally independent and transverse to almost any language for the purposes of comparative linguistics. We used Wikipedia and Google Translate to build a list for the 5 languages of the competition containing 66 concepts with similar properties to the ones proposed by Swadesh. Since the alignment of concepts grouped into synsets among WordNets in different languages is not always available, we decided to use Google Translate. In case of considering a not included language among the supported ones by Google Translate (they are more than 100), we estimated it could be comparatively more feasible and economic to build an automatic translator from a parallel corpus than the manual construction of a WordNet for that language. Nevertheless, a resource like BabelNet, for instance, could also provide accurate translations of transverse concepts. Our goal, is to build cross-lingual systems starting from monolingual systems connected across languages by the proposed list of concepts. Figure 1 provides a general overview of the general architecture of the proposed system.
The organizers of the challenge proposed a benchmark corpus for the sake of comparison of the participating systems. The systems proposed by our team used for training the Wikipedias in the 5 languages, which is the benchmark corpora for the monolingual sub-task. The benchmark corpus for the cross-lingual systems is the Europarl parallel corpus 1 .Alternatively, our cross-lingual systems used the proposed list of language-traverse concepts, which is considerably smaller, simpler 1 http://www.statmt.org/europarl/ and cheaper than the Europarl corpus. Although, the results obtained by our systems were in the middle range of the general ranking of official results, all of them were statistically significant and consistent across all datasets. Moreover, in some cases our results were comparable to other systems relying in considerably larger, more complex and more expensive resources.
The rest of the paper contains the following sections. In section 2, we present the motivation for our approach. Section 3 contains the detailed description of our participating systems. In section 4 the obtained results are presented and discussed. Finally, in section 5 we provide some concluding remarks.

Motivation
A concept list of basic vocabulary items showing the universality of certain parts of the lexicon of human languages was initially proposed by Morris Swadesh (Swadesh, 1950(Swadesh, , 1952. Swadesh claimed that certain morphemes and everyday words such as mother, son, hand, head, sun, warm, water, tree, etc. connected with concepts and experiences common to all human groups are relatively stable over time. Since then, many concepts lists following the same characteristics have been compiled for several purposes in descriptive linguistics. Considering that concepts are not only transverse to languages, but they also share some proximity when they are semantically close. For instance, mother and son are more semantically close than mother and sun independently of the language. Our approach is based on the idea that a set of transverse concepts to languages serve as a support to index a vectorial representation of the words of a given language. In order to obtain such representation for just one language, it is required the lexicalization of that set of transverse concepts and a lexical-similarity (or distance) system of that language. This semantic representation is crosslingual since it only depends on the relative similarities (or distances) of each one of the words to be represented to the set of transverse concepts. Therefore, the representation of a particular word w is a vector where each dimension corresponds to the similarity score between w and each word from the set of transversal concepts.
Intuitively, three conditions that a set of transverse concepts for a set of languages should follow were considered. First, these concepts should be relatively frequent in all the given languages due to the fact that infrequent words tend to produce low-quality measurements in the required lexical similarity systems based either on knowledge or corpus. Second, it is preferable that the transverse concepts are lexicalized in each one of the languages with just one word. This condition could anticipate problems with the rules of the usage of multi-words in each language. Third, the monolingual lexical similarity systems should be similar in their construction and used resources. The latter improves the conditions so that the distances and similarities among concepts could be proportional through the different languages.
As a result, the list of transverse concepts, a relatively simple resource to obtain, can be useful to turn a set of monolingual systems into a crosslingual system.

Methods
We build two groups of monolingual lexical similarity systems and other two groups of crosslingual systems. For both, monolingual and crosslingual sub-tasks, the systems labeled as run1 rely mainly on Pointwise Mutual Information (PMI) (Church and Hanks, 1990), and those labeled as run2 were based on Polyglot's word embeddings (Al-Rfou et al., 2013). The following subsections describe such systems.

Monolingual systems
3.1.1 run1: PMI and common contexts P M I is a simple corpus-based informationtheoretical method for finding associations between pairs of words using the distributional hypothesis, which states that associations between words depend on the coocurrences of the words in a large corpus. The P M I score between two words a and b can be computed with this formula: Probabilities can be estimated by the following expressions: Where o a and o b are the number or occurrences of words a and b in the corpus, o a∧b is the number of coocurrences, and N the total number of words in the corpus (all occurrences). We used the benchmark corpora proposed for the task, that is, Wikipedia's dumps for the 5 languages downloaded in October 2016. The preprocessing comprised lower-casing and stopwords 2 removal. For obtaining o a∧b , each coocurrence of a followed by b or vice-versa was counted. N was the total number of non-stopwords on each corpus.
The P M I scores computed using coocurrences is a low-cost and effective tool for finding word associations. However, associations between synonyms or words in the same category cannot be detected with such method because they do not tend occur consecutively in text. For capturing these second-order relationships, we proposed an association measure based in the proportion of common contexts between pairs of words. For that, we defined the context of a word as a duple of its left and right neighbor words (after removing stopwords). During the process of context definiction, we also tried other context settings such as two neighbor words before and after, two before/one after, one before/ two after, just one before, just one after, just two before and just two after (we even attempted not to remove the stopwords). However, we observed that when using the trial data, the setting with the best performance was a neighbor word before and after. Thus, we collected for each word a the set of its contexts C a . The Jaccard coefficient (Jaccard, 1901;Jimenez et al., 2016) was used for comparing pairs of words represented as their sets of contexts: The final similarity score for a pair of words was the average of previously scaled P M I and JCC scores.
Here, max P M I and max JCC are the maximum scores of the corresponding measures within the entire dataset of word pairs being compared. In our implementation, if P M I produced a mathematical error such as division by zero or logarithm of a negative number, then the P M I score was replaced by the average of the scores obtained by the same measure for the other non-erroneous word pairs in the dataset.  (Al-Rfou et al., 2013), which were obtained using the word2vec algorithm (Mikolov et al., 2013) applied to Wikipedia as corpus for a large number of languages. For each pair of target words a and b, their 64-dimensional vector representations (64 is the number of dimmensions in Polyglot's vectors) were obtained from Polyglot's files and then compared using cosine similarity. If a target word started with a capital letter and it was not found in the database of embeddings, then the word is lowercased and searched again. Similarity, if multiwords targets are not found we used the vectorial summation of the representations of the composing words. After that, if some target is still not found, as before, we used the average score of non-erroneous word pairs in the dataset.

Obtaining a Swadesh-like list
For obtaining a list of concepts with similar properties to the Swadesh's list, first we collected the lists of the top-5000 more frequent terms from the Wikipedia for each one of the 5 target languages. Next, each word on each list was translated to the other 4 languages and the translations were translated back to the original language. All translations were obtained using the GOOGLE-TRANSLATE() function in the spreadsheet editor of Google Drive. On each list, we preserved only the rows whose all 4 back translations coincided with the original word. Finally, the 5 list were merged and aligned for identify terms that occurred in the 5 languages. Only the terms occurring exactly in the 5 languages were preserved.
From the previous selection, we obtained a list containing 172 concepts with their lexicalizations in the 5 target languages. This initial list was purged manually by removing proper names, cardinals, stopwords and other unwanted forms. The final result is an aligned list of 66 concepts of fre-  Table 2: Results for the monolingual sub-task (values are the harmonic mean between Pearson's and Spearman's correlation coefficients).
quent words in 5 languages. Besides, all possible combination pair from the 5 words on each concept are common translations of the others. The obtained list is shown in Table 1. That is the proposed list of language-traverse concepts used for enhancing the previously described monolingual lexical-similarity systems to support crosslinguality.

Cross-lingual systems
The proposed lexical cross-lingual systems were built by combining the monolingual systems described in subsections 3.1.1 and 3.1.2, with the list of 66 language-traverse concepts proposed in the previous subsection. The method for that is straightforward and depicted in Figure 1. Basically, for obtaining a vectorial representation of a word in a particular language, such word is compared using a monolingual lexical-similarity system for that language, against the 66 lexicalizations of the transverse concepts in that language. The result is a 66-dimensional vector, which is a language-independent representation the word. For comparing a pair of words in two different languages, their language-independent vectorial representations are obtained using their respective monolingual systems and the aligned list of concepts. Then the final similarity score is obtained computing the cosine similarity between the two vectors. We built two cross-lingual systems labeled as run1, using the monolingual systems described in subsection 3.1.1, and run2, with the systems described in subsection 3.1.2.

Results and discussion
Results obtained by our monolingual systems (run1 and run2) are shown in Table 2. Run1 averaged relatively close to the baseline, which in  The system that outperformed the baseline was the P M I-JCC monolingual system (run1) for German. Run2, based on Polyglot's embeddings, was consistently worse than run1. Although, both systems use the same corpora, the difference in performance is significant. As regards our runs, we suggest that P M I-JCC is a method that takes better advantage of small corpora in comparison with the word2vec algorithm used in the construction of Polyglot's embeddings. Unlike the results of monolingual systems, the results for run1 and run2 in the cross-lingual task had a similar performance and were considerably less than the baseline (see Table 3). An interesting question we asked was to what extent the results of monolingual systems predict the performance of bilingual systems. In order to answer this question, we measured Pearson's correlation (r) between the result of the bilingual system and the minimum between the results of the two monolingual systems for the 10 language combinations. The result was r run1 = 0.883, r run2 = 0.263, and r baseline = 0.950. Clearly, the results of monolingual systems based on P M I-JCC and NASARI are good predictors of the results of bilingual systems.

Conclusions and future directions
From our participation in the task 2 in SemEval 2017 we can gather several conclusions. First, the proposed lexical-monolingual systems based respectively on PMI-JCC and Polyglot's embeddings (i.e. word2vec) obtained considerably different results, in spite of being constructed using the same corpus (i.e. Wikipedia). This result suggest that, for inferring lexical relationships, relatively small corpora can be better exploited by simpler methods such as PMI, which is convenient for under-resourced languages. Second, the proposed approach of using a parallel list of language-transverse concepts for building lexical cross-lingual systems from monolingual resources showed to be effective with a good costbenefit ratio. Third, there is an important performance gap between the proposed approach and the knowledge-based baseline approach.
However, the monolingual versions of both our approach (run1) and that baseline share the property of being good predictors of the performance of the cross-lingual versions. Therefore, we conclude that a straightforward way to improve the proposed system is to use better monolingual systems. Additionally, the method for selecting the set of language-traverse concepts can be improved by considering the transversality of the relationships and by the use of size-balanced multilingual corpora.