HHU at SemEval-2017 Task 2: Fast Hash-Based Embeddings for Semantic Word Similarity Assessment

This paper describes the HHU system that participated in Task 2 of SemEval 2017, Multilingual and Cross-lingual Semantic Word Similarity. We introduce our unsupervised embedding learning technique and describe how it was employed and configured to address the problems of monolingual and multilingual word similarity measurement. This paper reports from empirical evaluations on the benchmark provided by the task’s organizers.


Introduction
The goal of Task 2 of SemEval-2017 is to provide a reliable benchmark for the evaluation of monolingual and multilingual semantic representations (Camacho-Collados et al., 2017). The proposed evaluation benchmark goes beyond classic semantic relatedness tests by providing both monolingual and cross-lingual data sets that include multiword expressions, domain-specific terms, and named entities for five languages. To measure 'semantic similarity' between pairs of lexical items, the HHU system uses the algorithm proposed in (QasemiZadeh et al., 2017), which is based on a derandomization of the 'random positive-only projections' method proposed by QasemiZadeh and Kallmeyer (2016).
Word embedding techniques (i.e., using distributional frequencies to produce word vectors of reduced dimensionality) are one of the most popular approaches to semantic word similarity problems. These methods are often rationalized using Harris' Distributional Hypothesis that words of similar linguistic properties appear with/within a similar set of 'contexts' (Harris, 1954). For example, words of related meanings co-occur with similar context words {c 1 , . . . c n }.
This hypothesis implies that if these context words are grouped randomly into m buckets, e.g. {{c 1 . . . c x } 1 , . . . , {c y , . . . c n } m }, then co-related words still co-occur with similar sets of buckets. QasemiZadeh and Kallmeyer (2016) exploit this assumption and propose random positive-only projections for building word vectors directly at a reduced dimensionality m. In this paper, we propose a derandomization of this method and a hashbased technique for learning word embeddings. In Section 2, we describe our method. In Section 3, we report results obtained by applying this method to the shared-task benchmark. Finally, we conclude in Section 4.

Method
Our method consists of two logical routines: (a) a text skimmer to collect co-occurrence information; and (b) a hash-based encoder to build low-dimensional vectors from collected cooccurrences in (a). Evidently, these procedures can be merged and ordered differently to meet requirements of an application.
To build an m-dimensional embedding for an entity w (such as a word or phrase) that co-occurs with (or within) some context elements c (resulting from the skimming routine), we take the following steps: Here, w d is the dth component of w. The hash function assigns a hash code (e.g., an integer) to each context element c. The abs function returns the absolute value of its input number and % is the modulus operator and it gives the remainder of the division of the generated hash code by the chosen value m. We use the following Our choice for hash is motivated by its low collision rate for short words (byte sequences) and the closer resemblance of computed ds to an independent and identical distribution (i.i.d). It can be verified that the procedure proposed above implements a derandomization of QasemiZadeh and Kallmeyer's POP method: The generated modulus of hash codes from context elements constitutes a random positive-only projection matrix, and the component-wise additions compute the multiplication of this randomly generated matrix with the original high-dimensional vectors (QasemiZadeh et al., 2017).

Computing Similarities
Once ws are constructed, they are weighted by the expected and marginal frequencies, e.g., using positive pointwise mutual information (PPMI) (Church and Hanks, 1990;Turney, 2001). Let W p×m (consisting of p row vectors w of dimensionality m) be the set of embeddings in our model (i.e., the output of Algorithm 1). The PPMI weight for a component w xy in W is given by: For this task, however, we adopt cascaded PPMI weightings: PPMI-weighted vectors are weighted once more using the above-mentioned formula, i.e., we compute ppmi(ppmi(W p×m )). We believe this cascaded weighting yields better results by providing a well-balanced scaling of the original PPMI weights. Note that the weighting process is fast since it is carried out on vectors of small dimensionality m.
Finally, we compute similarities between these weighted vectors using a correlation measure. QasemiZadeh and Kallmeyer (2016) suggest Pearson's r for PPMI weighted vectors. Later, in QasemiZadeh et al. (2017), they suggest Goodman and Kruskal's γ coefficient (Goodman and Kruskal, 1954). To compute γ, concordant and discordant pairs must be counted. Given any pairs such as (x i , y i ) and (x j , y j ) from two mdimensional vectors x and y and the value the pair is neither concordant nor discordant. Let p and q be the number of concordant and discordant pairs, then γ is given by (Chen and Popovich, 2002, p. 86): In this paper, we suggest a new estimator based on Lin's information theoretic definition of similarity (Lin, 1998): ).

Extending the Method to Cross-Lingual Tasks
The proposed method can also be employed in a cross-lingual setting. However, this requires a small dictionary (translation-memory) and an additional pre-processing step.
In the pre-processing step, all pairs of lexical items in the input dictionary must be first mapped onto a common symbol space. Let's assume that the input dictionary consists of entries of the form l → {t 1 , . . . , t n } (i.e., l is a lexical item in the source language which has a number of t i translations in the target language). To build the common symbol space, we generate all possible (l, t i ) tuples and we assign them unique identifiers-i.e., (l, t i ) → s . Finally, these tuples and their assigned identifiers are flattened in a symbol table t: for instance, if (l, t i ) are assigned to the unique identifier s, then the entries of (l, s) and (t i , s) are stored in this table t. Note that the mappings in t are not necessarily one-to-one.
To build cross-lingual vectors for lexical items w in any of the input languages, similar to the monolingual setting, input corpora are scanned to collect context elements c. However, only those context elements that can be found in t are encoded into models. If t contains an identifier sym-bol s for a given context element c, then s is passed to Algorithm 1 to update vector w.

General Settings
As input, we use the Wikipedia text corpora provided by the task organizers. 2 In our reports, we include results from the sense-based NASARI vectors (i.e., the baseline introduced by the organizers): 300-dimensional embeddings obtained using a hybrid approach (Camacho-Collados et al., 2016). The evaluation metric is the harmonic mean (H) of Pearson's r and Spearman ρ correlations between the test datasets (i.e., gold data constructed from scores assigned by humans to word pairs) and the corresponding system generated ones.
We treat multi-word expressions similar to single-token words. Given a list of tokens, instead of collecting co-occurrence information only for single tokens, we extend our scan of input corpora to contiguous n-gram sequences of tokens for which n is decided by the maximum length of items in the evaluation test sets. In effect, we limit the active vocabulary of our system and collect co-occurrence information only for those lexical items in the task's test sets.

Monolingual Subtask
To collect co-occurrence information from input corpora, given the small size of input corpora, we adapt a greedy approach. Input corpora are read line by line; if a lexical item w t in our target vocabulary appears in a line at span i to j, we update w t by passing the following items as context element to Algorithm 1: Feature Sets: • The whole line (as one unit): this is done to capture information about possible cooccurrences of test lexical items within a large context (such as done in word-bydocument models).
• All the tokens from position i − 20 to j + 20 (i.e., including w t ), i.e., the classic sliding context window. We include w t to enforce similarity between a pair of multiword lexical items of similar constituent tokens.  Table 1: Results for our official submissions.
• All n-grams (n ∈ {3, 4}) generated from each of the tokens appearing in the above sliding context window: this is done to capture information about the morphological structure of the context words. Table 1 summarizes the results and configurations that we have used in our official submissions. For Farsi, for the first run, we built vectors of dimension m = 2000, weighted them using cascaded-PPMI (see Section 2.1) and used Pearson's r as a similarity measure. Evaluated by the organisers, this resulted in r = 0.541, ρ = 0.585, and the official score of H = 0.562. In the second run, however, we built vectors of dimensionality m = 2500 and after cascaded-PPMI weighting, similarities were computed using sim lin . This resulted in scores of r = 0.606, ρ = 0.601, and H = 0.604. To choose these configurations, we relied on the trial data as well as resources introduce in Camacho-Collados et al. (2015). For English, we observed that adding n-gram features deteriorates results; hence, we removed this set of features from our model of dimensionality m = 2500. In both runs, we used cascaded-PPMI. As a similarity measure, we used sim lin and Pearson's r in the first and second run, respectively. This produced a score of r = 0.71, ρ = 0.699, and H = 0.704 for the first run, and r = 0.656, ρ = 0.697, and H = 0.676 over the second run. Note that for both languages, we could build any vectors for a number lexical items since they did not occur in the input corpora (see the last column of Table 2 for details).

Extended Evaluations
While our official submissions are limited to English and Farsi, to provide a better understanding of the method's performance, we provide results for all the five languages in the monolingual subtask. To build models, we use the feature sets described in the previous section. The remaining hyper-parameter of our method is m (the dimensionality of models); we report results for m ∈ {300, 700, 2000}. Results obtained using various  Table 2: Results for vectors of various dimensionality (denoted by dim), and when using PPMI for weighting and Pearson's r for measuring similarity between them. H denotes the harmonic mean of r and ρ (i.e., the task's official score). #M is the number of lexical items which have not occurred in our input corpora; for pairs containing these items, we use 0 as a default value for similarity. Those settings that yield better results than the baseline are marked using ↑.   Table 4: Method's performance when using the combination of PPMI and sim lin . combinations of weighting techniques and similarity measure are summarized in Table 2 to 7. 3 Disregarding the choice of weighting technique and similarity measure, an increase in m often produces better results, but at the expense of higher computational cost. In addition, as suggested in Section 2.1, by comparing results between Table 2 to 4 and Table 5 to 7, we observe that using cascaded-PPMI weighting instead of simple PPMI weighting often yields better scores. The 3 Slight improvements in results for Farsi are due to homogenizing character encoding: Zero-width non-joiner characters (U+200c) are replaced by the space character (U+0020); the Arabic letter Kaf (U+0643) is replaced by the Farsi letter Kaf U+06A9, and the Arabic letter Yeh (U+064A) is replaced by the Farsi letter Yeh (U+FBFC).

Lang
Dim =300      only exception is when m is small (e.g., m = 300) and γ is used to measure similarities. For small m = 300, this combination of PPMI weighting and γ gives the best performance (Table 3); we witness that for m = 300, this combination also gives the best results for Camacho-Collados et al.'s data sets.

Cross-Lingual Subtask
We applied the methodology described in Section 2.2 to build cross-lingual embeddings for the pair emphEnglish and Farsi. To build the common symbol space, we extracted an English-to-Farsi translation dictionary from the English Wiktionary dump of January 2017, containing translations for 7500 lexical items in English. These 7500 entries were converted to a symbol table t of size 17760. We then augmented this table with Wikipedia's title translations. As a result, the number of entries in t increased to 1,299,770.
For each w in the test data set, we collected co-occurrences from a context window (extended 20 tokens at each side of w) for both words and multiword expressions that appear in t. Note that the sole input to our method was unaligned text from the English and Farsi Wikipedia corpus (similar to the monolingual setting). In both runs, we used vectors of dimensionality m = 3000 and the proposed sim lin measure to compute similarities between vectors. To weight vectors, in the first run, we used cascaded-PPMI while we used simple PPMI for the second run. Table 8 provides a summary of the method's performance. Surprisingly, our simple methodology performs at least as well as the baseline technique. Table 8 can be easily improved by feeding in additional input, particularly parallel corpora. For instance, we observe that using the Open Subtitles corpus in addition to the Wikipedia corpus can enhance the results for the combination of cascaded-PPMI and sim lin (Run 1) from H = 0.505 to 0.575.

Conclusion
This paper described the methodology behind the HHU system that participated in the SemEval 2017 shared task on semantic word similarity. The proposed technique uses a hash-based algorithm for building embeddings. The method is fast and simple, and it demands only a small amount of computational resources to build a model. As shown by empirical evaluations, our method shows acceptable performance in semantic similarity tasks. Our code is available for download (https://user.phil.hhu.de/~zadeh/ material/hash-vectors/) in order to replicate the results reported in this paper.