QLUT at SemEval-2017 Task 2: Word Similarity Based on Word Embedding and Knowledge Base

This paper shows the details of our system submissions in the task 2 of SemEval 2017. We take part in the subtask 1 of this task, which is an English monolingual subtask. This task is designed to evaluate the semantic word similarity of two linguistic items. The results of runs are assessed by standard Pearson and Spearman correlation, contrast with official gold standard set. The best performance of our runs is 0.781 (Final). The techniques of our runs mainly make use of the word embeddings and the knowledge-based method. The results demonstrate that the combined method is effective for the computation of word similarity, while the word embeddings and the knowledge-based technique, respectively, needs more deeply improvement in details.


Introduction
Semantic word similarity aims at measuring the extent to which two words are similar (Camacho-Collados et al., 2017). Given two words, the runs in this competition should give a score which indicates the similarity between them, and it will be evaluated by the official gold standard set. This task doesn't offer any annotated corpus and the organizers encourage systems to utilize unlabeled corpus. With the development of word embeddings technique, more and more attentions are paid to it (Mikolov et al., 2013a;Mikolov et al., 2013b). We also adopt the word embeddings ________________________ * Corresponding author method in our runs.
Besides the word embeddings method, another knowledge-based method is proposed by us, which is based on BabelNet (Navigli and Ponzetto, 2012). Integrating Wikipedia and WordNet, BabelNet is a multilingual encyclopedic and lexicographic knowledge base, which builds an enormous semantic network linking concepts and named entities with the aid of a large semantic relations.
Based on the word embedding method and the knowledge-based method, a combined method is implemented, which achieves the best performance.

System Overview
In the subtask 1 (English monolingual word similarity) of this task, we have submitted two system runs, both of which are unsupervised. We mainly utilize the word embeddings method and the combined method.
The Figure 1 shows the framework of our system runs. In the top part of the figure, word1 and word2 are the input of our systems. Run1 utilizes the word embeddings method. Run2 utilizes the combined method, which is based on the word embeddings and knowledge-based method.

DataSet
Test Set: In this task, we submit our runs on the English monolingual word similarity dataset, which includes 500 word pairs. These word pairs may be concepts or named entities, which are tabseparated.
Gold Standard Set: This set is gold standard set, which is annotated by official annotators. Each line in this set is a similarity value according to the test set describe above in [0-4] rating scale. 4 shows that the two words are very similar, i.e., synonyms; 3 means that the two words are similar, but have slightly different details; 2 represents that the two words are slightly similar, having a topic/domain/function and ideas or related concepts in common; 1 shows that the two words are dissimilar, which only having some small details in common. 0 means that the two words are totally dissimilar.

Word Embeddings Method
In this competition, we use the word2vec toolkit 1 to train word embeddings on the English Wikipedia corpus 2 . Before training word embedding, we preprocess the text file of the corpus to change its character encoding form from Unicode to UTF-8, because it is the default set to run the word2vec toolkit. We set the training window size to 5 and default dimensions to 200, and choose the Skipgram model. After training on the corpus, word2vec toolkit generates a word embeddings file, in which each word in the Wikipedia corpus can be mapped to a word embedding of 200 dimensions. Each dimension of the word embedding is a double value. 1 https://code.google.com/p/word2vec/ 2 https://sites.google.com/site/rmyeid/projects/polyglot Word Similarity: Mikolov has explained that the word embedding has semantic meaning (Mikolov et al., 2013a). Therefore, given two words, the semantic word similarity can be easily attained by the cosine of their word embeddings: where ( 1 ) is the word embedding of word w1 and, | ( 1 )| and | ( 2 )| are the length of ( 1 ) and ( 2 ), respectively. Phrase Similarity: As Mikolov has presented that phrase vector can be easily gotten by simple vector addition (Mikolov et al., 2013b), we can gain the phrase similarity between two phrases as follows: where | 1 | and | 2 | are the number of the words, which phrase p1 and p2 contain respectively. Word represents the word, which belongs to p1.

Knowledge-based Method
Thanks to the BabelNet, which provides a large coverage of concepts and named entities connected in a large semantic relations, such as synonymy, hypernymy and hyponymy, we can get the semantic relations between the two given words (each being a concept or named entity) by the BabelNet API 3 . In order to easily compute the similarity of two words, we implement the following algorithm.  Lines 1-2, we make the similarity of 0.5 according to the official suggestion if the systems can't cover the words in the evaluation data. Lines 3-4, if the two items of input are synset, then we assign 1.0 as its similarity. Lines 5-6, if the two words do not have this relationship, the program will iteratively search the related synsets of word1 and word2, respectively, until they have common related synset(s) or the search steps step beyond a set threshold γ beforehand. Due to the large cost of the subsequent graph computation, we simply set 10 steps as the maximum iterative steps (i.e., γ). Lines 7-8, If the steps step is beyond γ, we consider that it may cost more than 10 steps to get the common synset which connect them in the graph, or even not get anything. In other words, the two words may be weakly similar, then we just simply set 0.0 as their similarity. Lines 9-14, if the steps step do not reach the threshold γ, we begin to construct the graph with word1, word2 and their related synset by means of JUNG toolkit 4 and then traverse the graph to get the Dijkstra shortest path path between the input word1 and word2. And we make the reciprocal of the path power of μ as their similarity sim: where path is the Dijkstra shortest path described above, and μ is set to 1.4 manually, which is used to adjust the similarity sim to be in a proper range (see 2.1). At last (line 15), it return the similarity sim.

Combined Method
This method is directly generated by combining the two methods described above, i.e., the word embeddings method and the knowledge-based method. We make this method, in order to leverage the performance of the two methods. More specially, we use the following equation to get the final similarity.
where represents the semantic similarity of the knowledge-based method and stands for the semantic similarity of the word embedding method. The parameter α is the manually factor for balancing the results of the two methods. And it is set to 0.6 manually.
is the final result. 4 http://jrtom.github.io/jung/ 3 Evaluation Run1: This run uses the word embeddings method described in Section 2.2. Given two words or phrases, it can get the semantic similarity by computing the cosine between their word vectors.
Run2: This run use the combined method described in Section 2.4. It can leverage the word embeddings method and knowledge-based method.
Runkb: This run use the knowledge-based method which is described in Section 2.3.
The runs are evaluated according to the measures of standard Pearson and Spearman correlation. The final score (see the last column in   last row) is the baseline system which is created by the official of this task. As we can see in Table 1 that the system Run2 make a 9.5% (Final) improvement in contrast with the baseline system (NASARI), and achieves the best performance. The performance of the system Run1 does not exceed the baseline system. Table 2 shows that the system Runkb get its best performance when  is set to 1.4 (see 2.3). Table 3 shows that Run2 get its best performance when  is set to 0.4 instead of 0.6 (see 2.4). These results show that the word embeddings method and the knowledge-based method, respectively, are not enough effective while the combined method of them makes the best performance of 0.781 in all our runs.

Conclusions and Future Work
Our best run achieves the performance of 0.781 (Final). It shows that the combined method is more effective for the computation of word similarity than the word embeddings method and the knowledge-based method, respectively. There are a large room to improve the performance of the word embeddings method and the knowledgebased method. In the future, we will refine the various relations among words to improve knowledge-based method.