QLUT at SemEval-2017 Task 1: Semantic Textual Similarity Based on Word Embeddings

This paper reports the details of our submissions in the task 1 of SemEval 2017. This task aims at assessing the semantic textual similarity of two sentences or texts. We submit three unsupervised systems based on word embeddings. The differences between these runs are the various preprocessing on evaluation data. The best performance of these systems on the evaluation of Pearson correlation is 0.6887. Unsurprisingly, results of our runs demonstrate that data preprocessing, such as tokenization, lemmatization, extraction of content words and removing stop words, is helpful and plays a significant role in improving the performance of models.


Introduction
Semantic Textual Similarity (STS) has been held in SemEval since 2012 (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015;Agirre et al., 2016), which is a basic task in natural language processing (NLP) field. It aims at computing the semantic similarity of two short texts or sentences, and the result will be evaluated on a gold standard set, which is made by several official annotators (Cer et al., 2017). In recent years, as an unsupervised method, word embedding (Mikolov et al., 2013a) becomes more and more popular in SemEval (Jimenez, 2016;Wu et al., 2016).
The paper describes the submission of our systems to STS 2017, which utilize word embedding method. Different from some teams who have _______________________ * Corresponding author Figure 1: Framework of system. used word embedding described above, what we pay attention to is the point of preprocessing evaluation data. With this consideration, we process the evaluation data with different method in order to verify whether it works or not. The framework of our systems is showed in Figure 1. Its simple description is as follows: Tokenization: This is to tokenize the two sentences of the system's input. Though the English sentence is tokenized naturally, the punctuations are not. For instance, the sentence "A person is on a baseball team." will be tokenized to "A person is on a baseball team .".
Extraction of content words: In this process, content words of the tokenized sentence will be extracted. For example, the tokenized sentence "A person is on a baseball team ." turns into "person is baseball team". In this paper, content words include nouns, verbs, adverbs or adjectives.
Lemmatization: It is known that words in English sentences have a variety of forms. This operation will lemmatize these words to their basic forms, for example, word "made" and "making" will be changed to "make". In addition, this process also convert the uppercase to lowercase, for instance, "Make" will be changed to "make".
Word embeddings: This process utilizes the word2vec toolkit 1 to train on the Wikipedia corpus, then the word embeddings can be obtained.
Sentence similarity: The similarity of two sentences is computed as the cosine of their sentence embeddings, which can be gotten easily (see 2.3).
Normalization: Due to the different range of the results of runs, similarity scores are normalized to meet the official standard.

System Overview
In STS 2017, we submit three system runs, all of which are unsupervised and utilize word embedding method after preprocessing.

Data Set
Test Set: The test set of the Track 5 (English monolingual pairs) consists of 250 sentence pairs. Each of these sentence pairs is in a line, split by tab.
Gold Standard Set: This set is the gold standard similarity score of 250 sentence pairs in the test set. The range of the score is from 0 to 5. More specially, 0 denotes that the two sentences are completely dissimilar; 1 means that the two sentences only have the same topic; 2 represents that the two sentences only have some details in common; 3 shows that the two sentences are approximately equivalent but they have some differences in the important details; 4 implies that the two sentences are roughly equivalent and some differences they have are not important; 5 indicates that the two sentences are completely equivalent.

Wikipedia Corpus
We use the unlabeled corpus, i.e., the English Wikipedia corpus, which have been processed by Rami Al-Rfou' 2 . The processed Wikipedia dumps 1 https://code.google.com/p/word2vec/ 2 https://sites.google.com/site/rmyeid/projects/polyglot have been tokenized in text format for all the languages which are considered in the evaluation. What we use in the system run is the English Wikipedia dump, after unzipped, a text file can be gotten and its size is 15.8 GB.

Method
In this competition, we use the word2vec toolkit on the Wikipedia corpus described above to train word embeddings. Before training word embeddings, we preprocess the text file in the corpus to transform its charset from Unicode to UTF-8, because UTF-8 is the default charset for us to run the word2vec toolkit. We set the training window size to 5 and default dimensions to 200, and choose the Skip-gram model. After trained on the corpus, the word2vec can generate a word embeddings file, in which each word in the corpus can be mapped to a word embedding of 200 dimensions. Each dimension of the word embedding is of floating point type double.
Mikolov has explained that the word embedding has semantic meaning (Mikolov et al., 2013a). Therefore, given two words, the semantic similarity of words can be easily obtained by the cosine of their word embeddings. Moreover, we can extend this to the semantic sentence similarity. Inspired by (Mikolov et al., 2013b;Wu et al., 2016), the sentence embedding of a sentence can be gained by accumulating the word embedding of all the words in it. Then by computing the cosine of two sentence embeddings, the semantic sentence similarity can be gotten as follows: where | 1 | and | 2 | are the number of tokens, which sentence s1 and s2 include, respectively. Word represents the word, which belongs to s1.

Runs
All of our runs utilize the same method described above, i.e., word emdeddings method. The only difference among them lies that each of these runs have different details in preprocessing the evaluation data. Here we clearly show their preprocessing operations in details. Run1: We firstly use the Stanford CoreNLP toolkit 3 (Manning et al., 2014) to split each token for the sentence pairs in the evaluation data. Then we tokenize all words with the help of the Stanford CoreNLP toolkit, then extract content words of the sentence pairs in the evaluation data.
Run2: As the operations of Run1, we tokenize the sentence pairs and extract content words for the sentence pairs in the evaluation data. Beyond that, we get the lemmas of these content words with the Stanford CoreNLP toolkit.
Run3: The only operation we do is to tokenize the sentence pairs of the evaluation data. Compared with Run1, all words are reserved in this run.
At last, in order to carry on the following evaluation, we normalize the output of these systems from [0, 1] to [0,5].
The three runs are submitted to official evaluation, which are compared in Table 1.
In order to further consider the influence of stop words, we perform another group of experiences. Based on the runs in Table 1, we remove stop words which is from NLTK package. The corresponding results are shown in Table 2.

Evaluation
In the task, the official evaluation tool is based on Pearson correlation. A system run in each test set is evaluated by its Pearson correlation with the official provided gold standard set.
The results in Table 1 above shows that the system Run2 get the best performance of 0.6433. Compared with Run1, Run2 achieves a 2.78% improvement, which implies that to lemmatize content words can be helpful. The difference of 12.31% between Run1 and Run3 indicates that the extraction of content words can make a larger improvement for the similarity computation of the sentence pairs.
In order to further know the effect of lemmatization with Run3, we make the system Run3'. The only difference between them is that in the operation of preprocessing the data, Run3' makes the lemmatization of the sentence pairs in the data, on the contrary, Run3 do not do it. The contrast of Run3 and Run3' again confirms that lemmatiza-tion for computing the similarity of the sentence pairs can be effective.
As is shown in Table 2, the relative performance of each run is similar with Table 1. Run2get the best performance of 0.6887, which demonstrate the effectiveness of content words extraction and lemmatization. Each run in Table 2 achieves a better performance than that in Table 1, which demonstrates that it is necessary to remove stop words.

Conclusions and Future Work
The best Pearson correlation of our runs is 0.6887. Although our runs do not get the state-of-the-art performance, the result of these runs is acceptable. And it shows that word embeddings method is effective. Besides, in the competition, we can conclude that the appropriate preprocessing operation (such as tokenization, lemmatization, extraction of content words and removing stop words) for the data is helpful and necessary. In the future, with the help of word embeddings, we will explore some improved method to get a better performance.