SEF@UHH at SemEval-2017 Task 1: Unsupervised Knowledge-Free Semantic Textual Similarity via Paragraph Vector

This paper describes our unsupervised knowledge-free approach to the SemEval-2017 Task 1 Competition. The proposed method makes use of Paragraph Vector for assessing the semantic similarity between pairs of sentences. We experimented with various dimensions of the vector and three state-of-the-art similarity metrics. Given a cross-lingual task, we trained models corresponding to its two languages and combined the models by averaging the similarity scores. The results of our submitted runs are above the median scores for five out of seven test sets by means of Pearson Correlation. Moreover, one of our system runs performed best on the Spanish-English-WMT test set ranking first out of 53 runs submitted in total by all participants.


Introduction
Semantic Textual Similarity (STS) aims to assess the degree to which two snippets of text are related in meaning to each other. The SemEval annual competition offers a track on STS (Cer et al., 2017) where submitted STS systems are evaluated in terms of the Pearson correlation between machine assigned semantic similarity scores and human judgments.
We participated in both monolingual sub-tracks and cross-lingual sub-tracks. Given a sentence pair in the same language, the SemEval STS task is to assign a similarity score to it ranging from 0 to 5, with 0 implying that the semantics of the sentences are completely independent and 5 denoting semantic equivalence (Cer et al., 2017). The crosslingual side of STS is similar to the initial task, but differs in the input sentences which come from two languages.
This year's shared task features six sub-tasks: Arabic-Arabic, Arabic-English, Spanish-Spanish, Spanish-English (two test sets), English-English and a surprise task (Turkish-English) for which no annotated data is offered.
For example, for the English monolingual STS track, the pair of sentences below had a score of 3 assigned by human annotators, meaning that the two sentences are roughly equivalent, but some essential information differs or is missing (Cer et al., 2017).
Bayes' theorem was named after Rev Thomas Bayes and is a method used in probability theory.
As an official theorem, Bayes' theorem is valid in all universal interpretations of probability.
We present an unsupervised, knowledge-free approach that utilizes Paragraph Vector (Le and Mikolov, 2014) to represent sentences by means of continuous distributed vectors. In addition to experimenting with feature spaces of different dimensionality, we also compare three state-of-the-art similarity metrics (Cosine, Bray-Curtis and Correlation) for calculating the STS scores. We do not make use of any lexical or semantic resources, nor hand-annotated labeled corpora in addition to the distributed representations trained on non-annotated text. The approach gives promising results on all sub-tasks, with our submitted systems ranking first out of 53 for one Spanish-English sub-track and above the median scores for five out of seven test sets.
We first shortly summarize related work in STS and describe Paragraph Vector in Section 2. Then we present our method in Section 3 along with the corpora we used in training the Paragraph Vector models. Section 4 contains an overview of the evaluation and the results.

Semantic Textual Similarity
We present in this subsection the state-of-the-art in STS-Task 1 using Paragraph Vector since it is the most relevant to our work. King et al. (2016), for instance make use of Paragraph Vectors as one approach in the English monolingual sub-task. Results are reported for a single vector size and the Cosine metric which is employed in obtaining the similarity score between sentences. Brychcín and Svoboda (2016) follow a similar approach but apply it also to the cross-lingual task.
We raise three research questions regarding the usage of Paragraph Vector in STS: • To which degree does the vector size matter?
• What could be a better alternative to the traditional Cosine metric for measuring the similarity between two vectors (obtained with Doc2Vec 1 )?
• Given a cross-lingual task, does averaging the similarity scores obtained using the Doc2Vec models trained on both language corpora result in an improvement over using only the scores from one model?

Paragraph Vector
In order to assess the semantic textual similarity of two sentences, methods of representing them are crucial. Le and Mikolov (2014) propose a continuous, distributed vector representation of phrases, sentences and documents, Paragraph Vectors. It is a continuation of the work in Mikolov et al. (2013a) where word vectors (embeddings) are introduced in order to semantically represent words. The strength of capturing the semantics of words via word embeddings is visible not only when considering words with similar meaning like "strong" and "powerful" (Le and Mikolov, 2014), but also in learning relationships such as male/female where the vector representation for King -Man + Woman results in a vector very close to Queen (Mikolov et al., 2013b).
In the Paragraph Vector framework, the paragraph vectors are concatenated with the word vectors to form one vector. The paragraph vector acts as a memory of what is missing in the current context. The word vectors are shared across all paragraphs, while the paragraph vector is shared across all contexts generated from the same paragraph. The vectors are trained using stochastic gradient descent with backpropagation (Le and Mikolov, 2014).
Since the STS task requires assigning a similarity score between two sentences, we apply Paragraph Vector at the sentence level. The models are trained using the Gensim library (Řehůřek and Sojka, 2010).

Semantic Textual Similarity via
Paragraph Vector

Corpora
For training the Doc2Vec models we used various corpora available for the different language pairs. Following the rationale from Lau and Baldwin (2016), we concatenated to the corpora the test set too as the Doc2Vec training is purely unsupervised. The corpora we used are made available by Opus (Tiedemann, 2012) (except Commoncrawl 2 and SNLI (Bowman et al., 2015)): Wikipedia (Wolk and Marasek, 2014), TED 3 , MultiUN (Eisele and Chen, 2010), EUBookshop (Skadiņš et al., 2014), SETIMES 4 , Tatoeba 5 , WMT 6 and News Commentary 7 . The following table presents which corpora were used and how many sentences they consist of. The corpora marked with * were used only for the third run. The SNLI, WMT and News Commentary corpora were used for run 3 in some sub-tasks where we aimed to assess whether using more data makes a difference. For training the English models only the EN side of the ES-EN language pair was used.

Preprocessing
For the sub-tasks that included the Arabic language we utilized the Stanford Arabic Segmenter (Monroe et al., 2014) in order to reduce lexical sparsity. For all the other sub-tasks, we performed text normalization, tokenization and lowercasing using the scripts available in the Moses Machine Translation Toolkit (Koehn et al., 2007).

Methods
We assess the semantic similarity between two sentences based on their continuous vector representations obtained by means of various Paragraph Vector models. A similarity metric is applied afterwards in order to determine the proximity between the two vectors. This measure is directly used as the similarity score of the two sentences.
For all sub-tasks we experiment with the PV-DBOW training algorithm, various vector sizes (200, 300 and 400) and with various state-of-the-art similarity metrics (Cosine, Bray-Curtis, Correlation) defined as: where u and v are the vector representations of the two sentences,ū andv denote the mean value of the elements of u and and v, and x · y is the dot product of x and y.
The Cosine metric is directly available from the Gensim library, while the Bray-Curtis and Correlation metrics are part of the spatial library from scipy 8 . We need to invert the score produced by the spatial library as it provides dissimilarity scores instead of the required similarity measures.
Given a monolingual sub-task L 1 −L 1 and multiple bilingual corpora, the L 1 side of the corpora is used to train Doc2Vec models. For all crosslingual sub-tasks L 1 − L 2 we used Google Translate to obtain the test set translation from L 1 to L 2 and vice versa. Then we trained the Doc2Vec models for the two languages separately and combined the similarity scores obtained by the two models by averaging. Since the scores are in the range (0, 1] we multiply them by 5 in order to return a continuous valued similarity score on a scale from 0 to 5, as the competition requires.
We submitted three runs to the competition:

Evaluation and Results
The similarity scores are evaluated by computing the Pearson Correlation between them and human judgments for the same sentence pairs. This section presents our results for all sub-tasks of the 2017 test sets and also for the STS Benchmark 9 (Cer et al., 2017).

STS 2017 Test Sets
When considering all 85 submitted runs (including the monolingual runs and the baseline), our best runs ranked 26 out of 49 for AR-AR, 21 out of 45 for AR-EN, 22 out of 48 for ES-ES, 28 out of 53 for ES-EN-a, 1 out of 53 for ES-EN-b, 35 out of 77 for EN-EN and 16 out of 48 for TR-EN (Cer et al., 2017). Several experiments were conducted with size 200, 300 and 400 for the Doc2Vec vectors, training on both sides of the corpora for the crosslingual tasks and applying Cosine, Bray-Curtis and Correlation similarity metrics. We detail in Table 3 the Pearson Correlation scores obtained.
The results indicate that the Bray-Curtis metric performs better than the other two in five out of seven test sets, with a tie on the EN-EN test set. Regarding the dimension of the Doc2Vec vectors, a conclusion cannot be simply drawn from these results, since size 200 leads to best results for ES-ES, ES-EN-a and EN-EN, size 300 gives best results for AR-AR, size 400 for AR-EN and ES-EN-b and a tie for TR-EN when using sizes 300 and 400. It is also important to note that the  Averaging the similarity scores for the source and the target language also seems to be a promising approach. This combination led to best Pearson correlation scores for two of the four crosslingual test sets (AR-EN and ES-EN-a).
We report in Table 4 the Pearson correlation results of the runs we submitted to the competition. For the first two runs we used Cosine for computing the similarity between the sentence pairs and for the third run we used Bray-Curtis.  The non-English language side of the corpora was used for training the Doc2Vec models for the cross-lingual tasks in the first two runs, while for the third run we trained the Doc2Vec models on the English side of the corpora. In the third run we also included additional data (except for AR-AR and ES-ES) in order to assess how the size of the training corpus for the Doc2Vec models influences the results. For the AR-EN, ES-EN-b and TR-EN sub-tasks the scores improved when using more training data, but the differences were small.

STS Benchmark
The Semeval STS organizers made available the STS Benchmark for the EN-EN task with the purpose of creating state-of-the-art approaches and collecting their results on standard data sets. The benchmark data consist of a selection of previous data sets used in the competition between 2012 and 2017.
Since the methods we presented are unsupervised and knowledge-free, we did not make use of the annotated training data when computing the similarity scores for the development and test sets. We tested two approaches for obtaining similarity scores on the EN-EN sub-task: the first infers the vectors for the development and test set sentences from the already trained Doc2Vec models (Post-training inference) and the other one retrains from scratch new models by adding the development and test sets to the initial Doc2Vec training data (New-Model).
As it can be noted in Table 4, the best Pearson correlation result for EN-EN was obtained using the settings from our submitted run 1. These settings also gave the best results for the STS Benchmark test data (

Conclusions
We presented in this paper our unsupervised knowledge-free approach to the STS task. A wide range of experiments were carried out in order to assess the impact of the similarity metric if Paragraph Vector is used to represent sentences. Our results indicate that Bray-Curtis might be a good choice, because it outperformed the commonly used Cosine metric on five out of seven test sets. Moreover, training the Doc2Vec models on both sides of the language corpora and averaging their similarity scores seems to be a promising approach for the cross-lingual STS task. The proposed method achieved encouraging results as we ranked first on the EN-ES-b sub-task and obtained Pearson correlation scores above the median score for five out of seven test sets.