DCU: Using Distributional Semantics and Domain Adaptation for the Semantic Textual Similarity SemEval-2015 Task 2

We describe the work carried out by the DCU team on the Semantic Textual Similarity task at SemEval-2015. We learn a regression model to predict a semantic similarity score between a sentence pair. Our system exploits distributional semantics in combination with tried-and-tested features from previous tasks in order to compute sentence similarity. Our team submitted 3 runs for each of the ﬁve English test sets. For two of the test sets, belief and headlines , our best system ranked second and fourth out of the 73 submitted systems. Our best submission averaged over all test sets ranked 26 out of the 73 systems.


Introduction
This paper describes DCU's participation in the Se-mEval 2015 English Semantic Textual Similarity (STS) task, whose goal is to predict how similar in meaning two sentences are (Agirre et al., 2014). The semantic similarity between two sentences is defined on a scale from 0 (no relation) to 5 (semantic equivalence). Thus, given a sentence pair, our aim is to learn a model which outputs a score between 0 and 5 reflecting the semantic similarity between the two sentences.
We explore distributional representations of words computed using neural networks -specifically Word2Vec vectors (Mikolov et al., 2013) -and we design features which attempt to encode semantic similarity at the sentence level. We also experiment with several methods of data selection, both for training word embeddings, and for selecting training data for our regression models. We submitted three runs for this task: for all three runs, the features used are identical, and the only difference between them is the training instance selection method used.

Data and Resources
The training data for the task is comprised of all the corpora from previous years STS tasks : STS-2012, STS-2013and STS-2014(Agirre et al., 2012Agirre et al., 2013;Agirre et al., 2014). The test data is taken from five domains: answers-forums, answers-students, belief, headlines and images. Two domains (headlines and images) have some training data available from the previous STS tasks 1 -the other three have been introduced for the first time.
We use the Word2Vec (W2V) representation for computing semantic similarity between two words. We then expand to incorporate the similarity between two sentences. Using W2V, a word can be represented as a vector of D dimensions, with each dimension capturing some aspect of the word's meaning in the form of different concepts learnt from the trained model. We use the gensim W2V implementation (Řehůřek and Sojka, 2010).
We use the text8 Wikipedia corpus to train our general W2V model. This corpus is comprised of 100MB of compressed Wikipedia data. 2 We use the UMBC corpus (Han et al., 2013) for building domain-specific W2V models.

Pre-processing
We perform minimal pre-processing, replacing all hyphens and apostrophes with spaces, and removing all non-alphanumeric symbols from the data. Our general domain model uses the NLTK 3 stop word list for stop word removal and the Porter stemming algorithm (Porter, 1980). Word2Vec handles the stem variations to some extent when it learns the vector representation from the raw input data. Thus for the domain-specific models, we only remove stopwords and do not stem.

Feature Design
To predict a semantic similarity score, we learn a regression model using the M5P algorithm. 4 We represent a sentence pair using the features described in the folowing subsections.

Cosine Similarity
We have two features representing the cosine similarity between two sentences, s1 and s2, where the sentences are represented as binary vectors with each dimension indicating the presence of a word. The first feature is the basic cosine similarity between the two sentence vectors and the second is the weighted cosine similarity between the two vectors, where each word is weighted by its inverse collection frequency (ICF). 5

Word2Vec
Sum W2V: For a given sentence we represent each word by its W2V representation and then sum each word vector in a sentence to find the centroid of the word vectors representing the entire sentence. The cosine of the centroids of the two sentences indicates the similarity between them. Using the sum approach, two features, sum and sum icf, are calculated, one corresponding to the basic cosine similarity between the vectors, and the other representing the weighted cosine similarity where, before calculating the centroid, each word vector is multiplied 3 http://www.nltk.org/ 4 We used the weka implementation: http://www.cs .waikato.ac.nz/ml/weka/ without performing any extra hyper-parameter optimization. 5 ICF is calculated using word frequency from the wikipedia 2011 dump. by its ICF weight. (1) Product W2V: Given s1 and s2, we take the element-wise product of each word vector in s1 and s2 and store the maximum product value for each word in s1 and similarly for s2. The Product W2V feature is the average of the maximum weights between each word of s1 with s2 and vice versa: The sum and product W2V models are inspired by the composition models of Mitchell and Lapata (2008) and semantic similarity measures of Mihalcea et al. (2006).
Domain-specific Cosine Similarity: Good coverage is obtained using the text8 corpus to train the W2V model. However, we also want to explore the performance with respect to an in-domain W2V model. So, for each of the test corpora, we first extract a corpus of similar sentences from the UMBC corpus by selecting up to 500 sentences for each content word in the test corpus and then use the extracted dataset to train a W2V model that has better coverage of the test domain. Using the domainspecific W2V corpus, we compute the feature domain w2v cosine similarity in a similar fashion to the Sum W2V feature -we compute the centroid vector of the content words in each sentence and then compute the cosine between the two centroids.
Syntax: We also hypothesize that two semantically similar sentences should have high overlap between their nouns, verbs, adjectives and adverbs. For each coarse-grained POS tag (NN * , VB * , JJ * and RB * ) we calculate the W2V cosine similarity between all words from s1 and s2 which have the same POS tag (using the Sum W2V combination method). For each coarse-grained POS tag, we also calculate the number of lexical matches with that particular POS tag.
We also parse each sentence using the Stanford parser (Manning et al., 2014) and look for dependency relation overlap between s1 and s2. 6 We concentrate on six dependency relations -nsubj, dobj, det, prep, amod and aux. For each relation we calculate the degree of overlap between the occurrences of this relation in the two sentences. We have two notions of relation overlap: a nonlexicalized version which just counts the relation itself (e.g. nsubj) and a lexicalized version which counts the relation and the two tokens it connects (e.g. nsubj word1 word2).

Monolingual Alignment
We compute the monolingual alignment between the two sentences using the word aligner introduced in (Sultan et al., 2014). Their system aligns related words in a sentence pair by exploiting semantic and contextual similarities of the words. From the aligned sentences, we then extract two features: percent aligned source and percent aligned target, which represent the fraction of tokens in each sentence which have an alignment in the other sentence. The intuition behind these features is that sentences which are semantically similar should have a higher fraction of aligned tokens, since alignments constitute either identical strings or paraphrases.

TakeLab
The Takelab system (Šarić et al., 2012) was the top performing system in STS-2012 task. Their system used support vector regression models with multiple features measuring word overlap similarity and syntactic similarity. We find that adding the Takelab features provides additional knowledge to our system and improves performance for the training datasets. We add the 21 features of the Takelab system to our feature set.

Training instance selection
After designing features to model semantic similarity between two sentences, the next important task is to select the training corpus for learning the weights for these features. Out of the five test sets for STS-2015, we only have in-domain training corpora for the headlines and images data sets. We hypothesize 6 Parsing is carried out on the raw sentences. that finding vocabulary similarity between the entire training and test corpus could be used to select more similar corpus for training of the system. We calculate the similarity between each of the corpora we have from previous STS tasks and each of the test corpora. Using the entire corpus vocabulary as a vector we find the cosine similarity between different corpora using the TFIDF (Manning et al., 2008), LSI (Hofmann, 1999), LDA (Blei et al., 2003b) and HLDA (Blei et al., 2003a) measures.
Next, we describe the mechanism we used for training data selection for each run: corpus (similar instances are computed using cosine similarity between the feature vectors). By combining these five training instances for all test instances and removing duplicates, we form a more focused training set which is expected to capture the test set diversity more effectively.
3. Run-3: In this variant, we do not want to limit ourselves to just the top three corpora, so we merge all the training data and then look for the five most similar training instances for each test instance to form a focused training set. Table 1 shows the results of our systems on the five test sets. For the test sets answers-forum and belief there was a considerable difference in the results across the three runs, indicating that selecting training instances has a significant effect on performance. For these two datasets across two runs the absolute difference in the Pearson coefficient is about 10% for answers-forum and about 20% for the belief dataset. Overall, our best system rank is 26 out of 73. If we look at the results for individual test sets, it seems our approach works well for the belief, headlines and image test set but performs poorly for the answer-student and answer-forum test sets. For the belief test set our Run-2 was ranked 2nd overall and for the headlines test set our Run-1 was ranked 4th overall. For the images test set, the results are competitive -the absolute difference in the Pearson value between our best run and the best system is only 0.03. Thus, apart from two corpora, answers-students and answer-forums, our approach performed quite well. We analyzed the features using GradientBoostin-gRegressor 8 for all the training sets. The feature importance varies slightly across different domains. For all datasets, we remove features with gini importance 9 < 0.01, then we look for the features which are present in at least three of the different domain for this year's test set. The features that performed well are shown in Table 2.

Conclusions
All of our runs have the same features, but use different training corpora to learn the weights. We thus show that training data selection can have an impact on the performance of a model, especially for a novel genre. Using Word2Vec to find semantic similarity between a sentence pair proved to be effective. Furthermore, composing W2V features in different ways can help to reveal new information about semantic similarity. Investigating the test sets where we failed to perform well, answer-forums and answer-students, reveals that we need to handle phrasal information more effectively by, for example, handling negation, devising measures to compare the sentences at the entity level and making better use of parser output.