WOLVESAAR at SemEval-2016 Task 1: Replicating the Success of Monolingual Word Alignment and Neural Embeddings for Semantic Textual Similarity

This paper describes the WOLVESAAR systems that participated in the English Semantic Textual Similarity (STS) task in SemEval-2016. We replicated the top systems from the last two editions of the STS task and extended the model using GloVe word embed-dings and dense vector space LSTM based sentence representations. We compared the difference in performance of the replicated system and the extended variants. Our variants to the replicated system show improved correlation scores and all of our submissions outperform the median scores from all participating systems.


Introduction
Semantic Textual Similarity (STS) is the task of assigning a real number score to quantify the semantic likeness of two text snippets. Similarity measures play a crucial role in various areas of text processing and translation technologies ranging from improving information retrieval rankings (Lin and Hovy, 2003;Corley and Mihalcea, 2005) and text summarization to machine translation evaluation and enhancing matches in translation memory and terminologies (Resnik and others, 1999;Ma et al., 2011;Banchs et al., 2015;Vela and Tan, 2015). The annual SemEval STS task (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015) provides a platform where systems are evaluated on the same data and evaluation criteria.

DLS System from STS 2014 and 2015
For the past two editions of the STS task, the top performing submissions are from the DLS@CU team (Sultan et al., 2014b;Sultan et al., 2015).
Their STS2014 submission is based on the proportion of overlapping content words between the two sentences treating semantic similarity as a monotonically increasing function of the degree to which two sentences contain semantically similar units and these units occur in similar semantic contexts (Sultan et al., 2014b). Essentially, their semantic metric is based on the proportion of aligned content words between two sentences, formally defined as: Al is the monotonic proportion of the semantic unit alignment from a set of alignments Al that maps the positions of the words (i, j) between sentences S (1) and S (2) , given that the aligned units belong to a set of content words, C. Since the proportion is monotonic, the equation above only provides the proportion of semantic unit alignments for S (1) . The Al alignments pairs are automatically annotated by a monolingual word aligner (Sultan et al., 2014a) that uses word similarity measures based on contextual evidence from the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) and syntactic dependencies.
The same computation needs to be made for S (2) . An easier formulation of the equation without the formal logic symbols is: Since the semantic similarity between (S (1) , S (2) ) should be a single real number, Sultan et al. (2014b) combined the proportions using harmonic mean: Instead of simply using the alignment proportions, Sultan et al. (2015) extended their hypothesis by leveraging pre-trained neural net embeddings (Baroni et al., 2014). They posited that the semantics of the sentence can be captured by the centroid of its content words 1 computed by the element-wise sum of the content word embeddings normalized by the number of content words in the sentence. Together with the similarity scores from Equation 3 and the cosine similarity between two sentence embeddings, they trained a Bayesian ridge regressor to learn the similarity scores between text snippets. In duplicating Sultan et al. (2015) work, we first have to tokenize and lemmatize text. The details of pre-processing choices was undocumented in their paper, thus we lemmatized the datasets with the NLTK tokenizer (Bird et al., 2009) and PyWSD lemmatizer (Tan, 2014). We use the lemmas to retrieve the word embeddings from the COMPOSES vector space (Baroni et al., 2014). Similar to Equation 2 (changing only the numerator), we sum the sentence embedding's centroid as follows: where v(S (1) ) refers to the dense vector space representation of the sentence S (1) and v(w i ) refers to the word embedding of word i provided by the COMPOSES vector space. The same computation has to be done for S (2) .
Intuitively, if either of the sentences contains more or less content words than the other, we can see the numerator changing but the denominator changes with it. The difference between v(S (1) ) and v(S (2) ) contributes to distributional semantic distance.
To calculate a real value similarity score between the sentence vectors, we take the dot product between the vectors to compute the cosine similarity between the sentence vectors: There was no clear indication of which vector space Sultan et al. (2015) have chosen to compute the similarity score from Equation 5. Thus we compute two similarity scores using both COMPOSES vector spaces trained with these configurations: • 5-word context window, 10 negative samples, subsampling, 400 dimensions • 2-word context window, PMI weighting, no compression, 300K dimensions In this case, we extracted two similarity features for every sentence pair. With the harmonic proportion feature from Equation 3 and the similarity scores from Equation 5, we trained a boosted tree ensemble on the 3 features using the STS 2012 to 2015 datasets and submitted the outputs from this model as our baseline submission in the English STS Task in SemEval 2016. Instead of using the COMPOSES vector space, we experimented with replacing the v(w i ) com-ponent in Equation 4 with the GloVe vectors, 2 v glove (w i ) such that:

Replacing COMPOSES with GloVe
The novelty lies in the usage of the global matrix to capture corpus wide phenomena that might not be captured by the local context window. The model leverages on both the non-zero elements in the wordword co-occurence matrix (not a sparse bag-ofwords matrix) and the individual context window vectors similar to the word2vec model (Mikolov et al., 2013).

Similarity Using Tree LSTM
Recurrent Neural Nets (RNNs) allow arbitrarily sized sentence lengths (Elman, 1990) but early work on RNNs suffered from the vanishing/exploding gradients problem (Bengio et al., 1994). Hochreiter and Schmidhuber (1997) introduced multiplicative input and output gate units to solve the vanishing gradients problem. While RNN and LSTM process sentences in a sequential manner, Tree-LSTM extends the LSTM architecture by processing the input sentence through a syntactic structure of the sentence. We use the ReVal metric (Gupta et al., 2015) implementation of Tree-LSTM (Tai et al., 2015) to generate the similarity score.
ReVal represents both sentences (h 1 , h 2 ) using Tree-LSTMs and predicts a similarity scoreŷ based on a neural network which considers both distance and angle between h 1 and h 2 : where, σ is a sigmoid function,p θ is the estimated probability distribution vector and r T = [1 2...K]. The cost function J(θ) is defined over probability distributions p andp θ using regularised Kullback-Leibler (KL) divergence.
In Equation 8, i represents the index of each training pair, n is the number of training pairs and p is the sparse target distribution such that y = r T p is defined as follows: is the similarity score of a training pair. This gives us a similarity score between [1, K] which is mapped between [0, 1]. 3 Please refer to Gupta et al. (2015) for training details.

Submission
We Our baseline submission uses the similarity score from Equations 3 and 5 as features to train a linear ridge regression. Our baseline submission achieved an overall 0.69244 Pearson correlation score on all domains.
Extending the baseline implementation, we included the similarity score from Equations 6 and 8 to the feature set and trained a boosted tree ensemble (Friedman, 2001) to produce our Boosted submission. Finally, we use the same feature set to train an eXtreme Boosted tree ensemble (XGBoost) (Chen and He, 2015;Chen and Guestrin, 2015) model.
We annotated the STS 2012 to 2015 datasets with the similarity scores from Equations 2, 3, 5, 6, 8. The annotations and our open source implementation of the system are available at https://github.com/alvations/stasis /blob/master/notebooks/STRIKE.ipynb answer-answer headlines plagiarism postediting question-question 5 Results Table 1 shows the results of our submission to the English STS task in SemEval-2016; the median and best scores are computed across all participating teams in the task. Our baseline system performs reasonably well, outperforming the median scores in most domains. Our extended variant of the baseline using boosted tree ensemble performs better in the answeranswer, headlines and postediting domains but performed worse in others. Comparatively, it improves the overall correlation score marginally by 0.002.
The system using XGBoost performs the best of the 3 models but it underperforms in the headlines and plagiarism domain when compared to the median scores.
Generally, we did not achieve the outstanding scores in the task as compared to the top performing team DLS@CU in the English STS 2015. Our XGBoost system performs far from the best scores from the top systems. However, overall our correlation scores are higher than the median scores across all submissions for the task.
As a post-hoc test, we have evaluated our baseline system by training on the STS 2012 to 2014 dataset and testing on the STS 2015 dataset and we achieved 0.76141 weighted mean Pearson correlation score on all domains. As compared to Sultan et al. (2015) results of 0.8015 we are 0.04 points short of their results which should technically rank our system at 20th out of 70+ submissions to the STS 2015 task 4 .
Machine Translation (MT) evaluation metrics have shown competitive performance in previous 4 Our replication attempt obtained better results compared to our STS-2015 submission (MiniExperts) that used a Support Vector Machine regressor trained on a number of linguistically motivated features (Gupta et al., 2014); it achieved 0.7216 mean score (Béchara et al., 2015). STS tasks (Barrón-Cedeño et al., 2013;Huang and Chang, 2014;Bertero and Fung, 2015;. Tan et al. (2016) annotated the STS datasets with MT metrics scores for every pair of sentence in the training and evaluation data. We extend our XG-Boost model with these MT metric annotations and achieved a higher score for every domain leading to an overall Pearson correlation score of 0.73050 (+Saarsheff in Table 1).

Conclusion
In this paper, we have presented our findings on replicating the top system in the STS 2014 and 2015 task and evaluated our replica of the system in the English STS task of SemEval-2016. We have introduced variants and extensions to the replica system by using various state-of-art word and sentence embeddings. Our systems trained on (eXtreme) Boosted Tree ensembles outperform the replica system using linear regression. Although our replica of the previous best system did not achieve stellar scores, all our systems outperform the median scores computed across all participating systems.