Samsung Poland NLP Team at SemEval-2016 Task 1: Necessity for diversity; combining recursive autoencoders, WordNet and ensemble methods to measure semantic similarity.

This paper describes our proposed solutions designed for a STS core track within the Se-mEval 2016 English Semantic Textual Similarity (STS) task. Our method of similarity detection combines recursive autoencoders with a WordNet award-penalty system that accounts for semantic relatedness, and an SVM classiﬁer, which produces the ﬁnal score from similarity matrices. This solution is further supported by an ensemble classiﬁer, combining an aligner with a bi-directional Gated Recurrent Neural Network and additional features, which then performs Linear Support Vector Regression to determine another set of scores.


Introduction
The tasks from the Semantic Textual Similarity (STS) contest have always attracted vivid interest from the NLP community. The goal is to measure the semantic similarity between two given sentences on a scale from 0 to 5, trying to emulate the idea of similarity degrees, thus replicating human language understanding.
After processing two pieces of text, semantic textual similarity software captures degrees of semantic equivalence. One of the goals of the STS task is to create a unified framework for extracting and measuring semantic similarity. Improvements achieved in the course of this task can be useful in many research areas, such as question answering (Marsi and Krahmer, 2005), machine translation (Callison-Burch, 2008), and plagiarism detection (Clough et al., 2002).
We present a solution designed to detect both the similarity between single words and longer, multiword phrases. It employs two important components: the unfolding recursive autoencoder (RAE) (Socher et al., 2011) and the penalty-award weight system based on WordNet (Miller et al., 2002). First, RAE is used to perform unsupervised learning on parse trees, then the WordNet module adjusts the distances of RAE vectors using awards and penalties based on semantic similarities of words. The complete pipeline includes a deep net (RAE) module, a WordNet module, a normalization module and a sentence similarity matrices computing module.
Another solution that ran in parallel to the RAE pipeline, was the monolingual word aligner (in some cases we used its corrected version with additional features, including a bag-of-words). Finally an ensemble classifier was used to perform Linear Support Vector Regression (Drucker et al., 1996) over the results from all the other classifiers. This included: the base word aligner (Sultan et al., 2015), bi-directional Gated Recurrent Neural Network Chung et al., 2014), the RAE with WordNet features and the corrected aligner.

System Overview
This section describes the modules that constitute our three runs. Detailed information about the configuration of these runs can be found in Section 3.

RAE with WordNet Features
RAE with the WordNet module is composed of two major parts: a recursive autoencoder (RAE) for unsupervised training of sentence representations and additional WordNet-based submodule for enhancing the performance of the RAE.
The RAE takes unlabeled parse trees and word vectors as input and learns phrase features for each node in the tree. The learned features can be used to recursively reconstruct the vectors at each node in the tree. The encoding part follows the Semantic Depencency Tree Recursive Neural Network (SDT-RNN) structure described in . In the decoding part, the tree structure used to encode the sentence is mirrored. The reconstruction error (the total error of the network) is counted for all subtrees as the summed Euclidean distance between a subtree's decoded terminal nodes and the original word vectors. The network learns to encode representations of meaningful phrases in tree nodes. We use the word representation vectors published by (Pennington et al., 2014).
The RAE is first trained in an unsupervised way on the Corpus of Contemporary American English (Davies, 2008) combined with SemEval STS training sets released before 2015 (only sentences without labels), then a sentence similarity matrices computing module (Section 2.1.2) is used to generate similarity scores for two candidate sentences. Our first experiment involved a procedure described in (Socher et al., 2011) with unmodified word representation vectors, using Euclidean distance as a measure of word-to-word similarity.
However, we noticed that some pairs of vectors representing related concepts (e.g., 'lady' and 'woman') were located surprisingly far from each other in the Euclidean space, while others were too close. As shown in , words and phrases which merely belong to the same class of concepts without being exact synonyms have a low distance in Euclidean space (e.g., phrases 'a few seconds' and 'two years' are grouped together). The representation vectors returned by the RAE do not amend this problem. For this reason, we created an additional module which uses WordNet (Miller, 1995) to enhance our word similarity measures in RAE trees. In Table 1 we show the influence of individual modules.

WordNet awards and penalties
The WordNet module adjusts the Euclidean distance between RAE vectors with awards and penal-ties based on the semantic similarity of pairs of words. We combined the following ideas: • awarding pairs of words with positive semantic similarity; • penalizing out-of-context words and disjoint similar concepts; • propagating scores to higher nodes of the dependency trees.
The concept of semantic similarity reflects the work of (Han et al., 2013), while out-of-context words and disjoint similar concepts reflect the ideas presented in (Han et al., 2015), but there are differences in both implementation and usage.
If none of the conditions are met, the semantic distance D is set to a negative value, which facilitates the counting of an award A described below. Thus, effectively, the value of D is an integer such that D(x, y) ∈ {−1, 0, 1, 2, 3, 4, 5}.
The semantic distance D(x, y) is transformed to an award A using the formula e −αD(x,y) introduced by (Li et al., 2003), where α is set to 0.25, as this value seemed to yield the best results: where β is a positive number (5 by default) used to control the level of adjustment made by the WordNet-related score. If β = 5 the maximum score for A(x, y) is 5. The Euclidean distances of RAE vectors are usually in the range of [0, 10], thus the parameter ensures that the WordNet-related similarity is sufficiently important.
Penalties. The out-of-context penalty for word x, OOC(x), is defined as a penalty for a word not paired in the second input sentence SS (Han et al., 2015). The word is not paired if its semantic similarity (or award) A = 0 with all the words (referenced below by the index i) in the second sentence: We allow three strategies for out-of-context penalization: penalize all recognizable parts of speech (nouns, verbs, adjectives and adverbs), penalize only nouns, penalize only physical objects (i.e. words which have physical object in their WordNet's hypernyms path). The third option is used by default, since both the original research of (Han et al., 2015) and ours suggest that it usually is the best option (although, in a minority of tests, the penalization of all out-of-context nouns yields better results).
We also penalize disjoint similar concepts. Disjoint similar concepts DSC(x, y) are defined as 'special care antonyms' or words of disjoint meaning (i.e. Monday Tuesday). In our solution they are found using WordNet's hypernyms hierarchy. If two words have a common direct hypernym, they are disjoint similar concepts (e.g. both Monday and Tuesday have weekday as a common hypernym in the WordNet hierarchy). By default, the DSC(x, y) function returns a penalty of −2 when two words are found to be antonyms or disjoint similar concepts, and 0 otherwise. Thus, the penalty P for two words x and y is: A complete framework of WordNet-related awards and penalties is defined by: Propagation. In the Sentence similarity matrices computing module 2.1.2, scores are calculated for all sentence subtrees, while WordNet's awards and penalties are calculated for words. Thus, to use WordNet's scores on all subtrees, the awards and penalties have to be propagated up, until they affect all nodes. To propagate WordNet's scores on a tree containing more than one word, we define a function: where i is an index for a single leaf in the subtree, and n is the total number of leaves in the subtree. sim W N (i) is a similarity score for a leaf i. The scores for particular leaves are divided by their depth depth(i) relative to the root of the subtree to account for their importance in the more complex trees, and then the scores are summed up at the root. For example, if we have a leaf with a score of 0.75, then this score is added up at the root level with its full weight only if the leaf is a direct child of the root. If it is located deeper in the subtree, the score is divided by this leaf's depth relative to the root of the subtree. The same procedure applies to all the leaves in the current subtree. In Table 1 we compare the efficiency of the WordNet module with and without propagation. An alternative strategy for incorporating the WordNet scores is to refine the vectors associated with particular leaves and then use these refined vectors to recompute complex tree nodes using RAE. For example, given the WordNet similarity for words 'woman' and 'lady', and the vectors A for 'woman' and B for 'lady', the vector A is refined using the following formula: The vector B stays the same. The tests have proven that the former strategy offers better results, so we decided to stick with it.

Sentence Similarity Matrices Computing
As a first step in our full algorithm, the RAE computes vectors for every node in the dependency parse tree. Then the subtrees of these trees are used to create the distance matrix. The matrix is created in a number of steps: the trees are traversed in level order, the subtrees are then sorted by depth and the leaves representing stop words are removed 1 . The remaining subtrees are used to construct the distance matrix, which is then filled with Euclidean distance measures d between each pair of subtrees x and y.
The above score is further transformed in two ways: it is made certain that the score value falls within the range 0-5, and that the distance is set to 0 if the WordNet similarity sim W N has a maximum value: The original score D RAE is replaced with the adjusted score D RAE in the distance matrix. Finally, we use dynamic pooling module as described in (Socher et al., 2011). The pooling module accounts for the varying lengths of the two trees. 1 Stop words list contains about 60 most common words in training data set and all punctuation characters.

Final RAE-based solution
The final score was produced by Linear Support Vector Regression over cells from the distance matrices after pooling as well as 12 additional features: • adjustment of roots (the Euclidean distance between WordNet-adjusted tree roots); • cosine distance between vectors representing tree roots of sentences; • information about the negation status of the two sentences (if both sentences contain/ do not contain negation = true, otherwise = false); • mean out of context penalty over full tree; • mean disjoint penalty over full tree; • mean WordNet similarity score over full tree; • score from aligner (Section 2.2); • if both sentences agree on the numbers (true for no numbers or the same numbers; false otherwise); • if both sentences have the same numbers (binary); • the absolute difference in tokens between two sentences; • if the numbers in one sentence contain the numbers from the second sentence (binary); • the percentage of tokens similarity between two sentences.
A new SVM classifier was created for every test set, since for every test set a different subset of training sets was used 2 . Distance matrices for all classifiers were created and normalized independently. norm c = 0.4( max(min( c µ , 3σ), −3σ) 3σ + 1) + 0.1 (9) First the normalization process was used to calculate mean µ and standard deviation σ for the matrix, next we performed calculations according to Equation 9 for every cell c of the matrix. The equation comes from (Socher et al., 2013) and normalizes the values to range [0.1, 0.9].

Aligner
As a monolingual word aligner we use two algorithms: a basic aligner and a corrected aligner. Both are based on the aligner described in (Sultan et al., 2014). The basic algorithm performs the following steps for the two sentences: align identical word sequences, align named entities, align content words using dependencies and align content words using surrounding words. Scoring is calculated according to (Sultan et al., 2015): score(S 1 , S 2 ) = n a (S 1 ) + n a (S 2 ) n(S 1 ) + n(S 2 ) , where n(S i ) and n a (S i ) are the number of content words and aligned content words in a sentence S i , respectively. An aligner using only the basic algorithm could not handle negations and antonyms well, so we modified it by adding two modules. The negation module checks whether there is a negation component present in only one sentence, and if so, the module reduces the score to 0. The antonym module verifies whether the two sentences contain at least one pair of antonyms from a list based on the WordNet, and if so, also reduces score to 0.
The corrected aligner is a Linear Support Vector Regression (Drucker et al., 1996) using the following features: the modified basic aligner feature, Bag of Words features inspired by (Han et al., 2015) (element-wise absolute value difference between vectors for words and bigrams, sentences' length difference, percentage of exact lemma to lemma matches) and additional features used in (Hänig et al., 2015): • length of the longest common subsequence of characters (some characters may be skipped), • length of the longest common sequence of characters, • cosine similarity between vectors of words, • edit distance between sentences, • WordNet word overlap (Šarić et al., 2012).

Ensemble
The ensemble classifier was actually a Linear Support Vector Regression over results from the other classifiers used for semantic similarity measurement. Each one of them returned score from 0 up to 5. The following classifiers were chosen for the ensemble approach: • modified basic aligner, presented in Section 2.2; • Bi-directional Gated Recurrent Neural Network Chung et al., 2014) with the output neural network described in (Tai et al., 2015); • RAE with WordNet Features, described in Section 2.1; • corrected aligner, described in Section 2.2 The training data set was split into 75% vs 25%. All classifiers except the aligner (which does not need to be trained) were trained on the 75%. The ensemble classifier was trained on a subset of the remaining data set.
Scores returned by the above classifiers were used as features in the Linear Support Vector Regression. The final result was rescaled to get score from the [0, 5] range.

Evaluation
In order to separately train models for each evaluation set, we created reference test sets that are similar to the evaluation sets, e.g. for headlines we used  These randomly selected samples that constituted the test sets were removed from the training sets. We used these reference test sets to find the best parameters of our models and chose the best model for run EN2. The final model uses all samples from sets assigned to each evaluation set in Table 2.
Our final results are presented in Table 3. For the AE run, we used RAE with WordNet Features, as described in Section 2.1. For each test, a separate classifier was created with its own training set, as presented in Table 2. The mapping was based on the average number of words per sentence in the set.
For the EN1, run we used the ensemble model described in Section 2.3. In the EN2 model we chose either RAE or ensemble based on the results for test sets matched with evaluation sets.
As shown in Table 3, the ensemble model (EN1) yields better results than RAE (AE) for all sets. Thus, the merged model (EN2) falls between the two.
We also present the results (Table 4) of our solution for SemEval 2015 sets. Comparing them with the best run from SemEval 2015 competition (weighted mean), we concluded that bi-GRU yields the worst results. Second and third worst results came from the modified aligner and RAE respectively. The ensemble and merged model yield the best results that surpass the performance of the 2015 winning solution.

Conclusions and Future Work
Our solution combines a vector similarity feature derived from word embeddings without losing the information contained in lexical similarity relations. As it turned out, one of the primary limitations of our paraphrase detection system is its heavy reliance on word order, which makes the solution less universal in its application. The other drawback of converting words to word vectors is being unable to account for situations where the same information is formatted differently (for instance, units of measurement, time expressions, etc.) Thus, our future works include improving our preprocessing module, so that it would produce a unified input, e.g., all numbers written in words will be converted into numerals and all dates will be unified into one format. We will also use specifically designed training modes to prevent overfitting and create a new curriculum learning dataset to make RAE training easier.