UWB at SemEval-2016 Task 1: Semantic Textual Similarity using Lexical, Syntactic, and Semantic Information

We present our UWB system for Semantic Textual Similarity (STS) task at SemEval 2016. Given two sentences, the system estimates the degree of their semantic similarity. We use state-of-the-art algorithms for the meaning representation and combine them with the best performing approaches to STS from previous years. These methods beneﬁt from various sources of information, such as lexical, syntactic, and semantic. In the monolingual task, our system achieve mean Pearson correlation 75.7% compared with human annotators. In the cross-lingual task, our system has correlation 86.3% and is ranked ﬁrst among 26 systems.


Introduction
Semantic Textual Similarity (STS) is one of the core disciplines in Natural Language Processing (NLP). Assume we have two textual fragments (word phrases, sentences, paragraphs, or full documents), the goal is to estimate the degree of their semantic similarity.
STS systems are usually compared with the manually annotated data. In the case of SemEval the data consist of pairs of sentences with a score between 0 and 5 (higher number means higher semantic similarity). For example, English pair Two dogs play in the grass. Two dogs playing in the snow.
has a score 2.8, i.e. the sentences are not equivalent, but share some information.
This year, SemEval's STS is extended with the Spanish-English cross-lingual subtask, where e.g. the pair Tuve el mismo problema que tú. I had the same problem. has a score 4.8, which means nearly equivalent.
Each year STS belongs to one of the most popular tasks at SemEval competition. The best STS system at SemEval 2012 (Bär et al., 2012) used lexical similarity and Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch, 2007). In SemEval 2013, the best model (Han et al., 2013) used semantic models such as Latent Semantic Analysis (LSA) (Deerwester et al., 1990), external information sources (WordNet) and n-gram matching techniques. For SemEval 2014 and 2015 the best system comes from (Sultan et al., 2014a;Sultan et al., 2014b;Sultan et al., 2015). They introduced new algorithm, which align the words between two sentences. They showed that this approach can be efficiently used also for STS. Overview of systems participating in previous SemEval competitions can be found in (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015).
The best performing systems from previous years are based on various architectures benefiting from lexical, syntactic, and semantic information. In this work we try to use the best techniques presented during last years, enhance them, and combine into a single model.

Lexical and Syntactic Similarity
This section presents the techniques exploiting lexical and syntactic information in the text. Some of them have been successfully used by (Bär et al., 2012). Many of the following techniques benefit from the weighing of words in a sentence using Term Frequency -Inverse Document Frequency (TF-IDF) (Manning and Schütze, 1999).
• Lemma n-gram overlaps: We compare word n-grams in both sentences using Jaccard Similarity Coefficient (JSC) (Manning and Schütze, 1999). We do it separately for different orders n ∈ {1, 2, 3, 4}. Containment Coefficient (Broder, 1997) is used for orders n ∈ {1, 2}. We extend original metrics by weighing of ngrams. We define this weight as a sum of IDF values of words in n-gram. N -gram match is not counted as 1 but as the weight of this n-gram. According to our experiments, this weighing significantly improves performance.
We also use information about the length of Longest Common Subsequence compared to the length of the sentences.
• POS n-gram overlaps: In similar way as for lemmas, we calculate Jaccard Similarity Coefficient and Containment Coefficient for ngrams of part-of-speech (POS) tags. Again, we use n-gram weighing and n ∈ {1, 2, 3, 4}. These features exploit syntactic similarity of the sentences.
• Character n-gram overlaps: Similarly to lemma or POS n-grams, we use Jaccard Similarity Coefficient and Containment Coefficient for comparing common substrings in both sentences. Here the IDF weights are computed on character n-gram level. We use n-gram weighing and n ∈ {2, 3, 4, 5}.
We enrich these features also by Greedy String Tiling (Wise, 1996) allowing to deal with reordered text parts and by Longest Common Substring (LCS) measuring the ration between LCS and length of the sentences.
• TF-IDF: For each word in a sentence we calculate TF-IDF. Given the word vocabulary V , the sentence is represented as a vector of dimension |V | with TF-IDF values of words present in the sentence. The similarity between two sentences is expressed as cosine similarity between corresponding TF-IDF vectors.

Semantic Similarity
In this section we describe in detail the techniques that are more semantically oriented and are based on Distributional Hypothesis. This principle states that we can induce (to some degree) the meaning of words from their distribution in the text. This claim has multiple theoretical roots in psychology, structural linguistics, or lexicography (Firth, 1957;Rubenstein and Goodenough, 1965;Miller and Charles, 1991).
• Semantic composition: This approach is based on Frege's principle of compositionality, which states that the meaning of a complex expression is determined as a composition of its parts, i.e. words. To represent the meaning of a sentence we use simple linear combination of word vectors, where weights are represented by the TF-IDF values of appropriate words. We use state-of-the-art word embedding methods, namely Continuous Bag of Words (CBOW) (Mikolov et al., 2013) and Global Vectors (GloVe) (Pennington et al., 2014). We use cosine similarity to compare vectors.
• Paragraph2Vec: Paragraph vectors were proposed in (Le and  as an unsupervised method of learning text representation.
Resulting feature vector has fixed dimension while the input text can be of any length. The paragraph vectors and word vectors are concatenated to predict the next word in a context. The paragraph token acts as a memory that remembers what information is missing from the current context. We use cosine similarity for comparing two paragraph vectors.
• Tree LSTM: Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) with a complex computational unit. We use tree-structured representation of LSTM presented in (Tai et al., 2015). Tree model represents the sentence structure. RNN processes input sentences of variable length via recursive application of a transition function on a hidden state vector h t . For each sentence pair it creates sentence representations h L and h R using Tree-LSTM model. Given these representations, model predicts the similarity score using a neural network considering distance and angle between vectors.
• Word alignment: Method presented in (Sultan et al., 2014a;Sultan et al., 2014b;Sultan et al., 2015) has been very successful in last years. Given two sentence we want to compare, this method finds and aligns the words that have similar meaning and similar role in these sentences.
Unlike the original method, we assume that not all word alignments have the same importance for the meaning of the sentences. The weight of a set of words A is a sum of word's IDF values ω(A) = w∈A IDF (w), where w is a word.
Then the sentence similarity is given by where S 1 and S 2 are input sentences (represented as sets of words). A 1 and A 2 denote the sets of aligned words for S 1 and S 2 , respectively. The weighing of alignments improves our results significantly.

Similarity Combination
The combination of STS techniques is in fact a regression problem where the goal is to find the mapping from input space x i ∈ R d of d-dimensional real-valued vectors (each value x i,a , where 1 ≤ a ≤ d represents the single STS technique) to an output space y i ∈ R of real-valued targets (desired semantic similarity). These mapping are learned from the training data {x i , y i } N i=1 of size N . There exist a lot of regression methods. We experiment with several of them: • Linear Regression: Linear Regression (LR) is probably the simplest regression method. It is defined as y i = λx i , where λ is a vector of weights that can be estimated for example by the least squares method.
• Gaussian processes regression: Gaussian process regression (GPR) is nonparametric kernel-based probabilistic model for non-linear regression (Rasmussen and Williams, 2005).
• SVM Regression: We use Support Vector Machines (SVM) for regression with the radial basis functions (RBF) as a kernel. We use improved Sequential Minimal Optimization (SMO) algorithm for parameter estimation introduce in (Shevade et al., 2000).
• Decision Trees Regression: The output of the Decision Trees Regression (DTR) (Breiman et al., 1984) is predicted by the sequence of decisions organized in a tree.
• Perceptron Regression: Multilayer Perceptron (MLP) is feed-forward artificial neural network that uses back-propagation to classify instances.

System Description
This section describes the settings of our final STS system. For monolingual STS task we submitted two runs. First is based on supervised learning and the second is unsupervised system: • UWB sup: Supervised system based on SVM regression with RBF kernel. We use all techniques described in 2 as features for regression.
During the regression we also use the simple trick. We create another features represented as a product of each pair of features x i,a × x i,b for a = b. We do so to better model the dependencies between single features. Together, we have 301 STS features. The system is trained on all SemEval datasets from prior years (see Table 1).
We handled with the cross-lingual STS task with Spanish-English bilingual sentence pairs in two steps. Firstly, we translated Spanish sentences to English via Google translator. The English sentences   , i.e. for tokenization, lemmatization and POS tagging. Most of our STS techniques (apart from word alignment and POS n-gram overlaps) work with lemmas instead of word forms (this leads to slightly better performance). Some of our STS techniques are based on unsupervised learning and thus they need large unannotated corpora to train. We trained Para-graph2Vec, GloVe and CBOW models on One billion word benchmark presented in (Chelba et al., 2014). Dimension of vectors for all these models was set to 300. TF-IDF values were also estimated on this corpus.
All regression methods mentioned in Section 2.3 are implemented in WEKA (Hall et al., 2009).

Results and Discussion
This section presents the results of our systems for both English monolingual and Spanish-English cross-lingual STS task of SemEval 2016. In addition we present detailed results on the test data from SemEval 2015. As an evaluation measure we use Pearson correlation between system output and human annotations.
In the tables we present the correlation for each individual test set. Column Mean represents the weighted sum of all correlations, where the weight are given by the ratio of data set length compared to the full length of all datasets together. The mean value of Pearson correlations is also used as the main evaluation measure for ranking the system submissions.
In the Table 2 we show the results for the test data from 2015. We trained our systems on SemEval STS data from years 2012-2014. We provide comparison of individual STS techniques as well as of different types of regressions. Clearly, the SVM regression and Gaussian processes regression perform best and with our feature set it is 1% better than the winning system of SemEval 2015. The best performing single technique is indisputably the weighed word alignment correlated by 79.6% with gold data. Note that without weighing, we achieved only 74.2% on this data. The original result from authors of this approach was, however, 79.2%. This is probably caused by some inaccuracies in our implementation. Anyway, the weighing improves the correlation even if we compare it to the original results. Note that for estimating regression parameters we use the data from all years apart from 2015 (see Table 1).
The results for monolingual STS task of SemEval 2016 are shown in Table 3. In the time of writing this paper the ranks of submitted systems were not known. Thus we present only our correlations. We can see that our supervised system (SVM regression) performs approximately 3% better than the unsupervised one (weighed word alignment). On the data from SemEval 2015 this difference was not so significant (approximately 1.5%).
Finally, the results for cross-lingual STS task of SemEval 2016 are shown in Table 4. We achieved very high correlations. To be honest we must say that we expected much lower correlation through the fact that we use the machine translation via Google translator causing certainly some inaccuracies (at least in the syntax of the sentence). On the other hand, it proves that our model efficiently generalizes the learned patterns. Here, there is almost no difference in performance between supervised and unsupervised version of submitted systems. Our submitted runs finished first and second among 26 competing systems.

Conclusion
In this paper we described our UWB system participating in SemEval 2016 competition in the task of Semantic Textual Similarity. We participated on both monolingual and cross-lingual parts of competition. Our best results have been achieved by SVM regression of various STS techniques based on lexical, syntactic, and semantic information. This approach has been shown to work well for both subtasks.