SOPA: Random Forests Regression for the Semantic Textual Similarity task

This paper describes the system used by the LIPN-IIMAS team in the Task 2, Semantic Textual Similarity, at SemEval 2015, in both the English and Spanish sub-tasks. We included some features based on alignment measures and we tested different learning models, in particular Random Forests, which proved the best among those used in our participation


Introduction
Our participation in SemEval 2015 was focused on solving the technical problems that afflicted our previous participation (Buscaldi et al., 2014) and including additional features based on alignments, such as the Sultan similarity (Sultan et al., 2014b) and the measure available in CMU Sphinx-4 (Lamere et al., 2003) for speech recognition. We baptised the new system SOPA from the Spanish word for "soup", since it uses a heterogeneous mix of features. Well aware of the importance that the training corpus and the regression algorithms have for the STS task, we used language models to select the most appropriate training corpus for a given text, and we explored some alternatives to the ν-Support Vector Regression (ν-SVR) (Schölkopf et al., 1999) used in our previous participations, specifically the Multi-Layer Perceptron (Bishop and others, 1995) and Random Forest (Breiman, 2001) regression algorithms. The obtained results show that Random Forests outperforms the other algorithms on every test set. We describe all the features in Section 2; the details on the learning algorithms and the training corpus selection process are described in Section 3, and the results obtained by the system are detailed in Section 4.

Similarity Measures
In this section we describe the measures used as features in our system. The total number of features used was 16 in English and 14 in Spanish. Since most measures have already been used in our previous participation, we provide only basic overview, referring the reader to the complete description in  for further details. When POS tagging and NE recognition were required, we used the Stanford CoreNLP for English and Spanish (Manning et al., 2014).

WordNet-based Conceptual Similarity
This measure has been introduced in order to measure similarities between concepts with respect to an ontology. The similarity is calculated as follows: first of all, words in sentences p and q are lemmatised and mapped to the related WordNet synsets. All noun synsets are put into the set of synsets associated to the sentence, C p and C q , respectively. If the synsets are in one of the other POS categories (verb, adjective, adverb) we look for their derivationally related forms in order to find a related noun synset: if there exists one, we put this synset in C p (or C q ). No disambiguation process is carried out, so we take all possible meanings into account.
Given C p and C q as the sets of concepts contained in sentences p and q, respectively, with |C p | ≥ |C q |, the conceptual similarity between p and q is calcu-132 lated as: where s(c 1 , c 2 ) is a conceptual similarity measure. Concept similarity can be calculated in different ways. We used a variation of the Wu-Palmer formula (Wu and Palmer, 1994) named "ProxiGenea3", introduced by (Dudognon et al., 2010), which is inspired by the analogy between a family tree and the concept hierarchy in WordNet. The ProxiGenea3 measure is defined as: where c 0 is the most specific concept that is present both in the synset path of c 1 and c 2 (that is, the Least Common Subsumer or LCS). The function returning the depth of a concept is noted with d.

IC-based Similarity
This measure has been proposed by (Mihalcea et al., 2006) as a corpus-based measure which uses Resnik's Information Content (IC) and the Jiang-Conrath (Jiang and Conrath, 1997) similarity metric. This measure is more precise than the one introduced in the previous subsection because it takes into account also the importance of concepts and not only their relative position in the hierarchy. We refer to  and (Mihalcea et al., 2006) for a detailed description of the measure. The idf weights for the words were calculated using the Google Web 1T (Brants and Franz, 2006) frequency counts, while the IC values used are those calculated by Ted Pedersen (Pedersen et al., 2004) on the British National Corpus 1 .

Syntactic Dependencies
This measure tries to capture the syntactic similarity between two sentences using dependencies. Previous experiments showed that converting constituents to dependencies still achieved best results on out-ofdomain texts (Le Roux et al., 2012), so we decided to use a 2-step architecture to obtain syntactic dependencies. First we parsed pairs of sentences with the LORG parser 2 . Second we converted the resulting parse trees to Stanford dependencies.
Given the sets of parsed dependencies D p and D q , for sentence p and q, a dependency d ∈ D x is a triple (l, h, t) where l is the dependency label (for instance, dobj or prep), h the governor and t the dependant. The similarity measure between two syntactic dependencies d 1 = (l 1 , h 1 , t 1 ) and d 2 = (l 2 , h 2 , t 2 ) is the levenshtein distance between the labels l 1 and l 2 multiplied by the average of idf h * s(h 1 , h 2 ) and idf t * s(t 1 , t 2 ), where idf h and idf t are the inverse document frequencies calculated on Google Web 1T for the governors and the dependants (we retain the maximum for each pair), respectively, and s is the ProxiGenea3 measure. NOTE: This measure was used only in the English sub-task.

Information Retrieval-based Similarity
Let us consider two texts p and q, an IR system S and a document collection D indexed by S. This measure is based on the assumption that p and q are similar if the documents retrieved by S for the two texts, used as input queries, are ranked similarly. Let sets of the top K documents retrieved by S for texts p and q, respectively. Let us define s p (d) and s q (d) the scores assigned by S to a document d for the query p and q, respectively. Then, the similarity score is calculated as: For the participation in the English sub-task we indexed a collection composed by the AQUAINT-2 3 and the English NTCIR-8 4 document collections, using the Lucene 5 4.2 search engine with BM25 similarity. We indexed also DBPedia 6 abstracts and the UkWaC (Ferraresi et al., 2008), but they were used to produce two additional features (separate from the basic IR one). The Spanish index was created using the Spanish QA@CLEF 2005 (agencia EFE1994-95, El Mundo 1994-95) and multiUN (Eisele and Chen, 2010) collections. The K value was set to 70 after a study detailed in (Buscaldi, 2013). Another IR-based feature was derived by the rank-biased overlap measure introduced by (Webber et al., 2010) which compares rankings without the need of weights. In total, we had 4 IR-based measures for English and 2 for Spanish.

N-gram Based Similarity
This measure tries to capture the fact that similar sentences have similar n-grams, even if they are not placed in the same positions. The measure is based on the Clustered Keywords Positional Distance (CKPD) model proposed in (Buscaldi et al., 2009) for the passage retrieval task.
The similarity between a text fragment p and another text fragment q is calculated as: Where P is the set of the heaviest n-grams in p where all terms are also contained in q; Q is the set of all the possible n-grams in q, and n is the total number of terms in the longest sentence. The weights for each term w i are calculated as w i = 1 − log(n i ) 1+log(N ) where n i is the frequency of term t i in the Google Web 1T collection, and N is the frequency of the most frequent term in the Google Web 1T collection. The weight for each n-gram (h(x, P )), with |P | = j is calculated as: The function d(x, x max ) determines the minimum distance between a n-gram x and the heaviest one x max as the number of words between them.

Geographical Context Similarity
This measure tries to measure if the two sentences refer to events that took place in the same geographical area. It is based on the observation that the compatibility of the geographic context between the sentences is an important clue to determine whether the sentences are related or not, especially in news. We built a database of geographically-related entities, using geo-WordNet (Buscaldi and Rosso, 2008) and expanding it with all the synsets that are related to a geographically grounded synset. This implies that also adjectives and verbs may be used as clues for the identification of the geographical context of a sentence. For instance, "Afghan" is associated to "Afghanistan", "Sovietize" to "Soviet Union", etc. The Named Entities of type PER (Person) are also used as clues: we use Yago 7 to check whether the NE corresponds to a famous leader or not, and in the affirmative case we include the related nation to the geographical context of the sentence. For instance, "Merkel" is mapped to "Germany". Given G p and G q the sets of places found in sentences p and q, respectively, the geographical context similarity is calculated as follows: Where d(x, y) is the spherical distance in Km. between x and y, and K is a normalization factor set to 10000 Km. to obtain similarity values between 1 and 0. If no toponyms or geographically groundable entities are found in either sentences, then the geographic context similarity is set to 1.

Word Alignment Similarity
This similarity metric is based on the work of (Sultan et al., 2014b;Sultan et al., 2014a). The metric calculates a similarity score based on an alignment between two texts. It starts with an alignment between similar words, it proceeds to align similar name entities, to continue with words with similar content, to finally align stop words. In the case of content words, it proposes to use the syntactic context to identify similar words. At the end, the similarity is calculated as a harmonic mean between the ratios of align words from sentence one to sentence two, and from sentence two to sentence one. CMU Sphinx-4 (Lamere et al., 2003) is a speech recognition system that includes an alignment function that is used to align speech transcriptions with 7 http://www.mpi-inf.mpg.de/yago-naga/yago/ 134 text. We took one of the sentence as a reference and the other one as a transcription and we used the output of the Sphinx alignment measure as a feature.

Other Measures
In addition to the above text similarity measures, we used also the difference in size between sentences and the following measures:

Cosine
Cosine distance calculated between p = (w p 1 , . . . , w pn ) and q = (w q 1 , . . . , w qn ), the vectors of tf.idf weights associated to sentences p and q, with idf values calculated on Google Web 1T.

Edit Distance
This similarity measure is calculated using the Levenshtein distance on characters between the two sentences.

Named Entity Overlap
This is a per-class overlap measure (in this way, "France" as an Organization does not match "France" as a Location) calculated using the Dice coefficient between the sets of NEs found, respectively, in sentences p and q.

Skip-gram Similarity
This measure is obtained as the dice coefficient calculated between the set of skip-grams contained in the two sentences.

Learning Models
Besides the ν-Support Vector Regression (ν-SVR) (Schölkopf et al., 1999) used in previous participation, we used Multilayer Perceptron and Random Forests. The Multilayer perceptron (Bishop and others, 1995) is a neural network model which has several interesting properties, such as robustness and nonlinearity. Our implementation uses a simple gradient descent learning algorithm with backpropagation and one hidden layer with 5 units. Random Forests (Breiman, 2001) are an ensemble learning method based on boosting and bagging of classification trees. In our experiments, we used Random Forests with 10 bootstrap samples.
In our runs, we selected a subset of the training set according to a similarity measure between the test and the training set based on a 1-to 3grams language model and average sentence length. The idea behind this selection process is that learning sentence similarities on a specific type of text will increase the accuracy of predictions on text with similar characteristics: image descriptions are usually written in a very different form than word definitions or forum answers. For each coherent subset of the training set, we built a language model L m = (G 1 , G 2 , G 3 ) where G n is the distribution frequency of n-grams in the subset. We obtained the same for the input dataset (L i ) and we calculated S(L m , is the Bhatthacharyya distance between the distributions F 1 and F 2 . We selected only those training dataset where S(L m , L i ) > 0.2. In Table 3 we show the comparison of the results obtained with such selection (the official ones) and those obtained using the complete training set (not submitted). The complete English training set was composed by the data from SemEval STS 2012, 2013 and 2014. In Spanish, we used our 2014 training set, which included the automatically translated English 2012-2013 pairs from STS and a corpus we made from RAE 8 definitions, and the 2014 Spanish test set. Table 1 and 2 presents our results our runs in Sem-Eval 2015 (Agirre et al., 2015). Our participation consisted on three runs for three different machine learning approaches to regressions: Support Vector Regression (LIPN-SVM), Multi Layer Perceptron (LIPN-MLP) and Random Forest (LIPN-RF). The LIPN-RF configuration was our best one and was ranked 25th run-wise and 14th system wise for the English corpora; 5th run-wise and 3rd system-wise for Spanish. Our English system had better overall performance than Spanish. The best performance was reached for the Believe dataset in English and News dataset in Spanish.

Results
Part of our proposal uses a language model to select a subset of the corpus used to train the regression. Table 3 shows performance with the full dataset and the selected training corpus for the En-   glish dataset with the three regression approaches. The overall score points that the corpus selection was not beneficial. The most significant improvement was concentrated on the Answer-students dataset, in this case the performance felt 0,0588 points. We checked the contribution of each feature using the relief attribute selection measure (Kononenko, 1994) over the English training set. The best feature was the WordNet one, followed by Sultan and IC-based similarity. The worst features were Rankbiased Overlap followed by NE Overlap and the Geographic context similarity (however, apart from RBO, the other ones don't have complete coverage). The other features have a statistically similar contribution.

Conclusions and Future Work
The new learning models adopted were particularly effective, outperforming the Support Vector Regression algorithm that we used in our previous participation. The alignment measure based on Sultan was also very effective, as indicated by feature selection. On the other hand, our corpus selection strategy did not prove useful in general, although it provided slight improvements on specific corpora. We will need to further analyse these results to understand how SOPA can still be improved.