SERGIOJIMENEZ at SemEval-2016 Task 1: Effectively Combining Paraphrase Database, String Matching, WordNet, and Word Embedding for Semantic Textual Similarity

In this paper, a system for semantic textual similarity, which participated in Task-1 in SemEval 2016 (monolingual and cross-lingual sub-tasks) is described. The system contains a preprocessing step that simpliﬁes text using PPDB 2.0 and detects negations. Also, six lexical similarity functions were constructed using string matching, word embedding and synonyms-antonyms relations in WordNet. These lexical similarity functions are projected to sentence level using a new method called Polarized Soft Cardinality that supports negative similarities between words to model opposites. We also introduce a novel L 2 -norm “cardinality” for vector space representations. The system extracts a set of 660 features from each pair of text snippets using the proposed cardinality measures. From this set, a subset of 12 features was selected in a supervised manner. These features are combined by SVR and, alternatively, by using the arithmetic mean to produce similarity predictions. Our team ranked second in the cross-lingual sub-task and got close to the best ofﬁ-cial results in the monolingual sub-task.


Introduction
Semantic Textual Similarity (STS) is a fundamental task in the field of natural language processing that has been addressed in SemEval competitions uninterruptedly since 2012 (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015). The task is to compare two text fragments and produce a similarity score that is assessed according to human judgment. This year (Agirre et al., 2016), a new cross-lingual sub-task in English and Spanish is proposed in addition to the traditional monolingual English task. In SemEval 2015, the most popular approach among the best systems was the use of words alignments between sentences combining resources such as WordNet (Miller, George A., 1995), neural word embedding (Mikolov, Tomas et al., 2013) and the Paraphrase Database (Pavlick et al., 2015). This paper describes our system submission to STS 2016 that uses a cardinality-based approach (instead of word alignments) for combining the resources mentioned above. Several teams have used soft cardinality successfully in previous STS competitions from 2012 to 2014 (Jimenez et al., 2012;Jimenez et al., 2013a;Jimenez et al., 2013b;Lynum et al., 2014). For the proposed system, we extended the model of soft cardinality to allow the use of negative values in the lexical similarity component to model opposites due to antonymy and negation. Figure 1 shows the overall architecture of the proposed system. Yellow labels in the upper left corner of each process component (blue squares) indicate the sections of this document where the module is discussed. In this figure, the processing pipeline is represented vertically in three stages: preprocessing, feature extraction and model learning. Red parallelograms represent the inputs and outputs of each process, from the pair of snippets of text for evaluation, through different intermediate representations (bag-of-words, vectors, etc.) and end on the predictions of similarity scores. The left side contains the used external resources linked to the process that makes use of each one.

Paraphrase Simplification
The Paraphrase Database (PPDB) is a list of pairs of words, short phrases of syntactic rules where each pair is semantically equivalent in some degree (Pavlick et al., 2015). In PPDB, each paraphrases pair {e 1 , e 2 } is obtained from translation models making use of the observation that if e 1 and e 2 are frequently translated to a same word or phrase in a foreign language, then there is a high probability of e 1 and e 2 being paraphrases of each other. In PPDB 2.0 each pair is labeled with − log(P (e 1 |e 2 )) and − log(P (e 2 |e 1 )) obtained from the translation models, where P (e 2 |e 1 ) is the inferred probability that the word or phrase e 1 is a good paraphrase for e 2 , and the contrary for P (e 1 |e 2 ). The motivation for using this resource is that text pairs that can be paraphrased to simplified versions should be easier to analyze by downstream modules. For example, consider the paraphrase pair e 1 ="interdisciplinary" and e 2 ="cross-disciplinary" labeled in PPDB with − log(P (e 2 |e 1 )) = 4.28 and − log(P (e 1 |e 2 )) = 0.82. Now, consider a pair of sentences for STS evaluation where these paraphrases occur: "The study was interdisciplinary." and "Our research is cross-disciplinary." Given that e 1 is higher scoring paraphrase for e 2 than the contrary (i.e. − log(P (e 2 |e 1 )) > − log(P (e 1 |e 2 ))), e 2 can be replaced by e 1 in the second sentence: "Our research is interdisciplinary". As a result, the pair of sentences now contains more frequent words and shares more words thereby facilitating subsequent STS analysis.
Let A and B be a pair of texts snippets for STS evaluation and e 1 and e 2 a pair of paraphrases from PPDB. Thus, {e 1 , e 2 } occurs in {A, B} if e 1 ⊂ A ∧ e 2 ⊂ B or e 2 ⊂ A ∧ e 1 ⊂ B, being aware of the special cases when e 1 ⊂ e 2 or e 2 ⊂ e 1 (whole words matching is used in those cases). The operator "⊂" means that the left argument is a sub-string of the right one. The input pairs of sentences for the STS task where preprocessed by looking for occurrences of paraphrases from PPDB and replacing the least probable paraphrase by the most probable paraphrase. For that, we used the top-ranked lexical paraphrases and phrasal paraphrases from the M-size version of the PPDB 2.0 1 (syntactic rules were not used). We determined the number of topranking lexical and phrasal paraphrases to use experimentally by using the overall STS system described in this paper trained and tested with STS datasets from previous years. Consistent increases in the performance measured by mean correlation was observed as the number of used paraphrases increased. The average relative improvement stabilized around 2% using 150,000 lexical paraphrases and 3,000,000 phrasal paraphrases. Using these thresholds for the paraphrase database we assessed the 14 thousand sentence pairs in training data and found 3,294 occurrences of lexical paraphrases and 1,778 phrasal paraphrases.

Tokenizing, Stop-words Removal and Negation Detection
The preprocessing continues by tokenizing sentences, removing stop-words and labeling negated words. For this stage, we use the tokenizer and stopwords list from NLTK 2 augmented with the following words: should, now, 's, 't, 've, something, would and also. Once stop-words are removed from the text, each word preceded by a negation token is labeled as a negated word. The negation tokens we use are: not, n't, nor, null, neither, barely, scarcely, hardly, no, none, nobody, nowhere, nothing, never and without. The negation tagged tokens are used by subsequent modules for modeling oppositeness between negated and non-negated forms (e.g. "not running" and "running").

Lexical Similarity
The analysis of short texts based on soft cardinality relies only on a similarity function between lexical units (Jimenez et al., 2010). Therefore, the first component of the proposed STS system is composed of four lexical similarity functions that compare a pair of words and yield a numerical value in [-1,1] interval. Returning values of 1 means that the two words can be considered identical, 0 for unrelated words, negative values for representing opposition, and other values for representing intermediate degrees of similarity and opposition. In this section, the four lexical similarity functions we use are described.

Lexical String Matching Boosted with Synonyms and Antonyms
The NLP community has widely recognized that the use of lemmas or stems, instead of words, is desirable in many applications of automatic text processing. Therefore, before comparing any pair of words we reduced them to their stems using the Porter's algorithm (Porter, 1980). Let x and y be two stemmed words represented each as a sequence of characters. The first proposed lexical function replaces this basic word representation by the set of tri-grams and tetra-grams of characters on each word. This representation was used successfully for addressing the STS task with purely string-based approaches (Jimenez et al., 2012). For example, the word country is stemmed to x =countri. Next, its [3:4]-grams representation is x ={cou, oun, unt, ntr, tri, coun, ount, untr, ntri}. Once x and y are represented as described, they are compared with the following expression: The second lexical similarity function is the wellknown Jaro-Winkler (Winkler, 1990) Where, len(x) is the number of characters in word x, m is the number of matching characters between x and y, t is the number of transpositions between x and y, lp is the length of the common prefix, p = 0.1, and b t = 0.7 is the "boost threshold". The number of matching characters m is the number of common characters between x and y whose occurrences are not farther than max[len(x),len(y)] 2 − 1 positions. The number t of transpositions is the number of matching characters that occur in different sequence order on each string. Clearly, m ≤ len(x), m ≤ len(y), and t ≤ m, therefore d(x, y) is defined only in [0, 1] interval (if m = 0, then d(x, y) is set to 0). Similarly to S 1 , S 2 was used to compare stems instead of words.
Both S 1 and S 2 returns 1 when x and y are identical, 0 when x and y do not have common characters, and intermediate values for other cases. To a certain extent, this stem string similarity reflects semantic similarity between words. To improve this property, a wrapper functionS was built over S 1 and S 2 to include information from the synonym and antonym relationships in WordNet and the negation feature extracted at preprocessing stage (see subsection 2.2). If x and y are synonyms in WordNet, then the wrapper functionS overwrites the results of S 1 and S 2 to 1 (identical meaning). For the case when x and y are antonyms, the wrapper function should return a negative value to represent the opposition between words x and y (opposite meaning). Unlike synonymy and identity, the relation between antonymy and numerical oppositeness is rather unclear because most antonym pairs also are semantically similar (e.g. small-large) (Mohammad et al., 2008). The natural choice for this negative values is -1 (Yih et al., 2012) (Yih et al., 2012). However, instead of setting -1 to represent oppositions between two words, we decided to set this value as a parameter to be determined experimentally. For that, we used the overall STS system described in this paper with the STS datasets from previous years. The value that optimized the mean correlation was -0.2 in a search range from -1 to 1. The negation feature of the words is used to add negation logic toS. For example, if x and y are synonyms but x is negated, then they are considered antonyms. Some examples are:S(car, auto) = 1,S(¬car, ¬auto) = 1, S(¬car, auto) = −0.2,S(love, hate) = −0.2, S(¬love, ¬hate) = −0.2 andS(¬love, hate) = 1 (¬ signify "negated word"). In the remaining cases, when x and y are neither synonyms nor antonyms, the wrapper function returns S i (x, y) except for the case when either x or y is negated. In that later case, the wrapper function returns 0.26 × S i (x, y), which is a scaling factor for modeling negation determined experimentally in the same way as the opposite value of 0.2. A couple of examples are:S 1 (skater, skateboard) = 0.489, S 1 (¬skater, skateboard) = 0.489 × 0.26 = 0.127. Henceforth, functions S 1 and S 2 are assumed to be overwritten by the described wrapper function.

Word Embedding
Two additional lexical similarity functions were built using word embedding representations. Let x and y be the vectorial representations of words x and y in R n . The used lexical similarity function between a pair of words is the cosine between these vectorial representations: Function S 3 is computed using the publicly available pre-trained Google News corpus word embedding 3 from the word2vec tool (Mikolov, Tomas et al., 2013). We also include a similar function S 4 that is defined identically to S 3 but uses pre-trained Twitter corpus word embedding 4 from the GloVe tool (Pennington, Jeffrey et al., 2014). These cosine based similarities produce scores in the range from -1 to 1.

Polarized Soft Cardinality
Lexical similarity can be leveraged to address sentence similarity by aggregating lexical similarity scores. One successful mechanism for doing this is soft cardinality (Jimenez et al., 2010), which is a generalization of the classic set cardinality that considers similarities between elements. Thus, the soft cardinality of a bag of words A = {a 1 , a 2 , . . . , a n } (i.e. a sentence) and a similarity function between words S(a i , a j ) is defined by this expression: Where p is the softness-control parameter, which is positive and its default value is p = 1. Soft cardinality is a generalization of classic cardinality because as p increases, |A| S gets closer to |A|. The soft cardinality of the union of two bags of words |A ∪ B| S is simply the soft cardinality of the concatenation of the bags. The soft cardinality of the intersection of two bag is defined as |A ∩ B| S = |A| S + |B| S − |A ∪ B| S . This model is restricted only to positive lexical similarity functions because negative values could lead to division-by-zero if n j=1 S(a i , a j ) p = 0 for any a i . Given that the lexical similarity functions S 1 to S 4 (described in Section 3) can return negative values for words with opposite semantics, a new soft cardinality model that supports such negative similarities between elements was proposed for this competition. The new polarized soft cardinality model is: Functions neg(s, p) and pos(s, p) filter respectively negative and positive values of s, then raise them to p power ignoring the sign. Note that, if S(a i , a j ) is strictly positive, then this model is equivalent to soft cardinality . This new model inserts dummy or "ghost" elements in A if there are opposite elements in A. For example, consider A = auto, love, hate and S(auto, love) = S(auto, hate) = S(love, hate) = 0. Clearly, |A| S = 3. However, if S(love, hate) = −1, then |A| S = 4. This increment in soft cardinality reflects the presence of a dummy element due to the fact that love and hate are opposites.
Using the lexical similarity functions presented in Section 3, four soft cardinality functions can be built: | * | S 1 , | * | S 2 , | * | S 3 and | * | S 4 . Each of those has the following softness control parameter values: p S 1 = 1.05, p S 2 = 0.85, p S 3 = 0.5 and p S 4 = 0.65, which were obtained experimentally using STS data from previous SemEval campaigns. These soft cardinality functions are used to extract numerical features from each pair of sentences to be evaluated (see Section 6).

L 2 -norm Cardinality
Given two sentences A and B represented as bag of words, the proposed L 2 -norm cardinality is a measure of the amount of information in A, B, A ∩ B and A ∪ B. L2-norm cardinality is analogous to soft cardinality but uses vectorial representations of the words and vector operations in its formulation. Instead of exploiting pairwise similarities between words as soft cardinality does, L 2norm cardinality uses vectorial representations of the words in the "bag" to assess its cardinality. Let A = { a 1 , a 2 , . . . , a p } and B = { b 1 , b 2 , . . . , b q } be a bag of vector-represented words in R n , where p and q are the number of words in A and B respectively. Firstly, A and B obtain a representation in R n by adding up the vectors in their respective bags, i.e. A = p i=1 a i and B = q i=1 b i . L 2 -norm cardinality is defined by the following expressions: |A ∪ B| n = |A| n + |B| n − |A ∩ B| n Two L 2 -norm cardinality functions can be built reusing the same word embedding used in S 3 and S 4 . Thus, | * | 300 is obtained using the pre-trained word2vec vectors and | * | 50 is obtained from the pre-trained GloVe vectors. L 2 -norm cardinalities | * | 300 and | * | 50 are added to the set of the four previously proposed soft cardinality functions to be used for extracting numerical features from sentence pairs.

Feature Extraction
The six cardinality functions proposed in Section 4 and Section 5 can be used to build a variety of similarity assessment measures for STS.. For example, for sentences A and B, the expression sim(A, B) = |A∩B| S 1 |A∪B| S 1 is a possible STS measure based on Jaccard's coefficient. However, the space of possible similarity functions that can be built from cardinalities |A| S 1 , |B| S 1 , |A ∪ B| S 1 and |A ∩ B| S 1 is huge. We explore a limited portion of this space by generating similarity function expressions from a set of 11 factors (see Table 1). Parameter c in Table 1 represents the sub-index for identifying the possible cardinality function, c ∈ {S 1 , S 2 , S 3 , S 4 , 300, 50}. The set of expressions used for combining these factors is heuristic but motivated by the formulations of existing cardinality-based similarity measures (e.g. Jaccard, Dice, matching, cosine among others). For each one of the six cardinality functions, these factors were combined into expressions that generate a total of 11×10 = 110 features of the form f i f j ; i = j.

Feature Selection
The feature selection process consists in selecting the k-best features from the set of 110 features for each cardinality function. We used the method SelectKBest 5 from the Scikit-learn machine learning kit (Pedregosa, Fabian et al., 2011). The data used for this selection process was the concatenation of 20 STS datasets labeled with gold standard from the past SemEval campaigns from the years 2012 to 2015 (14,437 sentence pairs with gold standard annotations). The process was performed using 10-fold cross-validation repeating the selection ten times with different randomly selected fold partitions. The k features that were selected the most Factor Expression † Name times in the k-best selection after all runs were retained for the final model. Preliminary experiments suggest a good value for k is two, as determined using the overall STS system with the same data. Table 2 shows the selected features for each cardinality function and their results on the mean correlation performance measure as assessed under previous STS shared task evaluation settings. Although, none of the features outperformed the best official results, results of | * | S 1 and | * | 300 cardinality functions are highly competitive. It is important to note that, in spite that the feature selection procedure is supervised, the selected features by themselves are inherently un-supervised.

Feature Combination
The 12 selected features showed in Table 2 were combined to produce predictions for our three participating runs. Run1 and run2 were SVR (support-vector regression) models with RBF kernel (Drucker, Harris et al., 1997) built with the 14,437 sentence pairs available for training. The difference between run1 and run2 is the values of the used SVR parameters C and γ. For run1, we used C = 0.6 and γ = 0.004, which were obtained by optimizing the weighted average of Pearson correlations using each available STS dataset alternatively for testing and the remaining pairs for training. For run2, we used C = 53 and γ = 0.012, which were obtained using all 14,437 sentence pairs as a single dataset and 5-folds cross-validation in 5 randomly selected division folds.
Unlike run1 and run2, run3 was effectively unsupervised and operated by simply averaging the 12 feature values after multiplying by -1 the values of the features with negative correlations in Table 2. 9 Cross-lingual Sub-task The predictions of the three runs submitted to the STS cross-lingual sub-task were produced using the same systems that produced predictions for the STS monolingual sub-task (English). For that, the texts in Spanish were translated into English using Google's public translate service. 6 10 Results Table 3 shows results obtained with the same systems used to produce our runs 1, 2 and 3, but using datasets from STS competitions from 2012 to 2015. The results labeled as "held-out" were obtained by holding out each dataset for testing and using the remaining datasets for training including the three training datasets from 2012. The results labeled as "same-data" were otained using the same training data available during each historical STS evaluation. All "held-out" systems outperformed consistently the best official results obtained by a single system for each year using the performance measure of weighted mean correlation. For the "same-data" systems, the run1 outperformed historical official results from 2013 to 2015, while run2 and run3 made it only for 2013 and 2014. Table 4 shows results of the same systems ("held-out" testing setting) and the best official results obtained on each dataset of the 20 individual evaluation datasets from prior STS competitions. In this tougher comparison, the proposed systems obtained state-of-the-art results in 10 out of the 20 individual datasets, and competitive results for the majority of the remaining datasets. Finally, Table 5 and Table 6 show the results obtained by our systems on the 2016 datasets along with the best official results of the competition. When comparing our three runs, none of them was consistently better with the three runs typically obtaining similar results. Therefore, it is possible to conclude that the | * | S1 f3 /f8   contribution of SVR was not considerable, with an exception for the answer-answer dataset.

Conclusion
The proposed STS system combined effectively the most popular resources used by the top systems during the SemEval 2015 for STS shared task. Results show that the proposed system outperformed all past systems in a per-system based comparison, and obtained state-of-the-art results in half of the datasets from past STS competitions at SemEval.     Table 6: Official results for our participating systems in the cross-lingual sub-task (English/Spanish).