SemantiKLUE: Semantic Textual Similarity with Maximum Weight Matching

This paper describes the SemantiKLUE sys-tem (Proisl et al., 2014) used for the SemEval-2015 shared task on Semantic Textual Similarity (STS) for English. The system was developed for SemEval-2013 and extended for SemEval-2014, where it participated in three tasks and ranked 13th out of 38 submissions for the English STS task. While this year’s submission ranks 46th out of 73, further experiments on the selection of training data led to notable improvements showing that the system could have achieved rank 22 out of 73. We report a detailed analysis of those training selection experiments in which we tested different combinations of all the available STS datasets, as well as results of a qualitative analysis conducted on a sample of the sentence pairs for which SemantiKLUE gave wrong STS predictions.


Introduction
The SemEval-2015 task on "Semantic Textual Similarity for English" (Agirre et al., 2015) is a rerun of the corresponding task from SemEval-2014 with new test data and updated categories. The predictions of participating systems were evaluated against manually annotated and subsequently filtered data. STS was measured on a scale ranging from 0 (no similarity at all) to 5 (total equivalence). SemantiK-LUE, developed in 2014, uses a distributional bagof-words model as well as a word-to-word alignment for each pair of sentences based on a maximum weight matching algorithm.
Our SemEval-2015 submission for all 5 test categories (headlines, images, belief, answers-forums, answers-students) was based on the training data set from 2014 with 2234 sentence pairs from 3 categories, namely paraphrase sentence pairs (MSRpar), sentence pairs from video descriptions (MSRvid) and MT evaluation sentence pairs (SMTeuroparl). Follow up experiments conducted after the submission deadline showed us that this training configuration was far from optimal, and that our system would have benefited a lot from a better training, as we managed to significantly improve the overall scores. With the best training configuration, Seman-tiKLUE would have ranked 22nd out of 73 submissions (11th out of 28 teams), with a weighted mean of Pearson correlation coefficients over all test categories of 0.7508 (best system: 0.8015) In the following sections, we first give a short overview of the system (Section 2), and then we describe the follow-up experiments that allowed us to define the best training data set in terms of its subsets (Section 3); finally, we present the results of a qualitative analysis of the performance of our system (Section 4).

System Description
SemantiKLUE combines supervised and unsupervised approaches for the computation of textual similarity: a number of similarity measures are computed and passed to a support vector regression learner, which is trained on the available training data and test sets of previous years. The learnt weights are then used to generate semantic similarity scores for the test data in the desired range.

Training Data and Preprocessing
The system was trained on manually annotated sentence pairs from the STS task at SemEval 2014. All sentence pairs were preprocessed with Stanford CoreNLP 1 for part-of-speech annotation and lemmatization. Each sentence was represented as a graph using the CCprocessed variant of the Stanford Dependencies (collapsed dependencies with propagation of conjunct dependencies) implemented with the NetworkX 2 module. This graph representation was involved in the computation of all 39 similarity measures for words and tokens in each sentence. Prepositions, articles, conjunctions as well as auxiliary verbs like be and have were ignored in the computation of token-based measures.

Similarity Measures: Overview
A detailed description of all 39 similarity measures used as features in SemantiKLUE is provided in Proisl et al., 2014 (Sections 2.2 -2.7). Similarity measures used by our system include: • Heuristic similarity measures: word form overlap and lemma overlap between two texts computed with Jaccard coefficient; difference in text length used by Gale and Church (1993); a binary feature to treat negation in each sentence pair. • Document similarity measures based on two distributional models: a model based on nonlemmatized information, built from the second release of the Google Books N-Grams database (Lin et al., 2012); a lemmatized model, built from a 10billion word Web corpus 3 . • Alignment-based measures: one-to-one alignment and one-to-many alignment for both words and lemmata, computed via maximum weight matching, based on cosine similarity between two words in paired sentences as edge weight. Figure  1 visualizes a one-to-many alignment based on lemmatized data. The colors of the connections correspond to different cosine ranges, reported in the legend to the right of the plot. • WordNet-based similarity measures: Leacock and Chodorow's (1998) normalized path length Figure 1: One-to-many alignment plot. Sentences: "A black and white dog is jumping into the water" , "A white dog runs across the water"; Subset: Images; Gold Score: 2.8; SemantiKLUE score: 2.93. and Lin's (1998) universal similarity measure. Using these similarity measures, the best one-toone and the best one-to-many alignment are computed. After that, the arithmetic mean of the similarities between the aligned words from text A and text B with and without identical word pairs is calculated. An additional WordNet-based feature is the number of unknown words in both texts. • Dependency-based heuristic measures: overlap of dependency relation labels between the two texts; arithmetic mean of the similarities between the best aligned one-to-one dependency relations based on Leacock and Chodorow's normalized path lengths; average overlap of neighbors for all aligned word pairs based on one-to-one alignment created with similarity scores from the lemmabased DSM. • Experimental features: cosine similarities for each pair of sentences; average neighbor rank based on the rank of text A among the nearest neighbors from text B and vice versa. The feature set described above was processed by the support vector regressor implemented in the scikit-learn 4 (Pedregosa et al., 2011) library. All the experiments presented in this paper rely on the best support vector setting identified by Proisl et al. (2014), namely: RBF kernel of degree 2 and penalty C = 0.7. In what follows, we describe the procedures adopted to adjust training data and find the best training configurations.

Experiments
This section describes all post-hoc experiments on the STS 2015 test data performed to improve the 4 http://scikit-learn.org/ 112 predictions of the system. The abbreviations used in the following tables reporting experiment results are listed in Table 1  All 39 similarity measures were used by the regression learner to train the system. SemantiKLUE was tested on different training data with various combinations of training and test sets from 2013 and 2014. Results for the submitted system are typeset in italics in Table 2, the best results in each column are typeset in bold font.
The best results would have been obtained by training on the MSR data from SemEval 2014 for all test sets. Considerable improvements can be achieved removing the SMTeuroparl category from the training set. This category consists of MT pairs of sentences whose exclusion would have given the system rank 37 (weighted mean of .7148) instead of 46 (.6717) out of 73 submissions.
We turned the test data from SemEval 2014 into a training set for the 2015 test data (see Table 3). The figures in Table 3    conducted more fine-grained experiments to look for the best combination of training data for the 2015 test sets. We combined training and test data of SemEval 2014 with the best training categories of SemEval 2013 (see Table 4) to test the performance of the system on the optimal training subset defined for SemEval 2014 14 . That optimal training configuration consists of the FNWN, headlines, MSR and OnWN data sets: the corresponding performance is typeset in italics. Comparable or even better results can be achieved with a combination of test and train categories of SemEval 2014 only. Thus, combining the training category MSR (mp + mv) with another test category of 2014 (such as tweets or headlines) results in about 1.5%-2% improvement. A more precise investigation helped us to find the best test combination with MSR, headlines, images, and tweet-news categories. This brought our system to the weighted mean of .7508, corresponding to the 11th place out of 28 teams. We tried to further improve these results, by adding the optimal categories for training found in 2014 and extended the best training set defined for 2015 with FNWN (mp+mv+hl+img+tn+fn), but this led to slightly worse results in all test categories.
A further set of experiments was aimed at testing different subsets of similarity measures used at the    machine learning stage. Results showed that the use of fewer similarity features (exclusion of all identical words in each pair of sentences from the calculation of similarity scores) resulted in worse performance of the whole system. Our system is based on a relatively large feature set, but we were also interested in discovering how well SemantiKLUE would have performed if trained on a single feature. We tested a feature based on cosine similarity between the two centroid vectors as a measure of semantic similarity for each sentence pair as suggested by Schütze (1998) using either tokens or lemmas (see Table 5). We selected cosine between centroid vectors as a candidate feature, because it is most intuitive and naturally connects to the representation of topical information, crucial in capturing textual similarity.
We found that regardless of the alignment (one to one or one to many both for lemma and tokens), the weighted mean of Pearson correlation coefficients is low (.6904 for the one-to-one alignment) for the cosine similarity value calculated with lemma based centroid vectors, but still higher than what is achieved by the more complex system with a large set of features with a poor training set (.6717) in the submission with mp+mv+smt used for the training set (see Table 4 for comparison). As we were interested in identifying the most balanced training sets in the test categories of 2014, we tested all categories against each other. Results are shown in Table 6: the rows of the table correspond to test subsets, while columns represent training sets. The results typeset in italics show that there is a high level of overtraining for the cases in which training and test data are identical. The most balanced and robust test data are those of the image and OnWN categories: they can be used as training data for future experiments.
To sum up, our results show that the best training configuration for SemEval 2015 involves MSR, headlines, images, and tweet-news categories (see Table 4). The scatter plots in Figures 2 to 4 relate the similarity score in the gold standard (x-axis) to the relatedness score produced by SemantiKLUE (yaxis) in its best training configuration, for three of

Qualitative Analysis
In this section, we report the results of a qualitative analysis conducted on sentence pairs for which SemantiKLUE, in the optimal training configuration identified in Section 2.2, made wrong predictions. Our goal was to identify a taxonomy of Seman-tiKLUE's problems. Broadly speaking, there are two possibilities for SemantiKLUE to make a wrong similarity guess: the system can overestimate the similarity between the two sentences -thus generating a relatedness score higher than the speakers' judgments -or it can underestimate similaritygenerating a score lower than the gold standard. In the process of interpretation/classification, we relied on the inspection of alignment plots (cf. Figure 1) and on our knowledge of the dynamics of the features within SemantiKLUE.
The analysis was conducted manually on a selected sample of sentence pairs from the test data. We selected sentences for which the absolute difference between the similarity score in the gold standard and the relatedness score produced by Se-mantiKLUE was between 1.5 and 2.5 points. That range was identified by inspecting the distribution of gold standard/relatedness score differences in the five subsets (corresponding plots are not shown here for reasons of space). Within this range, we randomly picked 40 items (sentence pairs) per subset, 20 with positive difference (underestimation), 20 15 http://www.r-project.org/ with negative difference (overestimation) 16 .
Let us start with the cases in which SemantiK-LUE overestimated STS. We list the identified mistake categories, providing a short description for the cases in which the label is not self-explanatory, and report the percentage of affected sentences. Each item can be affected by more than one mistake type.
• One or two words (often very frequent and with generic meaning) dominate the alignment, or one sentence is practically a subset of the other: 56% of the items. • Wrong alignments: 25% of the items.
• Modification: presence of identical modifiers with different heads boosts overall similarity. This mistake type affects 7% of the cases. • Same frame, different participants: the sentences depict the same event, but the participants (or the background) determine a significant difference in meaning that our system fails to capture. This problem affects 8% of the items. • Same participants, different frames: 11% of the items. • Negation: 10% of the items. • (Near) Antonyms: 8% of the items. • Proper Names: 18% of the items.
• Amounts: when building the alignment, Seman-tiKLUE ignores numerical values, which are in some cases crucial in determining (dis)similarities between sentences otherwise near identical (e.g., "2 people killed.." vs. "100 people killed"). This problem affects 18% of the items.
We now proceed to cases of underestimation, for which we identified the following mistake types: • Collocations (e.g, "heads up", "make sense") negatively affect the alignment process: Seman-tiKLUE would have performed better if multiwords had entered the alignment process as a whole, and not as individual edges. This mistake type affects 10% of the items. • Crucial alignments missing or weaker than expected: 17% of the items. • The similarity between the sentences is due to logical form, compositionality or world knowledge. This problem affects 16% of the items. • Different register makes alignment problematic, even if the sentences are content-wise similar: 12% of the items. • Displacement of different pieces of information between two sentences otherwise centered on the same topic makes them less similar for Se-mantiKLUE then for the raters: 28% of the items. • Spelling mistakes prevent otherwise straightforward alignments: 10% of the items. • Difficult cases, for which the alignment would simply suggest a score higher than the one predicted by the regressor. Such cases, (15%), require further investigation.

Conclusion
In this paper, we presented the results of our evaluation experiments on the performance of the Seman-tiKLUE system (Proisl et al., 2014) on the SemEval-2015 STS task. Our experiments showed that the performance of our system is heavily dependent on the choice of the training set, as we managed to significantly improve the performance of our system with respect to the original submission. The qualitative evaluation sketched in Section 4 provided interesting insights into specific features of the STS data and it allowed us to identify some idiosyncracies (e.g., the behavior of the system in case of alignment of identical words) and weaknesses (e.g., the treatment of multiwords in the process of alignment) that we are already working on improving.