FCICU: The Integration between Sense-Based Kernel and Surface-Based Methods to Measure Semantic Textual Similarity

This paper describes FCICU team participation in SemEval 2015 for Semantic Textual Similarity challenge. Our main contribution is to propose a word-sense similarity method us-ing BabelNet relationships. In the English subtask challenge, we submitted three systems (runs) to assess the proposed method. In Run1, we used our proposed method coupled with a string kernel mapping function to calculate the textual similarity. In Run2, we used the method with a tree kernel function. In Run3, we averaged Run1 with a previously proposed surface-based approach as a kind of integration. The three runs are ranked 41 st , 57 th , and 20 th of 73 systems, with mean correlation 0.702, 0.597, and 0.759 respectively. For the interpretable task, we submitted a modified version of Run1 achieving mean F1 0.846, 0.461, 0.722, and 0.44 for alignment, type, score, and score with type respectively.


Introduction
Semantic Textual Similarity (STS) is the task of measuring the similarity between two text snippets according to their meaning. Human has an intrinsic ability to recognize the degree of similarity and difference between texts. Simulating the process of human judgment in computers is still an extremely difficult task and has recently drawn much attention. STS is very important because a wide range of NLP applications such as information retrieval, question answering, machine translation, etc. rely heavily on this task. This paper describes our proposed STS systems by which we participated in two subtasks of STS task (Task2) at SemEval 2015, namely English STS and Interpretable STS. The former calculates a graded similarity score from 0 to 5 between two sentences (with 5 being the most similar), while the latter is a pilot subtask that requires aligning chunks of two sentences, describing what kind of relation exists between each pair of chunks, and a score for the similarity between the pair of chunks (Agirre et al., 2015).
Sense or meaning of natural language text can be inferred from several linguistic concepts, including lexical, syntactic, and semantic knowledge of the language. Our approach employs those aspects to calculate the similarity between senses of text constituents, phrases or words, relying mainly on BabelNet senses. The similarity between two text snippets is firstly calculated using kernel functions, which map a text snippet to the feature space based on a proposed word sense similarity method. Besides, the sense-based similarity score obtained is combined with a surface-based similarity score to study the consolidation impact in the STS task.
The paper is organized as follows. Section 2 explains our proposed word sense similarity method. Section 3 describes the proposed systems. Section 4 presents the experiments conducted and analyzes the results achieved. Section 5 concludes the paper and suggests some future directions.

The proposed Word-Sense Similarity (WSS) Method
Several semantic textual similarity (STS) methods have been proposed in literature. Sense-based methods are qualified when different words are used to convey the same meaning in different texts (Pilehvar et al., 2013). Surface-based methods, mostly fail in identifying similarity between texts with maximal semantic overlap but minimal lexical overlap. We present a sense-based STS approach that produces similarity score between texts by means of a kernel function (Shawe-Taylor and Cristianini, 2004). Then, we integrate the sensebased approach with the surface-based soft cardinality approach presented in (Jimenez et al., 2012) to demonstrate that both sense-based and surfacebased similarity methods are complementary to each other. The design of our kernel function relies on the hypothesis that the greater the similarity of word senses between two texts, the higher their semantic equivalence will be. Accordingly, our kernel maps a text to feature space using a similarity measure between word senses. We proposed a WSS measure that computes the similarity score between two word senses (ws i , ws j ) using the arithmetic mean of two measures: Semantic Distance (sim D ) and Contextual Similarity (sim C ). That is:

Semantic Distance
This measure computes the similarity between word senses based on the distance between them in a multilingual semantic network, named BabelNet (Navigli and Ponzetto, 2010). BabelNet 1 is a rich semantic knowledge resource that covers a wide range of concepts and named entities connected with large numbers of semantic relations. Concepts and relations are gathered from WordNet (Miller, 1995); and Wikipedia 2 . The semantic knowledge is encoded as a labeled directed graph, where vertices are BabelNet senses (concepts), and edges connect pairs of senses with a label indicating the type of the semantic relation between them. Our semantic distance measure is a function of two similarity scores: sim Bn and sim NBn .
The first score (sim Bn ) is based on the distance between two word-senses, ws i and ws j ; where, the shorter the distance between them, the more semantically related they are. That is: Maxlen ws ws len ws ws sim where Maxlen 3 is the maximum path length connecting two senses in BabelNet, and len(ws i ,ws j ) is the length of the shortest path between two senses, ws i and ws j , in BabelNet in both directions; i.e ws i  ws j , and ws j  ws i . The shortest path is calculated using Dijkstra's algorithm. The second score (sim NBn ) represents the degree of similarity between the neighbors of ws i and the neighbors of ws j , which influences the degree of similarity between the two senses. Hence, sim NBn is calculated by taking the arithmetic mean of all neighbor-pairs similarity. That is: where NS i and NS j are the sets of the most semantically related senses directly connected to ws i and ws j respectively in BabelNet; n i = | NS i |, and n j = | NS j |; and sim WuP (ws k , ws l ) is Wu and Palmer similarity measure (Wu and Palmer, 1994).
The values of the two scores presented above determine the way of calculating the semantic distance measure (sim D ) for word senses' pair (ws i , ws j ). For zero similarity of both scores, sim D is simply equals to Wu and Palmer similarity measure; i.e. sim D (ws i ,ws j ) = sim WuP (ws i ,ws j ). Generally, for non-zero similarity scores, sim D is calculated using the arithmetic mean of the two scores.

Contextual Similarity
This measure calculates the similarity between the word senses pair (ws i , ws j ) based on the overlap between their contexts derived from a corpus. The overlap coefficient used is Jaccard Coefficient. That is: where C i is the set of: 1) all the word senses that co-occur with ws i in the corpus, and 2) all senses directly connected to ws i in BabelNet; C j is similar.

Text Preprocessing
The given input sentences are first preprocessed to map the raw natural language text into structured or annotated representation. This process includes different tasks: tokenization, lemmatization, Partof-Speech tagging, and word-sense tagging. All tasks except word-sense tagging are carried out using Stanford CoreNLP (Manning et al., 2014). Sense tagging is the task of attaching a sense to a word or a token. It is performed by selecting the most commonly used BabelNet sense that matches the part of speech (POS) of the word. Accordingly, we restricted sense tagging to: nouns, verbs, adjectives, and adverbs.

English STS Subtask
We submitted three systems in this subtask, named Run1, Run2, and Run3.

Sense-based String Kernel (Run1)
Given two sentences, s 1 and s 2 , the similarity score between s 1 and s 2 resulted by this system is the value of a designed string kernel function between the two sentences. This kernel is defined by an embedded mapping from the space of sentences possibly to a vector space F, whose coordinates are indexed by a set I of word senses contained in s 1 and s 2 ; i.e.  : s  ( ws (s)) wsI  F. Thus, given a sentence s, it can be represented by a row vector as:  (s) = ( ws1 (s),  ws2 (s) …  wsN (s)), in which each entry records how similar a particular word sense (wsI) is to the sentence s. The mapping is given by: where WSS(ws, ws i ) is our defined word sense similarity method ( Eq. (1) ), and n is the number of word senses contained in sentence s. The string kernel between two sentences s 1 and s 2 is calculated as (Shawe-Taylor and Cristianini, 2004): The last step remaining is normalizing the kernel (i.e. range = [0,1]) to avoid any biasness to sentence length. The normalized string kernel κ NS (s 1 ,s 2 ) is calculated by (Shawe-Taylor and Cristianini, 2004

Sense-based Tree Kernel (Run2)
This system applies tree kernel instead of string kernel. Tree kernels generally map a tree to the feature space of subtrees. There are various types of tree kernel designed in literature, among them is the all-subtree kernel presented in (Shawe-Taylor and Cristianini, 2004). The all-subtree kernel is defined by an embedded mapping from the space of all finite syntactic trees to a vector space F, whose coordinates are indexed by a subset T of syntactic subtrees; i.e.  : t  ( st (t)) stT  F. The mapping  st (t) is a simple exact matching function that returns 1 if st is a subtree in t, and returns 0 otherwise. We modified the mapping of all-subtree kernel to capture the semantic similarity between subtrees instead of the structural similarity. The semantic similarity between subtrees is calculated recursively bottom-up from leaves to the root, in which the similarity between leaves is calculated using our defined word sense similarity method.
From this point, the remaining steps are typical to the string kernel steps followed in the first system. Hence, given two sentences s 1 and s 2 , their similarity score is the normalized kernel value between their syntactic parse trees t 1 and t 2 ; i.e. ) , ( ) , (

Sense-based with Surface-based (Run3)
This system provides the results of taking the arithmetic mean of: 1) our sense-based string kernel (Run1); and 2) the surface-based similarity function proposed by Jimenez et al. (2012). The approach presented in (Jimenez et al., 2012) represents sentence words as sets of q-grams on which the notion of Soft Cardinality is applied. In this system, all the calculations in the approach are used unchanged with the following parameters setup: p=2, bias=0, and =0.5. Accordingly, the similarity function is the Dice overlap coefficient on q-grams; i.e.

Interpretable STS Subtask
The interpretable STS is a pilot subtask, which aims to determine the parts of sentences, chunks, that are equivalent in meaning and the parts that are not. This is twofold: (a) aligning corresponding chunks, and (b) assigning a similarity score, and a type to each alignment. Given two sentences splitted into gold standard chunks, our system carries out the task requirements using our sense-based string kernel by considering each chunk as a text snippet. Firstly, the similarity between chunks of all possible chunk-pairs is calculated, upon which chunks are aligned. Where, chunk pairs with a high similarity score are aligned first, followed by pairs with lower similarity. Thereafter, for each alignment of chunks c 1 and c 2 , the alignment type is determined according to the following rules: • If the similarity score between c 1 and c 2 is 5, the type is EQUI. • If all word senses of c 1 matched the word senses in c 2 , the type is SPEC2; similarly for SPEC1. • If both c 1 and c 2 contain a single word sense, and are directly connected by an antonym relation in BabelNet, then the type is OPPO. • If the similarity score between c 1 and c 2 is in range [3,5[, the type is SIM; while if it is in range ]0,3[, the type is REL. • If any chunk has no corresponding chunk in the other sentence, then the type is either NOALI or ALIC based on the alignment restriction in the subtask.

English STS
The main evaluation measure selected by the task organizers was the mean Pearson correlation between the system scores and the gold standard scores calculated on the test set (3000 sentence pairs from five datasets). Table 1 presents the official results of our submissions in this subtask on SemEval-2015 test set. It also includes the results of the Soft Cardinality STS approach (SC) on the same test set for analysis. Our best system (Run3) achieved 0.7595 and ranked the 20 th out of 73 systems.
We conducted preliminary experiments on the training dataset of SemEval-2015 for evaluating our sense-based string and tree kernel similarity methods, and the integration between each of them with the SC approach. The results of those experiments led to the final submission of the two kernels separately (Run1 and Run2) and integrating the string kernel method with SC (Run3). Table 2 focuses on the results obtained from our integrated system (Run3) and SC approach in training, but includes also the recent SC approach (SC-ML) proposed in (Jimenez et al., 2014).
It is noteworthy from the tables that Run3 improved the SC system results on both the training and testing sets for all the different settings for alpha value in the SC approach. The possible reason based on our observation on the training datasets is that the two systems have opposite strength and weakness points. Figure 1 depicts the similarity scores resulted from Run1, Run3, and SC systems along with the gold standard scores (GS) on some sentence pairs from images dataset. It is shown from the figure that Run1 outperforms SC for semantically equivalent sentence pairs (i.e. scores > 3.5), while SC outperforms Run1 for less-related sentence pairs (i.e. score < 2). Hence, their integration by taking their average (Run3) improves the performance of their individual use and did not reduce the SC results. Also, though this integration is simple, it outperformed SC-ML that applies machine learning on some extracted text features.

Conclusions and Future work
Our experiments proved that sense-based and surface-based similarity methods are complementary to each other in STS. We also realized that string kernel is more beneficial than tree kernel. Our potential future work includes: 1) enhancing our sense-based kernel approach, and 2) further enhancement in the integration between SC and our sense-based approach.