BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity

This paper presents three systems for semantic textual similarity (STS) evaluation at SemEval-2017 STS task. One is an unsupervised system and the other two are supervised systems which simply employ the unsupervised one. All our systems mainly depend on the (SIS), which is constructed based on the semantic hierarchical taxonomy in WordNet, to compute non-overlapping information content (IC) of sentences. Our team ranked 2nd among 31 participating teams by the primary score of Pearson correlation coefficient (PCC) mean of 7 tracks and achieved the best performance on Track 1 (AR-AR) dataset.


Introduction
Given two snippets of text, semantic textual similarity (STS) measures the degree of equivalence in the underlying semantics. STS is a basic but important issue with multitude of application areas in natural language processing (NLP) such as example based machine translation (EBMT), machine translation evaluation, information retrieval (IR), question answering (QA), text summarization and so on.
The SemEval STS task has become the most famous activity for STS evaluation in recent years and the STS shared task has been held annually since 2012 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016Cer et al., 2017), as part of the SemEval/*SEM family of workshops. The organizers have set up publicly available datasets of sentence pairs with similarity scores from human annotators, which are up to more than 16,000 * Corresponding author sentence pairs for training and evaluation, and attracted a large number of teams with a variety of systems to participate the competitions.
Generally, STS systems could be divided into two categories: One kind is unsupervised systems (Li et al., 2006;Mihalcea et al., 2006;Islam and Inkpen, 2008;Han et al., 2013;Sultan et al., 2014b;Wu and Huang, 2016), some of which are appeared for a long time when there wasn't enough training data; The other kind is supervised systems (Bär et al., 2012;Šarić et al., 2012;Sultan et al., 2015;Rychalska et al., 2016;Brychcín and Svoboda, 2016) applying machine learning algorithms, including deep learning, after adequate training data has been constructed. Each kind of methods has its advantages and application areas. In this paper, we present three systems, one unsupervised system and two supervised systems which simply make use of the unsupervised one.

Preliminaries
Following the standard argumentation of information theory, Resnik (1995) proposed the definition of the information content (IC) of a concept as follows: where P(c) refers to statistical frequency of concept c.
Since information content (IC) for multiple words, which sums the non-overlapping concepts IC, is a computational difficulties for knowledge based methods. For a long time, IC related methods were usually used as word similarity (Resnik, 1995;Jiang and Conrath, 1997;Lin, 1997) or word weight (Li et al., 2006;Han et al., 2013) rather than the core evaluation modules of sentence similarity methods (Wu and Huang, 2016).

STS evaluation using SIS
To apply non-overlapping IC of sentences in STS evaluation, we construct the semantic information space (SIS), which employs the super-subordinate (is-a) relation from the hierarchical taxonomy of WordNet (Wu and Huang, 2016). The space size of a concept is the information content of the concept. SIS is not a traditional orthogonality multidimensional space, while it is the space with inclusion relation among concepts. Sentences in SIS are represented as a real physical space instead of a point in vector space.
We have the intuitions about similarity: The similarity between A and B is related to their commonality and differences, the more commonality and the less differences they have, the more similar they are; The maximum similarity is reached when A and B are identical, no matter how much commonality they share (Lin, 1998). The principle of Jaccard coefficient (Jaccard, 1908) is accordance with the intuitions about similarity and we define the similarity of two sentences S a and S b based on it: The quantity of the intersection of the information provided by the two sentences can be obtained through that of the union of them: So the remaining problem is how to compute the quantity of the union of non-overlapping information of sentences. We calculate it by employing the inclusion-exclusion principle from combinatorics for the total IC of sentence s a and the same way is used for sentence s b and both sentences: For the IC of n-concepts intersection in Equation (4), we use the following equation 1 : 1 For the sake of high computational complexity introduced by Equation (4), we simplify the calculation of common IC of n-concepts and use the approximate formula in Equation (6). The accurate formula of common IC is: Algorithm 1: getInExT otalIC(S ) Input: S : {c i |i = 1, 2, . . . , n; n = |S |} Output: tIC: Total IC of input S 1 if S = ∅ then 2 return 0 3 Initialize: tIC ← 0 4 for i = 1; i ≤ n; i + + do 5 foreach comb in C(n, i)-combinations do where, subsum (c 1 , · · · , c n ) is the set of concepts that subsume all the concepts of c 1 , · · · , c n in SIS. Algorithm 1 is according to Equation (4) and (6), here C (n, i) is the number of combinations of i-concepts from n-concepts, commonIC(comb) is calculated through Equation (6).
For more details about this section, please see the paper (Wu and Huang, 2016) for reference.
Searching subsumers in the hierarchical taxonomy of WordNet is the most time-consuming operation. Define one time searching between concepts be the minimum computational unit. Considering searching subsumers among multiple concepts, the real computational complexity is more than 0 * C(n, 1) + 1 * C (n, 2) + · · · + (n − 1) * C (n, n).
Note that the computational complexity through the inclusion-exclusion principle is more than O(2 n ). To decrease the computational complexity, we exploit the efficient algorithm for precise non-overlapping IC computing of sentences by making use of the thinking of the different set in hierarchical network (Wu and Huang, where c j ∈ subsum (c 1 , · · · , c n ), m is the total number of c j . We could see Equation.
Algorithm 2: getT otalIC(S ) Input: S : {c i |i = 1, 2, . . . , n; n = |S |} Output: tIC: We add the words into the SIS one by one each time and sum the gain IC of ICG(c i ) from the newly added concept c i . For sentence S = {c i |i = 1, 2, . . . , n; n = |S |}, where c i is the concept of the i-th concept in S , |S | is concept count of S , the formula of ICG(c i ) is as follows: For convenience in the expression of ICG(c i ), we define some functions: Root(c i ) indicates the set of paths, each path is the node list from c i to the root in the nominal hierarchical taxonomy of WordNet. Root(n) is the short form of Root (c 1 , · · · , c n ). Formally, let S et(p) be the set of nodes in path p, Root (n) = {p k |∀p k ∈ Root(c i ), p t ∈ Root(c j ), S et(p k ) ⊆ S et(p t ).i = 1, 2, . . . , n; j = 1, 2, . . . , n}. |Root(c i )| means the number of paths in Root(c i ). HS N(c i ) expresses the set of nodes in any of path in Root(c i ). HS N(n) is the short form of HS N (c 1 , · · · , c n ), formally, Let depth(c) be the max depth from concept c to the root. We define , · · · , |Root(c n+1 )|} and totalIC (c 1 , · · · , c n ) is the quantity of total information of n-concepts. We have (9) Algorithm 2 and 3 are according to Equation (8) and (9). Algorithm 3 is approximately equal to one time subsumer searching between concepts, thus The open source implementations of Algorithm 2 and 3 with related library are also available at GitHub 2 .
Theoretical system with lemmas and theorems has been established for supporting the correctness of Equation (8) and (9). For more details about this section, please see the paper (Wu and Huang, 2017) for reference.

Increasing Word Recall Rate for SIS
We made three aspects improvements in our another previous work: First, we utilize WordNet to directly obtain the nominal forms of a content word which is not a noun mainly through derivational pointers in WordNet. The word formation helps enhance the recall rate of known content words in sentence-to-SIS mappings. Second, name entity (NE) recognition tool (Manning et al., 2014) and the alignment tool (Sultan et al., 2014a) are employed to obtain non-overlapping unknown NEs, which are used for simulating non-overlapping IC in SIS. The alignment tool is mainly used for finding actually same NEs with different string forms and inconsistent NE annotations by the NE recognition tool. Through the statistic values of known NEs of the same kinds from previous datasets, we simulate the IC of out-of-vocabulary NEs in SIS. Finally, sentence IC is augmented by word weights which could deem as the importance of words.
The above contents of this subsection is mainly based on the work which is currently under review.

System Overview
We submitted three systems: One is the unsupervised system of exploiting non-overlapping IC in SIS, the other two are supervised systems of making use of the methods of sentence alignment and word embedding respectively.

Preprocessing
First of all, we translated all the other languages into English by employing Google machine translation system 3 and preprocessed the test datasets with tokenizer.perl and truecase.perl, which are the tools from Moses machine translation toolkit (Koehn et al., 2007), then utilized the preprocessed datasets to do POS obtaining and lemmatization by utilizing NLTK (Bird, 2006), and finally made use of lemma to do sentence alignment (Sultan et al., 2014a) and name entity recognition (Manning et al., 2014). We use the lemma instead of the original word in all the situations where need words to participate for the consideration of simplicity.
We also developed a word spelling correction module based on Levenshtein distance which is special for the spelling mistakes in STS datasets. It proved important for the eventual performances in previous years, however, it was not so critical this year.

Run 1: Unsupervised SIS
Run 1 is from the unsupervised system constructed using the framework described in Section 2 and the implementation is as follows: Word IC calculation employs Equation (1) and the probability of a concept c is: where words (c) is the set of all the words contained in concept c and its sub-concepts in Word-Net, N is the sum of frequencies of words contained in all the concepts in the hierarchy of semantic net. The word statistics are from British National Corpus (BNC) obtained by NLTK (Bird, 2006). Sentence IC computation applies Equation (9). For the simplification, we choose the concept of a word with the minimal IC, which denotes the most common sense of a word, in all the circumstances of conversion of word-to-concept and the selection between two aligned words, instead of word sense disambiguation (WSD).

Run 2: Supervised IC and Alignment
As the aligner of Sultan et al. (2014a) is successfully applied in STS evaluation, we should leverage its advantage of finding potential word aligned pairs from both sentences, especially for different surface forms. However, we did not obtain the global inverse document frequency (ID-F) data on time, thus we did not employ the aligner of Brychcín and Svoboda (2016), which is the improved version of Sultan et al. (2014a), that introduces IDF information of words in the similarity formula.
In this run, we use support vector machines (Chang and Lin, 2011) (SVM) for regression, more specifically sequential minimal optimization (Shevade et al., 2000) (SMO). There two features: One is the output of SIS, the other is that of unsupervised method of Sultan et al. (2015).
Actually, we tested some other regression methods. We found that LR and SVM always outperform the others. The tool for regression methods are implemented in WEKA (Hall et al., 2009).

Run 3: Supervised IC and Embeddings
Deep learning has become a hot topic in recent years and many supervised methods of STS incorporate deep learning models. At SemEval 2016 STS task, at least top 5 teams included deep learning modules according to incomplete statistics (Agirre et al., 2016).
In this run, we take advantage of the embeddings that obtained information from large scale  Table 1: Test sets at SemEval 2017 STS task.
corpora and train the linear regression (LR) model. There two features: One is the outputs of SIS, the other is from a modified version of basic sentence embedding which is the simply combination of word embeddings. The word embedding vectors are generated from word2vec (Mikolov et al., 2013) over the 5th edition of the Gigaword (LDC2011T07) (Parker et al., 2011). We also preprocess the Gigaword data with tokenizer.perl and truecase.perl. We modify this basic sentence embedding by importing domain IDF information. The domain IDFs of words could be obtained from the current test dataset by deeming each sentence as a document. We did not directly use the domain IDFs d as the weight of a word embedding. On previous datasets, we found d 0.8 as its weight performed nearly the best.

Data
SemEval 2017 STS task assesses the ability of systems to determine the degree of semantic similarity between monolingual and cross-lingual sentences in Arabic, English, Spanish and a surprise language of Turkish. The shared task is organized into a set of secondary sub-tracks and a single combined primary track. Each secondary subtrack involves providing STS scores for monolingual sentence pairs in a particular language or for cross-lingual sentence pairs from the combination of two particular languages. Participation in the primary track is achieved by submitting results for all of the secondary sub-tracks (Cer et al., 2017).
As shown in Table 1, the SemEval 2017 STS shared task contains 1750 pairs with gold standard (GS) out of total 2000 pairs from 7 different tracks. Systems were required to annotate all the pairs and performance was evaluated on all pairs or a subset with GS in the datasets. The GS for each pair ranges from 0 to 5, with the values indicating the corresponding interpretations: 5 indicates completely equivalence; 4 expresses mostly equivalent with differences only in some unimportant details; 3 means roughly equivalent but with differences in some important details; 2 means non-equivalence but sharing some details; 1 means the pairs only share the same topic; and 0 represents no overlap in similarity.

Evaluation
The evaluation metric is the Pearson productmoment correlation coefficient (PCC) between semantic similarity scores of machine assigned and human judgements. PCC is used for each individual test set, and the primary evaluation is measured by weighted mean of PCC on all datasets (Cer et al., 2017).
Performances of our three runs on each of SemEval 2017 STS test set are shown in Table 2. Bold numbers represents the best scores from any our system on each test set, including the primary scores. Cosine Baseline utilizes basic sentence embedding method for monolingual similarity (Track 1, 3 and 5) provided officially by STS organizers; Best system denotes all the scores are from the state-of-the-art system; All Systems Best means the best scores from all the systems participated in each track, regardless of whether they come from the same system; Differences indicates the differences between the best scores from our three systems and All Single Best in each track, primary difference is between our best system and state-of-the-art system. Team Rankings show the rankings of our best scores from that of other teams. Team Rankings of Primary could be the most important ranking for participants who submitted scores for all tracks.
Our team ranked 2nd for the primary score and achieved the best performance in Track 1 (Arabic-Arabic). Track 1 is the only track that totally employed new languages which has no references from the past (cross-lingual evaluation contains English sentences).
The very failing performance is in Track 4b. We guess the reasons could be the followings and further research is needed on this issue: 1) Our methods, especially for unsupervised SIS, ignore some important information as the embedding methods and are currently not suit for complicated post-editing sentences. We tested basic sentence embedding method in isolation which could achieve the score of more than 0.16,  much better than our IC based systems of Run 1 (0.0758) and Run 2 (0.0584),which are without embedding modules.
2) The translation quantity for long sentences by machine translation may be not good enough as that for short sentences. The translation results may lose some information in the original sentences for SIS and introduce more noise.

STS benchmark
In order to provide a standard benchmark to compare among the state-of-the-art in Semantic Textual Similarity for English, the organizers of Se-mEval STS tasks are already setting a leaderboard this year which includes results of some selected systems. The benchmark comprises a selection of the English datasets used in the STS tasks in the context of SemEval from 2012 to 2017 and it is organized into train, development and test (Cer et al., 2017).
Our systems are selected by the organizers to submit the results for STS benchmark. We employ the models that described above, but a small difference is in Run 3: d 0.9 was used as the weights of word embeddings, which could achieve the best performance of cosine similarity from the summed word embeddings in isolation. As our models need not tune hyperparameters, the train part is used for tuning parameters and training models while the development part and the test part are used for the testing of the final systems. Table 3 shows the performances of our systems.
From the table we could see Run 3 provides the best performance in benchmark, which is in accordance with the results in SemEval 2017 STS task. Our best system ranks 2nd at present. More details about STS benchmark and the real-time leaderboard could be find in the official website 4 . 4 http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Conclusions
At SemEval 2017 STS task, we introduced a unsupervised knowledge based method, SIS, which could be new at SemEval. SIS is the extension of information content for STS evaluation. The performance of SIS is pretty good on STS test sets for it's just a new unsupervised method with room to improve. Currently, our main concern is how to gain the information contained in word embeddings, which may be lost in knowledge based SIS, and combine it with SIS to improve STS performance.