ExB Themis: Extensive Feature Extraction from Word Alignments for Semantic Textual Similarity

We present ExB Themis – a word alignment-based semantic textual similarity system developed for SemEval-2015 Task 2: Semantic Textual Similarity. It combines both string and semantic similarity measures as well as alignment features using Support Vector Regression. It occupies the ﬁrst three places on Spanish data and additionally places second on English data. ExB Themis proved to be the best multilingual system among all participants.


Introduction
Semantic Textual Similarity (STS) is the task of measuring the degree of semantic equivalence of a sentence pair and is applicable to problems in Machine Translation and Summarization among others . STS has drawn a lot of attention in the last few years leading to the availability of multilingual training and test data and to the development of a variety of approaches. These approaches fall broadly into three categories (Han et al., 2013): Vector space approaches: Texts are represented as bag-of-words vectors and a vector similaritye. g. cosine -is used to compute a similarity score between two texts (Meadow et al., 1992).
Alignment approaches: Words and phrases in two texts are aligned and the quality or coverage of the resulting alignments are used as similarity measure (Mihalcea et al., 2006;Sultan et al., 2014).

Machine Learning approaches:
Multiple similarity measures and features are combined using supervised Machine Learning (ML). This approach relies on the availability of training data (Bär et al., 2012;Šarić et al., 2012).
ExB Themis combines advantages of all three categories: we implemented a complex alignment algorithm focusing on named entities, temporal expressions, measurement expressions and dedicated negation handling. Unlike other alignment-based approaches, we extract a variety of features to better model the properties of alignments instead of providing only one alignment feature (see Section 4.1).
Moreover, we employ a variety of similarity measures based on strings and lexical items (see Section 4.2). Our system integrates two well-known language resources -WordNet 1 and ConceptNet (Speer and Havasi, 2012). Additionally, it uses word embeddings to cope with data sparseness and the insufficiency of overlaps between sentences.
Finally, we train a Support Vector Regression (SVR) model using these features (see Section 5).

Preprocessing
Our text preprocessing comprises tokenization, case correction (e. g. US Flying Surveillance Missions to Help Find Kidnapped Nigerian Girls is corrected to US flying surveillance missions to help find kidnapped Nigerian girls), unsupervised part-of-speech (POS) tagging based on SVD2 (Lamar et al., 2010), supervised POS tagging using the Stanford Maximum Entropy tagger 2 as well as lemmatization using Stanford CoreNLP 3 for English and IXA Pipes 4 for Spanish. We also identify measurements (e. g. 55.8 g/mol) and temporal expressions (e. g. last week), data set-specific stop words (e. g. A close-up of for images dataset) using in-house algorithms as well as named entities as described by Hänig et al. (2014) and their titles (e. g. President Barack Obama).

ExB Themis Alignment
Our word alignment is direction-dependent and not restricted to one-to-one alignments. Different mapping types are distinguished and handled differently during feature extraction (see Section 4.1). We use the same type labels as provided by the organizers for the third subtask (interpretable STS) of this task (Agirre et al., 2015): EQUI denotes semantically equivalent chunks, oppositional meaning is labeled with OPPO, SPE1/2 denote similar meaning of the chunks, but the chunk in sentence 1/2 is more specific than the other one. SIM and REL denote similar and related meanings, respectively. ALIC is not used, because our algorithm is not restricted to oneto-one alignments. Finally, all unaligned chunks are labeled with NOALI.
Similar to Sultan et al. (2014), our alignment process follows a strict chronological order: Named entities are aligned to each other. Because we did not observe text pairs with possibly ambiguous name alignments (e. g. Michael in one text and both Michael Jackson and Michael Schumacher in the other) in the training data, we simply aligned all name pairs that share at least one identical token.
Normalized temporal expressions are aligned iff they denote the same point in time or the same time interval (e. g. 14:03 and 2.03 pm).
Measurement expressions are aligned iff they express the same absolute value (e. g. $100k and 100.000$ ).
Arbitrary token sequence alignment consists of multiple steps and is very time consuming 5 . We apply a high precision test for identical sequences based on Sultan et al. (2014): Our test uses synonym-lookups and ignores case information, punctuation characters and symbols. This enables us to match expressions like long term and long-term 6 . If one of both sequences consists of exactly one all-caps-token then we test if it is the acronym of the other sequence (e. g. US and United States).
We used WordNet and ConceptNet 7 to obtain information about synonymy, antonymy and hypernymy and equip the resulting alignments with the corresponding type. We additionally created a small database containing high-frequency synonyms (e. g. does and do), antonyms (e. g. doesn't and does) and negations (e. g. don't, never, no).
Negations can significantly effect the semantic similarity of two sentences (e. g. You are a Christian. vs. Therefore you are not a Christian.). Therefore, we explicitly model negations in our alignment. Some negations are handled during arbitrary token sequence alignment. We resolve the scope of all remaining negations using co-occurrence analysis: if exactly one of both neighboring tokens w 1/2 n−1 and w 1/2 n+1 is already aligned then the negation w 1/2 n is attached to it and we inverse the alignment type (e. g. EQUI becomes OPPO and vice versa). If both neighboring tokens are aligned then we pick the one contained in the co-occurence out of w 1/2 n−1 , w 1/2 n and w 1/2 n , w 1/2 n+1 yielding the highest co-occurrence significance score.
Remaining content words are aligned using cosine similarity on word2vec vectors (Mikolov et al., 2013). Analogously to Han et al. (2013), we align each content word to the content word of the other sentence with the same POS tag that yields the highest similarity score. To prevent weak alignments, we reject alignments with a similarity less than 1 /3.

Feature Extraction
Some approaches to STS relying on word alignment are unsupervised and extract a defined score based on the alignment process (e. g. proportions of aligned content words (Sultan et al., 2014)), others extract a single feature from the alignment and use it along with other features to train a regression model (e. g. align-and-penalize approach (Han et al., 2013;Kashyap et al., 2014)). Unlike these approaches, we extract 40 features from our alignment (see Section 4.1) to (a) build a complex model that is capable of modeling phenomena like alignments of different types and negations, and (b) not be forced to combine alignment properties arbitrarily.
We additionally extract 51 non-alignment features (see Section 4.2) leading to a total of 91 features.

Alignment Features
To encode the properties of a set of alignments A of sentences s 1 and s 2 as comprehensive as possible, we extract the following features 8 : where C is the set of all content words. We extract these features for alignments of type EQUI, OPPO, SPE1/2, REL 10 and NOALI (5 features).
Frequency features are encoded in binary format.
UMBC align-and-penalize features: We also include two features 11 based on Han et al. 8 Type-filtered subsets of A are denoted by Atype. 9 See Sultan et al. (2014) for details on the formulae. 10 Each content word is weighted by the similarity score achieved by word2vec for this type. 11 Splitting ST S = T − P into two features T and P achieves better results than keeping it in the original form.
(2013): we use their T as it is and integrated a simplified version of P with P A i = t,g(t) ∈A i (1+wp(t)) 2 · |s i | and P B i = | t,g(t) ∈B i | 2 · |s i | (2 features).
All proportion features, binary frequency features of REL-alignments, unaligned content words and unaligned negations were additionally computed and extracted for nouns only (16 features).
UMBC: We use several features described in Han et al. (2013): word n-gram similarity for n = 1, 2, 3, 4 (4 features). Moreover, we used word n-gram similarity for n = 1 where only nouns or only verbs where taken into account (2).

Readability Indicators:
We use several features that are typically used as indicators for readability (Oelke et al., 2012): relative difference in sentence length, average word length in characters, number of nouns per sentence, number of verbs per sentence and noun-verb-ratio (5).

STS Model
We compute STS scores using ν-SVR (Schölkopf et al., 2000) as implemented by LibSVM 13 . We use LibSVM's default SVR parameter settings without further optimization.

Interpretable STS Model
We align chunks using our word alignment (see Section 3). Because our word alignment itself does not rely on chunks, we extend its alignments using given chunk boundaries. If alignments overlap, we choose the longest alignment and discard the others. We do not differentiate between SIMI and REL -all REL alignments are considered as SIMI alignments. For chunking we use the OpenNLP 14 chunker with the default model trained on CoNLL-2000 shared task data (Sang and Buchholz, 2000).

Results
For English we train on all available data sets from STS challenges in 2012 , 2013 (Agirre et al., 2013) and 2014 (Agirre et al., 2014). For Spanish, each run trains on a different setting. Mean Pearson correlation is employed as an evaluation metric. Table 1 presents the official scores of our system. Run default uses our system as it is. Run themis only relies on alignment features in the belief model, all other models are the same as for default. Our third run -themisexp -is identical to run themis except for one improvement: it penalizes scores of the answers-students dataset exponentially to cope with the high ratio of common content words that lead to over-estimation of similarity scores.

Subtask 2c -Interpretable STS
Our three runs only differ regarding the applied alignment scorer method: we use the average similarity score per alignment type as observed in STSint training data, the most frequent similarity score per alignment type as observed in STSint training data, and an STS regression model per alignment type trained on all available English STS data sets. For subtrack gold chunks, our runs score 0.4885 to 0.4883 (F1 TYPE + SCORE) on headlines (ranks 10 -12 out of 14) and 0.4296 to 0.4246 on images (ranks 8 -10). Using system chunks we achieve scores of 0.4290 to 0.4284 on headlines (ranks 4-6 out of 10) and 0.3870 to 0.3806 on images (ranks 4-6).

Conclusions & Future Work
We presented our alignment-based STS system ExB Themis. Our system outperformed all other participants by a large margin on Spanish data. Furthermore, our system placed second on English data. ExB Themis proved to be the best multilingual STS system that easily can be adapted to further languages. We conclude that extensive feature extraction from word alignments is a very robust approach -especially when being applied to languages that lack high-quality resources.
In future work, we will investigate the influence of particular features in more detail and we want to enrich our model with structural information (Severyn et al., 2013;Sultan et al., 2014) and improved phrase similarity computation.