VRep at SemEval-2016 Task 1 and Task 2: A System for Interpretable Semantic Similarity

VRep is a system designed for SemEval 2016 Task 1 - Semantic Textual Similarity (STS) and Task 2 - Interpretable Semantic Textual Similarity (iSTS). STS quantiﬁes the semantic equivalence between two snippets of text, and iSTS provides a reason why those snip-pets of text are similar. VRep makes extensive use of WordNet for both STS, where the Vector relatedness measure is used, and for iSTS, where features are extracted to create a learned rule-based classiﬁer. This paper outlines the VRep algorithm, provides re-sults from the 2016 SemEval competition, and analyzes the performance contributions of the system components.


Introduction
VRep competed in SemEval 2016 Task 1 -Semantic Textual Similarity (STS) and Task 2 -Semantic Interpretable Textual Similarity (iSTS). Both of these tasks compute STS between two fragments of text. Task 2 expands upon Task 1 by requiring a reason for their similarity. VRep uses an STS measure based on the Vector relatedness measure (Pedersen et al., 2004), and a reasoning system based on JRIP (Cohen, 1995), an implementation of the iREP algorithm.
For Task 1, we are provided with paired sentences, and for each pair of sentences VRep assigns a number indicating their STS. The number ranges from 0 to 5, 0 indicating no similarity and 5 indicating equivalence.
For Task 2, we are provided with paired sentences and align the chunks of one sentence to the most similar chunks in the other sentence. Next, a reason and score are computed for that alignment. A chunk is a fragment of text that conveys a single meaning such as in the following example for which chunks are bracketed. [  Alignment reasons are selected from a small list of possible labels created by the event organizers (Agirre et al., 2015): As in Task 1 the scores range from 0 to 5, 0 indicating no similarity and 5 indicating equivalence. VRep makes extensive use of WordNet (Fellbaum, 2005) to compute STS and assign a label in iSTS. Vrep is written in Perl and is freely available for download 1 .

Algorithm Description
The same measure of STS is used for both Task 1 and Task 2; however, the algorithm for Task 1 is simpler and consists of only the first two steps: Preprocessing and Semantic Textual Similarity. The steps are outlined below and are expanded on in subsequent subsections.
1. Preprocessing -text is standardized 2. Semantic Textual Similarity -the STS between two chunks or two sentences is computed. This is the final step for Task 1.
3. Chunk Alignment -align each chunk of one sentence to a chunk in another sentence. If no chunks are similar then no alignment (NOALI) is assigned.
4. Alignment Reasoning -assign a label to each aligned chunk pair 5. Alignment Scoring -assign an alignment score on a 0-5 scale

Preprocessing
In the first step, data is prepared for processing as outlined below: 1. Tokenization -spaces are used as a delimiter 2. Lowercase All Characters -standardizes string equivalence testing and prevents incorrect part of speech (POS) tagging. The POS tagger tends to tag most words that have a capital letter as a proper noun which is often incorrect. This is particularly problematic with the headlines data set.
5. Stop Word Removal -remove any words that are not tagged as a noun, verb, adjective, or adverb. This reduces chunks and sentences to content words.

Semantic Textual Similarity (STS)
STS is computed in the same way for both tasks; however it is computed between two sentences for Task 1 and between two chunks for Task 2. While describing the computation of STS we refer to chunks; for Task 1 a sentence can be conceptualized as a chunk. VRep's STS computation is shown in Equation (1) and is similar to the method described by NeRoSim (Banjade et al., 2015) and Stefanescu (Ş tefȃnescu et al., 2014). chunkSim takes two chunks (c 1 , c 2 ) as input and computes the weighted sum of maximum word to word similarities, sim(w i , w j ). To do this, the sim(w i , w j ) is found for each word in c 2 against c 1 , and the maximum is added to a running sum.
where c 1 and c 2 are two chunks, n and m are the number of words in c 1 and c 2 , w i is word i of c 1 , w j is word j of c 2 sim(w i , w j ) is defined differently for words in WordNet and words not in WordNet. For words in WordNet, sim(w i , w j ) is the Vector relatedness measure 3 (Pedersen et al., 2004) with a threshold applied. The Vector measure was chosen for several reasons. Firstly it returns values scaled between 0 and 1 which is beneficial for applying thresholds in both chunk alignment and alignment reasoning. A known scale also allows for a direct mapping from the weighted sum to the answer space of Task 1 (scaled 0-5). Secondly the Vector measure works well when w i and w j are different parts of speech because it does not rely on WordNet hierarchies. When calculating sim(w i , w j ) all possible senses of both w i and w j are used, and sim(w i , w j ) is chosen as the maximum value. This eliminates the need for word sense disambiguation (WSD). After computing the measure, a threshold is applied that reduces any value less than 0.9 to 0.0. This value was tuned separately using the training data for both tasks via a grid search and 0.9 was found to be optimum for both. The threshold prevents dissimilar terms from impacting the STS which improves the accuracy and prevents noisy chunk alignments.
For words not in WordNet, chunkSim(w i , w j ) is a binary value: 1 if all the characters in both words match, 0 otherwise. Words not in WordNet tend to be proper nouns, abbreviations, or short words such as "he" or "she", "is" or "in", all of which are generally spelled identically making this a suitable measure.
chunkSim is defined as the sum of maximum word to word similarities normalized by the number of words in the shorter of the chunk pair. Normalization prevents similarity scores from increasing as chunk length increases. It also scales chunkSim within a predictable range of about 0.0 − 1.0.
chunkSim is used directly in Task 1 where it is linearly scaled by 5 to produce final output. We experimented with multiple regression fits (linear, exponential, logarithmic, power, and polynomial) between our chunkSim output and the provided gold standard values with little to no improvement, so the linear scaling of 5 was chosen for simplicity.

Chunk Alignment
chunkSim is computed between each chunk of two aligned sentences and the chunk with the highest chunkSim is selected for alignment. Multiple alignments are allowed for a single chunk. If all chunks have a similarity of 0, no alignment (NOALI) is assigned. Due to the high sim(w i , w j ) threshold, no threshold is required for chunkSim as with NeRoSim (Banjade et al., 2015).

Alignment Reasoning
Alignment Reasoning takes as input a chunk pair and provides a reason why that chunk pair is aligned. VRep's alignment reasoning is inspired by NeRoSim (Banjade et al., 2015), and SVCTSTS (Karumuri et al., 2015). Both these systems classify a chunk pair using features extracted from the chunk pair itself. NeRoSim's features tend to focus more on the semantic relationship between chunk pairs, such as whether or not the two chunks contain antonyms, synonyms, etc. The features of SVCSTS focus more on the syntactic form of the chunks, such as the number of words or counts of parts of speech in a chunk pair. VRep combines the two approaches and extracts a total of 72 syntactic and semantic features for each chunk pair.
Gold Standard chunk pairs of the SemEval 2015 Task 2 Test Data 4 were used to train our classifier, WEKA's (Hall et al., 2009) JRIP algorithm (Cohen, 1995) which creates a decision list for classification. The classifier uses only 24 of original 72 features and a series of 10 rules.
JRIP was chosen as a classifier due to its performance (see Table 5), and its concision. The rules generated are human readable which provides insight into how the classification occurs and the types of features that are discriminative. Classifiers were trained with chunk pairs from every data set (student answers, headlines, and images), both individually and combined. The best performing classifier for each topic was generated from the combined data. The set of features used and classification rules are shown below. α and β designate the individual chunks in the chunk pair being classified, and x i indicates a feature vector created from a chunk pair. i indicates the feature number in the feature list below.
It is interesting to note that there is no classifier for the REL class. The data set was heavily skewed towards the EQUI class which consisted of 60% of the total data, leaving a small percentage to be divided among the remaining 5 classes, with just around 5% being REL. With a larger training set we would ex-pect a classifier for REL to be generated.

Alignment Scoring
Alignment scores are assigned as either the required scores, 0 for NOALI and 5 or EQUI, or the average alignment score for each class as in (Karumuri et al., 2015). The average alignment score for classes were computed both for each topic alone and for all topics combined. The best performing set of scores came for all topics, came from the images data set alone. Scores used for each class are as follows: EQUI = 5.00, OPPO = 4.00, SPE1 = 3.24, SPE2 = 3.69, SIMI = 2.975, REL = 3.00, NOALI = 0.00.

Results
The performance of VRep is shown below for Se-mEval 2016 Task 1 and Task 2 test data sets. The baseline described by the task organizers (Agirre et al., 2015) is shown for comparison for Task 2. Baseline results were not made available for Task 1.

Task 1 -Semantic Similarity
For Task 1 the Pearson Correlation Coefficient between VRep's results and Gold Standard results are reported for the 2016 Task 1 Test Data 5 . A value of 1.0 indicates perfect correlation, 0.0 indicates no correlation. We ran VRep on five data sets with the results of each data set shown in Table 1. More details on the data sets and evaluation metrics are described in the competition summary 6 .

Task 2 -Interpretable Semantic Similarity
For Task 2, we report results for the Gold Chunks scenario (data is pre-chunked). Each data set is evaluated using the F1 score in four categories: (Ali) -Alignment -F1 score of the chunk alignment (Type) -Alignment Type -F1 score of the alignment reasoning (Score) -Alignment Scoring -F1 score of alignment scoring (Typ+Scor) -Alignment Type and Score -a combined F1 score of alignment reasoning and scoring F1 scores range from 0.0 to 1.0 with 1.0 being the best score. Data sets are available online 7 and evaluation metrics are described in more detail in the competition summary (Agirre et al., 2015).
where δ is the Levenshtein distance between the two words, and β is the threshold used

STS Component Analysis
Pearson Correlation Coefficients of STS scores of Task 1 and the F1 Ali Task 2 are used as evaluation metrics for the STS portion of VRep. Tables 3 and 4 show the effects of adding a component to the Basic system. Each component and the Basic system are described below: 1. As a baseline a Basic system which only applies Equation (1) is used. For Task 1 the result is scaled by 5. For Task 2 each chunk is aligned with the chunk with the highest chunkSim. No thresholding, or preprocessing is performed.
2. Threshold adds a threshold to sim(w i , w j ) in Equation (1). A modest threshold of 0.4 was used. The optimum threshold of 0.9 used in the final system was found with the system as a whole. We did not perform a grid search to optimize the threshold for all component tests.

Stop
Removal adds stop word removal as described in subsection 2.1.

4.
Levenshtein modifies sim(w i , w j ) for words not in WordNet. Rather than using a binary value for exact string matching the Levenshtein measure shown in Equation (2) is used. This allows for slight differences in spelling, plurality, tenses, etc. The measure requires a threshold parameter, β which limits the maximum Levenshtein distance (δ) and scales the Levenshtein Measure between 0.0 and 1.0. β = 2.0 was found via a grid search to perform best. The Levenshtein Measure is unnecessary for these tasks most likely because, as stated in section 1, words not in WordNet tend to be proper nouns or abbreviations for which the spelling is the same, and for short words such as "he" or "she", "is" or "in", even small edit distances can transform the word into a completely unrelated word.   5. Word Sense Disambiguation (WSD) should help to reduce noisy alignments by using the correct synset when computing the Vector relatedness measure. We used the entire sentence (all chunks) as input to SenseRelate::AllWords (Patwardhan et al., 2003). WSD improves results when used as a single component, but when used in combination with a threshold (Threshold + WSD) results are worse than a threshold alone. This is likely due to the fact that both WSD and thresholding aim to reduce noisy STS and chunk alignments. When used singularly they both achieve this task, but in combination WSD errors reduce performance.
Analysis of the test data indicated that the addition of these extra components was unnecessary, however to further analyze their contributions three runs were submitted for the both tasks 1 and 2. Run 1 used the basic system, run 2 eliminated the stop removal preprocessing step, and run 3 used the basic system with the Levenshtein measure described above. Test results were mixed and data set dependant, see the respective competition summaries (Agirre et al., 2016) for complete results.

Alignment Reasoning Component Analysis
For alignment reasoning, only the assignment of a label (Type) to a chunk pair is evaluated. We used the gold standard alignments provided for each data set, converted each gold standard chunk pair to the   Table 5. The baseline score is calculated as simply assigning the most common class, EQUI.

Conclusions and Future Work
In future iterations, more analysis should be done to refine the features used in classification. Using JRIP and other analysis criteria we can see see why certain features are discriminative, and develop more informative features. Rather than relying solely on the Levenshtein measure for words outside of WordNet, additional metrics, such as word2vec (Mikolov et al., 2013) could be incorporated.
Additional data should be added for training classifiers. The top performing classifier was generated from all data combined indicating that additional samples are necessary. It is likely that given more data, topic specific classifiers will outperform the general classifier we evaluated. Additional data will also help to reduce the class imbalance and will likely result in a set of rules for the REL class.
Since VRep already makes use of WordNet, it could be easily expanded to compete in the polarity subtask by implementing a polarity classifier using SentiWordNet (Baccianella et al., 2010).