DT_Team at SemEval-2017 Task 1: Semantic Similarity Using Alignments, Sentence-Level Embeddings and Gaussian Mixture Model Output

We describe our system (DT Team) submitted at SemEval-2017 Task 1, Semantic Textual Similarity (STS) challenge for English (Track 5). We developed three different models with various features including similarity scores calculated using word and chunk alignments, word/sentence embeddings, and Gaussian Mixture Model(GMM). The correlation between our system’s output and the human judgments were up to 0.8536, which is more than 10% above baseline, and almost as good as the best performing system which was at 0.8547 correlation (the difference is just about 0.1%). Also, our system produced leading results when evaluated with a separate STS benchmark dataset. The word alignment and sentence embeddings based features were found to be very effective.


Introduction
Measuring the Semantic Textual Similarity (STS) is to quantify the semantic equivalence between given pair of texts (Banjade et al., 2015;Agirre et al., 2015). For example, a similarity score of 0 means that the texts are not similar at all while a score of 5 means that they have same meaning. In this paper, we describe our system DT Team and the three different runs that we submitted to this year's SemEval shared task on STS English track (Track 5; Agirre et al. (2017)). We applied Support Vector Regression (SVR), Linear Regression (LR) and Gradient Boosting Regressor (GBR) with various features (see § 3.4) in order to predict the semantic similarity of texts in a given pair. We also report the results of our models when evaluated with a separate STS benchmark dataset created recently by the STS task organizers.

Preprocessing
The preprocessing step involved tokenization, lemmatization, POS-tagging, name-entity recognition and normalization (e.g. pc, pct, % are normalized to pc). The preprocessing steps were same as our DTSim system .

Feature Generation
We generated various features including similarity scores generated using different methods. We describe next the word-to-word and sentence-tosentence similarity methods used in our system.

Word Alignment Method
We lemmatized all content words and aligned them optimally using the Hungarian algorithm (Kuhn, 1955) implemented in the SEMILAR Toolkit (Rus et al., 2013). The process is the same as finding the maximum weight matching in a weighted bi-partite graph. The nodes are words and the weights are the similarity scores between the word pairs computed as described in § 3.1. In order to avoid noisy alignments, we reset the similarity score below 0.5 (empirically set threshold) to 0. The similarity score was computed as the sum of the scores for all aligned word-pairs divided by the total length of the given sentence pair.
In some cases, we also applied a penalty for unaligned words which we describe in § 3.3

Interpretable Similarity Method
We aligned chunks across sentence-pairs and labeled the alignments, such as Equivalent or Specific as described in . Then, we computed the interpretable semantic score as in the DTSim system .

Gaussian Mixture Model Method
Similar to the GMM model we have proposed for assessing open-ended student answers (Maharjan et al., 2017), we represented the sentence pair as a feature vector consisting of feature sets {7, 8, 9, 10, 14} from § 3.4 and modeled the semantic equivalence levels [0 5] as multivariate Gaussian densities of feature vectors. We then used GMM to compute membership weights to each of these semantic levels for a given sentence pair. Finally, the GMM score is computed as:

Compositional Sentence Vector Method
We used both Deep Structured Semantic Model (DSSM; Huang et al. (2013)) and DSSM with convolutional-pooling (CDSSM; Shen et al. (2014); ) in the Sent2vec tool 3 to generate the continuous vector representations for given texts. We then computed the similarity score as the cosine similarity of their representations.

Tuned Sentence Representation Based Method
We first obtained the continuous vector representations V A and V B for sentence pair A and B using the Sent2Vec DSSM or CDSSM models or skip-thought model 4 . Inspired by Tai et al. (2015), we then represented the sentence pairs by the features formed by concatenating element-wise dot product V A .V B and absolute difference |V A − V B |. We used these features in our logistic regression model which produces the outputp θ . Then, we predicted the similarity between the texts in the target pair as = r Tp θ , where r T = {1, 2, 3, 4, 5} is the ordinal scale of similarity. To enforce thatŷ is close to the gold rating y, we encoded y as a sparse target distribution p such that y = r T p as: where 1 ≤ i ≤ 5 and, y is f loor operation. For instance, given y = 3.2, it would give sparse p = [0 0 0.8 0.2 0]. For building logistic model, we used training data set from our previous DTSim system  and used image test data from STS-2014 and STS-2015 as validation data set.

Similarity Vector Method
We generated a vocabulary V of unique words from the given sentence pair (A, B). Then, we generated sentence vectors as in the followings: Otherwise, w ia is the maximum similarity between word i and any of the words in A, computed as: w ia = max j=|A| j=1 sim(w j , word i ). The sim(w j , word i ) is cosine similarity score computed using the word2vec model. Similarly, we compute V B from sentence B.

Weighted Resultant Vector Method
We combined word2vec word representations to obtain sentence level representations through vector algebra. We weighted the word vectors corresponding to content words. We generated resultant vector for A as R A = i=|A| i=1 θ i * word i , where the weight θ i for word i was chosen as word i ∈ {noun = 1.0, verb = 1.0, adj = 0.2, adv = 0.4, others (e.g. number) = 1.0}. Similarly, we computed resultant vector R B for text B. The weights were set empirically from training data. We then computed a similarity score as the cosine of R A and R B . Finally, we penalized the similarity score by the unalignment score (see § 3.3).

Penalty
We applied the following two penalization strategies to adjust the sentence-to-sentence similarity score. It should be noted that only certain similarity scores used as features of our regression models were penalized but we did not penalize the scores obtained from our final models. Unless specified, similarity scores were not penalized.

Crossing Score
Crossing measures the spread of the distance between the aligned words in a given sentence pair. In most cases, sentence pairs with higher degree of similarity have aligned words in same position or its neighborhood. We define crossing crs as: where aligned(w i , w j ) refers to word w i at index i in A and w j at index j in B are aligned. Then, the similarity score was reset to 0.3 if crs > 0.7. The threshold 0.7 was empirically set based on evaluations using the training data.

Unalignment Score
We define unalignment score similar to alignment score (see § 3.2.1) but this time the score is calculated using unaligned words in both A and B as: unalign score = |A|+|B|−2 * (#alignments) |A|+|B| . Then, the similarity score was penalized as in the followings: where the weight 0.4 was empirically chosen.

Feature Selection
We generated and experimented with many features. We describe here only those features used directly or indirectly by our three submitted runs which we describe in § 4. We used word2vec representation and WordNet antonym and synonym for word similarity unless anything else is mentioned specifically.
14. {mmr t}: min to max ratio as C t1 C t2 where C t1 and C t2 are the counts of type t ∈ {all, adjectives, adverbs, nouns, and verbs} for shorter text 1 and longer text 2 respectively.

Model Development
Training Data. We used data released in previous shared tasks (see Table 1) for the model development (see § 5 for STS benchmarking). Models and Runs. Using the combination of features described in § 3.4, we built three different models corresponding to the three runs (R1-3) submitted.

R1.
Linear SVM Regression model (SVR; = 0.1, C = 1.0) with a set of 7 features: overlap pen, ppdb wa pen ua, dssm, dssm lr, noali, abs dif f all tkns, mmr all tkns. R2. Linear regression model (LR; default weka settings) with a set of 8 features: dssm, cdssm, gmm, res vec, skipthought lr, sim vec, aligned, noun wa.  R3. Gradient boosted regression model (GBR; estimators = 1000, max depth = 3) which includes 3 additional features: w2v wa, ppdb wa, overlap to feature set used in Run 2. We used SVR and and LR models in Weka 3.6.8. We used GBR model using sklearn python library. We evaluated our models on training data using 10-fold cross validation. The correlation scores in the training data were 0.797, 0.816 and 0.845 for R1, R2, and R3, respectively. Table 2 presents the correlation (r) of our system outputs with human ratings in the evaluation data (250 sentence pairs from Stanford Natural Language Inference data (Bowman et al., 2015)). The correlation scores of all three runs are 0.83 or above, on par with top performing systems. All of our systems outperform the baseline by a large margin of above 10%. Interestingly, R1 system is at par with the 1 st ranked system differing by a very small margin of 0.009 (<0.2%). Figure 1 presents the graph showing R1 system output against human judgments (gold scores). It shows that our system predicts relatively better for similarity scores between 3 to 5 while the system slightly overshoots the prediction for the gold ratings in the range of 0 to 2. In general, it can be seen that our system works well across all similarity levels.
Figure 1: R1 system output in evaluation data plotted against human judgments (in ascending order).
above when compared with gold scores in test data. In Table 3, we list only those features having correlations of 0.8 or above. Similarity scores computed using word alignment and compositional sentence vector methods were the best predictive features. STS Benchmark (Agirre et al., 2017). We also evaluated our models on a benchmark dataset which consists of 1379 pairs and was created by the task organizers. We trained our three runs with the benchmark training data under identical settings. We used benchmark development data only for generating features from § 3.2.5 (as validation dataset). The correlation scores for R1, R2 and R3 systems were: In Dev: 0.800, 0.822, 0.830 and In Test: 0.755, 0.787, 0.792 All of our systems outperformed best baseline benchmark system (Dev = 0.77, Test = 0.72). Interestingly, R3 was the best performing while R1 was the least performing among the three. As such, generalization was found to improve with increasing number of features (#features: 7, 8 and 11 for R1, R2 and R3 respectively).

Conclusion
We presented our DT Team system submitted in SemEval-2017 Task 1. We developed three different models using SVM regression, Linear regression and Gradient Boosted regression for predicting textual semantic similarity. Overall, the outputs of our models highly correlate (correlation up to 0.85 in STS 2017 test data and up to 0.792 on benchmark data) with human ratings. Indeed, our methods yielded highly competitive results.