STS-UHH at SemEval-2017 Task 1: Scoring Semantic Textual Similarity Using Supervised and Unsupervised Ensemble

This paper reports the STS-UHH participation in the SemEval 2017 shared Task 1 of Semantic Textual Similarity (STS). Overall, we submitted 3 runs covering monolingual and cross-lingual STS tracks. Our participation involves two approaches: unsupervised approach, which estimates a word alignment-based similarity score, and supervised approach, which combines dependency graph similarity and coverage features with lexical similarity measures using regression methods. We also present a way on ensembling both models. Out of 84 submitted runs, our team best multi-lingual run has been ranked 12th in overall performance with correlation of 0.61, 7th among 31 participating teams.


Introduction
Semantic Textual Similarity (STS) measures the degree of semantic equivalence between a pair of sentences. Accurate estimation of semantic similarity would benefit many Natural Language Processing (NLP) applications such as textual entailment, information retrieval, paraphrase identification and plagiarism detection (Agirre et al., 2016). In an attempt to support the research efforts in STS, the SemEval STS shared Task (Agirre et al., 2017) offers an opportunity for developing creative new sentence-level semantic similarity approaches and to evaluate them on benchmark datasets. Given a pair of sentences, the task is to provide a similarity score on a scale of 0..5 according to the extent to which the two sentences are considered semantically similar, with 0 indicating that the semantics of the sentences are * *These authors contributed equally to this work completely unrelated and 5 signifying semantic equivalence. Final performance is measured by computing the Pearson's correlation (ρ) between machine-assigned semantic similarity scores and gold standard scores provided by human annotators. Since last year, the STS task have been extended to involve additional subtasks for crosslingual STS. Similar to the monolingual STS task, the cross-lingual task requires the semantic similarity measurement for two snippets of text that are written in different languages. In contrast to last year's edition (Agirre et al., 2016), the task is organized into 6 sub-tracks and a primary track, which is the average of all of the secondary sub-tracks results. Secondary sub-tracks involve scoring similarity for monolingual sentence pairs in one language (Arabic, English, Spanish), and cross-lingual sentence pairs from the combination of two different languages (Arabic-English, Spanish-English, Turkish-English). Our paper proposes both supervised and unsupervised systems to automatically scoring semantic similarity between monolingual and cross-lingual short sentences. The two systems are then combined with an average ensemble to strengthen the similarity scoring performance. Out of 84 submissions, our system is placed 12 th with an overall primary score of 0.61.

Related Work
Since 2012 (Agirre et al., 2012), the STS shared task has been one of the official shared tasks in SemEval and has attracted many researchers from the computational linguistics community (Agirre et al., 2017). Most of the state-of-the-art approaches often focus on training regression models on traditional lexical surface overlap features. Recently, deep learning models have achieved very promising results in semantic textual sim-ilarity. The top three best performing systems from STS 2016 used sophisticated deep learning based models (Rychalska et al., 2016;Brychcín and Svoboda, 2016;Afzal et al., 2016). The highest correlation score was obtained by Rychalska et al. (2016). They proposed a textual similarity model that combines recursive auto-encoders (RAE) from deep learning with WordNet award penalty, which helps to adjusts the Euclidean distance between word vectors.

System Description
Our contribution in the STS shared task includes three different systems: supervised, unsupervised and supervised-unsupervised ensemble. Our models are mainly developed to measure semantic similarity between monolingual sentences in English. For the cross-lingual tracks, we leverage the Google translate API to automatically translate other languages into English. In the following subsections, we describe our data preprocessing and present our three systems.

Data Preprocessing
We use all the previously released datasets since 2012 to train and evaluate our models. The final total number of training examples is 14 619. We use StanfordCoreNLP 1 pipeline to tokenize, lemmatize, dependency parse, and annotate the dataset for lemmas, part-of-speech (POS) tags, and named entities (NE). Stopwords are removed for the purpose of topic modeling and TfIdf computation.

Unsupervised Model
Inspired by (Sultan et al., 2015;Brychcín and Svoboda, 2016), our unsupervised solution calculates a similarity score based on the alignment of the input pair of sentences. As presented in Figure  1, given a pair of sentences S1, S2, the alignment task builds a set of matched pair of words match(w i , w j ) where w i is a word in sentence S1, and w j is a word in sentence S2. Each matched pair has a score on the scale [0-1]. This matching score indicates the strength of the semantic similarity between the aligned pair of words, with 1 representing the highest similarity match.
As shown in Figure 2, after preprocessing, the system starts with matching exact similar words 1 http://stanfordnlp.github.io/CoreNLP/ (lemmas), and words that share similar Word-Net hierarchy (synonyms, hyponyms, and hypernyms). We consider these two types of aligning as exact match with score 1. As a last step of the alignment process, we handle the words that have not been matched in the preceding steps. The solution uses Glove word embeddings (Pennington et al., 2014) to calculate the matching score. Glove (840B tokens, 2.2M vocab) represent the word embeddings in 300d vector. We calculate the cosine distance between the unmatched words and all the words in the other sentence. Using a greedy strategy, we pick up the best match of each word. The global similarity is calculated using a weighted matches scores as shown in equation (1).
For all w i in S1 or S2, and match(w i , w j ) is the best match score for W i with word W j from the other sentence. T f Idf (S1, S2) is the sum of the term frequency inverse document frequency of the words in S1, S2. The final alignment score is [0-1], so we scale it into the [0-5] range.

Supervised Model
To generate our supervised model, we extract the following features: I Bag-of-Words: for each sentence a |V|dimension vector is generated, where V includes the unique vocabulary from both sentences. Entries in single vectors correspond to the frequency of the word in the respective sentence. Cosine similarity between these vectors serves as a feature.

II Distributional Thesaurus (DTs) Expansion
Feature: Each non-stopword is expanded to its most similar top 10 words using the API for the Distributional Thesaurus (DTs) by Biemann and Riedl (2013).

III POS Tags Longest Common Subsequence:
We measure the length of the longest common subsequence of POS tags between sentence pairs. Additionally, we also average this length by dividing it by the total number of tokens in each sentence separately.
IV Topic Similarity Feature: To model the topical similarity between two documents, we use Latent Dirichlet Allocation (LDA, (Blei et al., 2003)) 2 model trained on a recent Wikipedia dump. To guarantee topic distribution stability, we run LDA for 100 repeated inferences. Then for each token, we assign the most frequent topic ID (Riedl and Biemann, 2012).
V Dependency-Graph Features: Following Kohail (2015), each sentence S is converted into a graph using dependency relations obtained from the parser. We define the dependency graph G S = {V S , E S }, where the graph vertices V S = {w 1 , w 2 , . . . , w n } represent the tokens in a sentence, and E S is a set of edges. Each edge e iy represents a directed dependency relation between w i and w y . We calculate TfIdf on three levels and weight our dependency graph using the following conditions: Word TfIdf: Considering only those words that satisfy the condition: TfIdf (w i ) > α 1 Pair TfIdf: Word pair are filtered based on the condition: TfIdf (w i , w y ) > α 2 Triplet TfIdf: Considering only those triples (word, pair and relation), which satisfies the condition: TfIdf (w i , w y , e iy ) > α 3 Similarities are then measured on three levels by representing each sentence as a vector of words, pairs and triples, where each entry in one vector is weighted using TfIdf. We used New York Times articles within the years 2004-2006, as a background corpus for TfIdf calculation.
VI Coverage Features: As a text gets longer, term frequency factors increase, and thus having a high similarity score is likelier for longer than for shorter texts. Coverage features measures the number of one-to-one tokens, edges and relations correspondence between the dependency graphs of a pair sentences as described in (Kohail and Biemann, 2017).
VII NE Similarity: We measure similarity based on the shared named entities between the pair of text.
VIII Unsupervised Dependency Alignment score: Using a Glove word embedding, we include the score of the cosine similarity between the syntactic heads of the matched words aligned in the unsupervised model (Sec. 3.2), as presented in equation (2).
For all w i in S1 or S2, we calculate the weighted cosine similarity between its syntactic dependency head: w i and the syntactic head of the matched word: w j .
These features are fed into three different regression methods 3 : Multilayer Perceptron (MLP) 4 neural network, Linear Regression (LR) and Regression Support Vector Machine (RegSVM). To evaluate our preliminary pre-testing models, we perform 10-fold cross-validation.  Table 1: Results obtained in terms of Pearson correlation over three runs for all the six sub-tracks in comparison with the baseline and the top obtained correlation in each track. The primary score represents the weighted mean correlation. Ens.* represents the results after adding the expansion and topic modeling features.

Ensembling Supervised and Unsupervised models
We create an ensemble model by by averaging the supervised and unsupervised models predictions.

Experimental Results
We report our results in Table 1. Overall we submitted 3 runs: Run1 uses the unsupervised approach discussed earlier in Sec. 3.2, Run2 uses a supervised MLP neural network trained as described in Sec. 3.3, and Run3 uses the ensemble average system described in Sec. 3.4. Due to time constraints and technical issues, only evaluation for English monolingual track was given. Additionally, we were not able to compute the topic modeling and expansion features. We included the missing features later after the task deadline. Final ensemble results are given under Ens.*. According to the results, we can make following observations: • Our results significantly outperform the baseline provided by the task organizers for monolingual tracks by a large margin.
• The ensemble outperforms the individual ensemble members.
• Results obtained in monolingual, especially English, are markedly higher than in crosslingual tracks. This might be due to noise introduced by the automatic translation.
• Results of track 4b appears to be significantly worse compared to other tracks results. In addition to the machine translation accuracy challenge, the difficulty of this track lies in providing longer sentences with less informative surface overlap between the sentences compared to other tracks.

Conclusion
We have presented and discussed our results on the task of Semantic Textual Similarity (STS). We have shown that combining supervised and unsupervised models in an ensemble provides better results than when each is used in isolation. 31 teams participated in the task with 84 runs.
Our best system achieves an overall mean Pearson's correlation of 0.61, ranking 7 th among all teams, 12 th among all submissions. Future work includes building a real multi-lingual model by projecting phrases from different languages into the same embedding space. In the current solution, we consider hyponyms/hypernyms as synonyms.
The system gives an exact match score for these word pairs. In the future, we tackle finding a way to give calculated dynamic scores for such kind of alignment to do not equalize them with exact matches.