NUIG-UNLP at SemEval-2016 Task 1: Soft Alignment and Deep Learning for Semantic Textual Similarity

We present a multi-feature system for computing the semantic similarity between two sentences. We introduce the use of soft alignment for computing text similarity, and also evaluate different methods to produce it. The main features used by our system are based on alignment and Explicit Semantic Analysis. Our system was above the median scores for 4 out of the 5 datasets at SemEval 2016 STS Task 1.


Introduction
Semantic textual similarity is the task of deciding if two sentences express a similar or identical meaning and requires a deep understanding of a sentence and its meaning in order to achieve high performance. Recent successful approaches to this problem have been based on the idea of creating monolingual alignments (Sultan et al., 2014a) indicating which words in each of the two sentences correspond to each other. This is quite successful in many cases where many words have the same lemma, however when synonymous and semantically similar terms are used, it is much harder to construct alignment. For this reason, we propose the use of soft alignments, where instead of producing a hard linking between individual words in the sentence, we instead produce a score indicating how likely one word in a sentence is to be aligned to another word in the other sentence. We examine several methods that can be used to learn these alignments including word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and models based on deep learn-ing that have been suggested for machine translation . In addition, we look into recent models for sentence and document similarity that can leverage the large amount of loosely aligned text in particular those based on Explicit Semantic Analysis (Gabrilovich and Markovitch, 2007) and recent extensions aimed at generating orthogonal representations (McCrae et al., 2013;Aggarwal et al., 2015). While these novel techniques alone can achieve high performance on the task, we note that simple metrics such as the number of overlapping terms can produce reasonable performance. For added robustness we combine features based on simple metrics with novel methods explored in this work as a multi-feature regression problem, which we solve by means of an M5 Decision tree (Wang and Witten, 1996;Quinlan, 1992). The rest of the paper is structured as follows: we present our system in Section 2. We then present both our internal evaluation results and the official Task 1 results in Section 3 and finally we conclude in Section 4.

Methods
For convenience we assume that semantic textual similarity consists of finding a function that maps two strings, a and b, of length n a ,n b , to a single value y ∈ [0, 1]. We will use A to denote the set of words in a and B for the words in b. We assume we have a dataset D consisting of triples of the form (a i , b i , y i ).

Baseline features
The basic level of our system is the construction of baseline features that can be quickly and easily evaluated to quickly find candidates that are likely to be highly similar, which may be useful in search applications. Our features were partially based on those of Hänig et al. (2015), but simplified such that we do not require a part-of-speech tagger. The features we used are as follows: Longest Common Subsequence The length in tokens of the longest common subsequence between the two sequences n-gram Overlap The number of n-grams that occur in both sentences divided by the length of the shorter sentence.
Jaccard The Jaccard Index of the sentences using a bag-of-words model (|A ∩ B|/|A ∪ B|).

Containment
The containment of the sentences using a bag-of-words model (|A ∩ B|/ min(|A|, |B|)).
Average Word Length Ratio The average length of the words, using the symmetric ratio as above.
Greedy String Tiling As in Wise (1993), we used Arun Kumar Jayapal's implementation 1 , where similarity is given as follows, where coverage is the number of tokens covered by the tiling.

× coverage |A| + |B|
Source and Target Length The length (in tokens) of each of the two strings 1 https://github.com/arunjeyapal/ GreedyStringTiling Keypairs For each word pair (a, b) where a ∈ A and b ∈ B, we calculated We took only the word pairs with the 20 highest absolute values for λ a,b . The feature consisted of the number of occurrences of these keypairs.

Hard alignment
As we believe that hard alignments are also useful, we included features from hard alignment, firstly using a model based on Sultan et al. (2014b)'s system. We simplified this method using only the Word Similarity Aligner (wsAlign) and Named Entity Aligner (neAlign) parts of Sultan's method, we found that this agreed with Sultan's implementation 2 to an F-Measure of 93.8% and our observations and internal results suggested that the differences in alignments did not correspond to obvious improvements in alignment accuracy. 3 We also use the alignments given by Jacana aligner (Yao et al., 2013) 4 directly as further alignments in our system. Jacana is a discriminatively trained monolingual word aligner that uses Conditional Random Field (CRF) model to globally decode the best alignment. It uses features based on WordNet and part-of-speech tags.

WordSim
Semantic relatedness measures can be directly used to compute the soft alignments between the sentences. In this approach, we compare the pretrained neural word embeddings to compute the relatedness between words across both the sentences, thus producing the soft alignment matrix. We use cosine similarity for this purpose. We use the neural embeddings 5 developed by Baroni et al. (2014 For each word pair (a, b) where a ∈ A and b ∈ B, lets consider a and b as neural word embeddings respectively for a and b. Thus, soft alignment can be defined formally as a matrix S of size n a x n b , where s ij = a. b giving similarity between the word vectors.

BiLSTM
Soft alignments based only on the word similarity do not consider the word context in the sentence. Therefore, with the help of bidirectional recurrent neural network (BiRNN), we produce contextdependent word representations. BiRNN consists of a forward recurrent neural network (RNN) and a separate backward RNN to read the sentence in forward and backward directions respectively. Figure 1 shows a BiRNN representation for encoding a sentence. Here, a i refers to the pre-trained neural word embedding, and h i refers to the BiRNN representation for i th word in the sentence, produced by the concatenation of forward and backward RNN representations. The approach being followed here is the same used in order to produce variable length sentence representation for soft attention mechanism in neural machine translation . However, here we just use the BiRNN representations of the words for producing the soft alignments. Such representations can be considered as contextdependent representation as they carry the sentence summary at the position of each word. After producing the BiRNN representations for both the sentences, we simply compute cosine similarities be-tween these representations at each word position across both the sentences, to produce the soft alignment matrix. Similarly as above in section 2.3.1, soft alignment can be defined as a matrix S of size n a x n b , where s ij = h. g and h and g are the BiRNN based contextual word representations. We use long shortterm memory (Hochreiter and Schmidhuber, 1997, LSTM) for our experiments using our own implementation. For learning the BiLSTM based representations, we use a collection of the sentences from the previous years Semantic Textual Similarity Task.

Features
In order to apply soft alignment in a machine learning setting, we wish to transform these alignments into a set of features. In particular, these features should be able to calculate a single value bounded in some range regardless of the relative size of the two target sentences. As we generate a nonsquare matrix of variable size, we hypothesize that good alignments should resemble hard alignments.  For hard alignments, we used the number of rows (i.e., tokens) that had at least one alignment, thus for the soft alignment we used the row max of the alignment as follows: For out experiments we calculated four features with the following values of φ = 0.1, 0.5, 1, 2.
In addition, we experimented with other sparsity features proposed by Hurley and Rickard (2009) and included the H G metric and a modified version of the −l p metric we call col-l p as they gave good correlations with the sentence scores.
We used p = 2, 10 to give two features for col−l p . Finally, we noted that in many cases the most important alignments were those between low frequencies word, therefore we transformed our alignments by multiplying them with the inverse document frequency of the words, e.g.,

ESA Similarity
Gabrilovich and Markovitch (2007) introduced the ESA model that represents the semantics of a word with a distributional vector over the Wikipedia concepts. We use a snapshot of English Wikipedia from 1 st October, 2013 which contains 13.9 million articles (concepts). We built an index of all Wikipedia articles using Lucene. We retrieve distributional vector of a sentence by searching over a Lucene index. Thus, a Lucene ranking score represents the magnitude of a vector dimension that corresponds to a retrieved Wikipedia article. We used Lucene's built-in scoring function to obtain the top K = 1000 articles and then to obtain the semantic relatedness between two sentences, we compute cosine similarity between their distributional vectors.

Classifying with M5 Trees
Finally, having baseline features, extracted features from the hard and soft alignments, and the ESA similarity, we combine all of our features into a single vector and thus transform the problem into that of a traditional regression task. We experimented with various classifiers using the Weka toolkit (Hall et al., 2009) and found that in nearly all experiments, the strongest performance was obtained using the M5 Decision Tree method (Wang and Witten, 1996;Quinlan, 1992) and so we adopted this for all our experiments.

Internal Evaluation
We conducted a series of evaluations using data from previous SemEval challenges (Agirre et al., 2014) as a baseline as shown in Table 1. These results present the following configurations using 10-fold cross-validation: Sultan Only (-DF) Using baseline features, which are also used in all experiments, and Sultan et In addition, we noticed that different datasets tended to have a different distribution of scores. As such we tried evaluating in two modes macrotraining where we trained the decision tree on all datasets simultaneously and microtraining where a decision tree was trained for each dataset and used only for this dataset. As the microtraining results are much stronger, for the task, we developed a lightweight domain classifier that found the nearest training dataset to the test dataset and used the classifier trained on the most appropriate dataset in our evaluation runs. This classifier used the size of the intersection of the set of 100 most frequent words in each dataset and we obtained 100% accuracy for this classifier in identifying datasets by 10-fold crossvalidation, i.e., we choose the classifier trained on the training set T which maximizes the following similarity to the test set T :

SemEval results
We submitted three runs to the SemEval task and the results are reported in Table 2, the configurations were as follows: m5all3 This run uses all the aligners (Sultan, Jacana, WordSim) including ESA, and trains a single decision tree on all datasets simultaneously, without using the domain classifier.
m5dom1 This run uses Sultan, WordSim as aligners including ESA, and trains a decision tree per test dataset on the nearest training datasets using the domain classifier.
m5dom2 This run uses all the aligners including ESA, and trains a decision tree per test dataset on the nearest training dataset using the domain classifier.
We notice that our system has above-median performance in most datasets however, we have significant difficulties in the answer-answer, this is because of the length and complexity of the sentences. We also see that while the domain classifier was helpful in a few cases, notably the headlines, it did not in general seem to improve the results. Finally, we find that the combination of all classifiers was helpful.

Conclusion
We propose a combination of different approaches and our internal results suggest that combining multiple approaches can improve overall system performance. However, we notice that for some datasets performance is still very low and we hypothesise that in this case a new approach is needed. We notice that adapting to the domain proved extremely effective in the cross-fold setting but not in the general case, however our proposed method was very simplistic and further improvement in the quality of the result may be achieved by a more sophisticated algorithm.