L2F/INESC-ID at SemEval-2017 Tasks 1 and 2: Lexical and semantic features in word and textual similarity

This paper describes our approach to the SemEval-2017 “Semantic Textual Similarity” and “Multilingual Word Similarity” tasks. In the former, we test our approach in both English and Spanish, and use a linguistically-rich set of features. These move from lexical to semantic features. In particular, we try to take advantage of the recent Abstract Meaning Representation and SMATCH measure. Although without state of the art results, we introduce semantic structures in textual similarity and analyze their impact. Regarding word similarity, we target the English language and combine WordNet information with Word Embeddings. Without matching the best systems, our approach proved to be simple and effective.


Introduction
In this paper we present two systems that competed in SemEval-2017 tasks "Semantic Textual Similarity" and "Multilingual Word Similarity", using supervised and unsupervised techniques, respectively.
For the first task we used lexical features, as well as a semantic feature, based in the Abstract Meaning Representation (AMR) and in the SMATCH measure. AMR is a semantic formalism, structured as a graph (Banarescu et al., 2013). SMATCH is a metric for comparison of AMRs . To the best of our knowledge, these were not yet applied to Semantic Textual Similarity. In this paper we focus on the contribution of the SMATCH score as a semantic feature for Semantic Textual Similarity, relative to a model based on lexical clues only.
For word similarity, we test semantic equivalence functions based on WordNet (Miller, 1995) and Word Embeddings (Mikolov et al., 2013). Experiments are performed on test data provided in the SemEval-2017 tasks, and yielded competitive results, although outperformed by other approaches in the official ranking.
The document is organized as follows: in Section 2 we briefly discuss some related work; in Sections 3 and 4, we describe our systems regarding the "Semantic Textual Similarity" and "Multilingual Word Similarity" tasks, respectively. In Section 5 we present the main conclusions and point to future work.

Related work
The general architecture of our STS system is similar to that of Brychcín and Svoboda (2016), Potash et al. (2016) or Tian and Lan (2016), but we employ more lexical features and AMR semantics. Brychcín and Svoboda (2016) model feature dependence in Support Vector Machines by using the product between pairs of features as new features, while we rely on neural networks. In Potash et al. (2016) it is concluded that feature based systems have better performance than structural learning with syntax trees. A fully-connected neural network is employed on hand engineered features and on an ensemble of predictions from feature based and structural based systems. We also employ a similar neural network on hand engineered features, but use semantic graphs to obtain one of such features.
For word similarity, our approach isolates the micro view approach seen in (Tian and Lan, 2016), where word embeddings are applied to measure the similarity of word pairs in an unsupervised manner. This work also describes supervised experiments on a macro/sentence view, which em-ploy hand engineered features and the Gradient Boosting algorithm, as in our STS system. Henry and Sands (2016) employ WordNet for their sentence and chunk similarity metric, as also occurs in our system for word similarity.

Task 1 -Semantic textual similarity
In this section we describe our participation in Task 1 of SemEval-2017 (Cer et al., 2017), aimed at assessing the ability of a system to quantify the semantic similarity between two sentences, using a continuous value from 0 to 5 where 5 means semantic equivalence. This task is defined for monolingual and cross-lingual pairs. We participated in the monolingual evaluation for English, and we also report results for Spanish, both with test sets composed by 250 pairs. Most of our lexical features are language independent, thus we use the same model.
For a pair of sentences, our system collects the numeric output of metrics that assess their similarity relative to lexical or semantic aspects. Such features are supplied to a machine learning algorithm to: a) build a model, using pairs labeled with an equivalence value (compliant with the task), or b) predict such value, using the model.

Features
In our system, the similarity between two sentences is represented by multiple continuous values, obtained from metrics designed to leverage lexical or semantic analysis on the comparison of sequences or structures. Lexical features are also applied to alternative views of the input text, such as character or metaphone 1 sequences. A total of 159 features was gathered, from which one relies on semantic representations.
Lexical features are obtained from INESC-ID@ASSIN (Fialho et al., 2016), such as TER, edit distance and 17 others. These are applied to 6 representations of an input pair, totaling 96 features since not all representations are valid on all metrics (for instance, TER is not applicable on character trigrams). Its metrics and input representations rely on linguistic phenomena, such as the BLEU score on metaphones of input sentences.
We also gather lexical features from HARRY 2 , where 21 similarity metrics are calculated for bits, bytes and tokens of a pair of sentences, except for the Spectrum kernel on bits (as it is not a valid combination), resulting in 62 of our 159 features.
The only semantic feature is the SMATCH score  which represents the similarity among two AMR graphs (Banarescu et al., 2013). The AMR for each sentence in a pair is generated with JAMR 3 , and then supplied to SMATCH, which returns a numeric value between 0 and 1 denoting their similarity.
In SMATCH, an AMR is translated into triples that represent variable instances, their relations, and global attributes such as the start node and literals. The final SMATCH score is the maximum F score of matching triples, according to various variable mappings, obtained by comparing their instance tokens. These are converted into lower case and then matched for exact equality.

Experimental setup
We applied all metrics to the train, test and trial examples of the SICK corpus (Marelli et al., 2014) and train and test examples from previous Semantic Textual Similarity in SemEval, as compiled by Tan et al. (2015).
Thus, our training dataset is comprised of 24623 vectors (with 9841 from SICK) assigned to a continuous value ranging from 0 to 5. Each vector contains our 159 feature values for the similarity among the sentences in an example pair.
We standardized the features by removing the mean and scaling to unit variance and norm. Then, machine learning algorithms were applied to the feature sets to train a model of our Semantic Textual Similarity representations. Namely, we employed ensemble learning by gradient boosting with decision trees, and feedforward neural networks (NN) with 1 and 2 fully connected hidden layers.
SMATCH is not available for Spanish, therefore this feature was left out when evaluating Spanish pairs (es-es). For English pairs (en-en), the scenarios include: a) only lexical features, or b) an ensemble with lexical features and the SMATCH score (without differentiation).
Gradient boosting was applied with the default configuration provided in scikit-learn (Pedregosa et al., 2011). NN were configured with single and multiple hidden layers, both with a rectifier as activation function. The first layer combines the 159 input features (or 158 when not using SMATCH) into 270 neurons, which are either combined into a second layer with 100 neurons, or to the output layer (with 1 neuron). Finally, we employed the mean square error cost function and the ADAM optimizer (Kingma and Ba, 2014), and fit a model in 100 epochs and batches of 5.
Our experiments were run with Tensorflow 0.11 (Abadi et al., 2015), with NN implementations from the Keras framework 4 . Gradient boosting implementation is from scikit-learn.

Results
System performance in the Semantic Textual Similarity task was measured with the Pearson coefficient. A selection of results is shown in Table  1, featuring our different scenarios/configurations, our official scores (in bold), and systems that achieved results similar to ours or are the best of each language/track. Variations of our system are identified by the "l2f " prefix.  We should mention that, afterwards, we ran our experiments with Theano 0.8.2, which yielded different results. As an example, on the English track, using the same settings (network topology, training data and normalization) of run "l2f NN-2 (+smatch)" resulted in a Pearson score of 0.72374. More recently, Tensorflow released version 1.0, which resulted in a score of 0.70437 for the same setup 5 .
In order to evaluate the contribution of SMATCH, we analyzed some examples where SMATCH led to a lower deviation from the gold standard, and, at the same time, higher deviation from runs without SMATCH.
On 15 pairs, SMATCH based predictions were consistently closer to the gold standard, across all learning algorithms, with an average difference of 0.27 from non SMATCH predictions. However, after analyzing the resulting AMR of some of these cases, we noticed that information was lost during AMR conversion. For instance, consider the following examples, which led to the results presented in Table 2.
(A) The player shoots the winning points. / The basketball player is about to score points for his team., with a gold score of 2.8.
(B) A woman jumps and poses for the camera. / A woman poses for the camera., with a gold score of 4.0.
(C) Small child playing with letter P / 2 young girls are sitting in front of a bookcase and 1 is reading a book., with a gold score of 0.8.
Considering example A, we can see the information lost during the AMR conversion in the following.
(w / win-01 :ARG1 (p / point)) vs. (b / basketball :ARG1-of (s / score-01 :ARG2 (t / team :location-of (p / point)))) The top structure (until "vs.") is the AMR for the first sentence, where "winning" is incorrectly identified as a verb, and the actual verb ("shoot") and its subject ("player") are missing. The same subject is also missing in the bottom AMR.   though not presented here): (a / and :op1 (p / pose-02 :ARG0 (w / woman) :ARG1 (c / camera))) vs. (p / pose-02 :ARG0 (w / woman) :location (c / camera)) (2) Thus, we could not identify specific situations to which AMR explicitly contributed, since examples where using SMATCH yielded better results reveal that SMATCH was applied to AMR with less information than in the source sentence.
To conclude, we should say that 20 pairs were consistently better predicted without SMATCH, with an average difference to SMATCH based predictions of 0.38.

Task 2.1 -Multilingual word similarity: English
In this section we report the experiments conducted for the second task of 2017 SemEval (Camacho-Collados et al., 2017). The task consists of, given a pair of words, automatically measuring their semantic similarity, in a continuous range of [0 − 4], from unrelated to totally similar. The test set was composed of 500 pairs of tokens (which can be words or multiple-word expressions); a small trial of 18 pairs set was also provided by the organizers. For this task we used a family of equivalence functions, from now on equiv(t 1 , t 2 ), where t 1 and t 2 are the tokens to be compared. equiv functions return a value in the range [0 − 1]. This value was later scaled into the goal's range. Then, we analyzed how to combine them. In the following subsections we detail our approach.

Equivalence functions
Two functions were considered: • equiv W N , which uses WordNet (Miller, 1995).
• equiv W 2V , which employs Word2Vec vectors (Mikolov et al., 2013) to compare the two tokens -we use the pre-trained vectors model available, trained on the Google News dataset 6 . equiv W N (t 1 , t 2 ) is defined as: where: • syn(t) gives the synset of the token t; • hyp(t) gives the hypernyms of t; • x = 1 − max(n × 0.1, m × 0.1), with n and m being the number of nodes traversed in the synsets of t 1 and t 2 , respectively. equiv W N matches, thus, two tokens if they have a common hypernym (Resnik, 1995) in their synset path. We compute the path distance by traversing the synsets upwards until finding the least common hypernym. For each node up, a decrement of 0.1 is awarded, starting at 1.0. If, no concrete common hypernym is found, then 0 is the result returned.   For example, laptop and notebook have the common synset Portable Computer, one node above both words, which results in a score of 1 − 0.1 = 0.9. Crocodile and lizard return 0.8, as one needs to go up two nodes in both tokens to find the common synset Diapsid. We do not consider generic synsets such as artifact or item.
Regarding equiv W 2V , it computes the cosine similarity between the vectors representing the two tokens: where W 2V (t) is the vector representing the word embedding for the token t. If the token is composed by more than one word, their vectors are added before computing the cosine similarity. For example, self-driving car and autonomous car obtain a cosine similarity of 0.53 (showing a degree of similarity, resulting from multiple-word tokens), while brainstorming and telescope result in a score of 0.04, which means the tokens are not related. Note that the scores are rounded to 0 if they are negative.

Combining the equivalence functions
We started by applying the equiv(t 1 , t 2 ) to the trial set. Table 3 shows some results for this experience. As one can see, in certain cases it would be better to use equiv W N , and in others the equiv W 2V function.
Just these few examples show how hard it is to combine these functions. Although we did not expect to accomplish relevant results with such approach, we decided to train a linear regression model in Weka (Hall et al., 2009) with the (very small) provided example set.
The final result obtained was C 1 = 5.0381 × equiv w2v + 0.6355, which only uses one of the functions. We used this equation in one of our runs, RunW2V, with a modified version: C = min(C 1 , 4).
Believing equiv W N had potential to be important in certain cases, we manually designed a weighed function to combine both functions. The threshold was decided by analyzing the trial set only. We ended up with the following decision function: The idea behind it is the following: when equiv W 2V is below a threshold (set to 0.12), we use equiv W N . Then either equiv W N does not find a relation as well (and probably has a value of 0.0), or it finds one and it is probably correct (see sunset/string in Table 3). This led to our second run, RunMix.

Results
Results for the task are presented in Table 4, with our runs in bold as submitted (run1 is RunW2V and run2 is RunMix). Both our runs attain a similar score, which is somehow surprising given how differently the scores were calculated. We placed at the middle of the table, although only a few points short from the 5th best ranked run -a difference of less than 0.04 on both Pearson and final score. This ends up being an interesting result, based on how simple our approach was, and the lack of data to properly learn a function to combine our equiv functions.

Conclusions and Future Work
In this paper we present our results on two tasks of 2017 SemEval competition, "Semantic Textual Similarity" and "Multilingual Word Similarity". The results obtained yielded competitive results, although being outperformed by other approaches in the official ranking.
For the "Semantic Textual Similarity" task, our models performed similarly for multilingual data, since most features are language independent, and essentially rely on matching tokens among input sentences. Therefore, our method is feasible for all monolingual pairs.
We could not identify situations where the SMATCH metric improved the results, although in 15 cases SMATCH based predictions were closer to the gold standard, across all learning algorithms.
Future work includes replacing the exact instance matching in SMATCH with our word similarity module, and using the SMATCH representation in a structural learning method such as Tree-LSTM (Tai et al., 2015), or in a more balanced/weighed ensemble with the lexical features.
In what respects the "Multilingual Word Similarity" task, we believe that our participation was simple, but still effective. We used two semantic resources (WordNet and Word2Vec), a weighting function learned on a small trial set, and a hand-crafted formula to combine the similarity scores of our two functions, which makes it an approach lacking ground. The results were still promising, given the simplicity of our approach.
As future work, the word similarity module itself could be largely improved by automatically learning a set of weights to combine the two functions. For instance, the gold standard, now available, can be a useful tool for this task, as other large datasets like Simlex-999 (Hill et al., 2014).