LIPN-IIMAS at SemEval-2017 Task 1: Subword Embeddings, Attention Recurrent Neural Networks and Cross Word Alignment for Semantic Textual Similarity

In this paper we report our attempt to use, on the one hand, state-of-the-art neural approaches that are proposed to measure Semantic Textual Similarity (STS). On the other hand, we propose an unsupervised cross-word alignment approach, which is linguistically motivated. The neural approaches proposed herein are divided into two main stages. The first stage deals with constructing neural word embeddings, the components of sentence embeddings. The second stage deals with constructing a semantic similarity function relating pairs of sentence embeddings. Unfortunately our competition results were poor in all tracks, therefore we concentrated our research to improve them for Track 5 (EN-EN).


Introduction
Semantic Textual Similarity (STS) refers to the Natural Language Processing (NLP) task which is aimed at measuring the degree of similarity/dissimilarity between two text units (Agirre et al., 2012(Agirre et al., , 2016. In other words given a pair of text snippets (generally a pair of sentences) the task is to determine a real value (the semantic similarity score) in the interval between 0.0 and 5.0, which represents how much similar are the two sentences of a given pair.
There are two main types of proposed systems in prior editions of the competition: supervised and unsupervised systems. While supervised systems are expected to be highly reliable because of that they use human-annotated gold standards, unsupervised systems also are highly reliable by using modest levels of linguistic knowledge. In this work we report results from both, unsupervised and supervised systems.
Currently the STS task involves tracks of different nature, i.e. the monolingual and cross-lingual ones. In this paper we investigate the underlying properties in text which are relevant to measure semantic similarity, thus we focus our major efforts into the English-English Track 5.

Data
We tested a couple of supervised systems. We prepared the STS monolingual English datasets from years 2012, 2013, 2015 and 2016. After discarding sentence pairs whose similarity score was absent from the corresponding gold standard files, we obtained a dataset consisted of 10, 592 sentence pairs (6, 858 are already marked as training pairs and 3, 734 are already marked as test pairs).
In order to obtain subword embeddings we trained the "fastText" method for 20, 50, 100, 200 and 300 dimensions by using the English Wikipedia (Bojanowski et al., 2016). We decided to take advantage of the capability of this method for inferring out-of-vocabulary words. This advantage is mainly due to the fastText's character level n-gram approach, which represents a meaningful performance difference both in training and in testing.

Systems Description
Multiple Neural Network architectures were used to model similarity measuring in supervised settings. Also an unsupervised system 1 was directly tested on this year's evaluation dataset.

Word embeddings + RNN
We see the Recurrent Neural Networks (RNN) as intuitive models for observing relevance of sentence elements; in particular the Long-Short Term Memories (LSTMs). These kind of networks are  well documented as suitable for modeling sequentiality of lexical units within sentences whereas avoiding the gradient vanishing of long term patterns (Hochreiter and Schmidhuber, 1997).
In the case of Attention LSTMs, they capture additional features of the sequential process they model. The additional features are encoded into an attention vector. This attention vector indicates to the network which segments of the sequence (sentence) are statistically more relevant than the other ones according to the training set.
In this paper we used the architecture proposed by (Vinyals et al., 2015), where the authors used a stacked Attention LSTM for PoS tagging. In Figure 1 we show a modified version of the mentioned architecture, which consists of two attention LSTM layers on the bottom, one Gated Recurrent Unit (GRU) at the middle and a simple RNN on top (Cho et al., 2014). Notice that this description corresponds to each of the twin networks showed in the figure, which is our adaptation to the STS task. This recurrent architecture is followed by a Maxout Network (Goodfellow et al., 2013), which has a monolithic output layer (i.e. the similarity score y i ∈ [1, 5] ⊂ R).

Sentence embeddings + MLP
The word/sentence embedding stage was modeled via the doc2vec method (Le and Mikolov, 2014), which is based on the word2vec word embedding method (Mikolov et al., 2013). For each pair of sentences, we obtained a pair of sentence embeddings (s a , s b ) ∈ R d × R d . Thus each pair was concatenated to form a pair vector p i = s a ∥s b ∈ R 2d . In this way, we obtained a training set (p 1 , y 1 ), ..., (p m , y m ) which was feed to a simple MLP. The output layer of the MLP is a 6-node softmax, so we have six possible output similarity values, i.e. y i ∈ {0, ..., 5}.

Cross word aligner
We proposed an unsupervised system which is motivated by linguistic elements we identified as highly relevant accordingly to linguistic theories. General linguistics states that we can know what is being said about something by seeing at the predicative structure. The theories by Harris (1968) inspire NLP algorithms where it is said that word use leads to meaning (which is commonly interpreted as word co-occurrence). Harris also said that combinatorics of words is more informative in the predicates, where redundancy is needed by speakers to provide integrity to a message.
In an attempt to follow these statements and also inspired by success obtained by authors like Han et al. (2013) and Rychalska et al. (2016), we implemented a word alignment system. Unlike previous works, our system considers that verbs operate on nouns. We used Open Information Extraction algorithms (openIE) for detecting predicates (P a , P b ) of the form (N P, V P, N P ) within each sentence of the pair (S a , S b ) (Fader et al., 2011).
Similarly to the word analogies commonly used for word embedding evaluations (Mikolov et al., 2013), our system considers that verbs frequently operate on nouns. Thus, it is measured how similar each verb v a ∈ P a of a sentence S a is, with respect to its combination with each noun n b ∈ P b of a sentence S b , i.e. d c (S a , S b ). Given that the relationship d c (·, ·) is not commutative this similarity also is computed from S b to S a , i.e.
where θ(·, ·) is the cosine similarity and v a , n a ∈ R d are word embeddings categorized as verbs People are ready for change People change Figure 2: General scheme for the vector similarities of cross word alignments with respect to structural categories.
and nouns within the sentence S a , respectively. N v,a , N n,a are the number of verbs and nouns considered in S a (same for S b ). Overall, equations (1a) and (1b) are the average vector similarities of cross word alignments with respect to structural categories between S a and S b . For example, in Figure 2 the sentence "People are ready for change" [S a ] is compared against the phrase "people change" , is to quantify how the word "people" is used along the conjugated form "are" (which forms a predicate together with the noun phrase "ready for change"). This operation is also The kind of predicates showed in Figure 2 are often part of more complex sentences, e.g. "It is clear that future is near and people is ready for change". We extracted these predicates by using the openIE algorithm implemented in the coreNLP 2 library.
There are cases in the STS corpora where no extractions are made. This is due to the low recall openIE systems offer until now (Xu et al., 2013). That is, many openIE algorithms can extract neither implicit relations (e.g. "Mexico City, where Aztecs live") nor short phrases (e.g. "The white house"). We assume that these snippets are expressed in their minimum form, so things like "people changes" are embedded word by word. The embeddings are then compared either to embeddings of other equally short phrases or to embeddings of openIE extractions. The global score is simply the average of all distances:

Results
Our systems passed through several refinement stages. Unfortunately, the submitted runs were 2 http://stanfordnlp.github.io/CoreNLP/ to early stages and did not reach competitive performance as can be seen in Table 1. We transformed the multi-lingual data onto English using the Google Translate API and trained a unique model on resulting data. We submitted two LSTM models, with and without attention mechanism. The models were selected by monitoring the best test score after 25 training epochs. Additional systems were tested after-competition. Our best results are considered as such given its absolute value (inverse correlations can be reinterpreted insystem in the case we reach higher values).

Word embeddings + RNN
A sentence can be seen as a sequence of word embeddings which are appended in order to form a sentence matrix. For this system we used FastText word embeddings. Given a sentence pair, each sentence matrix is fed to each of the multi-layered RNNs described in Section 3.1. We used the lasttop hidden states (or time steps) of the two networks as sentence embeddings. We concatenated these sentence embeddings. In this way, we obtained pair vectors p 1 , ..., p m ⊂ R 2t that were feed to the top Maxout network (herein t is the number of hidden states each of the top RNN layers has in Figure 1). Table 2 were trained over 1500 pairs from data described in Section 2 (1050 for training and 450 for test). As shown in the table, we fed the networks with word embeddings of 200, 100 and 50 dimensions. Results are much better for the architecture formed by word embeddings of 200 dimensions, 50 hidden states and 100 hidden Maxout nodes.  Table 2: Twin Attention LSTM-GRU-RNN-Maxout architecture and performance (afterofficial evaluation) on the 2017 track 5.

Hidden layers MSE (%) Correlation
In Table 3 we depict the Mean Squared Error (MSE) for the test set and the Pearson's weighted correlation coefficient for the track 5 evaluation. Many combinations in the architecture during the training showed that even the minimum test MSE is very high. Therefore our setting Doc2vec+MLP did not allow for good generalization.

Cross word aligner
The cross word alignment system is unsupervised and we tested it directly on some of the most popular past year's datasets. We used fastText word embeddings of different dimensions. A good choice for semantic assessment is 100 dimensions (Bojanowski et al., 2016). Additionally we reported results for 300, 200, 50 and 20 dimensions.
On top of Table 4 we show our best result (after official evaluation), which is that for 200 dimensions. Furthermore we noticed our engineered features are sensitive to text properties, e.g. domains

Conclusions
Despite of the success that RNNs have recently showed, we observed that even when they do not require feature engineering, instead they require training time, large data amounts, high computational power and architecture engineering. The results we showed in Section 4.1 are not good. The reason is very probably one the aforementioned and it needs to be improved. We think the amount of sequential patterns with which we trained our networks was not enough. Such patterns are based on punctual lexical items (each particular word embedding), but not in generalized sequential and semantic patterns. Our cross word alignment system is based on feature engineering, in such a way that we showed that when a simple cosine similarity focuses on relevant segments of sentences, the performance can be progressively improved (probably by improving feature engineering and adding external resources not considered at this moment). This reasoning is consistent with much other unsupervised approaches. It is needed to say that even when we performed simple feature engineering, a critical part of our method was the use of word embeddings, which are barely based on linguistic feature engineering.