Neobility at SemEval-2017 Task 1: An Attention-based Sentence Similarity Model

This paper describes a neural-network model which performed competitively (top 6) at the SemEval 2017 cross-lingual Semantic Textual Similarity (STS) task. Our system employs an attention-based recurrent neural network model that optimizes the sentence similarity. In this paper, we describe our participation in the multilingual STS task which measures similarity across English, Spanish, and Arabic.


Introduction
Semantic textual similarity (STS) measures the degree of equivalence between the meanings of two text sequences (Agirre et al., 2016). The similarity of the text pair can be represented as discrete or continuous values ranging from irrelevance (1) to exact semantic equivalence (5). It is widely applicable to many NLP tasks including summarization (Wong et al., 2008;Nenkova et al., 2011), generation and question answering (Vo et al., 2015), paraphrase detection (Fernando and Stevenson, 2008), and machine translation (Corley and Mihalcea, 2005).
In this paper, we describe a system that is able to learn context-sensitive features within the sentences. Further, we encode the sequential information with Recurrent Neural Network (RNN) and perform attention mechanism (Bahdanau et al., 2015) on RNN outputs for both sentences. Attention mechanism was performed to increase sensitivity of the system to words of similarity significance. We also optimize directly on the Pearson correlation score as part of our neural-based approach. Moreover, we include a pair feature * The author is currently serving in his Alternative Military Service of Education. adapter module that could be used to include more features to further improve performance. However, for this competition we include merely the TakeLab features (Šarić et al., 2012). 1

Related Works
Most proposed approaches in the past adopted a hybrid of varying text unit sizes ranging from character-based, token-based, to knowledge-based similarity measure (Gomaa and Fahmy, 2013). The linguistic depths of these measures often vary between lexical, syntactic, and semantic levels.
Most solutions include an ensemble of modules that employs features coming from different unit sizes and depths. More recent approaches generally include the word embedding-based similarity (Liebeck et al., 2016;Brychcín and Svoboda, 2016) as a component of the final ensemble. The top performing team in 2016 (Rychalska et al., 2016) uses an ensemble of multiple modules including recursive autoencoders with WordNet and monolingual aligner (Sultan et al., 2016). UMD-TTIC-UW (He et al., 2016) proposes the MPCNN model that requires no feature engineering and managed to perform competitively at the 6 th place. MPCNN is able to extract the hidden information using the Convolutional Neural Network (CNN) and adds an attention layer to extract the vital similarity information.

Model
Given two sentences I 1 = {w 1 1 , w 1 2 , ..., w 1 n 1 } and I 2 = {w 2 1 , w 2 2 , ..., w 2 n 2 }, where w ij denote the jth token of the ith sentence, embedded using a function φ that maps each token to a D-dimension trainable vector. Two sentences are encoded with 1 Our system and data is available at github.com/iamalbert/semval2017task1. Sentence Encoder For each sentence, the RNN firstly converts w i j into x i j ∈ R 2H , using an bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014) 2 by sequentially feeding w i j into the unit, in both forward and backward directions. The superscripts of w, x, a, u, n are omitted for clear notation. x Then, we attend each word x j for different salience a j and blend the memories x 1;n into sentence embedding u: where W ∈ R 2H×2H and r ∈ R 2H are trainable parameters.
Surface Features Inspired by the simple system described in (Šarić et al., 2012), We also extract surface features from the sentence pair as following: •Ngram Overlap Similarity: These are features drawn from external knowledge like Word-Net (Miller, 1995) and Wikipedia. We use both PathLen similarity (Leacock and Chodorow, 1998) and Lin similarity (Lin et al., 1998) to compute similarity between pairs of words w 1 i and w 2 j in I 1 and I 2 , respectively. We employ the suggested preprocessing step (Šarić et al., 2012), and add both WordNet and corpus-based information to ngram overlap scores, which is obtained with the harmonic mean of the degree of overlap between the sentences.
•Semantic Sentence Similarity: We also compute token-based alignment overlap and vector space sentence similarity (Šarić et al., 2012). Semantic alignment similarity is computed greedily between all pairs of tokens using both the knowledge-based and corpusbased similarity. Scores are further enhanced with the aligned pair information. We obtain the weighted form of latent semantic analysis vectors (Turney and Pantel, 2010) for each word w, before computing the cosine similarity. As such, sentence similarity scores are enhanced with corpus-based information for tokens. The features are concatenated into a vector, denoted as m.
Scoring Let S be a descrete random variable over {0, 1, ..., 4, 5} describing the similarity of the given sentence pair {I 1 , I 2 }. The representation of the given pair is the concatenation of u 1 , u 2 , and m, which is fed into an MLP with one hidden layer to calculate the estimated distribution of S.
Therefore, the score y is the expected value of S: , where v = [0, 1, 2, 3, 4, 5] T . The entire system is shown in Figure 1.

Word Embedding
We explore initializing word embeddings randomly or with pre-trained word2vec (Mikolov et al., 2013) of dimension 50, 100, 300, respectively. We found that the system works the best with 300-dimension word2vec embeddings.

Optimization
Let p n , y n be the predicted probability density and expected score andŷ n be the annotated gold score of the n-th sample. Most of the previous learningbased models are trained to minimize the following objectives on a batch of N samples: • Negative Log-likelihood (NLL) of p andp (Aker et al., 2016). The task is viewed as a classification problem for 6 classes.
, where t n isŷ n rounded to the nearest integer.
• Kullback-Leibler divergence (KLD) of p n and gold distributionp n estimated byŷ n : otherwise (Li and Huang, 2016;Tai et al., 2015). For each n, there exists some k such thatp n k = 1 and ∀h = k,p n h = 0, KLD is identical to NLL.
However, the evaluation metric of this task is Pearson Correlation Coefficient (PCC), which is invariant to changes in location and scale of y n but none of the above objectives can reflect it. Here we use an example to illustrate that MSE and KLD can even report an inverse tendency. In Table 1, group A has lower MSE and KLD loss than group B, but its PCC is also lower.
To solve this problem, we train the model to maximize PCC directly. Hence, the loss function is given by: Since N is fixed for every batch, L PCC is differentiable with respect to y n , which means we can apply back propagation to train the network. To the best of our knowledge, we are the first team to adopt this training objective.   We gather dataset from SICK (Marelli et al., 2014) and past STS across years 2012(Agirre et al., 2012 for both cross-lingual and monolingual subtasks. We shuffle and split them according to the ratio 80:20 into training set and validation set, respectively. Table 2 indicates the size of training set and validation set. All non-English sentence appearing in training, validation, and test set are translated into English with Google Cloud Translation API.

Experiments
In the experiment, the size of output of GRU is set to be H = 200. We use ADAM algorithm to optimize the parameters with mini-batches of 125. The learning rate is set to 10 −4 at the beginning and reduced by half for every 5 epochs. We trained the network for 15 epochs. Table 3, we demonstrate that the system performs better with pretrained word vectors (WI) than randomly initialized (RI).  Table 3: System performance with different dimensions of word embeddings, using either randomly initialized or pre-trained word embedding.

Word embeddings In
Loss function We display performances with systems optimized with KLD, MSE, and PCC. It shows that when using L PCC as the training objective, our system not only performs the best but also converges the fastest. As shown in Table 4 and Figure 2.
Loss function PCC L KLD 0.6839 L MSE 0.7863 L PCC 0.8174 Table 4: Influence of different loss objectives on the system performance measured using PCC on our validation set.

Final System Results
We tune the model on validation set, and select the set of hyper-parameters that yields the best performance to obtain the scores of test data. We report the official provisional results in Table 5. There is an obvious performance drop in track4b, which happens to all teams. We hypothesize that the sentences in track4b (en es) are collected from a special domain, due to the fact that the number of

Conclusion
We propose a simple neural-based system with a novel means of optimization. We adopt a simple neural network with surface features which leads to a promising performance. We also revise several popular training objectives and empirically show that optimizing directly on Pearson's correlation coefficient achieved the best scores and perform competitively on STS-2017.