HCTI at SemEval-2017 Task 1: Use convolutional neural network to evaluate Semantic Textual Similarity

This paper describes our convolutional neural network (CNN) system for Semantic Textual Similarity (STS) task. We calculated semantic similarity score between two sentences by comparing their semantic vectors. We generated semantic vector of every sentence by max pooling every dimension of their word vectors. There are mainly two trick points in our system. One is that we trained a CNN to transfer GloVe word vectors to a more proper form for STS task before pooling. Another is that we trained a fully-connected neural network (FCNN) to transfer difference of two semantic vectors to probability of every similarity score. We decided all hyper parameters empirically. In spite of the simplicity of our neural network system, we achieved a good accuracy and ranked 3rd in primary track of SemEval 2017.


Introduction
Semantic Textual Similarity (STS) is the task of determining the degree of semantic similarity between two sentences. STS task is a building block of many natural language processing (NLP) applications. Therefore, it has received a significant amount of attention in recent years. STS tasks in SemEval have been held from 2012 to 2017 (Cer et al., 2017). Successfully estimating the degree of semantic similarity between two sentences requires a very deep understanding of both sentences. Well performing STS methods can be applied to many other natural language understanding tasks including paraphrasing, entailment detection, answer selection, hypothesis evidencing, machine translation (MT) evaluation and quality estimation, summarization, question answering (QA) and short answer grading.
Measuring sentence similarity is challenging for two reasons. One is the variability of linguistic expression and the other is the limited amount of annotated training data. Therefore, conventional NLP approaches, such as sparse, hand-crafted features are difficult to use. However, neural network systems (He et al., 2015a;He and Lin, 2016) can alleviate data sparseness with pre-training and distributed representations. We propose a convolutional neural network system with 5 components: 1) Enhance GloVe word vectors by adding handcrafted features. 2) Transfer the enhanced word vectors to a more proper form by a convolutional neural network. 3) Max pooling over every dimension of all word vectors to generate semantic vector. 4) Generate semantic difference vector by concatenating the element-wise absolute difference and the element-wise multiplication of two semantic vectors. 5) Transfer the semantic difference vector to the probability distribution over similarity scores by fully-connected neural network. Figure 1 provides an overview of our system. The two sentences to be semantically compared are first pre-processed as described in subsection 2.1. Then the CNN described in subsection 2.2 combines the word vectors from each sentence into an appropriate sentence level embedding. After that, the methods described in subsection 2.3 are used to compute representations that compare paired sentence level embeddings. Then, a fullyconnected neural network (FCNN) described in subsection 2.4 transfers the semantic difference vector to a probability distribution over similarity scores. All hyperparameters in our system were empirically tuned for the STS task and shown in Table 1. We implemented our neural network system by using Keras 1 (Chollet, 2015) and Tensor-Flow 2 (Abadi et al., 2016).

Pre-process
Several text preprocessing operations were performed before feature engineering: 1) All punctuations are removed.
3) All sentences are tokenized by Natural Language Toolkit (NLTK) (Bird et al., 2009). 4) All words are replaced by pre-trained GloVe word vectors (Common Crawl, 840B tokens) (Pennington et al., 2014). Words that do not exist in the pre-trained embeddings are set to the zero vector.
5) All sentences are padded to a static length l = 30 with zero vectors (He et al., 2015a).
Several hand-crafted features are added to enhance the GloVe word vectors: 1) If a word appears in both sentences, add a TRUE flag to the word vector, otherwise, add a FALSE flag.
2) If a word is a number, and the same number appears in the other sentence, add a TRUE flag to the word vector of the matching number in each sentence, otherwise, add a FALSE flag.
3) The part-of-speech (POS) tag of every word according to NLTK is added as a one-hot vector.

Convolutional neural network (CNN)
Our CNN consists of n = 300 one dimensional filters. The length of the filters is set to be the same as the dimension of the enhanced word vectors. The activation function of the CNN is set to be relu (Nair and Hinton, 2010). We did not use any regularization or drop out. Early stopping triggered by model performance on validation data was used to avoid overfitting. The number of layers is set to be 1. We used the same model weights to transfer each of the words in a sentence. Sentence level embeddings are calculated by max pooling (Scherer et al., 2010) over every dimension of the transformed word level embedding. 1 http://github.com/fchollet/keras 2 http://github.com/tensorflow/tensorflow

Comparison of semantic vectors
To calculate the semantic similarity score of two sentences, we generate a semantic difference vector by concatenating the element-wise absolute difference and the element-wise multiplication of the corresponding paired sentence level embeddings. The calculation equation is Here, SDV is the semantic difference vector, SV 1 and SV 2 are the semantic vectors of the two sentences, and • is Hadamard product which generate the element-wise multiplication of two semantic vectors.

Fully-connected neural network (FCNN)
An FCNN is used to transfer the semantic difference vector (600 dimension) to a probability distribution over the six similarity labels used by STS. The number of layers is set to be 2. The first layer uses 300 units with a tanh activation function. The second layer produces the similarity label probability distribution with 6 units combined with a sof tmax activation function. We train without using regularization or drop out.

Experiments and Results
We randomly split all dataset files of SemEval-2012-2015 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015 into ten. We used the preparation of the data from (Baudis et al., 2016). We used 90% of the pairs in each individual dataset file for training and the other 10% for validation. We tested our model in the English dataset of SemEval-2016 (Agirre et al., 2016). Our objective function is the Pearson correlation coefficient computed over each batch. ADAM was used as the gradient descent optimization method. All parameters are set to the values suggested by (P. Kingma and Ba, 2015): learning rate is 0.001, β1 is 0.9, β2 is 0.999, is 1e-08. he unif orm (He et al., 2015b) was used as the initial function of layers. We did the experiment 8 times and choose the model that achieved the best performance on the validation dataset. Our system got a Pearson correlation coefficient result of 0.7192±0.0062. We also used the same model design to take part in all tracks of SemEval-2017. We submitted two runs. One with machine translation (MT) and another without (non-MT). In MT run, we translated all the other languages in the test dataset into English by Google Translate 3 and used the English model to evaluate all similarity scores. For the monolingual tracks, we also tried non-MT run, which means we trained the models directly from the English, Spanish and Arabic data. Here, we independently trained another English model for each run. The difference between English-English performance from MT and non-MT is caused by the random shuffling of data during training.
We also trained another English model with same design to evaluate the STS benchmark dataset (Cer et al., 2017) 4 . We used only the Train part for training and the Dev. part to fine tune. We also run our system without any hand-crafted features. The purely sentence representation system

Discussion
The difference between our model's performance and that of the best participating system are relative small for all tracks except track 4b and 6. We note that the sentences in track 4b are significantly longer than the sentences in other tracks. We speculate that the results of our system in track 4b were pulled down by the decision to use static padding of length 30 within our model.
Another trend that could be observed is that the results of non-MT were likely harmed by the smaller amounts of available training data. We had over 10,000 training pairs for English, but only 1634 pairs in Spanish and 1104 in Arabic. Correspondingly, for our non-MT models, we achieved our best Pearson correlation scores on English with diminished results on Spanish and our worst results on Arabic. Notably, the results obtained by combining our English model with MT to handle Spanish and Arabic were not affected by the 5 As of April 17, 2017 limited amount of training data for these two languages and provided better performance.

Conclusion
We proposed a simple convolutional neural network system for the STS task. First, it uses a convolutional neural network to transfer hand-crafted feature enhanced GloVe word vectors. Then, it calculates a semantic vector representation of each sentence by max pooling every dimension of their transformed word vectors. After that, it generates a semantic difference vector between two paired sentences by concatenating their element-wise absolute difference and the element-wise multiplication of their semantic vectors. Next, it uses a fullyconnected neural network to transfer the semantic difference vector to a probability distribution over similarity scores.
In spite of the simplicity of our neural network system, the difference in performance between our proposed model and the best performing systems that participated in the STS shared task are less than 0.1 absolute in almost all STS tracks and result in our model being ranked 3rd on primary track of SemEval STS 2017.