ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity

To address semantic similarity on multilingual and cross-lingual sentences, we firstly translate other foreign languages into English, and then feed our monolingual English system with various interactive features. Our system is further supported by combining with deep learning semantic similarity and our best run achieves the mean Pearson correlation 73.16% in primary track.


Introduction
Sentence semantic similarity is the building block of natural language understanding. Previous Semantic Textual Similarity (STS) tasks in SemEval focused on judging sentence pairs in English and achieved great success. In SemEval-2017 STS shared task concentrates on the evaluation of sentence semantic similarity in multilingual and cross-lingual (Agirre et al., 2017). There are two challenges in modeling multilingual and crosslingual sentence similarity. On the one hand, this task requires human linguistic expertise to design specific features due to the different characteristics of languages. On the other hand, lack of enough training data for a particular language would lead to a poor performance.
The SemEval-2017 STS shared task assesses the ability of participant systems to estimate the degree of semantic similarity between monolingual and cross-lingual sentences in Arabic, English and Spanish, which is organized into a set of six secondary sub-tracks (Track 1 to Track 6) and a single combined primary track (Primary Track) achieved by submitting results for all of the secondary sub-tracks. Specifically, track 1, 3 and 5 are to determine STS scores for monolingual sentence pairs in Arabic, Spain and English, respectively. Track 2, 4, and 6 involve estimat-ing STS scores for cross-lingual sentence pairs from the combination of two particular languages, i.e., Arabic-English, Spanish-English and surprise language (here is Turkish)-English cross-lingual pairs. Given two sentences, a continuous valued similarity score on a scale from 0 to 5 is returned, with 0 indicating that the semantics of the sentences are completely independent and 5 signifying semantic equivalence. The system is assessed by computing the Pearson correlation between system returned semantic similarity scores and human judgements.
To address this task, we first translate all sentences into English through the state-of-the-art machine translation (MT) system, i.e., Google Translator 1 . Then we adopt a combination method to build a universal model to estimate semantic similarity, which consists of traditional natural language processing (NLP) methods and deep learning methods. For traditional NLP methods, we design multiple effective NLP features to depict the semantic matching degree and then supervised machine learning-based regressors are trained to make prediction. For neural networks methods, we first obtain distributed representations for each sentence in sentence pairs and then feed these representations into end-to-end neural networks to output similarity scores. Finally, the scores returned by the regressors with traditional NLP methods and by the neural network models are equally averaged to get a final score to estimate semantic similarity.
2 System Description Figure 1 shows the overall architecture of our system, which consists of the following three modules:  Figure 1: The system architecture Traditional NLP Module is to extracts two kinds of NLP features. The sentence pair matching features are to directly calculate the similarity of two sentences from several aspects and the single sentence features are to first represent each sentence in NLP method and then to adopt kernelbased method to calculate the similarity of two sentences. All these NLP-based similarity scores act as features to build regressors to make prediction.
Deep Learning Module is to encode input sentence pairs into distributed vector representations and then to train end-to-end neural networks to obtain similarity scores.
Ensemble Module is to equally average the above two modules to get a final score.
Next, we will describe the system in detail.

Traditional NLP Module
In this section, we give the details of feature engineering and learning algorithms.

Sentence Pair Matching Features
Five types of sentence pair matching features are designed to directly calculate the similarity of two sentences based on the overlaps of character/word/sequence, syntactic structure, alignment and even MT metrics.
N-gram Overlaps: Let S i be the sets of consecutive n-grams, and the n-gram overlap (denoted as ngo) is defined as (Šarić et al., 2012): We obtain n-grams at three different levels (i.e., the original and lemmatized word, the character level), where n = {1, 2, 3} are used for word level and n = {2, 3, 4, 5} are used for character level. Finally, we collect 10 features. Sequence Features: Sequence features are designed to capture more enhanced sequence information besides the n-gram overlaps. We compute the longest common prefix / suffix / substring / sequence and levenshtein distance for each sentence pair. Note that the stopwords are removed and each word is lemmatized so as to estimate sequence similarity more accurately. As a result, we get 5 features.
Syntactic Parse Features: In order to model tree structured similarity between two sentences rather than sequence-based similarity, inspired by Moschitti (2006), we adopt tree kernels to calculate the similarity between two syntactic parse trees. In particular, we calculate the number of common substructures in three different kernel spaces, i.e., subtree (ST), subset tree (SST), partial tree (PT). Thus we get 3 features.
Alignment Features: Sultan et al. (2015) used word aligner to align matching words across a pair of sentences, and then computes the proportion of aligned words as follows: where n a (S) and n(S) is the number of aligned and non-repeated words in sentence S. To assign appropriate weights to different words, we adopt two weighting methods: i) weighted by five POS tags (i.e., noun, verb, adjective, adverb and others; we first group words in two sentences into 5 POS categories, then for each POS category we compute the proportion of aligned words, and we get 5 features as a result. ii) weighted by IDF values (calculated in each dataset separately). Totally, we collect 7 alignment features.
MT based Features: Following previous work in (Zhao et al., 2014) and (Zhao et al., 2015), we use MT evaluation metrics to measure the semantic equivalence of the given sentence pairs. Nine MT metrics (i.e., BLEU, GTM-3, NIST, -WER, -PER, Ol, -TERbase, METEOR-ex, ROUGE-L) are used to assess the similarity. These 9 MT based features are calculated using the Asiya Open Toolkit 2 .
Finally, we collect a total of 34 sentence pair matching features.

Single Sentence Features
Unlike above sentence pair matching features to directly estimate matching score between two sentences, the single sentence features are to represent each sentence in the same vector space to calculate the sentence similarity. We design the following three types of features.
BOW Features: Each sentence is represented as a Bag-of-Words (BOW) and each word (i.e., dimension) is weighted by its IDF value.
Dependency Features: For each sentence, its dependency tree is interpreted as a set of triples, i.e., (dependency-label, governor, subordinate). Similar to BOW, we treat triples as words and represent each sentence as Bag-of-Triples.
Word Embedding Features: Each sentence is represented by concatenating min/max/average pooling of vector representations of words. Note that for each word, its vector is weighted by its IDF value. Table 1 lists four the state-of-the-art pretrained word embeddings used in this work.

Embedding
Dimension Source word2vec Mikolov et al. (2013) 300d GoogleNews-vectors-negative300.bin GloVe Pennington et al. (2014) 100d glove.6B.100d.txt 300d glove.6B.300d.txt paragram Wieting et al. (2015) 300d paragram 300 sl999.txt Table 1: Four pretrained word embeddings However, in comparison with the number of sentence pair matching features (33 features), the dimensionality of single sentence features is huge (approximately more than 71K features) and thus it would suppress the discriminating power of sentence pair matching features. Therefore, In order to reduce the high dimensionality of single sentence features, for each single sentence feature, we use 11 kernel functions to calculate sentence pair similarities. Table 2 lists the 11 kernel functions we used in this work. In total we collect 33 sin-  Finally, these 67 NLP features are standardized into [0, 1] using max-min normalization before building regressor models.

Regression Algorithms
Five learning algorithms for regression are explored, i.e., Random Forests (RF), Gradient Boosting (GB) Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and XGBoost (XGB). Specially, the first four algorithms are implemented in scikit-learn toolkit 3 , and XGB is implemented in xgboost 4 . In preliminary experiments, SVM and SGD underperformed the other three algorithms and thus we adopt RF, GB and XGB in following experiments.

Deep Learning Module
Unlike above method adopting manually designed NLP features, deep learning based models are to calculate semantic similarity score with the pretrained word vectors as inputs. Four pretrained word embeddings listed in Table 1 are explored and the paragram embeddings achieved better results in preliminary experiments. We analyze and find the possible reason may be that the paragram embeddings are trained on Paraphrase Database 5 , which is an extensive semantic resource that consists of many phrase pairs. Therefore, we use paragram embeddings to initialize word vectors.
Based on pretrained word vectors, we adopt the following four methods to obtain single sentence vector as (Wieting et al., 2015): (1) by simply averaging the word vectors in single sentence; (2) after (1), the resulting averaged vector is multiplied by a projection matrix; (3) by using deep averaging network (DAN, Iyyer et al. (2015)) consisting of multiple layers as well as nonlinear activation functions; (4) by using long short-term memory network (LSTM, Hochreiter and Schmidhuber (1997)) to capture long-distance dependencies information.
In order to obtain the vector of sentence pair, given two single sentence vectors, we first use a element-wise subtraction and a multiplication and then concatenate the two values as the final vector of sentence pair representation. At last, we use a fully-connected neural network and output the probability of similarity based on a softmax function. Thus we obtain 4 deep learning based scores.
To learn model parameters, we minimize the KL-divergence between the outputs and gold labels, as in Tai et al. (2015). We adopt Adam (Kingma and Ba, 2014) as optimization method and set learning rate of 0.01.

Ensemble Module
The NLP-based scores and the deep learning based scores are averaged in the ensemble module to obtain the final score.

Experimental Settings
Datasets: SemEval-2017 provided 7 tracks in monolingual and cross-lingual language pairs. We first translate all sentences into English via Google Translator and then we build a universal model on only English pairs. The training set we used is all the monolingual English dataset from SemEval STS task (2012-2015) consisting of 13, 592 sentence pairs.
For each track, we grant the training datasets provided by SemEval-2017 as development set.  Almost all test data is from SNLI, except for Track 4b from WMT. This can explain why on Track 4b SP-EN-WMT, the performance is very poor. So we perform 10 − f old cross validation (CV) on Track 4b SP-EN-WMT.
Preprocessing: All sentences are translated into English via Google Translator. The Stanford CoreNLP  is used for tokenization, lemmatization, POS tagging and dependency parsing.
Evaluation: For Track 1 to Track 6, Pearson correlation coefficient is used to evaluate each individual test set. For Primary Track, since it is achieved by submitting results of all the secondary sub-tracks, a macro-averaged weighted sum of all correlations on sub-tracks is used for evaluation.

Results on Training Data
A series of comparison experiments on English STS 2016 training set have been performed to explore different features and algorithms. Table 4 lists the results of different NLP features with GB learning algorithm. We find that: (1) the simple BOW Features with kernel functions are effective for sentence semantic similarity. (2) The combination of all these NLP features achieved the best results, which indicates that all features make contributions. Therefore we do not perform feature selection and use all these NLP features in following experiments. Table 5 lists the results of different algorithms using all NLP features as well as deep learning scores. We find:

Comparison of Learning Algorithms
(1) Regarding machine learning algorithms, RF and GB achieve better results than XGB. GB performs the best on 3 and RF performs the best on 2 of 5 datasets.
(2) Regarding deep learning models, DL-word and DL-proj outperform the other 2 non-linear models on all the 5 datasets. This result is consistent with the findings in (Wieting et al., 2015):"In outof-domain scenarios, simple architectures such as word averaging vastly outperform LSTMs." (3) All ensemble methods significantly improved the performance. The ensemble of 3 machine learning algorithms (RF+GB+XGB) outperforms any single learning algorithm. Similarly, the ensemble of the 4 deep learning models (DL-all) promotes the performance to 75.28%, which is sig-   nificantly better than single model and is comparable to the result using expert knowledge. Furthermore, the ensemble of 3 machine learning algorithms and 4 deep learning models by averaging these 7 scores (EN-seven), achieves the best results on all of the development set in English STS 2016. It suggests that the traditional NLP methods and the deep learning models are complementary to each other and their combination achieves the best performance.

Results on Cross-lingual Data
To address cross-lingual, we first translate crosslingual pairs into monolingual pairs and then adopt the universal model to estimate semantic similarity. Thus, language translation is critical to the performance. The first straightforward way for translation (Strategy 1) is to translate foreign language into English. We observe that it is more likely to produce synonyms when using Strategy 1. For example: one English-Spanish pair The respite was short. La tregua fue breve. is translated into English-English pair, The respite was short.
The respite was brief.
where short and brief are synonyms produced by MT rather than their actual literal meaning expressed in original languages. Reminding that one MT system may be in favour of certain words and it also can translate English into foreign language. Thus we propose Strategy 2 for translation, i.e., we first translate the English sentence into foreign target language and then roll back to English via MT again. Under Strategy 2, the above example English-Spanish pair is translated into the same English sentence: The respite was brief. Table 6 compares the results of the two strategies on cross-lingual data. It is clear that Strategy 2 achieves better performance, which indicates that the semantic difference between synonyms in cross-lingual pairs resulting from MT are different from that in monolingual pairs.

Results on Spanish-English WMT
On Spanish-English WMT dataset, the system performance dropped dramatically. The possible reason may lie in that they are from different domains. Therefore, we use 10-fold cross validation on  this dataset for evaluation. Table 7 list the results on Spanish-English WMT, where the last column (wmt(CV)) of show that using the in-domain dataset achieves better performance. Take a closer look at this dataset, we find that several original Spanish sentences are meaningless. For example, the English-Spanish pair His rheumy eyes began to cloud. A sus ojos rheumy comenzóa nube. has a score of 1 as the second is not a proper Spanish sentence. Since there are many meaningless Spanish sentences in this dataset sourced from MT evaluation, we speculate that these meaningless sentences are made to be used as negative training samples for MT model. Thus, only on this dataset, we grant Spanish as target language and translate English sentences into Spanish sentences. After that, we use 9 MT evaluation metrics (mentioned in Section 2.1) to generate MT based Features. Then these 9 MT metrics are averaged as the similarity score (MT(es) 3 ).  Table 7: Pearson correlations on Spanish-English WMT. MT(es) 3 is calculated using their translated Spanish-Spanish form. We did not perform cross validation in deep learning models and did not ensemble them due to time constraint.
From Table 7, we see that the MT(es) 3 score alone achieves 0.2858 on wmt in terms of Pearson correlation, which even surpasses the best performance (0.2677) of ensemble model. Based on this, we also combine the ensemble model with MT(es) 3 and their averaged score achieves 0.3789 in terms of Pearson correlation.

System Configuration
Based on the above results, we configure three following systems: Run 1: all features using RF algorithms. (RF) Run 2: all features using GB algorithms. (GB) Run 3: ensemble of three algorithms and four deep learning scores. (EN-seven) Particularly, we train Track 4b SP-EN-WMT using the wmt dataset provided in SemEval-2017 and Run 2 and Run 3 on this track are combined with MT(es) 3 features. Table 8 lists the results of our submitted runs on test datasets. We find that: (1) GB achieves slightly better performance than RF, which is consistent to that in training data; (2) the ensemble model significantly improves the performance on all datasets and enhance the performance of Primary Track by about 3% in terms of Pearson coefficient; (3) on Track 4b SP-EN-WMT, combining with MT(es) 3 significantly improves the performance.

Results on Test Data
The last three rows list the results of two top systems and one baseline system provided by organizer. The baseline is to use the cosine similarity of one-hot vector representations of sentence pairs. On all language pairs, our ensemble system achieves the best performance. This indicates that both the traditional NLP methods and the deep learning methods make contribution to performance improvement.

Conclusion
To address mono-lingual and cross-lingual sentence semantic similarity evaluation, we build a universal model in combination of traditional NLP methods and deep learning methods together and the extensive experimental results show that this combination not only improves the performance but also increases the robustness for modeling similarity of multilingual sentences. Our future work will concentrate on learning reliable sentence pair representations in deep learning.