MITRE at SemEval-2017 Task 1: Simple Semantic Similarity

This paper describes MITRE’s participation in the Semantic Textual Similarity task (SemEval-2017 Task 1), which evaluated machine learning approaches to the identification of similar meaning among text snippets in English, Arabic, Spanish, and Turkish. We detail the techniques we explored ranging from simple bag-of-ngrams classifiers to neural architectures with varied attention and alignment mechanisms. Linear regression is used to tie the systems together into an ensemble submitted for evaluation. The resulting system is capable of matching human similarity ratings of image captions with correlations of 0.73 to 0.83 in monolingual settings and 0.68 to 0.78 in cross-lingual conditions, demonstrating the power of relatively simple approaches.


Introduction
Semantic Textual Similarity (STS) measures the degree to which two snippets of text convey the same meaning. Cross-lingual STS measures the same for sentence pairs written in two different languages. Automatic identification of semantically similar text has practical applications in domains such as evaluation of machine translation outputs, discovery of parallel sentences in comparable corpora, essay grading, and news summarization. It serves as an easily explained assay for systems modeling semantics.
SemEval-2017 marked the sixth consecutive year of a shared task measuring progress in STS. Current machine learning approaches to measuring semantic similarity vary widely. One design decision for STS systems is whether to explicitly align words between paired sentences. Wieting et al. (2016) demonstrate that sentence embeddings without explicit alignment or atten-tion can often provide reasonable performance on STS tasks. Related work in textual entailment offers evidence that neural models with soft alignment outperform embeddings-only approaches Chen et al. (2016);Parikh et al. (2016). However these results were obtained on a dataset multiple orders of magnitude larger than existing STS datasets. In absence of large datasets, word alignments similar to those used in statistical machine translation have proven to be useful (Zarrella et al., 2015;Itoh, 2016).
In this effort we explored diverse methods for aligning words in pairs of candidate sentences: translation-inspired hard word alignments as well as soft alignments learned by deep neural networks with attention. We also examined a variety of approaches for comparing aligned words, ranging from bag-of-ngrams features leveraging handengineered lexical databases, to recurrent and convolutional neural networks operating over distributed representations. Although an ideal crosslingual STS system might operate directly on input sentences in their original language, we used machine translation to convert all the inputs into English. The paucity of in-domain training data and the simplicity of the image caption genre made the translation approach reasonable. Our contribution builds on approaches developed for English STS but points a way forward for progress on knowledge-lean, fully-supervised methods for semantic comparison across different languages.

Task, Data and Evaluation
Semantic Textual Similarity was a shared task organized within SemEval-2017 (Agirre et al., 2017). The task organizers released 1,750 sentence pairs of evaluation data organized into six tracks: Arabic, Spanish, and English monolingual, as well as Arabic-English, Spanish-English, and Turkish-English cross-lingual.
Most of this evaluation data was sourced from the Stanford Natural Language Inference corpus (Bowman et al., 2015). The sentences are English-language image captions, grouped into pairs and human-annotated on a scale of 0 to 5 for semantic similarity. In the monolingual English task, the average sentence length was 8.7 words, and the average rating was 2.3 (e.g. The woman had brown hair. and The woman has gray hair.) There was a roughly balanced distribution of highly rated pairs (e.g. A woman is bungee jumping. and A girl is bungee jumping.) and poorly rated pairs (e.g. The yard has a dog. and The dog is running after another dog.) Annotated sentence pairs were manually translated from English into other languages to create additional tracks.
For each pair, task participants predicted a similarity score. Systems were evaluated by Pearson correlation with the human ratings.

System Overview
We created an ensemble of five systems which each independently predicted a similarity score. Some features were reused among many components, including word embeddings, machine translations, alignments, and dependency parses.

English Word Embeddings
We used word2vec (Mikolov et al., 2013) to learn distributed representations of words from the text of the English Wikipedia. We applied word2phrase twice to identify phrases of up to four words, and trained a skip-gram model of size 256 for the 630,902 vocabulary items which appeared at least 100 times, using a context window of 10 words and 15 negative samples per example.

Machine Translation
Sentences in the image caption genre tend to be short and use a simple vocabulary. To test the extent to which this is true of SNLI data, we trained a small unregularized neural language model which achieved a perplexity of 18.9 on a held-out test set. The same parameterization achieved a perplexity of 114.5 in experiments on the Penn Treebank (Zaremba et al., 2014). We proceeded to translate all non-English sentences to English, recognizing that modern MT systems are sufficient to provide high quality translations for simple sentences. We used the Google Translate API in mid-January 2017.

Dependency Parses
The dependency parse arcs were used as features to assist in aligning and comparing pairs of words. The Stanford Parser library produced these typed dependency representations (Chen and Manning, 2014). The English PCFG model with basic dependencies was used rather than the default collapsed dependencies to ensure that the parser gave us exactly one parse arc for each token.

Alignment
Comparing sentences can be a tallying process. One can find all associated atomic pairs in the left hand and right hand sides, cross them off, and judge the dissimilarity based on the remaining residuals. This process is reminiscent of finding translation equivalences for training machine translation systems (Al-Onaizan et al., 1999).
To this end, we built an alignment system on top of word embeddings. First, the min alignment is produced to maximize the sum of cosine similarities (sim(w i , w j ) = 1 + cos(w i , w j )) of word vectors corresponding to aligned word pairs under the constraint that no word is aligned more than once. The max alignment is constrained such that each word must be paired with at least one other, and the total number of edges in the alignment can be no more than word count of the longer string. In both cases, LPSOLVE was employed to find the assignment maximizing these criteria (Berkelaar et al., 2004).
Dependency parses constructed in Section 3.3 were aligned in a similar way. Consider dependency arcs a i : head → dep Instead of the sum of cosine similarities as atoms in the linear program, however, we used sim(a 1 , a 2 ) = sim(head(a 1 ), head(a 2 )) + 10sim(dep(a 1 ), dep(a 2 )) to give preference to matching dependency arcs a 1 and a 2 with similar heads.

Ensemble Components
TakeLab The open source TakeLab Semantic Text Similarity System was incorporated as a baseline (Šarić et al., 2012). Specifically we use LIBSVM to train a support vector regression model with an RBF kernel, cost parameter of 20, gamma of 0.2, and epsilon of 0.5. Input features were comprised of TakeLab-computed ngram overlap and word similarity metrics.

Recurrent Convolutional Neural Network
We recreate the recurrent neural network (RNN) model described in Zarrella et al. (2015) and train it using the embeddings and parse-aware alignments described above. Briefly, this 16dimensional RNN operates over a sequence of aligned word pairs, comparing each pair according to features that encode embedding similarity, word position, and unsupervised string similarity.
We extended this model with four new feature categories. The first was a binary variable that indicates whether both words in the pair were determined to have the same dependency type in their respective parses. We also added three convolutional recurrent neural networks (CRNNs), each of which receive as input a sequence of word embeddings, and which learn STS features via 256 1D convolutional filters connected (with 50% dropout) to a 128-dimensional LSTM. For each aligned word pair, the first CRNN operates on the embeddings of the aligned words, the second CRNN operates on the squared difference of the embeddings of the aligned words, and the final CRNN operates on the embeddings of the parent words selected by the dependency parse. All above RNN outputs were concatenated to form a sequence of 400-dimensional (16+128*3) timesteps, which fed a 128-dimensional LSTM connected to a single sigmoidal output unit.
We unrolled this network to a zero-padded sequence length of 60 and trained it to convergence using Adam with a mean average error loss function (Kingma and Ba, 2014). The embeddings were not updated during training. We ensembled eight instances of this network trained from different random initializations.
Paris: String Similarity More than a decade ago, MITRE entered a system based on string similarity metrics in the 2004 Pascal RTE competition (Bayer et al., 2005). The libparis code base implements eight different string similarity and machine translation evaluation algorithms; measures include an implementation of the MT evaluation BLEU (Papineni et al., 2002); WER, a common speech recognition word error rate based on Levenshtein distance (Levenshtein, 1966); WER-g (Foster et al., 2003); ROUGE (Lin and Och, 2004); a simple position-independent error rate similar to PER (Leusch et al., 2003); both global and local similarity metrics often used for biological string comparison (Gusfield, 1997).
Finally, there are precision and recall measures based on bags of all substrings (or n-grams in word tokenization).
In total, the package computes 22 metrics for a pair of strings. The metrics were run on both casefolded and original versions as well as on word tokens and characters, yielding 88 string similarity features. Some of the metrics are not symmetric, so they were run both forward and reversed based on presentation in the dataset yielding 176 features. Finally, for each feature value x, log(x) was added as a feature, producing a final count of 352 string similarity features. LIBLINEAR used these features to build a L1-regularized logistic regression model. This system was unchanged, except for retraining, from the system described in Zarrella et al. (2015) Simple Alignment Measures Section 3.4 describes methods we used for aligning two strings. L2-regularized logistic regression was used to combine 16 simple features calculated as sideeffects of alignment. Details are described in Zarrella et al. (2015).

Enhanced BiLSTM Inference Model (EBIM)
We recreated the neural model described in Chen et al. (2016) which reports state-of-the-art performance on the task of finding entailment in the SNLI corpus. The model encodes each sentence with a bidirectional LSTM over word embeddings, uses a parameter-less attention mechanism to produce a soft alignment matrix for the two sentences, and then does inference over each timestep and its alignment using another LSTM. Two fullyconnected layers complete the prediction. Chen et al. (2016) improves performance by concatenating the final LSTM representation from EBIM with that of a similar model where a modified LSTM operates over a syntax tree; we did not include this extension in our submission.
Our implementation kept most hyperparameters described in the paper. However, we used the word2vec embeddings described above and found that freezing the embeddings produced better performance for this small dataset. We also found our models worked better without dropout on the embedding layer. Where the original model chooses a class via softmax, we output a semantic similarity score trained to minimize mean squared error.   Table 2: Factored and ablated system components evaluated on our dev set and the official test set.

Ensemble
The semantic similarity estimates of the predictors described above contributed to the final prediction with a weighting determined by L2-regularized logistic regression.

Experiment Details
We used as training data a selection of English monolingual sentence pairs released during prior SemEval STS evaluations. Specifically, we trained on 6,898 pairs of news and caption genre data from the 2012-2014 and 2016 evaluations. We used an additional 400 and 350 captions from the 2015 evaluation as development and tuning sets, respectively. We did not use out-of-genre data (e.g. dictionary definitions, Europarl, web forums, student essays) or the newly-released multilingual 2017 training data. The dev set was used to select hyperparameters for individual components, while the tuning set was used to select the hyperparameters for the final ensemble.

Results
The evaluation of our components on the competition test set is shown in Table 1. The official similarity score produced by this approach achieved 0.6590 correlation with expert judgment averaged across all tracks. A misfiling during construction of the ensemble submission for tracks 5 and 6 reduced the official score from 0.6687. The dev columns of Table 2 show the ability of each individual system in isolation on the dev data ("Factored") as well as the performance of the ensemble when the individual system was removed ("Ablated"). Note that the Align system should have been ablated from the final system to achieve a higher score. Presumably its capability was strictly dominated by the CRNNs that used many of the same features.
The test scores for individual CRNN models ranged from 0.605 to 0.636, highlighting the volatility inherent in the process. The CRNNensemble improved slightly over the best single model, with a score of 0.638.

Conclusion
Five models of semantic similarity constructed from 2004 to 2016 were combined for paraphrase detection in image captions. The TakeLab bagof-features SVM developed and open-sourced in 2012, when trained on our selection of in-genre data and evaluated on a machine translated version of the test set, performed well enough in isolation to place fourth out of seventeen in the Primary Track of the Semantic Textual Similarity competition organized within SemEval-2017 Task 1, which had submissions from 31 teams in total.
Inclusion of explicit word alignments, a neural attention model, and recurrent networks accounting for sequences of syntactic dependencies yielded an improvement in Pearson correlation from 0.650 to 0.672, a modest improvement which increased the corrected system's ranking to third. This surprising result is perhaps an indication that image captions have few of the complex linguistic dependencies that typically make estimating semantic similarity a difficult task. Future work could focus on testing whether this result holds when performing crosslingual STS without explicit machine translation.