SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).


Introduction
Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other. The STS task is motivated by the observation that accurately modeling the meaning similarity of sentences is a foundational language understanding problem relevant to numerous applications including: machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. STS enables the evaluation of techniques from a diverse set of domains against a shared interpretable performance criteria. Semantic inference tasks related to STS include textual entailment (Bentivogli et al., 2016;Bowman et al., 2015;Dagan et al., 2010), semantic relatedness (Bentivogli et al., 2016) and paraphrase detection (Xu et al., 2015;Ganitkevitch et al., 2013;Dolan et al., 2004). STS differs from both textual entailment and paraphrase detection in that it captures gradations of meaning overlap rather than making binary classifications of particular relationships. While semantic relatedness expresses a graded semantic relationship as well, it is non-specific about the nature of the relationship with contradictory material still being a candidate for a high score (e.g., "night" and "day" are highly related but not particularly similar).
To encourage and support research in this area, the STS shared task has been held annually since 2012, providing a venue for evaluation of state-ofthe-art algorithms and models (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016. During this time, diverse similarity methods and data sets 1 have been explored. Early methods focused on lexical semantics, surface form matching and basic syntactic similarity (Bär et al., 2012;Šarić et al., 2012a;Jimenez et al., 2012a). During subsequent evaluations, strong new similarity signals emerged, such as Sultan et al. (2015)'s alignment based method. More recently, deep learning became competitive with top performing feature engineered systems . The best performance tends to be obtained by ensembling feature engineered and deep learning models (Rychalska et al., 2016).
Significant research effort has focused on STS over English sentence pairs. 2 English STS is a well-studied problem, with state-of-the-art systems often achieving 70 to 80% correlation with human judgment. To promote progress in other languages, the 2017 task emphasizes performance on Arabic and Spanish as well as cross-lingual pairings of English with material in Arabic, Spanish and Turkish. The primary evaluation criteria combines performance on all of the different language conditions except English-Turkish, which was run as a surprise language track. Even with this departure from prior years, the task attracted 31 teams producing 84 submissions.

Task Overview
STS is the assessment of pairs of sentences according to their degree of semantic similarity. The task involves producing real-valued similarity scores for sentence pairs. Performance is measured by the Pearson correlation of machine scores with human judgments. The ordinal scale in Table 1 guides human annotation, ranging from 0 for no meaning overlap to 5 for meaning equivalence. Intermediate values reflect interpretable levels of partial overlap in meaning. The annotation scale is designed to be accessible by reasonable human judges without any formal expertise in linguistics. Using reasonable human interpretations of natural language semantics was popularized by the related textual entailment task (Dagan et al., 2010). The resulting annotations reflect both pragmatic and world knowledge and are more interpretable and useful within downstream systems.

Evaluation Data
The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) is the primary evaluation data source with the exception that one of the pilot track on cross-lingual Spanish-English STS. The English tracks attracted the most participation and have the largest use of the evaluation data in ongoing research.

5
The two sentences are completely equivalent, as they mean the same thing. The bird is bathing in the sink. Birdie is washing itself in the water basin.

4
The two sentences are mostly equivalent, but some unimportant details differ. Two boys on a couch are playing video games. Two boys are playing a video game.

3
The two sentences are roughly equivalent, but some important information differs/missing. John said he is considered a witness but not a suspect. "He is not a suspect anymore." John said.

2
The two sentences are not equivalent, but share some details. They flew out of the nest in groups. They flew into the nest together.

1
The two sentences are not equivalent, but are on the same topic. The woman is playing the violin. The young lady enjoys listening to the guitar.

0
The two sentences are completely dissimilar. The black dog is running through the snow. A race car driver is driving his car through the mud.  Agirre et al. (2013).
cross-lingual tracks explores data from the WMT 2014 quality estimation task (Bojar et al., 2014). 3 Sentences pairs in SNLI derive from Flickr30k image captions (Young et al., 2014) and are labeled with the entailment relations: entailment, neutral, and contradiction. Drawing from SNLI allows STS models to be evaluated on the type of data used to assess textual entailment methods. However, since entailment strongly cues for semantic relatedness (Marelli et al., 2014), we construct our own sentence pairings to deter gold entailment labels from informing evaluation set STS scores.
Track 4b investigates the relationship between STS and MT quality estimation by providing STS labels for WMT quality estimation data. The data includes Spanish translations of English sentences from a variety of methods including RBMT, SMT, hybrid-MT and human translation. Translations are annotated with the time required for human correction by post-editing and Human-targeted Translation Error Rate (HTER) (Snover et al., 2006). 4 Participants are not allowed to use the gold quality estimation annotations to inform STS scores.  WMT's quality estimation task. Track 6 is a surprise language track with no annotated training data and the identity of the language pair first announced when the evaluation data was released.

Data Preparation
This section describes the preparation of the evaluation data. For SNLI data, this includes the selection of sentence pairs, annotation of pairs with STS labels and the translation of the original English sentences. WMT quality estimation data is directly annotated with STS labels.

Arabic, Spanish and Turkish Translation
Sentences from SNLI are human translated into Arabic, Spanish and Turkish. Sentences are translated independently from their pairs. Arabic translation is provided by CMU-Qatar by native Arabic speakers with strong English skills. Translators are given an English sentence and its Arabic machine translation 5 where they perform post-editing to correct errors. Spanish translation is completed by a University of Sheffield graduate student who is a native Spanish speaker and fluent in English. Turkish translations are obtained from SDL. 6

Embedding Space Pair Selection
We construct our own pairings of the SNLI sentences to deter gold entailment labels being used to inform STS scores. The word embedding similarity selection heuristic from STS 2016 (Agirre et al., 2016) is used to find interesting pairs. Sentence embeddings are computed as the sum of in-dividual word embeddings, v(s) = w∈s v(w). 7 Sentences with likely meaning overlap are identified using cosine similarity, Eq. (1). (1)

Annotation
Annotation of pairs with STS labels is performed using Crowdsourcing, with the exception of Track 4b that uses a single expert annotator.

Crowdsourced Annotations
Crowdsourced annotation is performed on Amazon Mechanical Turk. 8 Annotators examine the STS pairings of English SNLI sentences. STS labels are then transferred to the translated pairs for crosslingual and non-English tracks. The annotation instructions and template are identical to Agirre et al. (2016). Labels are collected in batches of 20 pairs with annotators paid $1 USD per batch. Five annotations are collected per pair. The MTurk master 9 qualification is required to perform the task. Gold scores average the five individual annotations.

Expert Annotation
Spanish-English WMT quality estimation pairs for Track 4b are annotated for STS by a University of Sheffield graduate student who is a native speaker of Spanish and fluent in English. This track differs significantly in label distribution and the complexity of the annotation task. Sentences in a pair are translations of each other and tend to be more semantically similar. Interpreting the potentially subtle meaning differences introduced by MT errors is challenging. To accurately assess STS performance on MT quality estimation data, no attempt is made to balance the data by similarity scores.

Training Data
The following summarizes the training data: Table 3 English; Table 4 Spanish; 10 Table 5 Spanish-English; Table 6 Arabic; and Table 7 Arabic-English. Arabic-English parallel data is supplied by translating English training data, Table 8. English, Spanish and Spanish-English training data pulls from prior STS evaluations. Arabic and Arabic-English training data is produced by translating a subset of the English training data and transferring the similarity scores. For the MT quality estimation data in track 4b, Spanish sentences are translations of their English counterparts, differing substantially from existing Spanish-English STS data. We release one thousand new Spanish-English STS pairs sourced from the 2013 WMT translation task and produced by a phrase-based Moses SMT system (Bojar et al., 2013). The data is expert annotated and has a similar label distribution to the track 4b test data with 17% of the pairs scoring an STS score of less than 3, 23% scoring 3, 7% achieving a score of 4 and 53% scoring 5.

Training vs. Evaluation Data Analysis
Evaluation data from SNLI tend to have sentences that are slightly shorter than those from prior years of the STS shared task, while the track 4b MT qual-   Similarity scores for our pairings of the SNLI sentences are slightly lower than recent shared task years and much lower than early years. The change is attributed to differences in data selection and filtering. The average 2017 similarity score is 2.2 overall and 2.3 on the track 7 English data. Prior English data has the following average similarity scores: 2016 2.4; 2015 2.4; 2014 2.8; 2013 3.0; 2012 3.5. Translation quality estimation data from track 4b has an average similarity score of 4.0.

System Evaluation
This section reports participant evaluation results for the SemEval-2017 STS shared task.

Participation
The task saw strong participation with 31 teams producing 84 submissions. 17 teams provided 44 systems that participated in all tracks. Table 9 summarizes participation by track. Traces of the focus on English are seen in 12 teams participating just in track 5, English. Two teams participated exclusively in tracks 4a and 4b, Spanish-English. One team took part solely in track 1, Arabic.

Evaluation Metric
Systems are evaluated on each track by their Pearson correlation with gold labels. The overall rank-   ing averages the correlations across tracks 1-5 with tracks 4a and 4b individually contributing.

CodaLab
As directed by the SemEval workshop organizers, the CodaLab research platform hosts the task. 11

Baseline
The baseline is the cosine of binary sentence vectors with each dimension representing whether an individual word appears in a sentence. 12 For crosslingual pairs, non-English sentences are translated into English using state-of-the-art machine translation. 13 The baseline achieves an average correlation of 53.7 with human judgment on tracks 1-5 and would rank 23 rd overall out the 44 system submissions that participated in all tracks.

Rankings
Participant performance is provided in  track 4b's MT quality estimation data. This highlights the difficulty and importance of making fine grained distinctions for certain downstream applications. Assessing STS methods for quality estimation may benefit from using alternatives to Pearson correlation for evaluation. 14 Results tend to decrease on cross-lingual tracks. The baseline drops > 10% relative on Arabic-English and Spanish-English (SNLI) vs. monolingual Arabic and Spanish. Many participant systems show smaller decreases. ECNU's top ranking entry performs slightly better on Arabic-English than Arabic, with a slight drop from Spanish to Spanish-English (SNLI).

Methods
Participating teams explore techniques ranging from state-of-the-art deep learning models to elaborate feature engineered systems. Prediction signals include surface similarity scores such as edit distance and matching n-grams, scores derived from word alignments across pairs, assessment by MT evaluation metrics, estimates of conceptual similarity as well as the similarity between word and sentence level embeddings. For cross-lingual and non-English tracks, MT was widely used to convert the two sentences being compared into the same language. 15 Select methods are highlighted below. 14 e.g., Reimers et al. (2016) report success using STS labels with alternative metrics such as normalized Cumulative Gain (nCG), normalized Discounted Cumulative Gain (nDCG) and F1 to more accurately predict performance on the downstream tasks: text reuse detection, binary classification of document relatedness and document relatedness within a corpus. 15 Within the highlighted submissions, the following use a monolingual English system fed by MT: ECNU, BIT, HCTI and MITRE. HCTI submitted a separate run using ar, es and en trained models that underperformed using their en model with MT for ar and es. CompiLIG's model is cross-lingual but includes a word alignment feature that depends on MT. SEF@UHH built ar, es, en and tr models and use MT for the cross-lingual pairs. LIM-LIG and DT Team only participate in monolingual tracks.   83.02• 15.50 compiLIG  76.84 14.64 compiLIG  79.10 14.94 DT TEAM (Maharjan et al., 2017) 85.36 DT TEAM (Maharjan et al., 2017) 83.60 DT TEAM (Maharjan et al., 2017) 83.29 FCICU (Hassan et al., 2017) 82.17 ITNLPAiKF (Liu et al., 2017) 82.31 ITNLPAiKF (Liu et al., 2017) 82.31 ITNLPAiKF (Liu et al., 2017) 81.59 L2F/INESC-ID (Fialho et al., 2017 Table 10: STS 2017 rankings ordered by average correlation across tracks 1-5. Performance is reported by convention as Pearson's r × 100. For tracks 1-6, the top ranking result is marked with a • symbol and results in bold have no statistically significant difference with the best result on a track, p > 0.05 Williams' t-test (Diedenhofen and Musch, 2015).
ECNU (Tian et al., 2017) The best overall system is from ENCU and ensembles well performing a feature engineered models with deep learning methods. Three feature engineered models use Random Forest (RF), Gradient Boosting (GB) and XGBoost (XGB) regression methods with features based on: n-gram overlap; edit distance; longest common prefix/suffix/substring; tree kernels (Moschitti, 2006); word alignments (Sultan et al., 2015); summarization and MT evaluation metrics (BLEU, GTM-3, NIST, WER, ME-TEOR, ROUGE); and kernel similarity of bagsof-words, bags-of-dependencies and pooled wordembeddings. ECNU's deep learning models are differentiated by their approach to sentence embeddings using either: averaged word embeddings, projected word embeddings, a deep averaging network (DAN) (Iyyer et al., 2015) or LSTM (Hochreiter and Schmidhuber, 1997). Each network feeds the element-wise multiplication, subtraction and concatenation of paired sentence embeddings to additional layers to predict similarity scores. The ensemble averages scores from the four deep learning and three feature engineered models. 16 BIT  Second place overall is achieved by BIT primarily using sentence information content (IC) informed by WordNet and BNC word frequencies. One submission uses sentence IC exclusively. Another ensembles IC with Sultan et al. (2015)'s alignment method, while a third ensembles IC with cosine similarity of summed word embeddings with an IDF weighting scheme. Sentence IC in isolation outperforms all systems except those from ECNU. Combining sentence IC with word embedding similarity performs best.
HCTI (Shao, 2017) Third place overall is obtained by HCTI with a model similar to a convolutional Deep Structured Semantic Model (CDSSM) (Chen et al., 2015;Huang et al., 2013). Sentence embeddings are generated with twin convolutional neural networks (CNNs). The embeddings are then compared using cosine similarity and elementwise difference with the resulting values fed to additional layers to predict similarity labels. The architecture is abstractly similar to ECNU's deep learning models. UMDeep (Barrow and Peskov, 2017) took a similar approach using LSTMs rather than CNNs for the sentence embeddings.
MITRE (Henderson et al., 2017) Fourth place overall is MITRE that, like ECNU, takes an ambitious feature engineering approach complemented by deep learning. Ensembled components include: alignment similarity; TakeLab STS (Šarić et al., 2012b); string similarity measures such as matching n-grams, summarization and MT metrics (BLEU, WER, PER, ROUGE); a RNN and recurrent convolutional neural networks (RCNN) over word alignments; and a BiLSTM that is state-ofthe-art for textual entailment (Chen et al., 2016).
FCICU (Hassan et al., 2017) Fifth place overall is FCICU that computes a sense-base alignment using BabelNet (Navigli and Ponzetto, 2010). Babel-Net synsets are multilingual allowing non-English and cross-lingual pairs to be processed similarly to English pairs. Alignment similarity scores are used with two runs: one that combines the scores within a string kernel and another that uses them with a weighted variant of Sultan et al. (2015)'s method. Both runs average the Babelnet based scores with soft-cardinality (Jimenez et al., 2012b).
CompiLIG  The best Spanish-English performance on SNLI sentences was achieved by CompiLIG using the following cross-lingual features: conceptual similarity using DBNary (Serasset, 2015), MultiVec word embeddings (Berard et al., 2016) and character n-grams. MT is used to incorporate a similarity score based on Brychcin and Svoboda (2016)'s improvements to Sultan et al. (2015)'s method. (Nagoudi et al., 2017) Using only weighted word embeddings, LIM-LIG took second place on Arabic. 17 Arabic word embeddings are summed into sentence embeddings using uniform, POS and IDF weighting schemes. Sentence similarity is computed by cosine similarity. POS and IDF outperform uniform weighting. Combining the IDF and POS weights by multiplication is reported by LIM-LIG to achieve r 0.7667, higher than all submitted Arabic (track 1) systems.

Analysis
Figure 1 plots model similarity scores against human STS labels for the top 5 systems from tracks 5 (English), 1 (Arabic) and 4b (Spanish-English MT). While many systems return scores on the same scale as the gold labels, 0-5, others return scores from approximately 0 and 1. Lines on the graphs illustrate perfect performance for both a 0-5 and a 0-1 scale. Mapping the 0 to 1 scores to range from 0-5, 21 approximately 80% of the scores from top performing English systems are within 1.0 pt of the gold label. Errors for Arabic are more broadly distributed, particularly for model scores between 1 and 4. The Spanish-English MT plots the weak relationship between the predicted and gold scores. Table 12 provides examples of difficult sentence pairs for participant systems and illustrates common sources of error for even well-ranking systems 19 For the cross-lingual tracks with language pair L1-L2, Duma and Menzel (2017) report additional experiments that vary the language choice for the paragraph vector model, using either L1 or L2. Experimental results are also provided that average the scores from the L1 and L2 models as well as that use vector correlation to compute similarity. 20  including: (i) word sense disambiguation "making" and "preparing" are very similar in the context of "food", while "picture" and "movie" are not similar when picture is followed by "day"; (ii) attribute importance "outside" vs. "deserted" are smaller details when contrasting "The man is in a deserted field" with "The man is outside in the field"; (iii) compositional meaning "A man is carrying a canoe with a dog" has the same content words as "A dog is carrying a man in a canoe" but carries a different meaning; (iv) negation systems score ". . . with goggles and a swimming cap" as nearly equivalent to ". . . without goggles or a swimming cap". Inflated similarity scores for examples like "There is a young girl" vs. "There is a young boy with the woman" demonstrate (v) semantic blending, whereby appending "with a woman" to "boy" brings its representation closer to that of "girl". For multilingual and cross-lingual pairs, these issues are magnified by translation errors for systems that use MT followed by the application of a monolingual similarity model. For track 4b Spanish-English MT pairs, some of the poor performance can in part be attributed to many systems using MT to re-translate the output of another MT system, obscuring errors in the original translation.

Contrasting Cross-lingual STS with MT Quality Estimation
Since MT quality estimation pairs are translations of the same sentence, they are expected to be minimally on the same topic and have an STS score ≥ 1. 22 The actual distribution of STS scores is such that only 13% of the test instances score below 3, 22% of the instances score 3, 12% score 4 and 53% score 5. The high STS scores indicate that MT systems are surprisingly good at preserving meaning. However, even for a human, interpreting changes caused by translations errors can be difficult due both to disfluencies and subtle errors with important changes in meaning. The Pearson correlation between the gold MT quality scores and the gold STS scores is 0.41, which shows that translation quality measures and STS are only moderately correlated. Differences are in part explained by translation quality scores penalizing all mismatches between the source segment and its translation, whereas STS focuses on differences in meaning. However, the difficult in- 22 The evaluation data for track 4b does in fact have STS scores that are ≥ 1 for all pairs. In the 1,000 sentence training set for this track, one sentence that received a score of zero.    terpretation work required for STS annotation may increase the risk of inconsistent and subjective labels. The annotations for MT quality estimation are produced as by-product of post-editing. Humans fix MT output and the edit distance between the output and its post-edited correction provides the quality score. This post-editing based procedure is known to produce relatively consistent estimates across annotators.

STS Benchmark
The STS Benchmark is a careful selection of the English data sets used in SemEval and *SEM STS shared tasks between 2012 and 2017. Tables 11 and 13 provide details on the composition of the benchmark. The data is partitioned into training, development and test sets. 23 The development set can be used to design new models and tune hyperparameters. The test set should be used sparingly and only after a model design and hyperparameters have been locked against further changes. Using the STS Benchmark enables comparable assessments across different research efforts and improved tracking of the state-of-the-art. Table 14 shows the STS Benchmark results for some of the best systems from Track 5 (EN-EN) 24 and compares their performance to competitive baselines from the literature. All baselines were run by the organizers using canonical pre-trained models made available by the originator of each method, 25 with the exception of PV-DBOW that 23 Similar to the STS shared task, while the training set is provided as a convenience, researchers are encourage to incorporate other supervised and unsupervised data as long as no supervised annotations of the test partitions are used. 24 Each participant submitted the run which did best in the development set of the STS Benchmark, which happened to be the same as their best run in Track 5 in all cases.  2015) 74.3 63.9 Averaged Word Embedding Baselines LexVec Weighted matrix factorization of PPMI (Salle et al., 2016a,b) 68.9 55.8 FastText Skip-gram with sub-word character n-grams (Joulin et al., 2016) 65.2 53.9 Paragram Paraphrase Database (PPDB) fit word embeddings (Wieting et al., 2015) 63.0 50.1 GloVe Word co-occurrence count fit embeddings (Pennington et al., 2014) 52.4 40.6 Word2vec Skip-gram prediction of words in a context window (Mikolov et al., 2013a,b) 70.0 56.5 * 10-fold cross-validation on combination of dev and training data.  (2016) and InferSent which was reported independently. When multiple pre-trained models are available for a method, we report results for the one with the best dev set performance. For each method, input sentences are preprocessed to closely match the tokenization of the pre-trained models. 26 Default whether GloVe, LexVec, or Word2Vec word embeddings were used; C-PHRASE: http://clic.cimec.unitn. it/composes/cphrase-vectors.html; PV-DBOW: https://github.com/jhlau/doc2vec, A P -N E W S trained apnews dbow.tgz; LexVec: https: //github.com/alexandres/lexvec, embedddings lexvec.commoncrawl.300d.W.pos.vectors.gz; FastText: https://github.com/facebookresearch/ fastText/blob/master/pretrained-vectors. md, Wikipedia trained embeddings from wiki.en.vec; Paragram: http://ttic.uchicago.edu/˜wieting/, embeddings trained on PPDB and tuned to WS353 from Paragram-WS353; GloVe: https://nlp.stanford. edu/projects/glove/, Wikipedia and Gigaword trained 300 dim. embeddings from glove.6B.zip; Word2vec: https://code.google.com/archive/ p/word2vec/, Google News trained embeddings from GoogleNews-vectors-negative300.bin.gz. 26 Sent2Vec: results shown here tokenized by tweetTokenize.py constrasting dev experiments used wikiTokenize.py, both distributed with Sent2Vec. LexVec: numbers were converted into words, all punctuation was removed, and text is lowercased; FastText: sentences are prepared using the normalize text() function within FastText's get-wikimedia.sh script and lowercased; Paragram: Joshua (Matt Post, 2015) pipeline to pre-process and tokenized English text; C-PHRASE, GloVe, PV-DBOW & inference hyperparameters are used unless noted otherwise. The averaged word embedding baselines compute a sentence embedding by averaging word embeddings and then using cosine to compute pairwise sentence similarity scores.
While state-of-the-art baselines for obtaining sentence embeddings perform reasonably well on the benchmark data, improved performance is obtained by top 2017 STS shared task systems. There is still substantial room for further improvement. To follow the current state-of-the-art, visit the leaderboard on the STS wiki. 27

Conclusion
We have presented the results of the 2017 STS shared task. This year's shared task differed substantially from previous iterations of STS in that the primary emphasis of the task shifted from English to multilingual and cross-lingual STS in-SIF: PTB tokenization provided by Stanford CoreNLP  with post-processing based on dev OOVs; Word2vec: Similar to FastText, to our knownledge, the preprocessing for the pre-trained Word2vec embeddings is not publicly described. We use the following heuristics for the Word2vec experiment: All numbers longer than a single digit are converted into a '#' (e.g., 24 → ##) then prefixed, suffixed and infixed punctuation is recursively removed from each token that does not match an entry in the model's lexicon. 27 http://ixa2.si.ehu.es/stswiki/index. php/STSbenchmark volving four different languages: Arabic, Spanish, English and Turkish. Even with this substantial change relative to prior evaluations, the shared task obtained strong participation. 31 teams produced 84 system submissions with 17 teams producing a total of 44 system submissions that processed pairs in all of the STS 2017 languages. For languages that were part of prior STS evaluations (e.g., English and Spanish), state-of-the-art systems are able to achieve strong correlations with human judgment. However, we obtain weaker correlations from participating systems for Arabic, Arabic-English and Turkish-English. This suggests further research is necessary in order to develop robust models that can both be readily applied to new languages and perform well even when less supervised training data is available. To provide a standard benchmark for English STS, we present the STS Benchmark, a careful selection of the English data sets from previous STS tasks (2012)(2013)(2014)(2015)(2016)(2017). To assist in interpreting the results from new models, a number of competitive baselines and select participant systems are evaluated on the benchmark data. Ongoing improvements to the current state-of-the-art is available from an online leaderboard.