SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation

Semantic Textual Similarity (STS) seeks to measure the degree of semantic equivalence between two snippets of text. Similarity is expressed on an ordinal scale that spans from semantic equivalence to complete unrelated-ness. Intermediate values capture speciﬁcally deﬁned levels of partial similarity. While prior evaluations constrained themselves to just monolingual snippets of text, the 2016 shared task includes a pilot subtask on computing semantic similarity on cross-lingual text snippets. This year’s traditional mono-lingual subtask involves the evaluation of English text snippets from the following four domains: Plagiarism Detection, Post-Edited Machine Translations, Question-Answering and News Article Headlines. From the question-answering domain, we include both question-question and answer-answer pairs. The cross-lingual subtask provides paired Spanish-English text snippets drawn from the same sources as the English data as well as independently sampled news data. The English sub-task attracted 43 participating teams producing 119 system submissions, while the cross-lingual Spanish-English pilot subtask attracted 10 teams resulting in 26 systems.


Introduction
Semantic Textual Similarity (STS) assesses the degree to which the underlying semantics of two segments of text are equivalent to each other. This assessment is performed using an ordinal scale that The authors of this paper are listed in alphabetic order. ranges from complete semantic equivalence to complete semantic dissimilarity. The intermediate levels capture specifically defined degrees of partial similarity, such as topicality or rough equivalence, but with differing details. The snippets being scored are approximately one sentence in length, with their assessment being performed outside of any contextualizing text. While STS has previously just involved judging text snippets that are written in the same language, this year's evaluation includes a pilot subtask on the evaluation of cross-lingual sentence pairs.
The systems and techniques explored as a part of STS have a broad range of applications including Machine Translation (MT), Summarization, Generation and Question Answering (QA). STS allows for the independent evaluation of methods for computing semantic similarity drawn from a diverse set of domains that would otherwise be only studied within a particular subfield of computational linguistics. Existing methods from a subfield that are found to perform well in a more general setting as well as novel techniques created specifically for STS may improve any natural language processing or language understanding application where knowing the similarity in meaning between two pieces of text is relevant to the behavior of the system. Paraphrase detection and textual entailment are both highly related to STS. However, STS is more similar to paraphrase detection in that it defines a bidirectional relationship between the two snippets being assessed, rather than the non-symmetric propositional logic like relationship used in textual entailment (e.g., P → Q leaves Q → P unspecified). STS also expands the binary yes/no catego-

Score English
Cross-lingual Spanish-English

5
The two sentences are completely equivalent, as they mean the same thing.
The bird is bathing in the sink. Birdie is washing itself in the water basin.
El pájaro se esta bañando en el lavabo. Birdie is washing itself in the water basin. 4 The two sentences are mostly equivalent, but some unimportant details differ.
In May 2010, the troops attempted to invade Kabul.
The US army invaded Kabul on May 7th last year, 2010.
The US army invaded Kabul on May 7th last year, 2010. 3 The two sentences are roughly equivalent, but some important information differs/missing. John said he is considered a witness but not a suspect. "He is not a suspect anymore." John said.
John dijo queél es considerado como testigo, y no como sospechoso. "He is not a suspect anymore." John said. 2 The two sentences are not equivalent, but share some details. They flew out of the nest in groups. They flew into the nest together.
Ellos volaron del nido en grupos. They flew into the nest together. 1 The two sentences are not equivalent, but are on the same topic.
The woman is playing the violin. The young lady enjoys listening to the guitar.
La mujer está tocando el violín. The young lady enjoys listening to the guitar. 0 The two sentences are completely dissimilar. John went horse back riding at dawn with a whole group of friends. Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.
Al amanecer, Juan se fue a montar a caballo con un grupo de amigos. Sunrise at dawn is a magnificent view to take in if you wake up early enough for it. rization of both paraphrase detection and textual entailment to a finer grained similarity scale. The additional degrees of similarity introduced by STS are directly relevant to many applications where intermediate levels of similarity are significant. For example, when evaluating machine translation system output, it is desirable to give credit for partial semantic equivalence to human reference translations. Similarly, a summarization system may prefer short segments of text with a rough meaning equivalence to longer segments with perfect semantic coverage.
STS is related to research into machine translation evaluation metrics. This subfield of machine translation investigates methods for replicating human judgements regarding the degree to which a translation generated by an machine translation system corresponds to a reference translation produced by a human translator. STS systems plausibly could be used as a drop-in replacement for existing translation evaluation metrics (e.g., BLEU, MEANT, ME-TEOR, TER). 1 The cross-lingual STS subtask that is newly introduced this year is similarly related to machine translation quality estimation.
The STS shared task has been held annually since 2012, providing a venue for the evaluation of state-of-the-art algorithms and models (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015). During this time, a diverse set of genres and data sources have been explored (i.a., news headlines, video and image descriptions, glosses from lexical resources including WordNet (Miller, 1995;Christiane Fellbaum, 1998), FrameNet (Baker et al., 1998), OntoNotes (Hovy et al., 2006), web discussion forums, and Q&A data sets). This year's evaluation adds new data sets drawn from plagiarism detection and post-edited machine translations. We also introduce an evaluation set on Q&A forum question-question similarity and revisit news headlines and Q&A answer-answer similarity. The 2016 task includes both a traditional monolingual subtask with English data and a pilot cross-lingual subtask that pairs together Spanish and English texts.

Task Overview
STS presents participating systems with paired text snippets of approximately one sentence in length. The systems are then asked to return a numerical score indicating the degree of semantic similarity between the two snippets. Canonical STS scores fall on an ordinal scale with 6 specifically defined degrees of semantic similarity (see Table 1). While the underlying labels and their interpretation are ordinal, systems can provide real valued scores to indicate their semantic similarity prediction.
Participating systems are then evaluated based on the degree to which their predicted similarity scores correlate with STS human judgements. Algorithms are free to use any scale or range of values for the scores they return. They are not punished for outputting scores outside the range of the interpretable human annotated STS labels. This evaluation strategy is motivated by a desire to maximize the flexibility in the design of machine learning models and systems for STS. It reinforces the assumption that computing textual similarity is an enabling component for other natural language processing applications, rather than being an end in itself. Table 1 illustrates the ordinal similarity scale the shared task uses. Both the English and the crosslingual Spanish-English STS subtasks use a 6 point similarity scale. A similarity label of 0 means that two texts are completely dissimilar; this can be interpreted as two sentences with no overlap in their meanings. The next level up, a similarity label of 1, indicates that the two snippets are not equivalent but are topically related to each other. A label of 2 indicates that the two texts are still not equivalent but agree on some details of what is being said. The labels 3 and 4, both indicate that the two sentences are approximately equivalent. However, a score of 3 implies that there are some differences in important details, while a score of 4 indicates that the differing details are not important. The top score of 5, denotes that the two texts being evaluated have complete semantic equivalence.
In the context of the STS task, meaning equivalence is defined operationally as two snippets of text that mean the same thing when interpreted by a reasonable human judge. The operational approach to sentence level semantics was popularized by the recognizing textual entailment task (Dagan et al., 2010). It has the advantage that it allows the labeling of sentence pairs by human annotators without any training in formal semantics, while also being more useful and intuitive to work with for downstream systems. Beyond just sentence level semantics, the operationally defined STS labels also reflect both world knowledge and pragmatic phenomena.
As in prior years, 2016 shared task participants are allowed to make use of existing resources and tools (e.g., WordNet, Mikolov et al. (2013)'s word2vec). Participants are also allowed to make unsupervised use of arbitrary data sets, even if such data overlaps with the announced sources of the evaluation data.

English Subtask
The English subtask builds on four prior years of English STS tasks. Task participants are allowed to use all of the trial, train and evaluation sets released during prior years as training and development data. As shown in table 2, this provides 14,250 paired snippets with gold STS labels. The 2015 STS task annotated between 1,500 and 2,000 pairs per data set that were then filtered based on annotation agreement and to achieve better data set balance. The raw annotations were released after the evaluation, providing an additional 5,500 pairs with noisy STS annotations (8,500 total for 2015). 2 The 2016 English evaluation data is partitioned into five individual evaluation sets: Headlines, Plagiarism, Postediting, Answer-Answer and Question-Question. Each evaluation set has between 209 to 254 pairs. Participants annotate a larger number of pairs for each dataset without knowledge of what pairs will be included in the final evaluation.  (2012,2013,2014,2015) and test (2016) data sets.

Data Collection
The data for the English evaluation sets are collected from a diverse set of sources. Data sources are selected that correspond to potentially useful domains for application of the semantic similarity methods explored in STS systems. This section details the pair selection heuristics as well as the individual data sources we use for the evaluation sets.

Selection Heuristics
Unless otherwise noted, pairs are heuristically selected using a combination of lexical surface form and word embedding similarity between a candidate pair of text snippets. The heuristics are used to find pairs sharing some minimal level of either surface or embedding space similarity. An approximately equal number of candidate sentence pairs are produced using our lexical surface form and word embedding selection heuristics. Both heuristics make use of a Penn Treebank style tokenization of the text provided by CoreNLP .  Surface Lexical Similarity Our surface form selection heuristic uses an information theoretic measure based on unigram overlap (Lin, 1998). As shown in equation (1), surface level lexical similarity between two snippets s 1 and s 2 is computed as a log probability weighted sum of the words common to both snippets divided by a log probability weighted sum of all the words in the two snippets.
sim l (s 1 , s 1 ) = 2 × w∈s 1 ∩s 2 log P (w) Unigram probabilities are estimated over the evaluation set data sources and are computed without any smoothing.
Word Embedding Similarity As our second heuristic, we compute the cosine between a simple embedding space representation of the two text snippets. Equation (2) illustrates the construction of the snippet embedding space representation, v(s), as the sum of the embeddings for the individual words, v(w), in the snippet. The cosine similarity can then be computed as in equation (3).
Three hundred dimensional word embeddings are obtained by running the GloVe package (Pennington et al., 2014) with default parameters over all the data collected from the 2016 evaluation sources. 3

Newswire Headlines
The newswire headlines evaluational set is collected from the Europe Media Monitor (EMM) (Best et al., 2005) using the same extraction approach taken for STS 2015 (Agirre et al., 2015) over the date range July 28, 2014 to April 10, 2015. The EMM clusters identify related news stories. To construct the STS pairs, we extract 749 pairs of headlines that appear in same cluster and 749 pairs of headlines associated with stories that appear in different clusters. For both groups, we use a surface level string similarity metric found in the Perl package String::Similarity (Myers, 1986) to select an equal number of pairs with high and low surface similarity scores.

Plagiarism
The plagiarism evaluation set is based on Clough and Stevenson (2011)'s Corpus of Plagiarised Short Answers. This corpus provides a collection of short answers to computer science questions that exhibit varying degrees of plagiarism from related Wikipedia articles. 4 The short answers include text that was constructed by each of the following four strategies: 1) copying and pasting individual sentences from Wikipedia; 2) light revision of material copied from Wikipedia; 3) heavy revision of material from Wikipedia; 4) non-plagiarised answers produced without even looking at Wikipedia. This corpus is segmented into individual sentences using CoreNLP .

Postediting
The Specia (2011) EAMT 2011 corpus provides machine translations of French news data using the Moses machine translation system (Koehn et al., 2007) paired with postedited corrections of those translations. 5 The corrections were provided by human translators instructed to perform the minimum useful for finding semantically similar text snippets that differ in surface form. 4 Questions: A. What is inheritance in object orientated programming?, B. Explain the PageRank algorithm that is used by the Google search engine, C. Explain the Vector Space Model that is used for Information Retrieval., D. Explain Bayes Theorem from probability theory, E. What is dynamic programming? 5 The corpus also includes English news data machine translated into Spanish and the postedited corrections of these translations. We use the English-Spanish data in the cross-lingual task. number of changes necessary to produce a publishable translation. STS pairs for this evaluation set are selected both using the surface form and embedding space pairing heuristics and by including the existing explicit pairs of each machine translation with its postedited correction.

Question-Question & Answer-Answer
The question-question and answer-answer evaluation sets are extracted from the Stack Exchange Data Dump (Stack Exchange, Inc., 2016). The data include long form Question-Answer pairs on a diverse set of topics ranging from highly technical areas such as programming, physics and mathematics to more casual topics like cooking and travel.
Pairs are constructed using questions and answers from the following less technical Stack Exchange sites: academia, cooking, coffee, diy, english, fitness, health, history, lifehacks, linguistics, money, movies, music, outdoors, parenting, pets, politics, productivity, sports, travel, workplace and writers. Since both the questions and answers are long form, often being a paragraph in length or longer, heuristics are used to select a one sentence summary of each question and answer. For questions, we use the title of the question when it ends in a question mark. 6 For answers, a one sentence summary of each question is constructed using LexRank (Erkan and Radev, 2004) as implemented by the Sumy 7 package.

Cross-lingual Subtask
The pilot cross-lingual subtask explores the expansion of STS to paired snippets of text in different languages. The 2016 shared task pairs snippets in Spanish and English, with each pair containing exactly one Spanish and one English member. A trial set of 103 pairs was released prior to the official evaluation window containing pairs of sentences randomly selected from prior English STS evaluations, but with one of the snippets being translated into Spanish by human translators. 8 The similarity scores associated with this set are taken from the manual STS annotations within the original English data. Participants are allowed to use the labeled STS pairs from any of the prior STS evaluations. This includes STS pairs from all four prior years of the English STS subtasks as well as data from the 2014 and 2015 Spanish STS subtasks.

Data Collection
The cross-lingual evaluation data is partitioned into two evaluation sets: news and multi-source. The news data set is manually harvested from multilingual news sources, while the multi-source dataset is sampled from the same sources as the 2016 English data, with one of the snippets being translated into Spanish by human translators. 9 As shown in Table  3, the news set has 301 pairs, while the multi-source set has 294 pairs. For the news evaluation set, participants are provided with exactly the 301 pairs that will be used for the final evaluation. For the multisource dataset, we take the same approach as the English subtask and release 2,973 pairs for annotation by participant systems, without providing information on what pairs will be included in the final evaluation.

Cross-lingual News
The cross-lingual news dataset is manually culled from less mainstream news sources such as Russia Today 10 , in order to pose a more natural challenge in terms of machine translation accuracy. Articles on the same or differing topics are collected, with particular effort being spent to find articles on the same or somewhat similar story (many times written by the same author in English and Spanish), that exhibit a natural writing pattern in each language by itself and do not amount to an exact translation. When compared across the two languages, such articles exhibit different sentence structure and length. Additional paragraphs are also included by the writer that would cater to the readers' interests. For example, in the case of articles written about the Mexican drug lord Joaquin "El Chapo" Guzman, who was recently captured, the English articles typically have less extraneous details, focusing more on facts, while the articles written in Spanish provide additional background information with more narrative. Such articles allow for the manual extraction of high quality pairs that enable a wider variety of testing scenarios: from exact translations, to paraphrases exhibiting a different sentence structure, to somewhat similar sentences, to sentences sharing common vocabulary but no topic similarity, and ultimately to completely unrelated sentences. This ensures that semantic similarity systems that rely heavily on lexical features (which have been also typically used in STS tasks to derive test and train datasets) are at a disadvantage, and rather systems that actually explore semantic information receive due credit.

Multi-source
The raw multi-source data sets annotated by participating systems are constructed by first sampling 250 pairs from each of the following four data sets from the English task: Answer-Answer, Plagiarism, Question-Question and Headlines. One sentence from each sampled pair is selected at random for translation into Spanish by human translators. 11 An additional 1973 pairs are drawn from the English-Spanish section of EAMT 2011. We include all pairings of English source sentences with their human post-edited Spanish translations, resulting in 1000 pairs. We also include pairings of English source sentences with their Spanish machine translations. This only produced an additional 973 pairs, since 27 of the pairs are already generated by human postedited translations that exactly match their corresponding machine translations. The gold standard data are selected by randomly drawing 60 pairs belonging to each data set within the raw multi-source data, except for EMM where only 54 pairs were drawn. 12

Annotation
Annotation of pairs with STS scores is performed using crowdsourcing on Amazon Mechanical Turk. 13 This section describes the templates and annotation parameters we use for the English and cross-lingual Spanish-English pairs, as well as how the gold standard annotations are computed from multiple annotations from crowd workers.

English Subtask
The annotation instructions for the English subtask are modified from prior years in order to accommodate the annotation of question-question pairs. Figure 1 illustrates the new instructions. References to statements are replaced with snippets. The new instructions remove the wording suggesting that anno-11 Inspection of the data suggests the translation service provider may have used a postediting based process. 12 Our annotators work on batches of 7 pairs. Drawing 54 pairs from the EMM data results in a total number pairs that is cleanly divisible by 7.
13 https://www.mturk.com/ tators "picture what is being described" and provide tips for navigating the annotation form quickly. The annotation form itself is also modified from prior years to make use of radio boxes to annotate the similarity scale rather than drop-down lists. The English STS pairs are annotated in batches of 20 pairs. For each batch, annotators are paid $1 USD. Five annotations are collected per pair. Only workers with the MTurk master qualification are allowed to perform the annotation, a designation by the MTurk platform that statistical identifies workers who perform high quality work across a diverse set of MTurk tasks. Gold annotations are selected as the median value of the crowdsourced annotations after filtering out low quality annotators. We remove annotators with correlation scores < 0.80 using a simulated gold annotation computed by leaving out the annotations from the worker being evaluated. We also exclude all annotators with a kappa score < 0.20 against the same simulated gold standard.
The official task evaluation data are selected from pairs having at least three remaining labels after excluding the low quality annotators. For each evaluation set, we attempt to select up to 42 pairs for each STS label. Preference was given to pairs with a higher number of STS labels matching the median label. After the final pairs are selected, they are spot checked with some of the pairs having their STS score corrected.

Cross-lingual Spanish-English Subtask
The Spanish-English pairs are annotated using a slightly modified template from the 2014 and 2015 Spanish STS subtask. Given the multilingual nature of the subtask, the guidelines consist of alternating instructions in either English or Spanish, in order to dissuade monolingual annotators from participating (see Figure 2). The template is also modified to use the same six point scale used by the English subtask, rather than the five point scale used in the Spanish subtasks in the past (which did not attempt to distinguish between differences in unimportant details). Judges are also presented with the cross-lingual example pairs and explanations listed on Table 1.
The cross-lingual pairs are annotated in batches of 7 pairs. Annotators are paid $0.30 USD per batch and each batch receives annotations from 5 workers. The annotations are restricted to workers who have completed 500 HITs on the MTurk platform and have less than 10% of their lifetime annotations rejected. The gold standard is computed by averaging over the 5 annotations collected for each pair.

System Evaluation
This section reports the evaluation results for the 2016 STS English and cross-lingual Spanish-English subtasks.

Participation
Participating teams are allowed to submit up to three systems. 14 For the English subtask, there were 119 systems from 43 participating teams. The crosslingual Spanish-English subtask saw 26 submissions from 10 teams. For the English subtask, this is a 45% increase in participating teams from 2015. The Spanish-English STS pilot subtask attracted approximately 53% more participants than the monolingual Spanish subtask organized in 2015.

Evaluation Metric
On each test set, systems are evaluated based on their Pearson correlation with the gold standard STS labels. The overall score for each system is computed as the average of the correlation values on the individual evaluation sets, weighted by the number of data points in each evaluation set.

Baseline
Similar to prior years, we include a baseline built using a very simple vector space representation. For this baseline, both text snippets in a pair are first tokenized by white-space. The snippets are then projected to a one-hot vector representation such that each dimension corresponds to a word observed in one of the snippets. If a word appears in a snippet one or more times, the corresponding dimension in the vector is set to one and is otherwise set to zero. The textual similarity score is then computed as the cosine between these vector representations of the two snippets.

English Subtask
The rankings for the English STS subtask are given in Tables 4 and 5. The baseline system ranked 100th. Table 6 provides the best and median scores for each of the individual evaluation sets as well as overall. 15 The table also provides the difference between the best and median scores to highlight the extent to which top scoring systems outperformed the typical level of performance achieved on each data set.
The best overall performance is obtained by Samsung Poland NLP Team's EN1 system, which achieves an overall correlation of 0.778 (Rychalska et al., 2016). This system also performs best on three out of the five individual evaluation sets: answer-answer, headlines, plagiarism. The EN1 system achieves competitive performance on the postediting data with a correlation score of 0.83516. The best system on the postediting data, RICOH's Run-n (Itoh, 2016), obtains a score of 0.867. Like all systems, EN1 struggles on the question-question data, achieving a correlation of 0.687. Another system submitted by the Samsung Poland NLP Team named 15 The median scores reported here do not include late or corrected systems. The median scores for the on-time systems without corrections are: ALL 0.68923; plagiarism 0.78949; answer-answer 0.48018; postediting 0.81241; headlines 0.76439; question-question 0.57140.    EN2 achieves the best correlation on the questionquestion data at 0.747. The most difficult data sets this year are the two Q&A evaluation sets: answer-answer and questionquestion. Difficulty on the question-question data was expected as this is the first year that questionquestion pairs are formally included in the evaluation of STS systems. 16 Interestingly, the baseline system has particular problems on this data, achieving a correlation of only 0.038. This suggests that surface overlap features might be less informative on this data, possibly leading to prediction errors by the systems that include them. An answer-answer evaluation set was previously included in the 2015 STS task. Answer-answer data is included again in this year's evaluation specifically because of the poor performance observed on this type of data in 2015. In Table 6, it can be seen that the difficult Q&A data sets are also the data sets that exhibit the biggest difference between the top performing system for that evaluation set and typical system performance, as capture by the median on the same data. For the easier data sets, news headlines, plagiarism, and postediting, there is only a relatively modest gap between the best system for that data set and typical system performance ranging from about 0.05 to 0.06. However, on the difficult Q&A data set the difference between the best system and typical systems jumps to 0.212 for answer-answer and 0.170 for question-question. This suggests these harder 16 The pair selection criteria from prior years did not explicitly exclude the presence of questions or question-question pairs. Of the 19,189 raw pairs from prior English STS evaluations as trial, train or evaluation data, 831 (4.33%) of the pairs include a question mark within at least one member of the pair, while 319 of the pairs (1.66%) include a question mark within both members. data sets may be better at discriminating between different approaches with most systems now being fairly competent on assessing easier pairs. 17

Methods
Participating systems vary greatly in the approaches they take to solving STS. The overall winner, Samsung Poland NLP Team, proposes a textual similarity model that is a novel hybrid of recursive auto-encoders from deep learning with penalty and reward signals extracted from WordNet (Rychalska et al., 2016). To obtain even better performance, this model is combined in an ensemble with a number of other similarity models including a version of Sultan et al. (2015)'s very successful STS model enhanced with additional features found to work well in the literature.
The team in second place overall, UWB, combines a large number of diverse similarity models and features (Brychcin and Svoboda, 2016). Similar to Samsung, UWB includes both manually engineered NLP features (e.g., character n-gram overlap) with sophisticated models from deep learning (e.g., Tree LSTMs). The third place team, May-oNLPTeam, also achieves their best results using a combination of a more traditionally engineered NLP pipeline with a deep learning based model (Afzal et al., 2016). Specifically, MayoNLPTeam combines a pipeline that makes use of linguistic resources such as WordNet and well understood concepts such as the information content of a word (Resnik, 1995) with a deep learning method known as Deep Structured Semantic Model (DSSM) (Huang et al., 2013).
The next two teams in overall performance, ECNU and NaCTeM, make use of large feature sets, including features based on word embeddings. However, they did not incorporate the more sophisticated deep learning based models explored by Samsung, UWB and MayoNLPTeam (Tian and Lan, 2016;Przybyła et al., 2016).
The next team in the rankings, UMD-TTIC-UW, only makes use of a single deep learning model (He et al., 2016). The team extends a multi-perspective convolutional neural network (MPCNN) (He et al., 2015) with a simple word level attentional mecha-nism based on the aggregate cosine similarity of a word in one text with all of the words in a paired text. The submission is notable for how well it performs without any manual feature engineering.
Finally, the best performing system on the postediting data, RICOH's Run-n, introduces a novel IRbased approach for textual similarity that incorporates word alignment information (Itoh, 2016).

Cross-lingual Spanish-English Subtask
The rankings for the cross-lingual Spanish-English STS subtask are provided in Table 7. Recall that the multi-source data is drawn from the same sources as the monolingual English STS pairs. It is interesting to note that the performance of cross-lingual systems on this evaluation set does not appear to be significantly worse than the monolingual submissions even though the systems are being asked to perform the more challenging problem of evaluating crosslingual sentence pairs. While the correlations are not directly comparable, they do seem to motivate a more direct comparison between cross-lingual and monolingual STS systems.
In terms of performance on the manually culled news data set, the highest overall rank is achieved by an unsupervised system submitted by team UWB (Brychcin and Svoboda, 2016). The unsupervised UWB system builds on the word alignment based STS method proposed by Sultan et al. (2015). However, when calculating the final similarity score, it weights both the aligned and unaligned words by their inverse document frequency. This system is able to attain a 0.912 correlation on the news data, while ranking second on the multi-source data set. For the multi-source test set, the highest scoring submission is a supervised system from the UWB team that combines multiple signals originating from lexical, syntactic and semantic similarity approaches in a regression-based model, achieving a 0.819 correlation. This is modestly better than the second place unsupervised approach that achieves 0.808.
Approximately half of the submissions are able to achieve a correlation above 0.8 on the news data. On the multi-source data, the overall correlation trend is lower, but with half the systems still obtaining a score greater than 0.6. Due to the diversity of the material embedded in the multi-source data, it seems to amount to a more difficult testing scenario.
Nonetheless, there are cases of: 1) systems performing much worse on the news data set: the FBK HLT-MT systems experience an approximately 0.25 drop in correlation on the news data as compare to the multi-source setting; 2) systems performing evenly on both data sets.

Methods
In terms of approaches, most runs rely on a monolingual framework. They automatically translate the Spanish member of a sentence pair into English and then compute monolingual semantic similarity using a system developed for English. In contrast, the CNRC team (Lo et al., 2016) provides a true crosslingual system that makes use of embedding space phrase similarity, the score from XMEANT, a crosslingual machine translation evaluation metric (Lo et al., 2014), and precision and recall features for material filling aligned cross-lingual semantic roles (e.g., action, agent, patient). The FBK HLT team (Ataman et al., 2016) proposes a model combining cross-lingual word embeddings with features from QuEst (Specia et al., 2013), a tool for machine translation quality estimation. The RTM system (Biçici, 2016) also builds on methods developed for machine translation quality estimation and is applicable to both cross-lingual and monolingual similarity. The GWU NLP team (Aldarmaki and Diab, 2016) uses a shared cross-lingual vector space to directly assess sentences originating in different languages. 18

Conclusion
We have presented the results of the 2016 STS shared task. This year saw a significant increase in participation. There are 119 submissions from 43 participating teams for the English STS subtask. This is a 45% increase in participating teams over 2015. The pilot cross-lingual Spanish-English STS subtask has 26 submissions from 10 teams, which is impressive given that this is the first year such a challenging subtask was attempted. Interestingly, the cross-lingual STS systems appear to perform competitively to monolingual systems on pairs drawn from the same sources. This suggests that it would be interesting to perform a more direct comparison between cross-lingual and monolingual systems. 18 The GWU NLP team includes one of the STS organizers.