SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT)

In this shared task, we present evaluations on two related tasks Paraphrase Identiﬁcation (PI) and Semantic Textual Similarity (SS) systems for the Twitter data. Given a pair of sentences, participants are asked to produce a binary yes/no judgement or a graded score to measure their semantic equivalence. The task features a newly constructed Twitter Para-phrase Corpus that contains 18,762 sentence pairs. A total of 19 teams participated, sub-mitting 36 runs to the PI task and 26 runs to the SS task. The evaluation shows encouraging results and open challenges for future research. The best systems scored a F1-measure of 0.674 for the PI task and a Pearson correlation of 0.619 for the SS task respectively, comparing to a strong baseline using logistic regression model of 0.589 F1 and 0.511 Pearson; while the best SS systems can often reach > 0.80 Pearson on well-formed text. This shared task also provides insights into the relation between the PI and SS tasks and suggests the importance to bringing these two research areas together. We make all the data, baseline systems and evaluation scripts publicly available. 1


Introduction
The ability to identify paraphrases, i.e. alternative expressions of the same (or similar) meaning, and the degree of their semantic similarity has proven useful for a wide variety of natural language processing applications (Madnani and Dorr, 2010). It is particularly useful to overcome the challenge of high redundancy in Twitter and the sparsity inherent in their short texts (e.g. oscar nom'd doc ↔ Oscarnominated documentary; some1 shot a cop ↔ someone shot a police). Emerging research shows paraphrasing techniques applied to Twitter data can improve tasks like first story detection (Petrović et al., 2012), information retrieval (Zanzotto et al., 2011) and text normalization (Xu et al., 2013;Wang et al., 2013).
Previously, many researchers have investigated ways of automatically detecting paraphrases on more formal texts, like newswire text. The ACL Wiki 2 gives an excellent summary of the state-ofthe-art paraphrase identification techniques. These can be categorized into supervised methods (Qiu et al., 2006;Wan et al., 2006;Das and Smith, 2009;Socher et al., 2011;Blacoe and Lapata, 2012;Madnani et al., 2012;Ji and Eisenstein, 2013) and unsupervised methods (Mihalcea et al., 2006;Rus et al., 2008;Fernando and Stevenson, 2008;Islam and Inkpen, 2007;Hassan and Mihalcea, 2011). A few recent studies have highlighted the potential and importance of developing paraphrase identification (Zanzotto et al., 2011;Xu et al., 2013) and semantic similarity techniques (Guo and Diab, 2012) specifically for tweets. They also indicated that the very informal language, especially the high degree of lexical variation, used in social media has posed serious challenges to both tasks.   Table 1) that is the same as the Twitter Paraphrase Corpus we developed earlier in (Xu, 2014) and . This PIT-2015 paraphrase dataset is distinct from the data used in previous studies in many aspects: (i) it contains sentences that are opinionated and colloquial, representing realistic informal language usage; (ii) it contains paraphrases that are lexically diverse; and (iii) it contains sentences that are lexically similar but semantically dissimilar. It raises many interesting research questions and could lead to a better understanding of our daily used language and how semantics can be captured in such language. We believe that such a common testbed will facilitate docking of the different approaches for purposes of comparison, lead to a better understanding of how semantics are conveyed in natural language, and help advance other NLP techniques for noisy user-generated text in the long run.

Task Description and Evaluation Metrics
The task has two sentence-level sub-tasks: a paraphrase identification task and an optional semantic textual similarity task. The two sub-tasks share the same data but differ in annotation and evaluation.

Task A -Paraphrase Identification (PI)
Given two sentences, determine whether they express the same or very similar meaning. Following the literature on paraphrase identification, we evaluate system performance by the F-1 score (harmonic mean of precision and recall) against human judgements.
Task B -Semantic Textual Similarity (SS) Given two sentences, determine a numerical score between 0 (no relation) and 1 (semantic equivalence) to indicate their semantic similarity. Following the literature, the system outputs are compared by Pearson correlation with human scores. We also compute the maximum F-1 score over the precision-recall curve as an additional data point.

Corpus
In this shared task, we use the Twitter Paraphrase Corpus that we first presented in (Xu, 2014) and . Table 2 shows the basic statistics of the corpus. The sentences are preprocessed with tokenization, 3 POS and named entity tags. 4 The training and development set consists of 17,790 sentence pairs posted between April 24th and May 3rd, 2013 from 500+ trending topics featured on Twitter (excluding hashtags). The training and development set is a random split. Each sentence pair is annotated by 5 different crowdsourcing workers. For the test set, we obtain both crowdsourced and expert labels on 972 sentence pairs from 20 randomly sampled Twitter trending topics between May 13th and June 10th, 2013. We use expert labels in this SemEval evaluation. Our dataset is more realistic and balanced, containing about 70% non-paraphrases vs. the 34% non-paraphrases in the benchmark Microsoft Paraphrase Corpus derived from news articles by Dolan et al. (2004). As noted in (Das and Smith, 2009), the lack of natural non-paraphrases in the MSR corpus creates bias towards certain models.

Annotation
In this section, we describe our data collection and annotation methodology. Since Twitter users are free to talk about anything regarding any topic, a random pair of sentences about the same topic has a low chance of expressing the same meaning (empirically, this is less than 8%). This causes two problems: a) it is expensive to obtain paraphrases via manual annotation; b) non-expert annotators tend to loosen the criteria and are more likely to make false positive errors. To address these challenges, we design a simple annotation task and introduce two selection mechanisms to select sentences which are more likely to be paraphrases, while preserving diversity and representativeness. Figure 1: A heat-map showing overlap between expert and crowdsourcing annotation. The intensity along the diagonal indicates good reliability of crowdsourcing workers for this particular task; and the shift above the diagonal reflects the difference between the two annotation schemas. For crowdsourcing (turk), the numbers indicate how many annotators out of 5 picked the sentence pair as paraphrases; 0,1 are considered non-paraphrases; 3,4,5 are paraphrases. For expert annotation, all 0,1,2 are nonparaphrases; 4,5 are paraphrases. Medium-scored cases (2 for crowdsourcing; 3 for expert annotation) are discarded in the system evaluation of the PI sub-task.

Raw Data from Twitter
We crawl Twitter's trending topics and their associated tweets using public APIs. 5 According to Twitter, trends are determined by an algorithm which identifies topics that are immediately popular, rather than those that have been popular for longer periods of time or which trend on a daily basis. We tokenize, remove emoticons 6 and split tweet into sentences.

Task Design on Mechanical Turk
We show the annotator an original sentence, then ask them to pick sentences with the same meaning from 10 candidate sentences. The original and candidate sentences are randomly sampled from the same topic. For each such 1 vs. 10 question, we obtain binary judgements from 5 different annotators, paying each annotator $0.02 per question. On average, each question takes one annotator about 30 ∼ 45 seconds to answer.

Annotation Quality
We remove problematic annotators by checking their Cohen's Kappa agreement (Artstein and Poesio, 2008) with other annotators. We also compute inter-annotator agreement with an expert annotator on the test dataset of 972 sentence pairs. In the expert annotation, we adopt a 5-point Likert scale to measure the degree of semantic similarity between sentences, which is defined by Agirre et al. (2012) as follows: 5: Completely equivalent, as they mean the same thing; 4: Mostly equivalent, but some unimportant details differ; 3: Roughly equivalent, but some important information differs/missing. 2: Not equivalent, but share some details; 1: Not equivalent, but are on the same topic; 0: On different topics.
Although the two scales of expert and crowdsourcing annotation are defined differently, their Pearson correlation coefficient reaches 0.735 (twotailed significance 0.001). Figure 1 shows a heatmap representing the detailed overlap between the two annotations. It suggests that the graded similarity annotation task could be reduced to a binary choice in a crowdsourcing setup. As for the binary paraphrase judgements, the integrated judgement of five crowdsourcing workers achieve a F1-score of 0.823, precision of 0.752 and recall of 0.908 against expert annotations.

Automatic Summarization Inspired Sentence Filtering
We filter the sentences within each topic to select more probable paraphrases for annotation. Our method is inspired by a typical problem in extractive summarization, that the salient sentences are likely redundant (paraphrases) and need to be removed in the output summaries. We employ the scoring method used in SumBasic (Nenkova and Vanderwende, 2005;Vanderwende et al., 2007), a simple but powerful summarization system, to find salient sentences. For each topic, we compute the probability of each word P (w i ) by simply dividing its frequency by the total number of all words in all sentences. Each sentence s is scored as the average of the probabilities of the words in it, i.e.
We then rank the sentences and pick the original sentence randomly from top 10% salient sentences and candidate sentences from top 50% to present to the annotators.
In a trial experiment of 20 topics, the filtering technique double the yield of paraphrases from 152 to 329 out of 2000 sentence pairs over naïve random sampling (Figure 2 and Figure 3). We also use PINC (Chen and Dolan, 2011) to measure the quality of paraphrases collected (Figure 4). PINC was designed to measure n-gram dissimilarity between two sentences, and in essence it is the inverse of BLEU. In general, the cases with high PINC scores include more complex and interesting rephrasings.

Topic Selection using Multi-Armed Bandits (MAB) Algorithm
Another approach to increasing paraphrase yield is to choose more appropriate topics. This is particularly important because the number of paraphrases varies greatly from topic to topic and thus the chance to encounter paraphrases during annotation (Figure 2). We treat this topic selection problem as a variation of the Multi-Armed Bandit (MAB) problem (Robbins, 1985) and adapt a greedy algorithm, the bounded -first algorithm, of Tran-Thanh et al.
(2012) to accelerate our corpus construction. Our strategy consists of two phases. In the first exploration phase, we dedicate a fraction of the total budget, , to explore randomly chosen arms of each slot machine (trending topic on Twitter), each m times. In the second exploitation phase, we sort all topics according to their estimated proportion of paraphrases, and sequentially annotate (1− )B l−m arms that have the highest estimated reward until reaching the maximum l = 10 annotations for any topic to insure data diversity.
We tune the parameters m to be 1 and to be between 0.35 ∼ 0.55 through simulation experiments, by artificially duplicating a small amount of real annotation data. We then apply this MAB algorithm in the real-world. We explore 500 random topics and then exploited 100 of them. The yield of paraphrases rises to 688 out of 2000 sentence pairs by using MAB and sentence filtering, a 4-fold increase compared to only using random selection (Figure 3).

Baselines
We provide three baselines, including a random baseline, a strong supervised baseline and a stateof-the-art unsupervised system:

Random:
This baseline provides a randomized real num-ber between [0, 1] for each test sentence pair as semantic similarity score, and uses 0.5 as cutoff for binary paraphrase identification output.

Logistic Regression:
This is a supervised logistic regression (LR) baseline used by Das and Smith (2009). It uses simple n-gram (also in stemmed form) overlapping features but shows very competitive performance on the MSR news paraphrase corpus. It uses 0.5 as cutoff to create binary outputs for the paraphrase identification task.

Weighted Matrix Factorization (WTMF): 7
The third baseline is a state-of-the-art unsupervised method developed by Guo and Diab (2012). It is specially developed for short sentences by modeling the semantic space of both words that are present in and absent from the sentences (Guo and Diab, 2012). The model was learned from WordNet (Fellbaum, 2010), OntoNotes (Hovy et al., 2006), Wiktionary, the Brown corpus (Francis and Kucera, 1979). It uses 0.5 as cutoff in the binary paraphrase identification task.

Systems and Results
A total of 18 teams participated in the PI task (required), 13 of which also submitted to the SS task (optional). Every team submitted 2 runs except one (up to 2 were are allowed). Table 3 shows the evaluation results. We use the F1score and Pearson correlation as the primary evaluation metric for the PI and SS task respectively. The results are very exciting that most systems outperformed the two strong baselines we chose, while still showing room for improvement towards the human upper-bound estimated by the crowdsourcing worker's performance.

Discussion
Most participants choose supervised methods, except for MathLingBp who uses semi-supervised, Columbia and Yamraj who use unsupervised methods. While the best performed systems are supervised, the best unsupervised system still outperforms some supervised systems and the state-of-the-art unsupervised baseline. About half of systems use word embeddings and many use neural networks.
To out best knowledge, this is the first time to have a large number of systems in an evaluation that has the two related tasks -paraphrase identification and semantic similarity, side by side for comparison. One interesting observation that comes out is the performance of the same system on the two tasks ("F1 vs. Pearson") are not necessarily related. For example, ASOBEK ranked 1st (out of 35 runs) and 18th (out of 25 runs) in the PI and SS tasks respectively, RTM-DCU ranked 27th and 3rd, while the MITRE system ranked 3nd and 1st place. Neither "F1 vs. max-F1" nor "Pearson vs. maxF1" nor "F1 vs. Pearson" show a strong correlation. It implies that (i) high-performance PI systems can be developed focusing on the binary classification problem without focusing on the degree of similarity; (ii) it is crucial to select the threshold to balance precision and recall for the PI binary classification problem; (iii) it is important for SS system to handle the debatable cases proporiately.

Participants' Systems
There are in total 19 teams participated: AJ: This team utilizes TERp and BLEU -automatic evaluation metrics for Machine Translation. The system uses a logistic regression model and performs threshold selection.

AMRITACEN: This team uses Recursive Auto
Encoders (RAEs). The matrix generated for the given input sentences is of variable size, then converted to equal sized matrix using repeat matrix concept.
ASOBEK (Eyecioglu and Keller, 2015): This team uses SVM classifier with simple lexical word overlap and character n-grams features.
CDTDS (Karampatsis, 2015): This team uses support vector regression trained only on the training set using the numbers of positive votes out of the 5 crowdsourcing annotations.
Columbia: This system maps each original sentence to a low dimensional vector as Orthogonal Matrix Factorization (Guo et al., 2014), and then computes similarity score based on the low dimensional vectors.
Depth: This team uses neural network that learns representation of sentences, then compute similarity scores based on hidden vector representations between two sentences.
EBIQUITY (Satyapanich et al., 2015): This team trains supervised SVM and logistic re-  Table 3: Evaluation results. The first column presents the rank of each team in the two tasks based on each team's best system. The superscripts are the ranks of systems, ordered by F1 for Paraphrase Identification (PI) task and Pearson for Semantic Similarity (SS) task. indicates unsupervised or semi-supervised system. In total, 19 teams participated in the PI task, of which 14 teams also participated in the SS task. Note that although the two sub-tasks share the same test set of 972 sentence pairs, the PI task ignores 134 debatable cases (received a medium-score from expert annotator) and uses only 838 pairs (663 paraphrases and 175 non-paraphrases) in evaluation, while SS task uses all 972 pairs. This causes that the F1-score in the PI task can be higher than the maximum F1-score in the SS task. Also note that the F1-scores of the baselines in the PI task are higher than reported in the Table 2 of , because the later reported maximum F1-scores on the PI task, ignoring the debatable cases.
gression models using features of semantic similarities between sentence pairs.
ECNU (Zhao and Lan, 2015): This team adopts typical machine learning classifiers and uses a variety of features, such as surface text, semantic level, textual entailment, word distributional representations by deep learning methods.
FBK-HLT (Ngoc Phuoc An Vo and Popescu, 2015): This team uses supervised learning model with different features for the 2 runs, such as n-gram overlap, word alignment and edit distance.
Hassy: This team uses a bag-of-embeddings approach via supervised learning. Two sentences are first embedded into a vector space, and then the system computes the dot-product of the two sentence embeddings.
HLTC-HKUST (Bertero and Fung, 2015): This team uses supervised classification with a standard two-layer neural network classifier. The features used include translation metrics, lexical, syntactic and semantic similarity scores, the latter with an emphasis on aligned semantic roles comparison.
MathLingBp: This team implements the alignand-penalize architecture described by Han et al. (2013) with slight modifications and makes use of several word similarity metrics. One metric relies on a mapping of words to vectors built from the Rovereto Twitter N-Gram corpus, another on a synonym list built from Wiktionary's translations, while a third approach derives word similarity from concept graphs built using the 4lang lexicon and the Longman Dictionary of Contemporary English (Kornai et al., 2015).
MITRE (Zarrella et al., 2015): A recurrent neural network models semantic similarity between sentences using the sequence of symmetric word alignments that maximize cosine similarity between word embeddings. We include features from local similarity of characters, random projection, matching word sequences, pooling of word embeddings, and alignment quality metrics. The resulting ensemble uses both semantic and string matching at many levels of granularity.
RTM-DCU (Bicici, 2015): This team uses referential translation machines (RTM) and machine translation performance prediction system (MTPP) for predicting semantic similarity where indicators of translatability are used as features (Biçici and Way, 2014) and instance selection for RTM is performed with FDA5 (Biçici and Yuret, 2014). RTM works as follows: FDA5 → MTPP → ML training → predict.
Rob (van der Goot and van Noord, 2015): This system is inspired by a state-of-the-art semantic relatedness prediction system by Bjerva et al. (2014). It combines features from different parses with lexical and compositional distributional feature using a logistic regression model.

STANFORD:
This team uses a supervised system with sentiment, phrase similarity matrix, and alignment features. Similarity metrics are based on vector space representation of phrases which was trained on a large corpus.
TkLbLiiR (Glavaš et al., 2015): This team uses a supervised model with about 15 comparisonbased numeric features. The most important features are the distributional features weighted by the topic-specific information.
WHUHJP: This team uses the word2vec tool to train a vector model on the training data, then computes distributed representations of sentences in the test set and their cosine similarity.
Yamraj: This team uses pre-trained word and phrase vectors on Google News data set (about 100 billion words) and Wikipeida articles. The system relies on the cosine distance between vectors representing the sentences computed using open-source toolkit Gensim.

Conclusions and Future Work
We have presented the task definition, data annotation and evaluation results to the first Paraphrase and Semantic Similarity In Twitter (PIT) shared task.
Our analysis provides some initial insights into the relation and the difference between paraphrase identification and semantic similarity problems. We make all the data, baseline systems and evaluation scripts publicly available. 8 In the future, we plan to extend the task to allow leverage of more information from social networks, for example, by providing the full tweets (and their ids) that are associated with each sentence and with each topic.