A Continuously Growing Dataset of Sentential Paraphrases

A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.

In this paper, we address a major challenge in paraphrase research -the lack of parallel corpora. There are only two publicly available datasets of naturally occurring sentential paraphrases and non-paraphrases: 2 the MSRP corpus derived from clustered news articles (Dolan and Brockett, 2005) and the PIT-2015 corpus from Twitter trending topics (Xu et al., 2014(Xu et al., , 2015. Our goal is not only to create a new annotated paraphrase corpus, but to identify a new data source and method that can narrow down the search space of paraphrases without using the classifier-biased or human-in-the-loop data selection as in MSRP and PIT-2015. This is so that sentential paraphrases can be conveniently and continuously harvested in large quantities to benefit downstream applications. We present an effective method to collect sentential paraphrases from tweets that refer to the same URL and contribute a new gold-standard annotated corpus of 51,524 sentence pairs, which is the largest to date (Table 1). We show the different characteristics of this new dataset contrasting the two existing corpora through the first system-  Table 1: Summary of publicly available large sentential paraphrase corpora with manual quality assurance. Our Twitter News URL Corpus has the advantages of including both meaningful non-paraphrases (Non-Para.) and multiple references (Multi-Ref.), which are important for training paraphrase identification and evaluating paraphrase generation, respectively.
atic study of paraphrase identification across multiple datasets. Our new corpus is complementary to previous work, as the corpus contains multiple references of both formal well-edited and informal user-generated texts. This is also the first work that provides a continuously growing collection, with more than 30,000 new sentential paraphrases per month automatically labeled at ∼70% precision. We demonstrate that up-to-date phrasal paraphrases can then be extracted via word alignment (see examples in Table 2). We plan to continue collecting paraphrases using our method and release a constantly updating paraphrase resource.   (Dolan et al., 2004;Dolan and Brockett, 2005) This corpus contains 5,801 pairs of sentences from news articles, with 4,076 for training and the remaining 1,725 for testing. It was created from clustered news articles by using an SVM classifier (using features including string similarity and WordNet synonyms) to gather likely paraphrases, then annotated by human on semantic equivalence. The MSRP corpus has a known deficiency skewed toward over-identification (Das and Smith, 2009), because the "purpose was not to evaluate the potential effectiveness of the classifier itself, but to identify a reasonably large set of both positive and plausible 'near-miss' negative examples" (Dolan and Brockett, 2005). It contains a large portion of sentence pairs with many ngrams shared in common.
Twitter Paraphrase Corpus [PIT-2015] (Xu et al., 2014(Xu et al., , 2015 This corpus was derived from Twitter's trending topic data. The training set contains 13,063 sentence pairs on 400 distinct topics, and the test set contains 972 sentence pairs on 20 topics. As numerous Twitter users spontaneously talk about varied topics, this dataset contains many lexically divergent paraphrases. However, this method requires a manual step of selecting topics to ensure the quality of collected paraphrases, because many topics detected automatically are either incorrect or too broad. For example, the topic "New York" relates to tweets with a wide range of information and cannot narrow the search space down enough for human annotation and the subsequent application of classification algorithms.

Constructing the Twitter URL Paraphrase Corpus
For paraphrase acquisition, it has been crucial to find a simple and effective way to locate paraphrase candidates (see related work in Section 6). We show the efficacy of tracking URLs in Twitter. This method does not rely on automatic news clustering as in MSRP or topic detection as in PIT-2015, but it keeps collecting good candidate paraphrase pairs in large quantities.

Twitter News URL Corpus
Original Tweet Samsung halts production of its Galaxy Note 7 as battery problems linger #Samsung temporarily suspended production of its Galaxy #Note7 devices following reports News hit that @Samsung is temporarily halting production of the #GalaxyNote7.

Paraphrase
Samsung still having problems with their Note 7 battery overheating. Completely halt production. SAMSUNG HALTS PRODUCTS OF GALAXY NOTE 7 . THE BATTERIES ARE * STILL * EX-PLODING . in which a phone bonfire in 1995-a real one-is a metaphor for samsung's current note 7 problems Non-Paraphrase samsung decides, "if we don't build it, it won't explode." Samsung's Galaxy Note 7 Phones AND replacement phones have been going up in flames due to the defective batteries  Table 4: A representative set of tweets linked by a URL in streaming data (generally poor readability).

Data Source: News Tweets vs. Streaming
We extracted the embedded URL in each tweet and used Twitter's Search API to retrieve all tweets that contain the same URL. Some tweets use shortened URLs, which we resolve as full URLs. We tracked 22 English news accounts in Twitter to create the paraphrase corpus in this paper (see examples in Table 3). We will extend the corpus to include other languages and domains in future work. As shown in Table 5, nearly all the tweets posted by news agencies have embedded URLs. About 51.17% of posts contain two URLs, usually one pointing to a news article and the other to media such as a photo or video. Although close to half of the tweets in Twitter streaming data 4 contain at least one URL, most of them are very hard to read (see examples in Table 4).  Table 5: Statistics of tweets in Twitter's streaming data and news account data. Many tweets contain more than one URL because media such as photo or video is also represented by URLs.

Filtering of Retweets
Retweeting is an important feature in Twitter. There are two types: automatic and manual retweets. An automatic retweet is done by clicking the retweet button on Twitter and is easy to remove using the Twitter API. A manual retweet occurs when the user creates a new tweet by copying and pasting the original tweet and possibly adding some extras, such as hashtags, usernames or comments. It is crucial to remove these redundant tweets with minor variations, which otherwise represent a significant portion of the data (Table 6). We preprocessed the tweets using a tokenizer 5 (Gimpel et al., 2011) and an in-house sentence splitter. We then filtered out manual retweets using a set of rules, checking if one tweet was a sub-string of the other, or if it only differed in punctuation, or the contents of the "twitter:title" or "twitter:description" tag in the linked HTML file of the news article. Table 6 shows the effectiveness of the filtering. We used PINC, a standard paraphrase metric, to measure ngram-based dissimilarity (Chen and Dolan, 2011), and Jaccard metric to measure token-based string similarity (Jaccard, 1912). After filtering, the dataset contains tweets with more significant rephrasing as indicated by higher PINC and lower Jaccard scores.

Gold Standard Corpus
To get the gold-standard paraphrase corpus, we obtained human labels on Amazon Mechanical Turk. We showed annotators an original sentence, and asked them to select sentences with the same meaning from 10 candidate sentences. For each question, we recruited 6 annotators and paid $0.03 to each worker. 6 On average, each question took about 53 seconds to finish. For each sentence pair, we aggregated the paraphrase and non-paraphrase labels using the majority vote. We constructed the largest gold standard paraphrase corpus to date, with 42,200 tweets of 4,272 distinct URLs annotated in the training set and 9,324 tweets of 915 distinct URLs in the test set. The training data was collected between 10/10/2016 and 11/22/2016, and testing data between 01/09/2017 and 01/19/2017. In Section 4, we contrast the characteristics of our data against existing paraphrase corpora.
Quality Control We evaluated the annotation quality of each worker using Cohen's kappa agreement (Artstein and Poesio, 2008) against the majority vote of other workers. We asked the best workers (the top 528 out of 876) to label more data by republishing the questions done by workers with low reliability (Cohen's kappa <0.4).
Inter-Annotator Agreement In addition, we had 300 sampled sentence pairs independently annotated by an expert. The annotated agreement is 0.739 by Cohen's kappa between the expert and the majority vote of 6 crowdsourcing workers. If we assume the expert annotation is gold, the precision of worker vote is 0.871, the recall is 0.787, and F1 is 0.827, similar to those of PIT-2015.

Continuous Harvesting of Sentential Paraphrases
Since our method directly applies to raw tweets, it can continuously extract sentential paraphrases from Twitter. In Section 4, we show that this approach can produce a silver-standard paraphrase corpus at about 70% precision that grows by more than 30,000 new sentential paraphrases per month. Section 5 presents experiments demonstrating the utility of these automatically identified sentential paraphrases.

Comparison of Paraphrase Corpora
Though paraphrasing has been widely studied, supporting analyses and experiments have thus far often only been conducted on a single dataset. In this section, we present a comparative analysis of our newly constructed gold-standard corpus with two existing corpora by 1) individually examining the instances of paraphrase phenomena and 2) benchmarking a range of automatic paraphrase identification approaches.

Paraphrase Phenomena
In order to show the differences across these three datasets, we sampled 100 sentential paraphrases from each training set and counted occurrences of each phenomenon in the following categories: Elaboration (textual pairs can differ in total information content, such as Trump's ex-wife Ivana and Ivana Trump), Phrasal (alternates of phrases, such as taking over and replaces), Spelling (spelling variants, such as Trump and Trumpf ), Synonym (such as said and told), Anaphora (a full noun phrase in one sentence that corresponds to the counterpart, such as @MarkKirk and Kirk) and Reordering (when a word, phrase or the whole sentence reorders, or even logically reordered, such as Matthew Fishbein questioned him and under questioning by Matthew Fishbein). We report the average number of occurrences of each paraphrase type per sentence pair for each corpus in Table 7. As sentences tend to be longer in MSRP and shorter in PIT-2015, we also normalized the numbers by the length of sentences to be more comparable to the URL dataset. These three datasets exhibit distinct and complementary compositions of paraphrase phenom-  ena. MSRP has more synonyms, because authors of different news articles may use different and rather sophisticated words. PIT-2015 contains many phrasal paraphrases, probably due to the fact that most tweets under the same trending topic are written spontaneously and independently. Our URL dataset shows more elaboration, spelling and anaphora paraphrase phenomena, showing that many URL-embedded tweets are created by users with a conscious intention to rephrase the original news headline.

Automatic Paraphrase Identification
We provide a benchmark on paraphrase identification to better understand various models, as well as the characteristics of our new corpus compared to the existing ones. We focus on binary classification of paraphrase/non-paraphrase, and report the maximum F1 measure of any point on the precision-recall curve.

Models
We chose several representative technical approaches for automatic paraphrase identification: GloVe (Pennington et al., 2014) This is a word representation model trained on aggregated global word-word co-occurrence statistics from a corpus. We used 300-dimensional word vectors trained on Common Crawl and Twitter, summed the vectors for each sentence, and computed the cosine similarity.
LR The logistic regression (LR) model incorporates 18 features based on 1-3 gram overlaps between two sentences (s 1 and s 2 ) (Das and Smith, 2009). The features are of the form precision n (number of n-gram matches divided by the number of n-grams in s 1 ), recall n (number of n-gram matches divided by the number of n-grams in s 2 ), and F n (harmonic mean of recall and precision). The model also includes lemmatized versions of these features.     Specifically, for the (vec) version, vectors of a pair of sentences v 1 and v 2 are converted into one feature vector,

WMF/OrMF
, by concatenating the element-wise sum v 1 + v 2 and absolute difference | v 1 − v 2 |. We also provide the (sim) variation, which directly uses the single cosine similarity score between two sentence vectors.

LEX-WMF/LEX-OrMF
This is an opensourced adaptation (Xu et al., 2014) of LEXDIS-CRIM (Ji and Eisenstein, 2013) that have shown comparable performance. It combines WMF/OrMF with n-gram overlapping features to train a LR classifier.
MultiP MultiP (Xu et al., 2014) is a multiinstance learning model suited for short messages on Twitter. The at-least-one-anchor assumption in this model looks for two sentences that have a topical phrase in common, plus at least one pair of anchor words that carry a similar key meaning. This model achieved the best performance in the PIT-2015 (Xu et al., 2014) dataset.
DeepPairwiseWord He et al. (2016) developed a deep neural network model that focuses on important pairwise word interactions across input sentences. This model innovates in proposing a similarity focus layer and a 19-layer very deep convolutional neural network to guide model attention to important word pairs. It has shown stateof-the-art performance on several textual similarity measurement datasets.

Model Performance and Dataset Difference
The results on three benchmark paraphrase corpora are shown in Table 8, 9 and 10. The random baseline reflects that close to 80% sentence pairs are paraphrases in the MSPR corpus. This is atypical in the real-world text data and may cause falsely positive predictions.
Both the edit distance and the LR models exploit surface word features. In particular, the LR model that uses lemmatization and ngram overlap features achieves very competitive performance on all datasets. Figure 1 shows a closer look at ngram differences across datasets measured by the PINC metric (Chen and Dolan, 2011), which is the opposite of BLEU (Papineni et al., 2002). MSRP consists of paraphrases with more ngram overlap (lower PINC), while PIT-2015 contains shorter and more lexically dissimilar sentences. Our new URL corpus is in between the two, and is more similar to PIT-2015. It includes user's intentional rephrasing of an original tweet from a news agency with some words untouched, as well as some dramatic paraphrases that are challenging for any automatic identification methods, such as CO2 levels mark 'new era' in the world's changing climate and CO2 levels haven't been this high for 3 to 5 million years.
MultiP exploits a restrictive constraint that the candidate sentence pairs share a same topical phrase. It achieves the best performance on PIT-2015, which naturally contains such phrases. For MSRP and URL datasets, we uses the named entity tagged with the longest span as an approximation of a shared topic phrase and thus suffered a performance drop.
Both Glove and WMT/OrMF utilize the underlying co-occurrence statistics of the text corpus. WMT/OrMF use global matrix factorization to project sentences into lower dimension and show great advantages on measuring sentence-level semantic similarities over Glove, which focuses on word representations. Figure 2 shows that the finegrained distribution of the OrMF-based cosine similarities and that the URL-linked Twitter data works well with OrMF to yield sentential paraphrases. Once combined with ngram overlap features, LEX-WMF and LEX-OrMF show consistently high performance across different datasets, close to the more complicated DeepPairwiseWord. The similarity focus mechanism on important pairwise word interactions in DeepPairwiseWord is more helpful for the two Twitter datasets, due to the fact that they contain lexically divergent paraphrases while MSRP has an artificial bias toward sentences with high n-gram overlap.

Extracting Phrasal Paraphrases
We can apply paraphrase identification models trained on our gold standard corpus to unlabeled Twitter data and continuously harvest sentential paraphrases in large quantities. We used the opensourced LEX-OrMF model and obtained 114,025 sentential paraphrases (system predicted probability ≥ 0.5 and average precision = 69.08%) from raw 1% free Twitter data between 10/10/2016 and 01/10/2017. To demonstrate the utility, we show that we can extract up-to-date lexical and phrasal paraphrases from this data.

Phrase Extraction and Ranking
One of the most successful ideas to obtain lexical and phrasal paraphrases in large quantities is through word alignment, then ranking for better quality. This approach was proposed by Bannard (Bannard and Callison-Burch, 2005) and previously applied to bilingual parallel data to create PPDB (Ganitkevitch et al., 2013;Pavlick et al., 2015). There has been little previous work utilizing monolingual parallel data to learn paraphrases since it is not as naturally available as bitexts.
We used the GIZA++ word aligner in the Moses machine translation toolkit (Koehn et al., 2007) and extracted 245,686 phrasal paraphrases. Some examples are shown in Table 2. We additionally explored two supervised monolingual aligners: Jacana aligner (Yao et al., 2013) and Md Sultan's aligner (Sultan et al., 2014). We ranked the phrase pairs using four different scores: • Language Model Score Let w −2 w −1 pw 1 w 2 be the context of the phrase p. We considered a phrase p to be a good substitute for p if w −2 w −1 p w 1 w 2 is a likely sequence according to a language model (Heafield, 2011) trained on Twitter data.
• Our Score We trained a supervised SVM regression model using 500 phrase pairs with human ratings. We used the language model, translation, and glove scores as features, and additionally used the inverse phrase translation probability ϕ(p |p), lexical weighting lex(p|p ), and lex(p |p) from Moses. Figure 3 compares the different ranking methods against the human judgments on 200 phrase pairs randomly sampled from GIZA++.

Paraphrase Quality Evaluation
We compared the quality of paraphrases extracted by our method with the closest previous work (BUCC-2013) (Xu et al., 2013), in which a similar phrase table was created using Moses from monolingual parallel tweets that contain the same named entity and calendar date. We randomly sampled 500 phrase pairs from each phrase table and collected human judgements on a 5point Likert scale, as described in . Table 11 shows the evaluation results. We focused on the highest-quality paraphrases that rated as 5 ("all of the meaning of the original phrase is retained, and nothing is added") and their presence among all extracted paraphrases sorted by ranking scores. We were also interested in how these phrasal paraphrases compared with those in PPDB. We sampled an equal amount of 420 paraphrase pairs from our phrase tables and PPDB, and then checked what percentage out of the total 840 could be found in our phrase tables and PPDB, respectively. As shown in Table 12, there is little overlap between URL data and PPDB, only 1.3% (51.3-50%) plus 0.8% (50.8-50%). Our Twitter URL data complements well with the existing paraphrase resources, such as PPDB, which are primarily derived from well-edited texts.

Related Work
Sentential Paraphrase Data Researchers have found several data sources from which to collect sentential paraphrases: multiple news agencies reporting the same event (MSRP) (Dolan et al., 2004;Dolan and Brockett, 2005), multiple trans-  Table 12: Coverage comparison of phrasal paraphrases extracted from Twitter URL data (sampled 1:1:1 from GIZA++, Jacana and Sultan's aligner outputs) and the PPDB (Ganitkevitch et al., 2013).
lated versions of a foreign novel (Barzilay and Elhadad, 2003;Barzilay and Lee, 2003) or other texts (Cohn et al., 2008), multiple definitions of the same concept (Hashimoto et al., 2011), descriptions of the same video clip from multiple workers (Chen and Dolan, 2011) or rephrased sentences (Burrows et al., 2013;Toutanova et al., 2016). However, all these data collection methods are incapable of obtaining sentential paraphrases on a large scale (i.e. limited number of news agencies or books with multiple translated versions), and/or lack meaningful negative examples. Both of these properties are crucial for developing machine learning models that identify paraphrases and measure semantic similarities.

Conclusion and Future Work
In this paper, we show how a simple method can effectively and continuously collect large-scale sentential paraphrases from Twitter. We rigorously evaluated our data with automatic identification classification models and various measurements. We will share our new dataset with the research community; this dataset includes 51,524 sentence pairs manually labeled and a monthly growth of 30,000 sentential paraphrases automatically labeled. Future work could include expanding into many different languages present in social media and developing language-independent automatic paraphrase identification models.