Acquiring Predicate Paraphrases from News Tweets

We present a simple method for ever-growing extraction of predicate paraphrases from news headlines in Twitter. Analysis of the output of ten weeks of collection shows that the accuracy of paraphrases with different support levels is estimated between 60-86%. We also demonstrate that our resource is to a large extent complementary to existing resources, providing many novel paraphrases. Our resource is publicly available, continuously expanding based on daily news.


Introduction
Recognizing that various textual descriptions across multiple texts refer to the same event or action can benefit NLP applications such as recognizing textual entailment (Dagan et al., 2013) and question answering. For example, to answer "when did the US Supreme Court approve samesex marriage?" given the text "In June 2015, the Supreme Court ruled for same-sex marriage", approve and ruled for should be identified as describing the same action.
To that end, much effort has been devoted to identifying predicate paraphrases, some of which resulted in releasing resources of predicate entailment or paraphrases. Two main approaches were proposed for that matter; the first leverages the similarity in argument distribution across a large corpus between two predicates (e.g. [a] 0 buy [a] 1 / [a] 0 acquire [a] 1 ) (Lin and Pantel, 2001;Berant et al., 2010). The second approach exploits bilingual parallel corpora, extracting as paraphrases pairs of texts that were translated identically to foreign languages (Ganitkevitch et al., 2013).
While these methods have produced exhaustive resources which are broadly used by applications, their accuracy is limited. Specifically, the first approach may extract antonyms, that also have similar argument distribution (e.g. [a] 0 raise to [a] 1 / [a] 0 fall to [a] 1 ) while the second may conflate multiple senses of the foreign phrase.
A third approach was proposed to harvest paraphrases from multiple mentions of the same event in news articles. 1 This approach assumes that various redundant reports make different lexical choices to describe the same event. Although there has been some work following this approach (e.g. Shinyama et al., 2002;Shinyama and Sekine, 2006;Roth and Frank, 2012;Zhang and Weld, 2013), it was less exhaustively investigated and did not result in creating paraphrase resources.
In this paper we present a novel unsupervised method for ever-growing extraction of lexicallydivergent predicate paraphrase pairs from news tweets. We apply our methodology to create a resource of predicate paraphrases, exemplified in Table 1.
Analysis of the resource obtained after ten weeks of acquisition shows that the set of paraphrases reaches accuracy of 60-86% at different levels of support. Comparison to existing resources shows that, even as our resource is still smaller in orders of magnitude from existing resources, it complements them with nonconsecutive predicates (e.g. take [a] 0 from [a] 1 ) and paraphrases which are highly context specific. The resource and the source code are available at http://github.com/vered1986/ Chirps. 2 As of the end of May 2017, it contains 456,221 predicate pairs in 1,239,463 different contexts. Our resource is ever-growing and is expected to contain around 2 million predicate paraphrases within a year. Until it reaches a large enough size, we will release a daily update, and at a later stage, we plan to release a periodic update.

Existing Paraphrase Resources
A prominent approach to acquire predicate paraphrases is to compare the distribution of their arguments across a corpus, as an extension to the distributional hypothesis (Harris, 1954). DIRT (Lin and Pantel, 2001) is a resource of 10 million paraphrases, in which the similarity between predicate pairs is estimated by the geometric mean of the similarities of their argument slots. Berant (2012) constructed an entailment graph of distributionally similar predicates by enforcing transitivity constraints and applying global optimization, releasing 52 million directional entailment rules (e.g. [a] 0 shoot [a] 1 → [a] 0 kill [a] 1 ).
A second notable source for extracting paraphrases is multiple translations of the same text (Barzilay and McKeown, 2001).
The Paraphrase Database (PPDB) (Ganitkevitch et al., 2013;Pavlick et al., 2015) is a huge collection of paraphrases extracted from bilingual parallel corpora. Paraphrases are scored heuristically, and the database is available for download in six increasingly large sizes according to scores (the smallest size being the most accurate). In addition to lexical paraphrases, PPDB also consists of 140 million syntactic paraphrases, some of which include predicates with non-terminals as arguments.

Using Multiple Event Descriptions
Another line of work extracts paraphrases from redundant comparable news articles (e.g. Shinyama et al., 2002;Barzilay and Lee, 2003). The assumption is that multiple news articles describing the same event use various lexical choices, providing a good source for paraphrases. Heuristics are applied to recognize that two news articles discuss the same event, such as lexical overlap and same publish date (Shinyama and Sekine, 2006). Given such a pair of articles, it is likely that predicates connecting the same arguments will be paraphrases, as in the following example: 1. GOP lawmakers introduce new health care plan 2. GOP lawmakers unveil new health care plan Zhang and Weld (2013) and Zhang et al. (2015) introduced methods that leverage parallel news streams to cluster predicates by meaning, using temporal constraints. Since this approach acquires paraphrases from descriptions of the same event, it is potentially more accurate than methods that acquire paraphrases from the entire corpus or translation phrase table. However, there is currently no paraphrase resource acquired in this approach. 3 Finally, Xu et al. (2014) developed a supervised model to collect sentential paraphrases from Twitter. They used Twitter's trending topic service, and considered two tweets from the same topic as paraphrases if they shared a single anchor word.

Resource Construction
We present a methodology to automatically collect binary verbal predicate paraphrases from Twitter. We first obtain news related tweets ( §3.1) from which we extract propositions ( §3.2). For a candidate pair of propositions, we assume that if both arguments can be matched then the predicates are likely paraphrases ( §3.3). Finally, we rank the predicate pairs according to the number of instances in which they were aligned ( §3.4).

Obtaining News Headlines
We use Twitter as a source of readily available news headlines. The 140 characters limit makes tweets concise, informative and independent of each other, obviating the need to resolve document-level entity coreference. We query the Twitter Search API 4 via Twitter Search. 5   Twitter's news filter that retrieves tweets containing links to news websites, and limit the search to English tweets.

Proposition Extraction
We extract propositions from news tweets using PropS , which simplifies dependency trees by conveniently marking a wide range of predicates (e.g, verbal, adjectival, nonlexical) and positioning them as direct heads of their corresponding arguments. Specifically, we run PropS over dependency trees predicted by spaCy 6 and extract predicate types (as in Table 1) composed of verbal predicates, datives, prepositions, and auxiliaries.
Finally, we employ a pre-trained argument reduction model to remove non-restrictive argument modifications . This is essential for our subsequent alignment step, as it is likely that short and concise phrases will tend to match more frequently in comparison to longer, more specific arguments. Figure 1 exemplifies some of the phenomena handled by this process, along with the automatically predicted output.

Generating Paraphrase Instances
Following the assumption that different descriptions of the same event are bound to be redundant (as discussed in Section 2.2), we consider two predicates as paraphrases if: (1) They appear on the same day, and (2) Each of their arguments aligns with a unique argument in the other predicate, either by strict matching (short edit distance, abbreviations, etc.) or by a looser matching (par-6 https://spacy.io tial token matching or WordNet synonyms). 7 Table 2 shows examples of predicate paraphrase instances in the resource.

Resource Release
The resource release consists of two files: 1. Instances: the specific contexts in which the predicates are paraphrases (as in Table 2). In practice, to comply with Twitter policy, we release predicate paraphrase pair types along with their arguments and tweet IDs, and provide a script for downloading the full texts.
2. Types: predicate paraphrase pair types (as in Table 1). The types are ranked in a descending order according to a heuristic accuracy score: where count is the number of instances in which the predicate types were aligned (Section 3.3), d is the number of different days in which they were aligned, and N is the number of days since the resource collection begun.
Taking into account the number of different days in which predicates were aligned reduces the noise caused by two entities that undergo two different actions on the same day. For example, the following tweets from the day of Chuck Berry's death:

Analysis of Resource Quality
We estimate the quality of the resource obtained after ten weeks of collection by annotating a sample of the extracted paraphrases. The annotation task was carried out in Amazon Mechanical Turk. 8 To ensure the quality of workers, we applied a qualification test and required a 99% approval rate for at least 1,000 prior tasks. We assigned each annotation to 3 workers and used the majority vote to determine the correctness of paraphrases.
We followed a similar approach to instancebased evaluation (Szpektor et al., 2007), and let workers judge the correctness of a predicate pair (e.g. [a] 0 purchase [a] 1 /[a] 0 acquire [a] 1 ) through 5 different instances (e.g. Intel purchased Mobileye/Intel acquired Mobileye). We considered the type as correct if at least one of its instance-pairs were judged as correct. The idea that lies behind this type of evaluation is that predicate pairs are difficult to judge out-of-context.
Differently from Szpektor et al. (2007), we used the instances in which the paraphrases appeared originally, as those are available in the resource.

Quality of Extractions and Ranking
To evaluate the resource accuracy, and following the instance-based evaluation scheme, we only considered paraphrases that occurred in at least 5 instances (which currently constitute 10% of the paraphrase types). We partition the types into four increasingly large bins according to their scores (the smallest bin being the most accurate), similarly to PPDB (Ganitkevitch et al., 2013), and annotate a sample of 50 types from each bin. Figure 2(a) shows that the frequent types achieve up to 86% accuracy.
The accuracy expectedly increases with the score, except for the lowest-score bin ((0, 10]) which is more accurate than the next one ((10, 20]). At the current stage of the resource there is a long tail of paraphrases that appeared few times. While many of them are incorrect, there are also true paraphrases that are infrequent and therefore have a low accuracy score. We expect that some of these paraphrases will occur again in the future and their accuracy score will be strengthened.

Size and Accuracy Over Time
To estimate future usefulness, Figure 2(b) plots the resource size (in terms of types and instances) and estimated accuracy through each week in the first 10 weeks of collection.
The accuracy at a specific time was estimated by annotating a sample of 50 predicate pair types with accuracy score ≥ 20 in the resource obtained at that time, which roughly correspond to the top ranked 1.5% types. Figure 2(b) demonstrates that these types maintain a level of around 80% in accuracy. The resource growth rate (i.e. the number of new types) is expected to change with time. We predict that the resource will contain around 2 million types in one year. 9

Comparison to Existing Resources
The resources which are most similar to ours are Berant (Berant, 2012), a resource of predicate entailments, and PPDB (Pavlick et al., 2015), a resource of paraphrases, both described in Section 2.
We expect our resource to be more accurate than resources which are based on the distributional approach (Berant, 2012;Lin and Pantel, 2001). In addition, in comparison to PPDB, we specialize on binary verbal predicates, and apply an additional phase of proposition extraction, handling various phenomena such as non-consecutive particles and minimality of arguments. Berant (2012) evaluated their resource against a dataset of predicate entailments (Zeichner et al., 2012), using a recall-precision curve to show the performance obtained with a range of thresholds on the resource score. This kind of evaluation is less suitable for our resource; first, predicate entailment is directional, causing paraphrases with the wrong entailment direction to be labeled negative in the dataset. Second, since our resource is still relatively small, it is unlikely to have sufficient coverage of the dataset at that point. We therefore leave this evaluation to future work.
To demonstrate the added value of our resource, we show that even in its current size, it already contains accurate predicate pairs which are absent from the existing resources. Rather than comparing against labeled data, we use types with score ≥ 50 from our resource (1,778 pairs), which were assessed as accurate (Section 4.2).
We checked whether these predicate pairs are covered by Berant and PPDB. To eliminate directionality, we looked for types in both directions, i.e. for a predicate pair (p1, p2) we searched for both (p1, p2) and (p2, p1). Overall, we found that 67% of these types do not exist in Berant, 62% in PPDB, and 49% in neither. is a person and [a] 1 is the time they are about to serve in prison. Given that get has a broad distribution of argument instantiations, this paraphrase and similar paraphrases are less likely to exist in resources that rely on the distribution of arguments in the entire corpus.

Conclusion
We presented a new unsupervised method to acquire fairly accurate predicate paraphrases from news tweets discussing the same event. We release a growing resource of predicate paraphrases. Qualitative analysis shows that our resource adds value over existing resources. In the future, when the resource is comparable in size to the existing resources, we plan to evaluate its intrinsic accuracy on annotated test sets, as well as its extrinsic benefits in downstream NLP applications.