Paraphrasing vs Coreferring: Two Sides of the Same Coin

We study the potential synergy between two different NLP tasks, both confronting predicate lexical variability: identifying predicate paraphrases, and event coreference resolution. First, we used annotations from an event coreference dataset as distant supervision to re-score heuristically-extracted predicate paraphrases. The new scoring gained more than 18 points in average precision upon their ranking by the original scoring method. Then, we used the same re-ranking features as additional inputs to a state-of-the-art event coreference resolution model, which yielded modest but consistent improvements to the model’s performance. The results suggest a promising direction to leverage data and models for each of the tasks to the benefit of the other.


Introduction
Recognizing that mentions of different lexical predicates discuss the same event is challenging (Barhom et al., 2019). Lexical resources such as WordNet (Miller, 1995) capture such synonyms (say, tell) and hypernyms (whisper, talk), as well as antonyms, which can be used to refer to the same event when the arguments are reversed ([a] 0 beat [a] 1 , [a] 1 lose to [a] 0 ). However, WordNet's coverage is insufficient, in particular, missing contextspecific paraphrases (e.g. (hide, launder), in the context of money). Conversely, distributional methods enjoy broader coverage, but their precision for this purpose is limited because distributionally similar terms may often be mutually-exclusive (born, die) or may refer to different event types which are only temporally or causally related (sentenced, convicted).
Two prominent lines of work pertaining to identifying predicates whose meaning or referents can be matched are cross-document (CD) event coreference resolution and recognizing predicate para-Tara Reid has checked into∨ Promises Treatment Center. Actress Tara Reid entered∨ well-known Malibu rehab center. Lindsay Lohan checked into × rehab in Malibu, California.
Director Chris Weitz is expected to direct∨ New Moon. Chris Weitz will take on∨ the sequel to "Twilight". Gary Ross is still in negotiations to direct × the sequel. Table 1: Examples from ECB+ (a cross-document coreference dataset) that illustrate the context-sensitive nature of event coreference. The illustrated predicates are co-referable, and hence may be used to refer to the same event in certain contexts, but obviously not all their mentions corefer.
phrases. The former identifies and clusters event mentions, across multiple documents, that refer to the same event within their respective contexts. The latter task, on the other hand, collects pairs of event expressions that, at the generic lexical level, may refer to the same event in certain contexts. Table 1 illustrates this difference with examples of co-referable predicate paraphrases, while their mentions obviously do not always co-refer.
Cross-document event coreference resolution systems are typically supervised, usually trained on the ECB+ dataset, which contains clusters of news articles on different topics (Cybulska and Vossen, 2014). Recent systems rely on neural representations of the mentions and their contexts (Kenyon-Dean et al., 2018;Barhom et al., 2019), while earlier approaches leveraged WordNet and other lexical resources to obtain a signal of whether a pair of mentions may be coreferring (e.g. Bejan and Harabagiu, 2010;Yang et al., 2015).
Approaches for acquiring predicate paraphrase, in the form of a pair of paraphrastic predicates or predicate templates, were based mostly on unsupervised signals. These included similarity between argument distributions (Lin and Pantel, 2001;Berant, 2012), backtranslation across languages (Barzilay and McKeown, 2001;Ganitkevitch et al., 2013;Mallinson et al., 2017), or leveraging redundant news reports on the same event, which are hence likely to refer to the same events and entities using different words (Shinyama et al., 2002;Shinyama and Sekine, 2006;Barzilay and Lee, 2003;Zhang and Weld, 2013;Xu et al., 2014;Shwartz et al., 2017). In some cases, the paraphrase collection phase includes a step of validating a subset of the paraphrases and training a model on these gold paraphrases to re-rank the entire resource (Lan et al., 2017).
In this paper, we study the potential synergy between predicate paraphrases and event coreference resolution. We show that the data and models for one task can benefit the other. In one direction (Section 3), we use event coreference annotations from the ECB+ dataset as distant supervision to learn an improved scoring of predicate paraphrases in the unsupervised Chirps resource (Shwartz et al., 2017). The distantly supervised scorer significantly improves upon ranking by the original Chirps scores, adding 18 points to average precision over a test sample.
In the other direction (Section 4), we incorporate data from Chirps, represented in the Chirps re-scorer feature vector, into a state-of-the-art event coreference system (Barhom et al., 2019). Chirps has a substantial coverage over the ECB+ coreferring mention pairs, and consequently, the incorporation yields a modest but consistent improvement across the various coreference metrics. 1

Background and Motivation
In this section we provide some background about the cross-document coreference resolution and paraphrase identification (acquisition) tasks, which is relevant for our approaches for synergizing these two tasks.

Event Coreference Resolution
Event coreference resolution aims to identify and cluster event mentions, that, within their respective contexts, refer to the same event. The task has two variants, one in which coreferring mentions are within the same document (within document) and another in which corefering mentions may be in different documents (cross-document, CD), on which we focus in this paper.
The standard datasets used for CD event coreference training and evaluation are ECB+ (Cybulska and Vossen, 2014), and its predecessors, EECB (Lee et al., 2012) and ECB (Bejan and Harabagiu, 2010). ECB+ contains a set of topics, each containing a set of documents describing the same global event. Both event and entity coreferences are annotated in ECB+, within and across documents.
The current state-of-the-art model from Barhom et al. (2019) iteratively and intermittently learns to cluster events and entities. A mention representation m i consists of several components, representing both the mention span and its surrounding context. The interdependence between clustering event vs. entity mentions is encoded into the mention representation, such that an event mention representation contains a component reflecting the current entity clustering, and vice versa. Using this representation, the model trains a pairwise mention scoring function that predicts the probability that two mentions refer to the same event.

Paraphrase Identification and acquisition
Paraphrases are differing textual realizations of the same meaning (Ganitkevitch et al., 2013), typically phrases or sentences (Dolan et al., 2005). A prominent approach for identifying and collecting paraphrases, backtranslation, assumes that if two (say) English phrases translate to the same term in a foreign language, across multiple foreign languages, this indicates that these two phrases are paraphrases. This approach was first suggested by Barzilay and McKeown (2001), later adapted to acquire the large PPDB resource (Ganitkevitch et al., 2013), and was also shown to work well with neural machine translation (Mallinson et al., 2017).
Paraphrase Identification through Event Coreference. An alternative approach for paraphrase identification, on which we focus in this paper, leverages multiple news documents discussing the same event. The underlying assumption is that such redundant texts may refer to the same entities or events using lexically-divergent mentions. Coreferring mentions are identified heuristically and extracted as candidate paraphrases. When long documents are used, the first step in this approach is to align each pair of documents by sentences. This was done by finding sentences with shared named entities (Shinyama et al., 2002) or lexical overlap (Barzilay and Lee, 2003;Shinyama and Sekine, 2006), and by aligning pairs of predicates or arguments (Zhang and Weld, 2013;Recasens et al., 2013). In more recent work, Xu et al. (2014) and Lan et al. (2017) extracted sentential paraphrases from Twitter by heuristically matching pairs of tweets discussing the same topic.
Predicate Paraphrases. In contrast to sentential paraphrases, it is also beneficial to identify differing textual templates of the same meaning. In this paper we focus on binary predicate paraphrases such as ("[a 0 ] quit from [a 1 ]", "[a 0 ] resign from [a 1 ]").
Earlier approaches for acquiring predicate paraphrases considered a pair of predicate templates as paraphrases if the distributions of their argument instantiations were similar. For instance, in "[a 0 ] quit from [a 1 ]", [a 0 ] would typically be instantiated by people names while [a 1 ] by employer organizations or job titles. A paraphrastic template like "[a 0 ] resign from [a 1 ]" is hence expected to have similar argument distributions, and can thus be detected by a distributional similarity approach (Lin and Pantel, 2001;Szpektor et al., 2004;Berant, 2012). Yet, as mentioned earlier, predicates with similar argument distributions are not necessarily paraphrastic, which introduces a substantial level of noise when acquiring paraphrase pairs using this approach.
In this paper, we follow the potentially more reliable paraphrase acquisition approach, which tries to heuristically identify concrete co-referring predicate mentions. Identifying such mention pairs, detected as actually being used to refer to the same event, can provide a strong signal for identifying these predicates as paraphrastic (vs. the quite noisy corpus-level signal of distributional similarity). In particular, we utilize the Chirps paraphrase acquisition method and resource, which follows this approach as described next in some detail.
Chirps: a Coreference-Driven Paraphrase Resource. Chirps (Shwartz et al., 2017) is a resource of predicate paraphrases extracted heuristically from Twitter. Chirps aims to recognize coreferring events by relying on the redundancy of news headlines posted on Twitter on the same day. It extracts binary predicate-argument tuples from each tweet and aligns pairs of predicate mentions whose arguments match, by some lexical matching criteria. The matched pairs of arguments are termed supporting pairs, e.g. ( . This score is proportional to the number of supporting pair instances in which the two templates were paired (n), as well as the number of different days in which such pairings were found (d), where N is the number of days the resource is collected. The Chirps resource provides the scored predicate paraphrases as well as the supporting pairs for each paraphrase.
Chirps has acquired more than 5 million distinct paraphrase pairs over the last 3 years. Human evaluation showed that this scoring is effective and that the percentage of correct paraphrases is higher for highly scored paraphrases. At the same time, due to the heuristic collection and scoring of predicate paraphrases in Chirps, entries in the resource may suffer from two types of errors: (1) type 1 error, i.e., the heuristic recognized pairs of non-paraphrastic predicates as paraphrases. This happens when the same arguments participate in multiple different events, as in the following paraphrases: "[Police] 0 arrest [man] 1 " and "[Police] 0 shoot [man] 1 "; and (2) type 2 error, when the scoring function assigned a low score to a rare but correct paraphrase pair, as in "[a 0 ] outgun [a 1 ]" and "[a 0 ] outperform [a 1 ]", for which only a single supporting pair was found.

Chirps*: Leveraging Coreference Information for Paraphrasing
Our goal in this section is to improve paraphrase scoring, in the context of Chirps, while leveraging available information and methods for event cross-document coreference resolution. To that end, we introduce Chirps*, a new supervised scorer for Chirps candidate paraphrases, whose novelties are two fold. First, we extract a richer feature representation for a candidate paraphrase pair (Section 3.1), which is fed into a supervised classifier for the candidates. Second, we collect, semi-automatically, distantly supervised training data for paraphrase classification, which is derived from the ECB+ cross-document coreference training set, leveraging the close relationship between the two tasks (Section 3.2). Finally, we provide some implementation details (Section 3.3).

Features
As described above, the original heuristic Chirps scorer relied only on a couple of features to score a candidate paraphrase pair. Our goal is to obtain a richer signal about the likelihood of a candidate predicate pair to indeed be paraphrastic. To that end, we collect a set of features from the available data, with a focus on assessing whether the instances from which the candidate pair was extracted indeed constitute cross-document coreferences.
Each candidate paraphrase pair consists of two predicate templates p 1 and p 2 , accompanied by the n supporting pair instances for the pair, each consisting a pair of argument terms, associated with this predicate paraphrase pair: support-pairs(p 1 , p 2 ) = {(t 1 1 , t 1 2 ), ..., (t n 1 , t n 2 )}. Each tweet included in Chirps links to a news article, whose content we retrieve. When representing a pair of predicate templates, we include both local features (based on a single supporting pair) and global features (based on all supporting pairs). Table 2 presents our 17 features, yielding a feature representation f p 1 ,p 2 ∈ R 17 for a paraphrase pair, grouped by different sources of information. The first group includes features derived from the statistics provided by the original Chirps resource. The other 4 sources of information are described in the following paragraphs.
Named Entity Coverage While the original Chirps method did not utilize the content of the linked article, we find it useful to retrieve more information about the event. Specifically, it might help mitigating errors in Chirps' argument matching mechanism, which relies on argument alignment considering only the text of the two tweets. We found that the original mechanism worked particularly well for named entities while being more error-prone for common nouns, which might require additional context.
Given (t i 1 , t i 2 ) ∈ support-pairs(p 1 , p 2 ), we use SpaCy (Honnibal and Montani, 2017) to extract sets of named entities, N E 1 and N E 2 , from the first paragraph of the news article linked from each tweet, respectively. We define a Named Entity Coverage score, NEC, as the maximum ratio of named entity coverage of one article by the other: We manually annotated a small balanced training set of 121 tweet pairs and used it to tune a score threshold T = 0.26, such that pairs of tweets whose NEC is at least T are considered coreferring. Finally, we include the following features: the number of coreferring tweet pairs (whose NEC score exceeds T ) and the average NEC score of these pairs.

Cross-document Coreference Resolution
We apply the state-of-the-art cross-document coreference model from Barhom et al. (2019) to data constructed such that each tweet constitutes a document and each pair of tweets corresponding to t j 1 and t j 2 in support-pairs(p 1 , p 2 ) forms a topic, to be analyzed for coreference. As input for the model, in each tweet, we mark the corresponding predicate span as an event mention and the two argument spans as entity mentions. The model outputs whether the two event mentions corefer (yielding a single event coreference cluster for the two mentions) or not (yielding two singleton clusters). Similarly, it clusters the four arguments to entity coreference clusters.
Differently from Chirps, this model makes its event clustering decision based on the predicate, arguments, and the context of the full tweet, as opposed to considering the arguments alone. Thus, we expect it not to cluster predicates whose arguments match lexically, if their contexts or predicates don't match (first example in Table 3). In addition, the model's mentions representation might help to identify lexically-divergent yet semantically-similar arguments (second example in Table 3).
For a given pair of tweets, we extract the following binary features with respect to the predicate mentions: Event Perfect when the predicates are assigned to the same cluster, and Event No Match when each predicate forms a singleton cluster. For argument mentions, we extract the following features: Entity Perfect if the two a 0 arguments belong to one cluster and the two a 1 arguments belong to another cluster; Entity Reverse if at least one of the a 0 arguments is clustered as coreferring with the a 1 argument in the other tweet; and Entity No

# Supporting pairs
The total number of support pairs of p1 and p2 across the template variants.

# Days
The total number of days d that p1 and p2 was matched in Chirps across the template variants.

# Available supporting pairs
The number of support pairs of p1 and p2 across the template variants that were still available to download.

# Days of available pairs
The total number of days d in which the support pairs above occurred in the available tweets.

Score
The maximal Chirps score across the template variants.

Named Entity Coverage
# NEC above threshold Number of pairs with NEC score of at least T .

Average above threshold
Average of NEC scores for pairs with a score of at least T .

# Event Perfect
Number of event pairs with perfect match.

# Event No Match
Number of event pairs with no match.

# Entity Perfect
Number of entity pairs with perfect match.

# Entity Reverse
Number of entity pairs with reverse match.

# Entity No Match
Number of entity pairs with no match.

# Perfectly Clustered + NEC
The number of pairs with NEC score of at least T and perfect clustering for event coreference resolution.

# Connected components
The number of connected components in Gp 1 ,p 2 .

Average component size
The average size of connected components in Gp 1 ,p 2 .

# In Clique
The number of pairs in support-pairs(p1, p2) that are in a clique.  Match otherwise. Also, we extract Perfectly Clustered with NE Coverage that combines both Named Entity coverage and coreference-resolution, which count the number of pairs their events are perfectly clustered and with NEC score of at least T.

Connected Components
The original Chirps score of a predicate paraphrase pair is proportional to two parameters: (1) the number of supporting pairs; (2) the ratio of number of days in which supporting pairs were matched relative to the entire collection period. The latter lowers the score of paraphrase pairs which might have been mistakenly aligned on relatively few days (e.g. due to misleading argument alignments in particular events). The number of days in which the predicates were aligned is taken as a proxy for the number of different events in which the predicates co-refer. Here, we aim to get a more reliable partition of tweets to different events by constructing a graph of tweets as nodes, with supporting tweet pairs as edges, and looking for connected components.
To that end, we define a bipartite graph G p 1 ,p 2 = (V, E) for a candidate paraphrase pair, where V = tweets(p 1 , p 2 ) contains all the tweets in which p 1 or p 2 appeared, and E = support-pairs(p 1 , p 2 ). We compute C, the number of connected components in G p 1 ,p 2 , and define the following group: ConComp = {c ∈ C : |c| > 2}, which represents the number of connected components with size greater than 2. From this group we derive two features #connected(p 1 , p 2 ) = |ConComp| which represents the number of the connected components and avg connected(p 1 , p 2 ), which is the average size of the connected components in the graph. A larger number of connected components indicates that the two predicates were aligned across a large number of likely different events.
Clique We similarly build a global tweet graph for all the predicate pairs, G all = (V , E ), where V = ∪ (p 1 ,p 2 ) tweets(p 1 , p 2 ), and E = ∪ (p 1 ,p 2 ) support-pairs(p 1 , p 2 ). We compute Q, the set of cliques in G all of size greater than 2. We assume that a pair of tweets are more likely to be coreferring if they are part of a bigger clique, whereas if they were extracted by mistake they wouldn't share many neighbors. We extract the following feature of clique coverage for a candidate paraphrase pair: CLC(p 1 , p 2 ) = |{t j 1 , t j 2 ∈ support-pairs(p 1 , p 2 ) : ∃q ∈ Q such that t j 1 ∈ q ∧ t j 2 ∈ q}|.

Distantly Supervised Labels
In order to learn to score the paraphrases, we need gold standard labels, i.e., labels indicating whether a pair of predicate templates collected by Chirps is indeed a paraphrase. Instead of collecting manual annotations for a sample of the Chirps data, we chose a low-budget distant supervision approach.
To that end, we leverage the similarity between the predicate paraphrase extraction and the event coreference resolution tasks, and use the annotations from the ECB+ dataset. Our dataset consists of the predicate paraphrases from Chirps that appear in ECB+ (denoted ch-ECB+). As positive examples we consider all pairs of predicates p 1 , p 2 from Chirps that appear in the same event cluster in ECB+, e.g., from {talk, say, tell, accord to, confirm} we extract (talk, say), (talk, tell), ..., (accord to, confirm).
Obtaining negative examples is a bit trickier. We consider as negative example candidates pairs of predicates p 1 , p 2 from Chirps, which are under the same topic, but in different event clusters in ECB+, e.g., given the clusters {specify, reveal, say}, and {get}, we extract (specify, get), (reveal, get), and (say, get).
Note that the ECB+ annotations are contextdependent. Thus a pair of predicates that are in principle coreferable may be annotated as non-    (2017), we annotated the templates while presenting 3 argument instantiations from their original tweets. Thus, we only included in the final data predicate pairs with at least 3 supporting pairs. We required that workers have 99% approval rate on at least 1,000 prior tasks and pass a qualification test. Each example was annotated by 3 workers. We aggregated the per-instantiation annotations using majority vote and considered a pair as positive if at least one instantiation was judged as positive. The data statistics are given in Table 4. This validation phase balanced the positive-negative proportion of instances in the data, from approximately 1:7 to approximately 4:5.

Model
We trained a random forest classifier (Breiman, 2001) implemented by the scikit-learn framework (Pedregosa et al., 2011). To tune the hyperparameters, we ran a 3 fold cross-validation randomized search, yielding the following values: 157 estimators, max depth of 8, minimum samples leaf of 1, and min samples split of 10. 2

Evaluation
We used the model for two purposes: (1) classification: determining if a pair of predicate templates are paraphrases or not; and (2) ranking the pairs based on the predicted positive class score. We consider the ranking evaluation as more informative, as we expect the ranking to reflect the number of contexts in which a pair of predicates may be coreferring. That is, predicate pairs that are coreferring in many contexts will be ranked higher than those that are coreferring in just a few contexts. We compare our model with two baselines: the original Chirps scores, and a baseline that assigns each pair of predicates the cosine similarity scores between the predicates using GloVe embeddings (Pennington et al., 2014). 3 For the classification decisions made by the two baseline scores (Chirps score and cosine similarity for Glove vectors), we learn a threshold that yields the best accuracy score over the train set, above which a pair of predicates is classified as positive. Table 5 displays the accuracy, precision, recall and F 1 scores for classification evaluation and the Average Precision (AP) for ranking evaluation. Our scorer dramatically improves upon the baselines in all metrics.
To show that the improved scoring generalizes beyond examples that appear in the ECB+ dataset, we selected a random subset of 500 predicate pairs with at least 6 support pairs from the entire Chirps resource and annotated them in the same method described in Section 3.2. The ranker evaluated on this subset gained 8 points in AP, relative to the original Chirps ranking. All results are statistically significant using bootstrap and permutation tests with p < 0.001 (Dror et al., 2018). Table 6 exemplifies highly ranked predicate pairs by our Chirps* scorer, the original Chirps scorer and the GloVe scorer, which illustrates the improved ranking performance of Chirps* (as measured in table 5 by the AP score).
Ablation Test To evaluate the importance of each type of feature, we perform an ablation test. Table 7 displays the performance of various ablated models, each of which with one set of features (Section 3.1) removed from the representation. In the classification task, removing the named entity coverage features somewhat improved the performance, mostly by increasing the recall. However, in terms of the (primary) ranking evaluation, each set of features contributed to the performance, with the full model performing best.

Leveraging a Paraphrasing Resource to Improve Coreference
In Section 3 we showed that leveraging CD event coreference annotations and model improves predicate paraphrase ranking. In this section, we show that this co-dependence can be used in both directions, and that using Chirps* as an external resource can improve the performance of a CD model. As a preliminary analysis, we computed Chirps' coverage of lexically-divergent pairs of co-referring event mentions in ECB+. We found approximately 30% coverage overall and above 50% coverage for coreferring verbal mentions. 4 This indicates a substantial coverage of the lexically-divergent positive coreferrability decisions that need to be made in ECB+. In absolute numbers, Chirps covers

Integration Method
The state-of-the-art CD coreference resolution model, by Barhom et al. (2019), trained a pairwise mention scoring function, M LP scorer (m i , m j ), which predicts the probability that two mentions m i , m j refer to the same event. The mention representation includes a lexical component (GloVe embeddings) as well as a contextual component (ELMo embeddings, Peters et al., 2018). The mention pair representation v i,j , which is fed to the pairwise scorer, combines the two separate mention representations. We extended the model by changing the input to the pairwise event mention scoring function to include information regarding the mention pair from Chirps*, as illustrated in Figure 1. We defined v i,j = [ v i,j ; c i,j ], where c i,j denotes the Chirps* features, computed in the following way: is the feature vector representing a pair of predicates (m i , m j ) for which there is an entry in Chirps, otherwise the input is a zero vector. M LP ch is an MLP with a single hidden layer of size 50 and output layer of size 100, which is used to transform the discrete values in f m i ,m j into the same embedding space of v i,j . The rest of the model remains the same, including the model architecture, training, and inference. 5

Evaluation
We evaluate the event coreference performance on ECB+ using the official CoNLL scorer (Pradhan et al., 2014 Barhom et al. (2019), and the left one is our Chirps* extension, which is transformed through M LP ch into the same embedding space. The two vectors are concatenated to form the mention pair representation, which is fed to the scoring function M LP scorer .
We compare the integrated model to the original model and to the lemma baseline which clusters together mentions that share the same mentionhead lemma. The results in Table 8 show that the Chirps-enhanced model provides an improvement of 3.5 points over the lemma baseline and a small improvement upon Barhom et al. (2019) in all F 1 score measures. The greatest improvement is in the link-based MUC measure, which counts the number of corresponding links between the mentions. The Chirps component helps link more coreferring mentions (improving recall) and prevents the linking of some wrong mentions (improving precision).
Although the gap between our model and the original model by Barhom et al. (2019) is statistically significant (bootstrap and permutation tests with p < 0.001), it is rather small. We can attribute it partly to the coverage of Chirps over ECB+ (around 30%), which entails that the majority of event mention pairs still have the same representation as in the original model. We also note that ECB+ suffers from annotation errors, as was observed by Barhom et al. (2019) and others.

Conclusion and Future Work
We studied the synergy between the tasks of identifying predicate paraphrases and event coreference resolution, both concerned with matching the meanings of lexically-divergent predicates, and showed that they can benefit each other. Using event coreference annotations as distant supervision, we learned to re-rank predicate paraphrases that were initially ranked heuristically, and managed to increase their average precision substantially. In the other direction, we incorporated knowledge from our re-ranked predicate paraphrases resource into a model for event coreference resolution, yielding a small improvement upon previous state-of-theart results. We hope that our study will encourage future research to make further progress on both tasks jointly.