Adding Semantics to Data-Driven Paraphrasing

We add an interpretable semantics to the paraphrase database (PPDB). To date, the relationship between phrase pairs in the database has been weakly de-ﬁned as approximately equivalent. We show that these pairs represent a variety of relations, including directed entailment ( little girl/girl ) and exclusion ( no-body/someone ). We automatically assign semantic entailment relations to entries in PPDB using features derived from past work on discovering inference rules from text and semantic taxonomy induction. We demonstrate that our model assigns these relations with high accuracy. In a down-stream RTE task, our labels rival relations from WordNet and improve the coverage of a proof-based RTE system by 17%.


Motivation
A basic precursor to language understanding is the ability to recognize when two expressions mean the same thing. Different expressions of the same information is the central problem addressed by paraphrasing and the closely related task of recognizing textual entailment (RTE). In RTE, a system is given two pieces of text, often called the text (T) and the hypothesis (H), and asked to determine whether T entails H, T contradicts H, or T and H are unrelatable (Figure 1). In contrast, data-driving paraphrasing typically sidesteps developing a clear definition of "meaning the same thing" and instead "assume[s] paraphrasing is a coherent notion and concentrate[s] on devices that can produce paraphrases" (Barzilay, 2003). Recent work on paraphrase extraction has resulted in enormous paraphrase collections (Lin and Pantel, 2001;Dolan et al., 2004;Ganitkevitch et al., 2013), but the usefulness of these collections Riots in Denmark were sparked by 12 editorial cartoons that were offensive to Muhammad. Twelve illustrations insulting the prophet caused unrest in Jordan. Figure 1: An example sentence pair for the RTE task. In order for a system to conclude that the premise (top) does not entail the hypothesis (bottom), it should recognize that sparked implies caused but that in Denmark precludes in Jordan. These phrase-level entailment relationships are modeled by natural logic.
is limited by the fast-and-loose treatment of the meaning of paraphrases. One concrete definition that is sometimes used for paraphrases requires that they be bidirectionally entailing (Androutsopoulos and Malakasiotis, 2010). That is, in terms of RTE, it is assumed that if P is a paraphrase of Q, then P entails Q and Q entails P. In reality, paraphrases are often more nuanced (Bhagat and Hovy, 2013), and the entries in most paraphrase resources certainly do not match this definition. For instance, Lin and Pantel (2001) extracted 12 million "inference rules" from monolingual text by exploiting shared dependency contexts. Their method learns paraphrases that are truly meaning equivalent, but it just as readily learns contradictory pairs such as hX rises, X fallsi. Ganitkevitch et al. (2013) extract over 150 million paraphrase rules by pivoting through foreign translations. This bilingual method often learns hypernym/hyponym pairs, e.g. due to variation in the discourse structure of translations (Callison- Burch, 2007), and unrelated pairs, e.g. due to misalignments or polysemy in the foreign language. The unclear semantics severely limits the applicability of paraphrase resources to natural language understanding (NLU) tasks. Some efforts have been made to identify directionality of paraphrases (Bhagat et al., 2007;Kotlerman et al., 2010), but tasks like RTE require even richer semantic information. For example, in the T/H pair shown in Figure 1, a system needs information not only about equivalent words (12/twelve) and asymmetric entailments (riots/unrest), but also semantic exclusion (Denmark/Jordan). Such lexical entailment relations are captured by natural logic, a formalism which views natural language itself as a meaning representation, eschewing external representations such as First Order Logic (FOL). This is a great fit for automatically extracted paraphrases, since the phrase pairs themselves can be used as the semantic representation with minimal additional annotation. But as is, paraphrase resources lack such annotation.
As a result, NLU systems rely on manually built resources like WordNet, which are limited in coverage and often lead to incorrect inferences (Kaplan and Schubert, 2001). In fact, in the most recent RTE challenge, over half of the submitted systems used WordNet (Pontiki et al., 2014). Even the NatLog system (MacCartney and Manning, 2007), which popularized natural logic for RTE, relied on WordNet and did not solve the problem of assigning natural logic relations at scale.
The main contributions of this paper are: • We add a concrete, interpretable semantics to the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013), the largest paraphrase resource currently available. We give each entry in the database a label describing the entailment relationship between the phrases.
• We develop a statistical model to predict these relations. The enormous size of PPDBover 77 million phrase pairs!-makes it impossible to perform this task manually. Our wide range of monolingual and bilingual features results in high intrinsic accuracy.
• We demonstrate improvements to a proofbased RTE system, showing that our automatic labels increase the number of proofs that it is able to find by 17%, while maintaining the same accuracy as when using goldstandard, manual labels.

Related Work
Lexical entailment resources Approaches to paraphrase identification have exploited signal from distributional contexts (Lin and Pantel, 2001;Szpektor et al., 2004), comparable corpora (Dolan et al., 2004;Xu et al., 2014), and graph structures (Berant et al., 2011;Brockett et al., 2013). These approaches are scalable, but they often assume that all relations are equivalence relations (Madnani and Dorr, 2010). Several efforts have attempted to build or augment lexical ontologies automatically, to discover other types of lexical relations like hypernyms. Most of these approaches rely on lexico-syntactic patterns. Hearst (1992) searched for hand-written patterns (e.g. "an X is a Y") in a large corpus in order to learn taxonomic relations between nouns. Snow et al. (2006) used dependency parses to automatically learn such patterns, which they used to augment WordNet with new hypernym relations. Similar monolingual signals have been used to learn fine-grained relationships between verbs, such as enablement and happensbefore (Chklovski and Pantel, 2004;Hashimoto et al., 2009).
Recognizing Textual Entailment The shared RTE tasks (Dagan et al., 2006) have been a springboard for research in natural language inference, Figure 2: Distribution of entailment relations in different sizes of PPDB. Distributions are estimated from our manual annotations of randomly sampled pairs. PPDB-XXXL contains over 77MM paraphrase pairs (where the majority type is independent), compared to only 700K in PPDB-S (where the majority type is equivalent).
using data motivated by the applications to information retrieval, information extraction, summarization, machine translation evaluation, and more recently, question answering (Giampiccolo et al., 2007) and essay grading (Clark et al., 2013). RTE systems vary considerably in their choice of representation and inference procedure. In the most recent shared task on RTE, some systems used deep logical representations of text, allowing them to invoke theorem provers (Bjerva et al., 2014) or Markov Logic Networks (Beltagy et al., 2014) to perform the inference, while others used shallower representations, relying on machine learning to perform inference (Lai and Hockenmaier, 2014;Zhao et al., 2014). Systems based on natural logic (MacCartney and Manning, 2007) use natural language as a representation, but still perform inference using a structured algebra rather than a statistical model. Regardless of the inference procedure, improvements to external lexical resources can improve RTE systems across the board (Clark et al., 2007).

The Paraphrase Database (PPDB)
PPDB is currently the largest available collection of paraphrases. Compared to other paraphrase resources such as the DIRT database (12 million rules) (Lin and Pantel, 2001) and the MSR paraphrase phrase table (13 million) (Dolan et al., 2004), PPDB contains over 150 million paraphrase rules covering three paraphrase types-lexical (single word), phrasal (multiword), and syntactic restructuring rules. We focus on lexical and phrasal paraphrases, of which there are over 77 million rules. Of these, a large fraction are true paraphrases-either equivalent (distant/remote) or asymmetric entailment (girl/little girl)-but many are not. PPDB contains some pairs which are related by semantic exclusion (nobody/someone), some of which are related by something other than entailment (swim/water), and some which are simply unrelated (car/family). Table 1 gives examples of pairs in PPDB falling into each of these categories. PPDB is released in six sizes (S, M, L, XL, XXL and XXXL), which fall roughly on a continuum from highest precision and lowest recall to lowest average precision and highest recall. Figure 2 shows how the distribution of entailment relations differs across the sizes of PPDB. 1 Our goal is to make these relations explicit, by providing annotations for each phrase pair. Because of the enormous scale of PPDB, this annotation must be done automatically.

Selection of Paraphrases
In this paper we focus on paraphrases pairs from PPDB that occur in RTE data. We use the recent SICK dataset from in the 2014 SemEval RTE challenge (Marelli et al., 2014) for our experiments. The data consists of 10K sentences split roughly evenly into training and testing sets. The sentence pairs are labeled using a 3-way entailment classification: ENTAILMENT, (29%) CONTRADIC-TION (15%), or NEUTRAL (56%). We consider all phrase pairs from PPDB hp 1 , p 2 i up to three words in length such that there is some T/H sentence pair in which p 1 appears in T and p 2 appears Lexical We use the lemmas, POS tags, and phrase lengths of p1 and p2, the substrings shared by p1 and p2, and the Levenstein, Jaccard, and Hamming distances between p1 and p2. Distributional Given a dependency context vectors for p1 and p2, we compute the number of shared contexts, and the Jaccard, Cosine, Lin1998, Weeds2004, Clarke2009, and Szpektor2008 similarities between the vectors. Paraphrase We include 33 paraphrase features distributed with PPDB, which include the paraphrase probabilities as computed in Bannard and Callison-Burch (2005). We refer the reader to Ganitkevitch and Callison-Burch (2014) for a complete description of all of the features included with PPDB. Translation We include the number of foreign language "pivots" (translations) shared by p1 and p2 for each of 24 languages used in the construction of PPDB, as a fraction of the total number of translations observed for each of p1 and p2.

Path
We include a sparse vector of all lexico-syntactic patterns (paths through a dependency parse) which are observed between p1 and p2 in the Annotated Gigaword corpus (Napoles et al., 2012). WordNet We include binary features indicating whether WordNet classifies p1 and p2 according to any of the following relations: synonym, hypernym, hyponym, antonym, holonym, meronym, cause, entailment, derivationally-related, similar-to, also-see, or attribute. in H. Roughly 55% of the word types and 5% of the phrase (bigram and trigram) types in the SICK data appear in PPDB. This gives us a list of 9,600 pairs, half from the training sentences, which we use for development in Section 6, and half from the test sentences, which we use for evaluation in Section 7.
The SICK data has a relatively small vocabulary, with 86% of words types and <1% of the phrase types covered by WordNet. Still, over half of the words in SICK which are covered by PPDB do not appear in WordNet. In general, PPDB covers a much larger vocabulary (1.6MM words) than does WordNet (155K words), and we expect the potential benefit of using PPDB in addition to or in place of WordNet to be larger on datasets with richer vocabularies.

Entailment Relations
We use the relations from Bill MacCartney's thesis on natural language inference as the basis for our categorization of relations (MacCartney, 2009). He outlines 7 basic entailment relationships: 2 Independence (P#Q): All other cases.
These relations are based on the theory of natural logic, meaning they are defined between pairs of natural language expressions rather than requiring an external formal representation. This makes them an ideal fit for the phrase pairs in in PPDB and similar automatically-constructed paraphrase resources. Nat.
This MTurk description Log. work ⌘ ⌘ X is the same as Y @ @ X is more specific than/is a type of Y A A X is more general than/encompasses Ŷ ¬ X is the opposite of Y | X is mutually exclusive with Y # ⇠ X is related in some other way to Y # X is not related to Y Table 2: Column 1 gives the semantics of each label under MacCartney's Natural Logic. Column 2 gives the notation we use throughout the remainder of this paper. Column 3 gives the description that was shown to Turkers.
Annotation We use Amazon Mechanical Turk (MTurk) to collect labels for our phrase pairs. We asked workers to choose between the options show in Table 2, which represent a modified version of MacCartney's relations. We replace negation (ˆ) with the weaker notion of "opposites," effectively merging it with the alternation (|) relation; we split the independent (#) class into two cases: truly independent phrases and phrases which are related by something other than entailment (which we denote ⇠). We omit the cover (^) relation entirely, as its practicality is not obvious. We show each pair to 5 workers, taking the majority label as truth. Each HIT consisted of two control questions taken from WordNet. Workers achieved good accuracies on our controls (82% overall) and moder-  ate levels of agreement (Fleiss's  = 0.56) (Landis and Koch, 1977). For a fuller discussion of the annotation, refer to the supplementary material.

Automatic Classification
We aim to build a classifier to automatically assign entailment types to entries in the PPDB, and to demonstrate that it performs well both intrinsically and extrinsically. We fix the direction of the @ and A relations to create a single class and train a logistic regression classifier to distinguish between the 5 classes {#, ⌘, A, ¬, ⇠}. We compute variety of basic lexical features and WordNet features (summarized in Figure 3). We categorize the remaining features into two broad groups: monolingual features, which are based on observed usage in the Annotated Gigaword corpus (Napoles et al., 2012), and bilingual features, which are based on translation probabilities observed in bilingual parallel corpora. Full descriptions of all the features used are provided in the supplementary material.

Monolingual features
Path features Snow et al. (2004) used lexicosyntactic patterns to mine taxonomic relations (hypernyms and hyponyms) between noun pairs. They were able to verify the earlier work of Hearst (1992) which found that certain patterns, e.g. X and other Y, are strong indicators of hypernymy. Using similar path features, we learn new patterns to differentiate between more subtle relations. For example, we learn the pattern separate X from Y is highly indicative of the ¬ relation. We learn that the pattern X including Y suggests A more than it suggests ⌘ whereas the pattern X known as Y suggests ⌘ more than A. Table 4 gives examples of some of the paths most indicative of the ¬ relation.
Distributional features Lin and Pantel (2001) attempted to mine inference rules from text by finding paths in a dependency tree which connect the same nouns. The intuition is that good paraphrases should tend to modify and be modified by in X and in Y in foods and in beverages separate X from Y separate the old from the young to X and/or to Y to the left or to the right from X to Y from 7 a.m. to 10 p.m. more/less X than Y more harm than good the same words. Given context vectors, Lin and Pantel (2001) used a symmetric similarity metric (Lin, 1998) to find candidate paraphrases. We build dependency context vectors for each word in our data and compute both symmetric as well as more recently proposed asymmetric similarity measures (Weeds et al., 2004;Szpektor and Dagan, 2008;Clarke, 2009), which are potentially better suited for identifying A paraphrases. Table 3 gives a comparison of the pairs which are considered "most similar" according to several of these metrics.

Bilingual features
We explore a variety of bilingual features, which we expect to provide complimentary signal to the monolingual features. Each pair in PPDB is associated with several paraphrase probabilities, which are based on the probabilities of aligning each word to the foreign "pivot" phrase (a foreign translation shared by the two phrases), computed as described in Bannard and Callison-Burch (2005). We also compute the total number of shared foreign translations for each phrase pair. Table 3 shows the highest ranked pairs by this bilingual similarity score, in comparison to several of the monolingual scores. Table 5 shows an ablation analysis. The bilingual features are especially important for distinguishing the ⌘ class, and the path and WordNet features are important for the ¬ class. The lexical features show strong performance across the board; this is often because they capture negation words (e.g. no) and substring features (little boy @ boy).   Table 3 shines some light onto the differences between monolingual and bilingual similarities. While the monolingual asymmetric metrics are good for identifying A pairs, the symmetric metrics consistently identify ¬ pairs; none of the monolingual scores we explored were effective in making the subtle distinction between ⌘ pairs and the other types of paraphrase. In contrast, the bilingual similarity metric is fairly precise for identifying ⌘ pairs, but provides less information for distinguishing between types of nonequivalent paraphrase. These differences are further exhibited in the confusion matrices shown in Figure 4; when the classifier is trained using only monolingual features, it misclassifies 26% of ¬ pairs as ⌘, whereas the bilingual features make this error only 6% of the time. On the other hand, the bilingual features completely fail to predict the A class, calling over 80% of such pairs ⌘ or ⇠.

Intrinsic Evaluation
We test the performance of our classifier intrinsically, through its ability to reproduce the human labels for the phrase pairs from the SICK test sentences. Table 7 shows the precision and recall achieved by the classifier for each of our 5 en-tailment classes. The classifier is able to achieve an overall 79% accuracy, reaching >70% precision while maintaining good levels of recall on all classes.   Figure 4 shows the classifier's confusion matrix and Table 6 shows some examples of common and interesting error cases. The majority of errors (26%) come from confusing the ⇠ class with the # class. This mistake is not too concerning from an RTE perspective since ⇠ can be treated as a special case of # (Section 5). There are very few cases in which the classifier makes extreme errors, e.g. confusing ⌘ with ¬ or with #; some interesting examples of such errors arise when the phrases contain pronouns (e.g. girl ⌘ she) or when the relation uses a highly infrequent word sense (e.g. photo ⌘ still).

The Nutcracker RTE System
To further test our classifier, we evaluate the usefulness of the automatic entailment predictions in a downstream RTE task. We run our experiments using Nutcracker, a state-of-the-art RTE system based on formal semantics (Bjerva et al., 2014). Baselines are in gray, this work in blue, human references in gold. PPDB-XL refers to a run in which every pair which appears in PPDB is assumed to be equivalent. PPDB-H refers to a run in which manual labels were used to generate axioms. PPDB+ refers to runs in which the automatic classifications were used to generate axioms. In some cases, better proof coverage causes NC to find incorrect proofs, illustrated by the decreased performance on CONTRADICTION when using PPDB-H. For example, using PPDB-H, NC finds an inconsistency for the pair Someone is not playing piano./A person is playing a keyboard. Using the PPDB+, in which piano/keyboard is falsely classified as #, NC fails to find a proof and so correctly guesses NEUTRAL.  In the SemEval 2014 RTE challenge, this system performed in the top 5 out of the more than 20 participating systems (Marelli et al., 2014). Given a text/hypothesis (T/H) pair, Nutcracker (NC) uses the Boxer parser (Bos, 2008) to produce a formal semantic representation of both T and H, which it translates into standard first-order logic. The logical formulae are passed to an off-the-shelf theorem prover, which searches for a logical entailment, and to a model builder, which attempts to find a logical contradiction. By default, when the system fails to find a proof for either entailment or inconsistency, it predicts the most frequent class (in our case, NEUTRAL). Therefore, NC relies heavily on lexical entailment resources in order to improve the recall of the theorem prover and model builder.
Baselines The most frequent class baseline is achieved by labeling every sentence pair as NEU-TRAL, and results in an accuracy of 56%. A stronger baseline is obtained by running NC alone, without any external axioms; in this case, words are only equivalent if they are lemma-identical.
As an additional baseline, we generate a "basic"  Table 8: Nutcracker's overall system accuracy and proof coverage when using different sources of axioms. Coverage is measured as the percent of sentence pairs for which NC's theorem prover or model builder is able to find a complete logical proof of either entailment or contradiction. When NC fails to find either type of proof, it guesses the most frequent class, NEUTRAL. NC alone uses no axioms. PPDB+ refers to the axioms generated automatically using the classifier described in this paper. PPDB-H refers axioms generated using the human labels on which the classifier was trained.
PPDB-XL 3 knowledge base (KB), which consists exclusively of axioms expressing synonym relationships. I.e. for every pair of phrases hp 1 , p 2 i in PPDB-XL, the PPDB-XL KB contains the equivalence axiom syn(p 1 , p 2 ). We also generate the WordNet (WN NEUTRAL Someone is playing a piano./There is no one playing a piano. CONTRA. NEUTRAL CONTRA.
There is no man pouring oil into a pan./A man is pouring oil into a skillet. over every pair in both directions, and we choose whichever direction and relation receives the highest confidence score to be the final prediction. We refer to this set of automatically-predicted axioms as PPDB+.
To calibrate our improvements, we also generate a KB using the human labels collected from MTurk, which we refer to as PPDB-Human or PPDB-H. Table 8 reports NC's overall prediction accuracy and the number of proofs found when using each of the described KBs. Figure 8 shows the performance in terms of the precision and recall achieved for each of the three entailment classes: ENTAILMENT, CONTRADICTION, and NEUTRAL. Table 9 provides some examples of T/H pairs on which predictions differed using the PPDB+ compared to the WN KB, and Figure 9 shows some illustrative misclassifications.

Results
Our automatic labels result in a 4% improvement in accuracy over the baseline of using NC alone (Figure 8), and a 15 point improvement in F1 measure for the entailment class (Table 8). By all performance measures, PPDB+ also outperforms WordNet as a source of axioms for NC. Moreover, adding PPDB+ to WordNet gives a 17% relative increase in the number of proofs found compared to using WordNet alone (Table 8). These additional proofs lead NC to make a greater number of correct predictions for the "right reasons" (i.e. finding a proof/contradiction) rather than by lucky guessing (recall NC guesses the most frequent class when it cannot find a proof).
For comparison, we run the same experiments using a KB of oracle human labels in place of the predicted labels in PPDB+. Using PPDB+, NC comes very close to the performance achieved when using PPDB-Human, demonstrating that the automatically generated PPDB+ provides as much utility to the end-to-end system as does a gold-standard resource.

Data Release
Upon publication, we are releasing a new PPDB fully annotated with semantic relations. We are also releasing the set of 14K manually labeled phrase pairs occurring in RTE data, and our software for extracting features and running the classifier, so that researchers can apply our model to their own paraphrase collections. This will constitute the largest lexical entailment resources available, while also offering new fine-grained annotation necessary for challenging NLU tasks. An evaluation of the predicted relations appearing in the entire Paraphrase Database (not just those occurring in RTE data) is given in the supplementary material.

Conclusion
We argue that a significant failing of recent work on data-driven paraphrasing is the weak definition of paraphrases as being more-or-less equivalent. In this paper, we show how a clear concept of semantics can be applied to large-scale paraphrase resources. In particular, the entailment relations given by natural logic are a great fit for paraphrase resources, since natural logic operates on pairs of natural language expressions (like the entries in PPDB). By classifying paraphrase entries with entailment relations, we provide them with an interpretable semantics. Our classifier uses extensive feature sets to scale natural logic to the enormous number of phrase pairs in PPDB. We rigorously evaluate our model, demonstrating high accuracy on an intrinsic task. On an extrinsic RTE task, our model's predictions allow an RTE system to find 17% more proofs and achieve a higher overall accuracy than when using WordNet's manual relations. Our new release of PPDB, annotated with semantic entailments, will dramatically improve PPDB's utility for NLU tasks.