POLY: Mining Relational Paraphrases from Multilingual Sentences

Language resources that systematically organize paraphrases for binary relations are of great value for various NLP tasks and have recently been advanced in projects like PATTY, WiseNet and DEFIE. This paper presents a new method for building such a resource and the resource itself, called POLY. Starting with a very large collection of multilingual sentences parsed into triples of phrases, our method clusters relational phrases using probabilistic measures. We judiciously leverage ﬁne-grained semantic typing of relational arguments for identifying synonymous phrases. The evaluation of POLY shows signiﬁcant improvements in precision and recall over the prior works on PATTY and DEFIE. An extrinsic use case demonstrates the beneﬁts of POLY for question answering.


Introduction
Motivation. Information extraction from text typically yields relational triples: a binary relation along with its two arguments. Often the relation is expressed by a verb phrase, and the two arguments are named entities. We refer to the surface form of the relation in a triple as a relational phrase. Repositories of relational phrases are an asset for a variety of tasks, including information extraction, textual entailment, and question answering. This paper presents a new method for systematically organizing a large set of such phrases. We aim to construct equivalence classes of synonymous phrases, analogously to how WordNet organizes unary predicates as noun-centric synsets (aka. semantic types). For example, the following relational phrases should be in the same equivalence class: sings in, is vocalist in, voice in denoting a relation between a musician and a song.
State of the Art and its Limitations. Starting with the seminal work on DIRT (Lin and Pantel, 2001), there have been various attempts on building comprehensive resources for relational phrases. Recent works include PATTY (Nakashole et al., 2012), WiseNet (Moro and Navigli, 2012) and DE-FIE (Bovi et al., 2015). Out of these DEFIE is the cleanest resource. However, the equivalence classes tend to be small, prioritizing precision over recall. On the other hand, PPDB (Ganitkevitch et al., 2013) offers the largest repository of paraphrases. However, the paraphrases are not relation-centric and they are not semantically typed. So it misses out on the opportunity of using types to distinguish identical phrases with different semantics, for example, performance in with argument types musician and song versus performance in with types athlete and competition.
Our Approach. We start with a large collection of relational triples, obtained by shallow information extraction. Specifically, we use the collection of Faruqui and Kumar (2015), obtained by combining the OLLIE tool with Google Translate and projecting multilingual sentences back to English. Note that the task addressed in that work is relational triple extraction, which is orthogonal to our problem of organizing the relational phrases in these triples into synonymy sets.
We canonicalize the subject and object arguments of triples by applying named entity disambiguation and word sense disambiguation wherever possible. Using a knowledge base of entity types, we can then infer prevalent type signatures for relational phrases. Finally, based on a suite of judiciously devised probabilistic distance measures, we cluster phrases in a type-compatible way using a graph-cut technique. The resulting repository contains ca. 1 Million relational phrases, organized into ca. 160,000 clusters.
Contribution. Our salient contributions are: i) a novel method for constructing a large repository of relational phrases, based on judicious clustering and type filtering; ii) a new linguistic resource, coined POLY, of relational phrases with semantic typing, organized in equivalence classes; iii) an intrinsic evaluation of POLY, demonstrating its high quality in comparison to PATTY and DEFIE; iv) an extrinsic evaluation of POLY, demonstrating its benefits for question answering. The POLY resource is publicly available 1 .

Method Overview
Our approach consists of two stages: relational phrase typing and relational phrase clustering. In Section 3, we explain how we infer semantic types of the arguments of a relational phrase. In Section 4, we present the model for computing synonyms of relational phrases (i.e., paraphrases) and organizing them into clusters.
A major asset for our approach is a large corpus of multilingual sentences from the work of Faruqui and Kumar (2015). That dataset contains sentences from Wikipedia articles in many languages. Each sentence has been processed by an Open Information Extraction method (Banko et al., 2007), specifically the OLLIE tool (Mausam et al., 2012), which produces a triple of surface phrases that correspond to a relational phrase candidate and its two arguments (subject and object). Each non-English sentence has been translated into English using Google Translate, thus leveraging the rich statistics that Google has obtained from all kinds of parallel multilingual texts. Altogether, the data from Faruqui and Kumar (2015) provides 135 million triples in 61 languages and in English (from the translations of the corresponding sentences). This is the noisy input to our 1 www.mpi-inf.mpg.de/yago-naga/poly/ method. Figure 1 shows two Spanish sentences, the extracted triples of Spanish phrases, the sentences' translations to English, and the extracted triples of English phrases.
The figure shows that identical phrases in the foreign language -"fue filmado por" -may be translated into different English phrases: "was shot by" vs. "was filmed by", depending on the context in the respective sentences. This is the main insight that our method builds on. The two resulting English phrases have a certain likelihood of being paraphrases of the same relation. However, this is an uncertain hypotheses only, given the ambiguity of language, the noise induced by machine translation and the potential errors of the triple extraction. Therefore, our method needs to de-noise these input phrases and quantify to what extent the the relational phrases are indeed synonymous. We discuss this in Sections 3 and 4.

Relation Typing
This section explains how we assign semantic types to relational phrases. For example, the relational phrase wrote could be typed as <author> wrote <paper>, as one candidate. The typing helps us to disambiguate the meaning of the relational phrase and later find correct synonyms. The relational phrase shot could have synonyms directed or killed with a gun. However, they represent different senses of the phrase shot. With semantic typing, we can separate these two meanings and determine that <person> shot <person> is a synonym of <per-son> killed with a gun <person>, whereas <direc-tor> shot <movie> is a synonym of <director> directed <movie>.
Relation typing has the following steps: argument extraction, argument disambiguation, argument typing and type filtering. The output is a set of candidate types for the left and right arguments of each English relational phrase.

Argument Extraction
For the typing of a relational phrase, we have to determine words in the left and right arguments that give cues for semantic types. To this end, we identify named entities, whose types can be looked up in a knowledge base, and the head words of common Translation Translation Figure 1: Multilingual input sentences and triples noun phrases. As output, we produce a ranked list of entity mentions and common nouns.
To create this ranking, we perform POS tagging and noun phrase chunking using Stanford CoreNLP (Manning et al., 2014) and Apache OpenNLP 2 . For head noun extraction, we use the YAGO Javatools 3 and a set of manually crafted regular expressions. Since the input sentences result from machine translation, we could not use dependency parsing, because sentences are often ungrammatical.
Finally, we extract all noun phrases which contain the same head noun. These noun phrases are then sorted according to their lengths.
For example, for input phrase contemporary British director who also created "Inception", our method would yield contemporary British director, British director, director in decreasing order.

Argument Disambiguation
The second step is responsible for the disambiguation of the noun phrase and named entity candidates. We use the YAGO3 knowledge base (Mahdisoltani et al., 2015) for named entities, and WordNet (Fellbaum, 1998) for noun phrases. We proceed in the ranking order of the phrases from the first step.
Candidate senses are looked up in YAGO3 and WordNet, respectively, and each candidate is scored. The scores are based on: • Frequency count prior: This is the number of Wikipedia incoming links for named entities in YAGO3, or the frequency count of noun phrase senses in WordNet.
• Wikipedia prior: We increase scores of YAGO3 entities whose URL strings (i.e., Wikipedia titles) occur in the Wikipedia page from which the triple was extracted.
2 opennlp.apache.org/ 3 mpi-inf.mpg.de/yago-naga/javatools/ • Translation prior: We boost the scores of senses whose translations occur in the original input sentence. For example, the word stage is disambiguated as opera stage rather than phase, because the original German sentence contains the word Bühne (German word for a concert stage) and not Phase. The translations of word senses are obtained from Universal WordNet (de Melo and Weikum, 2009).
We prefer WordNet noun phrases over YAGO3 named entities since noun phrases have lower type ambiguity (fewer possible types). The final score of a sense s is: where f req(s) is the frequency count of s, and wiki(s) and trans(s) equal maximal frequency count if the Wikipedia prior and Translation prior conditions hold (and otherwise set to 0). α, β, γ are tunable hyper-parameters (set using withheld data). Finally, from the list of candidates, we generate a disambiguated argument: either a WordNet synset or a YAGO3 entity identifier.

Argument Typing
In the third step of relation typing, we assign candidate types to the disambiguated arguments. To this end, we query YAGO3 for semantic types (incl. transitive hypernyms) for a given YAGO3 or Word-Net identifier.
The type system used in POLY consists of a subset of the WordNet noun hierarchy. We restrict ourselves to 734 types, chosen semi-automatically as follows. We selected the 1000 most frequent Word-Net types in YAGO3 (incl. transitive hypernyms). Redundant and non-informative types were filtered out by the following technique: all types were organized into a directed acyclic graph (DAG), and we removed a type when the frequency count of some of its children was higher than 80% of the parent's count. For example, we removed type trainer since more than 80% of trainers in YAGO3 are also coaches. In addition, we manually removed a few non-informative types (e.g. expressive style).
As output, we obtain lists of semantic types for the two arguments of each relational phrase.

Type Filtering
In the last step, we filter types one more time. This time we filter candidate types separately for each distinct relational phrase, in order to choose the most suitable specific type signature for each phrase. This choice is made by type tree pruning.
For each relational phrase, we aggregate all types of the left arguments and all types of the right arguments, summing up their their frequency counts. This information is organized into a DAG, based on type hypernymy. Then we prune types as follows (similarly to Section 3.3): i) remove a parent type when the relative frequency count of one of the children types is larger than 80% of the parent's count; ii) remove a child type when its relative frequency count is smaller than 20% of the parent's count.
For each of the two arguments of the relational phrase we allow only those types which are left after the pruning. The final output is a set of relational phrases where each has a set of likely type signatures (i.e., pairs of types for the relation's arguments).

Relation Clustering
The second stage of POLY addresses the relation clustering. The algorithm takes semantically typed relational phrases as input, quantifies the semantic similarity between relational phrases, and organizes them into clusters of synonyms. The key insight that our approach hinges on is that synonymous phrases have similar translations in a different language. In our setting, two English phrases are semantically similar if they were translated from the same relational phrases in a foreign language and their argument types agree (see Figure 1 for an example). Similarities between English phrases are cast into edge weights of a graph with phrases as nodes. This graph is then partitioned to obtain clusters.

Probabilistic Similarity Measures
The phrase similarities in POLY are based on probabilistic measures. We use the notation: • F : a set of relational phrases from a foreign language F • E: a set of translations of relational phrases from language F to English • c(f, e): no. of times of translating relational phrase f ∈ F into relational phrase e ∈ E • c(f ), c(e): frequency counts for relational phrase f ∈ F and its translation e ∈ E c(e) : (estimator for the) probability of e ∈ E being a translation of f ∈ F We define: as the probability of generating relational phrase e 1 ∈ E from phrase e 2 ∈ E. Finally we define: conf idence(e 1 , e 2 ) = 2 1 p(e 1 |e 2 ) + 1 p(e 2 |e 1 ) Confidence is the final similarity measure used in POLY. We use the harmonic mean in Equation 4 to dampen similarity scores that have big differences in their probabilities in Equation 2. Typically, pairs e 1 , e 2 with such wide gaps in their probabilities come from subsumptions, not synonymous phrases. Finally, we compute the support and confidence for every pair of English relational phrases which have a common source phrase of translation. We prune phrase pairs with low support (below a threshold), and rank the remaining pairs by confidence.

Graph Clustering
To compute clusters of relational phrases, we use modularity-based graph partitioning. Specifically, we use the partitioning algorithm of Blondel et al. (2008). The resulting clusters (i.e., subgraphs) are Cluster of relational phrases <location> is the heart of <location> <location> is situated in <location> <location> is enclosed by <location> <location> is located amidst <location> <location> is surrounded by <location> Table 1: Example of a cluster of relational phrases then ranked by their weighted graph density multiplied by the graph size (Equation 5). The example of a cluster is shown in Table 1.

Evaluation
For the experimental evaluation, we primarily chose triples from the German language (and their English translations). With about 23 million triples, German is the language with the largest number of extractions in the dataset, and there are about 2.5 million distinct relational phrases from the German-to-English translation. The POLY method is implemented using Apache Spark, so it scales out to handle such large inputs. After applying the relation typing algorithm, we obtain around 10 million typed relational phrases. If we ignored the semantic types, we would have about 950,000 distinct phrases. On this input data, POLY detected 1,401,599 pairs of synonyms. The synonyms were organized into 158,725 clusters.
In the following, we present both an intrinsic evaluation and an extrinsic use case. For the intrinsic evaluation, we asked human annotators to judge whether two typed relational phrases are synonymous or not. We also studied source languages other than German. In addition, we compared POLY against PATTY (Nakashole et al., 2012) and DEFIE (Bovi et al., 2015) on the relation paraphrasing task. For the extrinsic evaluation, we considered a simple question answering system and studied to what extent similarities between typed relational phrases can contribute to answering more questions.

Precision of Synonyms
To assess the precision of the discovered synonymy among relational phrases (i.e., clusters of para-  phrases), we sampled POLY's output. We assessed the 250 pairs of synonyms with the highest similarity scores. We also assessed a sample of 250 pairs of synonyms, randomly drawn from POLY's output. These pairs of synonyms were shown to several human annotators to check their correctness. Relational phrases were presented by showing the semantic types, the textual representation of the relational phrase and sample sentences where the phrase was found. The annotators were asked whether two relational phrases have the same meaning or not. They could also abstain.
The results of this evaluation are shown in Table 2 with (lower bounds and upper bounds of) the 0.95-confidence Wilson score intervals (Brown et al., 2001). This evaluation task had good interannotator agreement, with Fleiss' Kappa around 0.6. Table 3 shows anecdotal examples of synonymous pairs of relational phrases.
These results show that POLY's quality is comparable with state-of-the-art baselines resources. WiseNet (Moro and Navigli, 2012) is reported to have precision of 0.85 for 30,000 clusters. This is also the only prior work where the precision of synonymy of semantically typed relational phrases was evaluated. The other systems did not report that measure. However, they performed the evaluation of subsumption, entailment or hypernymy relationships which are related to synonymy. Subsumptions in PATTY have precision of 0.83 for top 100 and 0.75 for a random sample. Hypernyms in RELLY are reported to have precision of 0.87 for top 100 and 0.78 for a random sample. DEFIE performed separate evaluations for hypernyms generated directly from WordNet (precision 0.87) and hypernyms obtained through a substring generalization algorithm (precision 0.9).
Typical errors in the paraphrase discovery of POLY come from incorrect translations or extraction errors. For example, heard and belongs to were clustered together because they were translated from the Id Relation phrase Synonymous Relational Phrase 1 <location> is surrounded by <region> <location> is the heart of <region> 2 <artifact> is reminiscent of <time period> <artifact> recalls <time period> 3 <painter> was a participant in <show> <painter> has participated in <show> 4 <group> maintains a partnership with <district> <group> has partnered with <district> 5 <movie> was shot at <location> <movie> was filmed in <location> 6 <person> was shot by <group> <person> was shot dead by <group> 7 <movie> was shot by <film director> <movie> was directed by <film director> Table 3: Examples of synonyms of semantically typed relational phrases same semantically ambiguous German word gehört. An example for extraction errors is that took and participated in were clustered together because took was incorrectly extracted from a sentence with the phrase took part in. Other errors are caused by swapped order of arguments in a triple (i.e., mistakes in detecting passive form) and incorrect argument disambiguation.

Comparison to Competitors
To compare POLY with the closest competitors PATTY and DEFIE, we designed an experiment along the lines of the evaluation of Information Retrieval systems (e.g. TREC benchmarks). First, we randomly chose 100 semantically typed relational phrases with at least three words (to focus on the more interesting multi-word case, rather than single verbs). These relational phrases had to occur in all three resources. For every relational phrase we retrieved synonyms from all of the systems, forming a pool of candidates. Next, to remove minor syntactic variations of the same phrase, the relational phrases were lemmatized. In addition, we removed all leading prepositions, modal verbs, and adverbs. We manually evaluated the correctness of the remaining paraphrase candidates for each of the 100 phrases. Precision was computed as the ratio of the correct synonyms by one system to the number of all synonyms provided by that system. Recall was computed as the ratio of the number of correct synonyms by one system to the number of all correct synonyms in the candidate pool from all three systems.
The results are presented in Table 4. All results are macro-averaged over the 100 sampled phrases. We performed a paired t-test for precision and recall of POLY against each of the systems and obtained p-values below 0.05. POLY and DEFIE of-  However, DEFIE's synonyms often do not fit the semantic type signature of the given relational phrase and are thus incorrect. For example, was assumed by was found to be a synonym of <group> was acquired by <group>. PATTY, on the other hand, has higher recall due to its variety of prepositions attached to relational phrases; however, these also include spurious phrases, leading to lower precision. For example, succeeded in was found to be a synonym of <person> was succeeded by <leader>.
Overall, POLY achieves much higher precision and recall than both of these baselines.

Ablation Study
To evaluate the influence of different components, we performed an ablation study. We consider versions of POLY where Wikipedia prior and Translation prior (Section 3.2) are disregarded (− disambiguation), where the type system (Section 3.3) was limited to the 100 most frequent YAGO types (Type system 100) or to the 5 top-level types from the YAGO hierarchy (Type system 5), or where the type filtering parameter (Section 3.4) was set to 70% or 90% (Type filtering 0.7/0.9). The evaluation was done on random samples of 250 pairs of synonyms. Table 5 shows the results with the 0.95-confidence Wilson score intervals. Without our argument disambiguation techniques, the precision drops heavily. When weakening the type system, our tech- Type filtering 0.7 0.81 ± 0.05 192,117 Type filtering 0.9 0.73 ± 0.05 2,061,257  niques for argument typing and type filtering are penalized, resulting in lower precision. So we see that all components of the POLY architecture are essential for achieving high-quality output. Lowering the type-filtering threshold yields results with comparable precision. However, increasing the threshold results in a worse noise filtering procedure.

Evaluation with Other Languages
In addition to paraphrases derived from German, we evaluated the relational phrase synonymy derived from a few other languages with lower numbers of extractions. We chose French, Hindi, and Russian (cf. (Faruqui and Kumar, 2015)). The results are presented in Table 6, again with the 0.95-confidence Wilson score intervals.
Synonyms derived from French have similar quality as those from German. This is plausible as one would assume that French and German have similar quality in translation to English. Synonyms derived from Russian and Hindi have lower precision due to the lower translation quality. The precision for Hindi is lower, as the Hindi input corpus has much fewer sentences than for the other languages.

Extrinsic Evaluation: Question Answering
As an extrinsic use case for the POLY resource, we constructed a simple Question Answering (QA) system over knowledge graphs such as Freebase, and determined the number of questions for which the system can find a correct answer. We followed the approach presented by Fader et al. (2014). The system consists of question parsing, query rewriting and database look-up stages. We disregard the stage of ranking answer candidates, and merely test whether the system could return the right answer (i.e., would return with the perfect ranking).
In the question parsing stage, we use 10 highprecision parsing operators by Fader et al. (2014), which map questions (e.g., Who invented papyrus?) to knowledge graph queries (e.g., (?x, invented, papyrus)). Additionally, we map question words to semantic types. For example, the word who is mapped to person, where to location, when to abstract entity and the rest of the question words are mapped to type entity.
We harness synonyms and hyponyms of relational phrases to paraphrase the predicate of the query. The paraphrases must be compatible with the semantic type of the question word. In the end, we use the original query, as well as found paraphrases, to query a database of subject, predicate, object triples. As the knowledge graph for this experiment we used the union of collections: a triples database from OpenIE (Fader et al., 2011), Freebase (Bollacker et al., 2008), Probase (Wu et al., 2012) and NELL (Carlson et al., 2010). In total, this knowledge graph contained more than 900 Million triples.
We compared six systems for paraphrasing semantically typed relational phrases: • Basic: no paraphrasing at all, merely using the originally generated query.
• RELLY: using the subset of the PATTY taxonomy with additional entailment relationships between phrases (Grycner et al., 2015).
• POLY DE: using synonyms of relational phrases derived from the German language.
• POLY ALL: using synonyms of relational phrases derived from the 61 languages.
Since DEFIE's relational phrases are represented by BabelNet (Navigli and Ponzetto, 2012) word sense identifiers, we generated all possible lemmas for each identifier. We ran the paraphrase-enhanced QA system for three benchmark sets of questions: • TREC: the set of questions used for the evaluation of information retrieval QA systems (Voorhees and Tice, 2000) • WikiAnswers: a random subset of questions from WikiAnswers (Fader et al., 2013).
From these question sets, we kept only those questions which can be parsed by one of the 10 question parsing templates and have a correct answer in the gold-standard ground truth. In total, we executed 451 questions for TREC, 516 for WikiAnswers and 1979 for WebQuestions. For every question, each paraphrasing system generates a set of answers. We measured for how many questions we could obtain at least one correct answer. Table 7 shows the results.
The best results were obtained by POLY ALL. We performed a paired t-test for the results of POLY DE and POLY ALL against all other systems. The differences between POLY ALL and the other systems are statistically significant with pvalue below 0.05.
Additionally, we evaluated paraphrasing systems which consist of combination of all of the described datasets and all of the described datasets without POLY. The difference between these two versions suggest that POLY contains many paraphrases which are available in none of the competing resources.

Related Work
Knowledge bases (KBs) contribute to many NLP tasks, including Word Sense Disambiguation (Moro et al., 2014), Named Entity Disambiguation (Hoffart et al., 2011), Question Answering (Fader et al., 2014), and Textual Entailment (Sha et al., 2015). Widely used KBs are DBpedia (Lehmann et al., 2015), Freebase (Bollacker et al., 2008), YAGO (Mahdisoltani et al., 2015), Wikidata (Vrandecic and Krötzsch, 2014) and the Google Knowledge Vault (Dong et al., 2014). KBs have rich information about named entities, but are pretty sparse on relations. In the latter regard, manually created resources such as WordNet (Fellbaum, 1998), Verb-Net (Kipper et al., 2008) or FrameNet (Baker et al., 1998) are much richer, but still face the limitation of labor-intensive input and human curation. The paradigm of Open Information Extraction (OIE) was developed to overcome the weak coverage of relations in automatically constructed KBs. OIE methods process natural language texts to produce triples of surface forms for the arguments and relational phrase of binary relations. The first large-scale approach along these lines, TextRunner (Banko et al., 2007), was later improved by Re-Verb (Fader et al., 2011) and OLLIE (Mausam et al., 2012). The focus of these methods has been on verbal phrases as relations, and there is little effort to determine lexical synonymy among them.
The first notable effort to build up a resource for relational paraphrases is DIRT (Lin and Pantel, 2001), based on Harris' Distributional Hypothesis to cluster syntactic patterns. RESOLVER (Yates and Etzioni, 2009) introduced a probabilistic relational model for predicting synonymy. Yao et al. (2012) incorporated latent topic models to resolve the ambiguity of relational phrases. Other probabilistic approaches employed matrix factorization for finding entailments between relations (Riedel et al., 2013;Petroni et al., 2015) or used probabilistic graphical models to find clusters of relations (Grycner et al., 2014). All of these approaches rely on the cooccurrence of the arguments of the relation.
Recent endeavors to construct large repositories of relational paraphrases are PATTY, WiseNet and DEFIE. PATTY (Nakashole et al., 2012) devised a sequence mining algorithm to extract relational phrases with semantic type signatures, and organized them into synonymy sets and hypernymy hierarchies. WiseNet (Moro and Navigli, 2012) tapped Wikipedia categories for a similar way of organizing relational paraphrases. DEFIE (Bovi et al., 2015) went even further and used word sense disambiguation, anchored in WordNet, to group phrases with the same meanings.
Translation models have previously been used for paraphrase detection. Barzilay and McKeown (2001) utilized multiple English translations of the same source text for paraphrase extraction. Bannard and Callison-Burch (2005) used the bilingual pivoting method on parallel corpora for the same task. Similar methods were performed at a much bigger scale by the Paraphrase Database (PPDB) project (Pavlick et al., 2015). Unlike POLY, the focus of these projects was not on paraphrases of binary relations. Moreover, POLY considers the semantic type signatures of relations, which is missing in PPDB.
Research on OIE for languages other than English has received little attention. Kim et al. (2011) uses Korean-English parallel corpora for cross-lingual projection. Gamallo et al. (2012) developed an OIE system for Spanish and Portuguese using rules over shallow dependency parsing. The recent work of Faruqui and Kumar (2015) extracted relational phrases from Wikipedia in 61 languages using crosslingual projection. Lewis and Steedman (2013) clustered semantically equivalent English and French phrases, based on the arguments of relations.

Conclusions
We presented POLY, a method for clustering semantically typed English relational phrases using a multilingual corpus, resulting in a repository of semantically typed paraphrases with high coverage and precision. Future work includes jointly processing all 61 languages in the corpus, rather than considering them pairwise, to build a resource for all languages. The POLY resource is publicly available at www.mpi-inf.mpg.de/yago-naga/poly/.