Annotating Relation Inference in Context via Question Answering

We present a new annotation method for collecting data on relation inference in context. We convert the inference task to one of simple factoid question answering, allowing us to easily scale up to 16,000 high-quality examples. Our method corrects a major bias in previous evaluations, making our dataset much more realistic.


Introduction
Recognizing entailment between natural-language relations (predicates) is a key challenge in many semantic tasks. For instance, in question answering (QA), it is often necessary to "bridge the lexical chasm" between the asker's choice of words and those that appear in the answer text. Relation inference can be notoriously difficult to automatically recognize because of semantic phenomena such as polysemy and metaphor: Q: Which drug treats headaches?
A: Aspirin eliminates headaches.
In this context, "eliminates" implies "treats" and the answer is indeed "aspirin". However, this rule does not always hold for other cases -"eliminates patients" has a very different meaning from "treats patients". Hence, context-sensitive methods are required to solve relation inference.
Many methods have tried to address relation inference, from DIRT (Lin and Pantel, 2001) through Sherlock (Schoenmackers et al., 2010) to the more recent work on PPDB (Pavlick et al., 2015b) and RELLY (Grycner et al., 2015). However, the way these methods are evaluated remains largely inconsistent. Some papers that deal with phrasal inference in general (Beltagy et al., 2013;Pavlick et al., 2015a;Kruszewski et al., 2015) use an extrinsic task, such as a recent recognizing textual entailment (RTE) benchmark (Marelli et al., 2014). By nature, extrinsic tasks incorporate a variety of linguistic phenomena, making it harder to analyze the specific issues of relation inference.
The vast majority of papers that do focus on relation inference perform some form of post-hoc evaluation (Lin and Pantel, 2001;Szpektor et al., 2007;Schoenmackers et al., 2010;Weisman et al., 2012;Lewis and Steedman, 2013;Riedel et al., 2013;Rocktäschel et al., 2015;Grycner and Weikum, 2014;Grycner et al., 2015;Pavlick et al., 2015b). Typically, the proposed algorithm generates several inference rules between two relation templates, which are then evaluated manually. Some studies evaluate the rules out of context (is the rule "X eliminates Y "→"X treats Y " true?), while others apply them to textual data and evaluate the validity of the rule in context (given "aspirin eliminates headaches", is "aspirin treats headaches" true?). Not only are these post-hoc evaluations oblivious to recall, their "human in the loop" approach makes them expensive and virtually impossible to accurately replicate.
Hence, there is a real need for pre-annotated datasets for intrinsic evaluation of relation inference in context. Zeichner et al. (2012) constructed such a dataset by applying DIRT-trained inference rules to sampled texts, and then crowd-annotating whether each original text (premise) entails the text generated from applying the inference rule (hypothesis). However, this process is biased; by using DIRT to generate examples, the dataset is inherently blind to the many cases where relation inference exists, but is not captured by DIRT.
We present a new dataset for evaluating relation inference in context, which is unbiased towards one method or another, and natural to annotate. To create this dataset, we design a QA setting where annotators are presented with a single ques- tion and several automatically-retrieved text fragments. The annotators' goal is to mark which of the text fragments provide a potential answer to the question (see Figure 1). Since the entities in the text fragments are aligned with those in the question, this process implicitly annotates which relations entail the one in the question. For example, in Figure 1, if "[US PRESIDENT] increased taxes" provides an answer to "Which US president raised taxes?", then "increased" implies "raised" in that context. Because this task is so easy to annotate, we were able to scale up to 16,371 annotated examples (3,147 positive) with 91.3% precision for only $375 via crowdsourcing.
Finally, we evaluate a collection of existing methods and common practices on our dataset, and observe that even the best combination of methods cannot recall more than 25% of the positive examples without dipping below 80% precision. This places into perspective the huge amount of relevant cases of relation inference inherently ignored by the bias in (Zeichner et al., 2012). Moreover, this result shows that while our annotation task is easy for humans, it is difficult for existing algorithms, making it an appealing challenge for future research on relation inference. Our code 1 and data 2 are publicly available.

Relation Inference Datasets
To the best of our knowledge, there are only three pre-annotated datasets for evaluating relation inference in context. 3 Each example in these datasets consists of two binary relations, premise and hypothesis, and a label indicat-ing whether the hypothesis is inferred from the premise. These relations are essentially Open IE (Banko et al., 2007) assertions, and can be represented as (subject, relation, object) tuples. Berant et al. (2011) annotated inference between typed relations ("[DRUG] eliminates [SYMPTOM]"→" [DRUG] treats [SYMP-TOM]"), restricting the definition of "context". They also used the non-standard type-system from (Schoenmackers et al., 2010), which limits the dataset's applicability to other corpora. Levy et al. (2014) annotated inference between instantiated relations sharing at least one argument ("aspirin eliminates headaches"→"drugs treat headaches"). While this format captures a more natural notion of context, it also conflates the task of relation inference with that of entity inference ("aspirin"→"drug"). Both datasets were annotated by experts. Zeichner et al. (2012) annotated inference between instantiated relations sharing both arguments: aspirin eliminates headaches → aspirin treats headaches aspirin eliminates headaches aspirin murders headaches This format provides a broad definition of context on one hand, while isolating the task of relation inference. In addition, methods that can be evaluated on this type of data, can also be directly embedded into downstream applications, motivating subsequent work to use it as a benchmark (Melamud et al., 2013;Abend et al., 2014;Lewis, 2014). We therefore create our own dataset in this format.
The main drawback of Zeichner et al.'s process is that it is biased towards a specific relation inference method, DIRT (Lin and Pantel, 2001). Essentially, Zeichner et al. conducted a post-hoc evaluation of DIRT and recorded the results. While their approach does not suffer from the major disadvantages of post-hoc evaluation -cost and irreplicability -it ignores instances that do not behave according to DIRT's assumptions. These invisible examples amount to an enormous chunk of the inference performed when answering questions, which are covered by our approach (see §4).

Collection & Annotation Process
Our data collection and annotation process is designed to achieve two goals: (1) to efficiently sample premise-hypothesis pairs in an unbiased man-ner; (2) to allow for cheap, consistent, and scalable annotations based on an intuitive QA setting.

Methodology Overview
We start by collecting factoid questions. Each question is captured as a tuple q = (q type , q rel , q arg ), for example: In addition to "Which?" questions, this template captures other WH-questions such as "Who?" (q type = person).
We then collect a set of candidate answers for each question q. A candidate answer is also represented as a tuple (a answer , a rel , a arg ) or (a arg , a rel , a answer ), for example: aarg chocolate a rel is made from aanswer the cocoa bean We collect answer candidates according to the following criteria: 1. a arg = q arg 2. a answer is a type of q type 3. a rel = q rel These criteria isolate the task of relation inference from additional inference tasks, because they ensure that a's arguments are entailing q's. In addition, the first two criteria ensure that enough candidate answers actually answer the question, while the third discards trivial cases. In contrast to (Zeichner et al., 2012) and post-hoc evaluations, these criteria do not impose any bias on the relation pair a rel , q rel . Furthermore, we show in §3.2 that both a and q are both independent naturally-occurring texts, and are not machine-generated by applying a specific set of inference rules.
For each (a, q) pair, Mechanical Turk annotators are asked whether a provides an answer to q. This natural approach also enables batch annotation; for each question, several candidate answers can be presented at once without shifting the annotator's focus. To make sure that the annotators do not use their world knowledge about a answer , we mask it during the annotation phase and replace it with q type (see Figure 1 and  §3.3).
Finally, we instantiate q type with a answer , so that each (a, q) pair fits Zeichner's format: instantiated predicates sharing both arguments.

Data Collection
We automatically collected 30,703 pairs of questions and candidate answers for annotation. Our process is largely inspired by (Fader et al., 2014).
Questions We collected 573 questions by manually converting questions from TREC (Voorhees and Tice, 2000), WikiAnswers (Fader et al., 2013), WebQuestions , to our "Which q type q rel q arg ?" format. Though many questions did fit our format, a large portion of them were about sports and celebrities, which were not applicable to our choice of corpus (Google books) and taxonomy (WordNet). 4 Corpus QA requires some body of knowledge from which to retrieve candidate answers. We follow Fader et al. (2013;, and use a collection of Open IE-style assertions (Banko et al., 2007) as our knowledge base. Specifically, we used hand-crafted syntactic rules 5 to extract over 63 million unique subject-relation-object triplets from Google's Syntactic N-grams (Goldberg and Orwant, 2013). The assertions may include multiword phrases as relations or arguments, as illustrated earlier. This process yields some ungrammatical or out-of-context assertions, which are later filtered during annotation (see §3.3).
Answer Candidates In §3.1 we defined three criteria for matching an answer candidate to a question, which we now translate into a retrieval process. We begin by retrieving all assertions where one of the arguments (subject or object) is equal to q arg , ignoring stopwords and inflections. The matching argument is named a arg , while the other (non-matching) argument becomes a answer .
To implement the second criterion (a answer is a type of q type ) we require a taxonomy T , as well as a word-sense disambiguation (WSD) algorithm to match natural-language terms to entities in T . In this work, we employ WordNet's hypernymy graph (Fellbaum, 1998) as T and Lesk (Lesk, 1986) for WSD (both via NLTK (Bird et al., 2009)). While automatic WSD is prone to some errors, these cases are usually annotated as nonsensical in the final phase.
Lastly, we remove instances where a rel = q rel . 6

Crowdsourced Annotation
Masking Answers We noticed that exposing a answer to the annotator may skew the annotation; rather than annotating whether a rel implies q rel in the given context, the annotator might annotate whether a answer answers q according to her general knowledge. For example: Q: Which country borders Ethiopia?
A: Eritrea invaded Ethiopia.
An annotator might be misled by knowing in advance that Eritrea borders Ethiopia. Although an invasion typically requires land access, it does not imply a shared border, even in this context; "Italy invaded Ethiopia" also appears in our corpus, but it is not true that "Italy borders Ethiopia". Effectively, what the annotator might be doing in this case is substituting q type ("country") with a answer ("Eritrea") and asking herself if the assertion (a answer , q rel , q arg ) is true ("Does Eritrea border Ethiopia?"). As demonstrated, this question may have a different answer from the inference question in which we are interested ("If a country invaded Ethiopia, does that country border Ethiopia?"). We therefore mask a answer during annotation by replacing it with q type as a placeholder: This forces the annotator to ask herself whether a rel implies q rel in this context, i.e. does invading Ethiopia imply sharing a border with it?
Labels Each annotator was given a single question with several matching candidate answers (20 on average), and asked to mark each candidate answer with one of three labels: The sentence answers the question.
The sentence does not answer the question.
? The sentence does not make sense, or is severely non-grammatical. Figure 1 shows several annotated examples. The third annotation (?) was useful in weeding out noisy assertions (23% of candidate answers).
Aggregation Overall, we created 1,500 questionnaires, 7 spanning a total of 30,703 (a, q) pairs. Each questionnaire was annotated by 5 differ-7 Each of our 573 questions had many candidate answers. These were split into smaller chunks (questionnaires) of less than 25 candidate answers each. ent people, and aggregated using the unanimousup-to-one (at least 4/5) rule. Examples that did not exhibit this kind of inter-annotator agreement were discarded, and so were examples which were determined as nonsensical/ungrammatical (annotated with ?). After aggregating and filtering, we were left with 3,147 positive () and 13,224 negative () examples. 8 To evaluate this aggregation rule, we took a random subset of 32 questionnaires (594 (a, q) pairs) and annotated them ourselves (expert annotation). We then compared the aggregated crowdsourced annotation on the same (a, q) pairs to our own. The crowdsourced annotation yielded 91.3% precision on our expert annotations (i.e. only 8.7% of the crowd-annotated positives were expertannotated as negative), while recalling 86.2% of expert-annotated positives.

Performance of Existing Methods
To provide a baseline for future work, we test the performance of two inference-rule resources and two methods of distributional inference on our dataset, as well as a lemma-similarity baseline. 9

Baselines
Lemma Baseline We implemented a baseline that takes into account four features from the premise relation (a rel ) and the hypothesis relation (q rel ) after they have been lemmatized: (1) Does a rel contain all of q rel 's content words? (2) Do the relations share a verb? (3) Does the relations' active/passive voice match their arguments' alignments? (4) Do the relations agree on negation? The baseline will classify the example as positive if all features are true.
PPDB 2.0 We used the largest collection of paraphrases (XXXL) from PPDB (Pavlick et al., 2015b). These paraphrases include argument slots for cases where word order changes (e.g. passive/active).

Entailment Graph
We used the publiclyavailable inference rules derived from Berant et al.'s (2011) entailment graph. These rules contain typed relations and can also be applied in a context-sensitive manner. However, ignoring the types and applying the inference rules out of context worked better on our dataset, perhaps because Berant et al.'s taxonomy was learned from a different corpus.
Relation Embeddings Similar to DIRT (Lin and Pantel, 2001), we create vector representations for relations, which are then used to measure relation similarity. From the set of assertions extracted in §3.2, we create a dataset of relation-argument pairs, and use word2vecf (Levy and Goldberg, 2014) to train the embeddings. We also tried to use the arguments' embeddings to induce a contextsensitive measure of similarity, as suggested by Melamud et al. (2015); however, this method did not improve performance on our dataset.
Word Embeddings Using Google's Syntactic N-grams (Goldberg and Orwant, 2013), from which candidate answers were extracted, we trained dependency-based word embeddings with word2vecf (Levy and Goldberg, 2014). We used the average word vector to represent multi-word relations, and cosine to measure their similarity.

Results
Under the assumption that collections of inference rules are more precision-oriented, we also try different combinations of rule-based and embeddingbased methods by first applying the rules and then calculating the embedding-based similarity only on instances that were not identified as positive by the rules. Since the embeddings produce a similarity score, not a classification, we plot all methods' performance on a single precision-recall curve ( Figure 2).
All methods used the lemma baseline as a first step to identify positive examples; without it, performance drops dramatically. This is probably more of a dataset artifact than an observation about the baselines; just like we filtered examples where a rel = q rel , we could have used a more aggressive policy and removed all pairs that share lemmas.
It seems that most methods provide little value beyond the lemma baseline -the exception being Berant et al.'s (2011) entailment graph. Unifying the entailment graph with PPDB (and, implicitly, the lemma baseline) slightly improves performance, and provides a significantly better starting point for the method based on word embeddings. Even so, performance is still quite poor in absolute terms, with less than 25% recall at 80% precision. All Rules is the union of PPDB and the entailment graph. Rules + W Embs is a combination of All Rules and our word embeddings.

The Ramifications of Low Recall
These results emphasize the huge false-negative rate of existing methods. This suggests that a massive amount of inference examples, which are necessary for answering questions, are inherently ignored in (Zeichner et al., 2012) and post-hoc evaluations. Our dataset remedies this bias, and poses a new challenge for future research on relation inference.