CaRB: A Crowdsourced Benchmark for Open IE

Open Information Extraction (Open IE) systems have been traditionally evaluated via manual annotation. Recently, an automated evaluator with a benchmark dataset (OIE2016) was released – it scores Open IE systems automatically by matching system predictions with predictions in the benchmark dataset. Unfortunately, our analysis reveals that its data is rather noisy, and the tuple matching in the evaluator has issues, making the results of automated comparisons less trustworthy. We contribute CaRB, an improved dataset and framework for testing Open IE systems. To the best of our knowledge, CaRB is the first crowdsourced Open IE dataset and it also makes substantive changes in the matching code and metrics. NLP experts annotate CaRB’s dataset to be more accurate than OIE2016. Moreover, we find that on one pair of Open IE systems, CaRB framework provides contradictory results to OIE2016. Human assessment verifies that CaRB’s ranking of the two systems is the accurate ranking. We release the CaRB framework along with its crowdsourced dataset.


Introduction
Open Information Extraction (Open IE) refers to the task of forming relational tuples from sentences, without a fixed relation vocabulary (Banko et al., 2007). Open IE has numerous downstream applications such as knowledge base construction, relation extraction, summarisation and learning word embeddings (Stanovsky et al., 2015;Mausam, 2016). There have been many Open IE systems till date such as TextRunner (Banko et al., 2007), ReVerb , OLLIE (Mausam et al., 2012), ClausIE * Joint first author † Presently an AI Resident at Google (Del Corro and Gemulla, 2013), OpenIE 4 Pal and Mausam, 2016), Ope-nIE 5 (Saha et al., 2017;Saha and Mausam, 2018), PropS , NST (Jia et al., 2018), Neural Open IE (Cui et al., 2018), and more. With the advent of so many systems, it is imperative to have a standardized mechanism for automatic evaluation so that they can be compared.
Traditionally, these systems have been evaluated over small manually curated gold datasets (e.g., Mausam et al., 2012)). There are two problems with this approach. One, it is not reliable due to the small size of annotation. Second, it lacks standardization, since there is no single gold dataset over which all systems are evaluated. Moreover, the guidelines to annotate may vary across datasets and annotators. Recently, some standard benchmarks datasets and evaluators have been proposed: OIE2016 , RelVis (Schneider et al., 2017), and Wire57 (Léchelle et al., 2018). Unfortunately, these datasets are either too small or too noisy to meaningfully compare Open IE systems.
For instance, since its release in 2016, OIE2016 has been considered the de facto standard for Open IE evaluation (e.g., OIE2016 is used by the recent NST and Neural Open IE systems). However, upon close analysis, we find several issues with this benchmark. Its gold dataset makes significant errors and misses a large number of important tuples. This can be attributed to the fact that this dataset was not manually curated for Open IE, rather QA-SRL data was adapted for this task. There are also issues with its evaluation rules, which we detail later.
In response, we propose a new benchmark system CaRB: Crowdsourced automatic open Relation extraction Benchmark, which has a good sized and high quality dataset, along with better

Sent. #3
The main reason for this adoption over mainline gimp was its support for high bit depths which can be required for film work . OIE2016 ( high bit depths ; required ; film work ) CaRB ( this adoption ; has support for ; high bit depths ), ( high bit depths ; can be required for ; film work ), ( this adoption ; was over ; mainline gimp ), ( mainline gimp ; has no support for ; high bit depths ), ( its support for high bit depths which can be required for film work ; was The main reason for ; this adoption over mainline gimp ) Sent. #4 The number of ones equals the number of zeros plus one , since the state containing only zeros can not occur .  Our MTurk task has an automated system for training and qualifying workers, which makes crowdsourcing this annotation feasible. Two Open IE experts (authors of this paper) manually annotate 50 random sentences, which are then used as expert ground truth to evaluate the respective tuples in OIE2016's and CaRB's gold datasets (Tables 4,5). We find that CaRB outperforms OIE2016 by 21 points in precision and 16 points in recall in token level match. This demonstrates that CaRB's gold dataset is significantly more accurate than OIE2016's. Additionally, when evaluating all systems using our benchmark, we notice that CaRB reverses OIE2016's ranking of PropS and ClausIE. Human verification, again through crowdsourcing, verifies that two systems are ranked more accurately by CaRB. We release CaRB's dataset, along with its evaluator as a novel benchmark for further use by research community. 1

Related Work
To the best of our knowledge, there are three benchmarks systems available for comparing Open IE systems. Of them, the first and the most prominent is OIE2016 . This has been widely adopted as the 1 https://github.com/dair-iitd/CaRB standard evaluation framework to test new systems on. In OIE2016, gold tuples are generated using an automated rule-based system built on top of a QA-SRL dataset (He et al., 2015). In early analysis we find this dataset to be rather noisy. Table  1 illustrates some sample sentences from this gold dateset. These tuples look obviously wrong, and unfit to be in the gold set.
In addition to the dataset, Stanovsky and Dagan (2016) release a scorer that compares a set of gold tuples with a set of system tuples to estimate word-level precision and recall. This scorer has been identified to not penalize long extractions. It also does not penalise extractions for misidentifying parts of a relation in an argument slot (or vice versa), leading to trivial systems that score much better than genuine Open IE systems (Léchelle et al., 2018). We also observe that the scorer compares words all-to-all allowing multiple same words in an extraction to match a corresponding one in the gold. Thus, simply repeating a word in the extraction will give it a high precision score. Finally, the scorer loops over gold tuples in an arbitrary order, and matches them to predicted extractions in a sequential manner. Once a gold matches to a predicted extraction, it is rendered unavailable for any subsequent, potentially bettermatched, extraction.
Another dataset is RelVis (Schneider et al., 2017), a benchmark that borrows its data from four different datasets including OIE2016. Since OIE2016 forms a major part of this dataset, it has similar issues with noise. Its scorer makes some modifications to OIE2016. However, it does not reward partial coverage of gold tuples, and forces one system prediction to match just one gold. It also does not penalize overlong extractions.
Finally, Wire57 (Léchelle et al., 2018) makes further improvements in the scorer. It penalises overlong extractions and assigns a token-level precision and recall score to all gold-prediction pairs for a sentence. Moreover, it considers all pairs of extractions in its matching phase. However, it still forces one prediction to match just one gold. It also reports just one score for a system, ignoring the confidence values of the individual predictions that make the precision-recall curve of OIE2016 possible. Our scorer is inspired by theirs, with some changes. More importantly, the dataset used in Wire57 is manually curated, but with only 57 sentences, which is too small to suffice as a comprehensive test dataset.

Crowdsourcing CaRB Dataset
To overcome the shortcomings of dataset noise and size, we crowdsource a high-quality gold dataset for Open IE. We ask workers over Amazon Mechanical Turk (MTurk) to annotate extractions for the 1,282 sentences in dev and test splits of OIE2016. The workers annotate tuples in the form (arg1, rel, arg2), and also annotate location and time attributes for each tuple, when possible.
Open IE annotations are not easy to obtain from non-expert workers. To get acceptable quality, we train workers using a tutorial 2 that doubles up as a qualification test. Their performance in the test is automatically graded. Only workers that pass this are allowed to move on to the main task. The qualification is integrated with the task so that a new worker is served the tutorial and test first, but a qualified worker is directly taken to the main task. This makes the crowdsourcing process scalable.
We divide the task of annotating a sentence into three steps: (1) identifying the relation, (2) identifying the arguments for that relation, and (3) optionally identifying the location and time attributes for the tuple. The training process for the annotators is split into four steps, each of which focuses on a different guideline for Open IE. These are:   We also develop a user-friendly interface for annotating the sentences, which almost eliminates the need for workers to type anything. However, we note that several workers got frustrated in our qualification test, could not understand the task and left the job. However, several good workers completed the task successfully, and annotated significant high-quality data for us.
For sentences involving reporting verbs like said, told, asked, etc., some systems annotate additional attributional context for every utterance (Mausam et al., 2012). For this, we create a separate task, so as to prevent workers from being bombarded with all the rules at the same time.
We post-process the data to remove obvious incorrect annotations, like ones with a missing arg1 or rel. We also follow the convention of ending a relation with a preposition instead of beginning arg2 with one, so all prepositions are shifted to rel.

The CaRB Scorer
We now describe CaRB's approach for scoring system predictions against the gold. Instead of greedily matching gold tuples to system tuples in arbitrary order, CaRB creates an all-pair matching table, with each column as system tuple and each row as gold tuple. It computes precision and recall scores between each pair of tuples. Then, for computing overall recall, the maximum recall score is taken in each row, and averaged. By taking the maximum, recall computation matches a gold tuple with the closest system extraction. For computing precision, the system predictions are matched one to one with gold tuples, in the order of best match score to worst. The match precision scores are then averaged to compute precision. To compute precision-recall curve this computation is done at different confidence thresholds of system extractions.
In this way, CaRB's recall computation uses the notion of multi-match, wherein a gold tuple can match multiple system extractions. This is helpful in avoiding penalizing a system very heavily if it stuffs information from multiple gold tuples in a single extraction. Table 2 displays an example wherein system 1 combines information from two gold tuples in a single extraction, and system 2 only extracts one of the gold tuples. One-to-one match (OIE2016) is indifferent between the two which means that for OIE2016, adding more information in the same extraction has no value at all. However, multi match (CaRB) assigns higher recall to system 1, since it contains strictly more information, and higher precision to system 2, since its prediction exactly matched a gold extraction.
On the other hand, CaRB uses single match for precision. This is because CaRBs gold tuples are atomic, and cannot be further divided into more tuples. By single matching for precision, CaRB penalizes Open IE systems that produce several very similar and redundant extractions.
Another significant change from OIE2016 scorer is in the use of tuple match instead of lexical match. CaRB matches relation with relation, and arguments with arguments, however OIE2016 serialized the tuples into a sentence and just computed lexical matches. Table 3 illustrates an example when the arguments are shuffled, lexical match (OIE2016) shows no effect but tuple match (CaRB) rightfully decreases the scores. To avoid spurious matches, CaRB considers only matches with atleast one common word in the relation field.
Finally, some Open IE systems extract n-ary tuples and others do not. To treat all systems on equal footing, we follow previous work and append all higher numbered arguments into arg2.

Dataset Quality
We first estimate the overall quality of the crowdsourced dataset. To this end, two authors of this paper annotate 50 dev sentences from OIE2016 to    create an expert dataset. They first independently annotate tuples from these sentences, achieving an agreement F1 score of 83. They then resolve the differences and merge these independent sets. This is taken as an expert gold against which both OIE2016 and CaRB datasets are assessed.
Tables 4 and 5 estimate dataset quality of OIE2016 and CaRB. We find that CaRB has enormously high precision and recall values, suggesting that it is a much cleaner dataset. Table  1 compares the crowd sourced annotations and OIE2016 gold annotations for some sample sentences. While there is still scope for improvement, CaRB dataset appears much better than the OIE2016's gold.  remark that their gold dataset reaches an F1 of 95.8 on their expert annotation, whereas our assessment suggest values around 60. We surmise that this discrepancy is due to the different gold-prediction scoring schemes used. In original OIE2016 paper, the authors "match an automated extraction with a gold proposition if both agree on the grammatical head of all of their elements (predicate and arguments)". 3 The head match criterion is a much laxer scheme than ours and can explain the very high F1 score against their expert annotation.

Comparison of Open IE Systems
We test the different Open IE systems depicted in , using the CaRB dataset and scorer. The p-r curves obtained using OIE2016 and CaRB are outlined in figures 1 (reproduced from Stanovsky et al. (2018)) and 2. Precision, recall and F1 scores (at max F1 point) and area under precision-recall curve are reported in Table 6. It can be seen that the curve for PropS lies above ClausIE at all times in OIE2016, but PropS performs the worse of all systems in CaRB. To verify that CaRB indeed gives the correct ranking, we turn back to human verification.

Human Verification
Through human verification, our goal is to learn the accurate ranking for ClausIE and PropS. We randomly select 100 test sentences and evaluate both system extractions on this subset.
We assess the correct ranking between PropS and ClausIE using MTurk. Four workers are shown the extractions from both systems in ran-dom order and asked to either choose one of the systems as the better one or indicate that both are equal. The majority opinion of these four is considered as the correct ranking for that sentence, an equal split leading to a tie. In this experiment, we only allow MTurk workers who have been trained for Open IE for the crowdsourcing task to participate.
Of these 100 sentences, PropS is chosen to have performed better for 15, ClausIE for 69 whereas 16 ended up in a tie. ClausIE is indeed considered the better system in human evaluation, and we verify that CaRB gives an accurate ranking of these two systems compared to OIE2016.

Conclusion
We contribute CaRB, a crowdsourced dataset for evaluation and comparison of Open IE systems. We assess this dataset against an expert-annotated dataset and find that it is dramatically more accurate than the existing OIE2016 benchmark dataset.
We also implement a scorer that computes precision, recall and area under p-r curve for a given system output by matching it with the CaRB dataset. In designing our scorer, we make several design choices that deviate from prior work in both match scores and also in finding the best match for a tuple. We believe our scheme treats various systems fairly. And in one case where CaRB and OIE2016 give different rankings to two Open IE systems, we demonstrate via human evaluation that the ranking given by CaRB is the accurate one. We release the dataset and scorer for further use by research community.
We expect that crowdsourced annotation will also be able to help the training of Open IE systems as it has helped their evaluation -we leave the creation of a suitably large crowdsourced training set for Open IE to future work.