Bootstrapping for Numerical Open IE

We design and release BONIE, the first open numerical relation extractor, for extracting Open IE tuples where one of the arguments is a number or a quantity-unit phrase. BONIE uses bootstrapping to learn the specific dependency patterns that express numerical relations in a sentence. BONIE’s novelty lies in task-specific customizations, such as inferring implicit relations, which are clear due to context such as units (for e.g., ‘square kilometers’ suggests area, even if the word ‘area’ is missing in the sentence). BONIE obtains 1.5x yield and 15 point precision gain on numerical facts over a state-of-the-art Open IE system.


Introduction
Open Information Extraction (Open IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary (Etzioni et al., 2008;Mausam, 2016), by constructing the relation phrases and arguments from within the sentences themselves. Early works on Open IE such as REVERB  extract verbmediated relations via a handful of human-defined patterns. OLLIE improves recall by learning dependency patterns, using bootstrapping over RE-VERB extractions (Mausam et al., 2012). Open IE 4.2, a state-of-the-art open information extractor, is based on a combination of SRLIE, a verbmediated extractor over SRL frames , and RELNOUN 2.0, which performs special linguistic processing for extraction from complex noun phrases (Pal and Mausam, 2016). 1 1. Sentence: Hong Kong's labour force is 3.5 million. Open IE 4.2: (Hong Kong's labour force; is; 3.5 million) BONIE: (Hong Kong; has labour force of; 3.5 million) In this work, we present and release 2 the first system for open numerical extraction, which we name BONIE for Bootstrapping-based Open Numerical Information Extractor. It is important to note that existing Open IE systems, like Open IE 4.2, may also extract numerical facts. However, they are oblivious to the presence of numbers in arguments. Therefore, they may miss important extractions and may not always output the best numerical facts. Table 1 compares extractions generated by Open IE 4.2 and BONIE on some of the sample sentences.
At a high level BONIE follows OLLIE's design of identifying seed facts, constructing training data by bootstrapping sentences that may mention a seed fact, pattern learning and ranking. Madaan et al (2016) note that bootstrapping for numerical IE is challenging; it can lead to high noise and missed recall, since numbers can easily match out of context, and numbers may not match due to approximations. In response, similar to most previous works (e.g., LUCHS (Hoffmann et al., 2010)) BONIE matches a number if it is within a percentage threshold. Additionally, BONIE uses a quantity extractor (Roy et al., 2015), which provides the units mentioned in the sentence -BONIE bootstraps a sentence only when the units match.
When compared to OLLIE, BONIE contributes several numerical IE specific customizations. (1) Since no open facts are available for this task, we first manually define a set of high-precision seed patterns, which are run over a large corpus to generate seed facts.
(2) Not all seeds are fit for bootstrapping -many don't even have an entity as first argument. We develop heuristics to identify an informative subset from these. After bootstrapping and pattern learning, we find that we are missing important tuples. E.g., sentence #3 in Table  1 above has no explicit relation word -the relation "has length of" is implicit via the adjective 'long'. And, sentence #5 expresses the relation 'area' via the units. (3) BONIE identifies implicit relations using additional processing of units and adjectives. (4) Finally, BONIE can tag a quantity as count and prepends "number of" in the relation phrase (sentence #2).

Related Work
One of the first Open IE systems to obtain substantial recall is OLLIE (Mausam et al., 2012), which is a pattern learning approach based on a bootstrapped training data using high precision verb-based extractions. Other methods augment the linguistic knowledge in the systems -Exemplar (de Sá Mesquita et al., 2013) adds new rules over dependency parses, SRLIE develops extraction logic over SRL frames . Several works identify clauses and operate over restructured sentences (Schmidek and Barbosa, 2014;Corro and Gemulla, 2013;Bast and Haussmann, 2013). Other approaches use tree kernels (Xu et al., 2013), qualia-based patterns (Xavier et al., 2015), and simple within-sentence inference (Bast and Haussmann, 2014). However, none of them handle numbers specifically, and hence do not work for our problem.
Numerical Relations: Numbers play an important role in extracting information from text. Early works have seen people working on understanding numbers that express temporal information (Ling and Weld, 2010). More recently, the focus has been on numbers that express physical quantities or measures, either mentioned in text (Chaganty and Liang, 2016) or in the context of web tables (Ibrahim et al., 2016;Neumaier et al., 2016), or on numbers that represent cardinalities of relations (Mirza et al., 2017).
One of the prior works that applies to generic numerical relations is LUCHS (Hoffmann et al., 2010), where the system uses distant supervision to create 5,000 relation extractors, which included numerical relations as well. Researchers have also specifically developed numerical relation extractors to extract those relations where one of the arguments is a quantity (Vlachos and Riedel, 2015;Intxaurrondo et al., 2015;Madaan et al., 2016). However, all of them extract only an ontology relation, and hence are not directly applicable to Open IE.

Open Numerical Relation Extraction
The goal of Open Numerical Relation Extraction is to process a sentence that has a quantity mention in it, and extract any tuple of the form (Arg1, relation phrase, Arg2) where Arg2 (or Arg1) is a quantity. As a first step, BONIE learns patterns where Arg2 is a quantity, as most English sentences tend to express numerical facts in active voice. Figure 1 outlines BONIE's algorithm, which operates in two phases: training and extraction. BONIE's training includes creation of seed facts, generation of training data via bootstrapping, and pattern learning over dependency parses. In the extraction phrase, BONIE performs pattern matching and parse-based expansion to construct numerical tuples. These numerical tuples are made more coherent by a novel relation construction step.
As an example, the sentence "India has a population of 1.2 billion" matches seed pattern #2 (from Figure 2) to create a seed fact (India; population; 1.2 billion; null). This 'null' represents that the quantity needs no unit. While bootstrapping, this seed fact may match a sentence "India is the second most populous country in the world, with a population of 1.25 billion." in the corpus. This training example will help learn a new pattern. 3 This pattern, when applied to the sentence "Microsoft Windows is the most popular operating system, with a customer base of 300 million users", will extract (Microsoft Windows; has customer base of; 300 million users).
While BONIE's skeleton broadly resembles that of OLLIE's (Mausam et al., 2012), it brings in customizations specific to the problem of numerical extraction such as a modified pattern language, heuristics for generating high quality seed set and training data, special processing for non-noun relations, and a novel relation construction step. We now describe BONIE's algorithm in detail.

Generation of Seed Facts
Since open numerical facts are not readily available, we first write a handful of high-precision dependency patterns (see Figure 2 for a list). Each dependency pattern encodes the minimal sub-tree of the dependency parse connecting the relation, quantity and argument in that sentence. BONIE encodes a node in a pattern via '<depLabel>#<word>#<POSTag>', where 'depLabel' is the edge connecting the node to its parent, 'word' is the word at the node, 'POSTag' is its part of speech tag; '#' is a delimiter separating them. {rel}, {arg} and {quantity} in the patterns are placeholders for relation, argument and quantity headwords, respectively. BONIE generates seed facts by parsing the corpus and matching seed patterns with the parse. In case of a successful match, a seed fact of the form (arg headword; relation headword; quantity; unit) is generated. Argument and relation headwords are extracted directly from the parse. For the other two, it uses Illinois Quantifier (Roy et al., 2015), which returns both the quantity and unit separately.
Since seed facts form the basis of our training task, they must be as clean as possible -BONIE adds several filters to reduce noise. It considers a seed fact as valid only when the quantity node in the pattern is within some quantity span given by Illinois Quantifier. It also rejects any fact whose argument is not a proper noun.
After these filters it gets high-precision extractions, but not necessarily good seeds -many seeds are generic, which may easily match unrelated sentences. E.g., (Michael; drove; 20; kms) isn't a good seed, since 'Michael' isn't specific, and could erroneously match sentences mentioning another Michael with some unrelated reference of a 20 km drive. To improve the set, BONIE checks for the presence of a seed fact in Yago KB (Hoffart et al., 2013) and keeps only those that are common. Since Yago has many numerical facts for height, area, latitude, GDP, etc., this gives BONIE a diverse set of clean facts for further training.
Finally, some numerical facts may be expressed without using a nominal relation word. BONIE uses WordNet (Miller, 1995) to generate new seeds from such seed facts using the derivationally related noun form of the relation headword. For example, (Brown ; tall ; 13 ; inches) gets transformed to (Brown; height; 13; inches), which gets added as a seed fact.

Bootstrapping
Similar to OLLIE, BONIE finds sentences that contain all words in a seed fact and generates (sentence, fact) pairs. But unlike OLLIE, BONIE has quantities and units, and matching them as words isn't appropriate. Illinois Quantifier performs an internal normalization for both, e.g, changes 'dollars' and '$' to 'US$', and '%' to 'percent'. Since seed facts also have normalized units, we run Illinois quantifier on candidate sentences and match normalized units directly. Moreover, BONIE maintains a percentage threshold δ to control the amount of allowed difference between quantities in the sentence and seed fact. Once all constituents of a fact match with a sentence, BONIE generates the (sentence, fact) pair.

Open Pattern Learning
For each (sentence, fact) pair, BONIE parses the sentence, and replaces the argument and relation words of the fact with '{arg}' and '{rel}' placeholders. For quantity and unit words, BONIE replaces the one at a higher level in the parse with '{quantity}'. The minimal path containing '{arg}', '{rel}' and '{quantity}' is learned as a pattern. Since quantity and unit are typically expected to remain close to each other in a sentence, BONIE rejects all such patterns where the distance between them exceeds a certain threshold value.
Some patterns are learned with specific words such as 'contains' in example (partial) <(#contains#verb)<(dobj#{quantity}#.+)<... We believe that this pattern should work with all inflections and synonyms of 'contain', BONIE uses WordNet to expand the pattern by including all inflections and synset synonyms. Each pattern is scored based on the number of times it is learned from the data.

Constructing Extractions
After matching a pattern to a new sentence, similar to OLLIE, arg/rel phrases are completed by expanding the extracted headnouns on poss, det, num, neg, amod, quantmod, nn, and pobj edges. If one of the children of the argument headword is a prep, rcmod or partmod edge, the whole subtree under that is extracted. Quantity phrase is extracted by Illinois quantifier, but if any sibling node of the quantity node is connected by a prep edge, with the word 'of', BONIE expands the entire subtree below it. This allows "10 percent of 100 dollars" to be included in the quantity phrase. Relation Phrase Construction: Whenever the relation headword is an adjective or an adverb, BONIE uses WordNet to get its derivationally related noun form and that becomes the new relation. This transforms the tuple (Donald Trump ; old ; about 70 years) to (Donald Trump ; has age of ; about 70 years).
Sometimes, sentences don't use a numerical relation word -it is obvious from the units. E.g., sentence #4 on page 1 expresses the 'area' relation implicitly. BONIE infers these implicit relations using the unit analysis in UnitTagger (Sarawagi and Chakrabarti, 2014). Whenever BONIE sees a unit (sq kms) getting mistreated as a relation it uses UnitTagger to infer relations from units and postprocesses the extraction accordingly. The ex-traction, as a result changes from (James Valley; has sq kms of ; 5 of fruit orchards) to (James Valley ; has area of fruit orchards ; 5 sq kms).
Finally, in cases when a plural noun relation word also appears as a unit in the quantifier, BONIE hypothesizes that it is a count extraction, and prepends 'number of' to the relation headword and removes the unit from the quantity phrase. E.g., (Microsoft; has employees; 100,000 employees) from sentence #2 becomes (Microsoft; has number of employees; 100,000).

Experiments
We build BONIE over data from ClueWeb12, 4 filtered so as to keep only the sentences that contain numbers. We further remove those where quantity represents a date, time, or duration, and where the quantity is accompanied by document words like 'Section', ' Table', or ' Figure'. We use the dependency parser from ClearNLP 5 . We generate about 21,000 seed facts from roughly 20 million numerical sentences. These are matched against 7 million numerical sentences obtaining about 18,500 (sentence, fact) pairs.
We tried different values of δ (the matching threshold) and found results to not be sensitive as long as δ varies in the range of 2% to 5%. So we set δ=2% during the final evaluation. The distance threshold between the quantity and unit mentioned in Section 2.3 is set to 3 and is based on our general understanding of parse trees.
BONIE learns around 7,000 new patterns. Since pattern frequency is a good indicator of pattern quality (Wu and Weld, 2010), we rank the patterns on the basis of frequency and take the top 1,000 patterns for further analysis. We find that almost all patterns beyond the top 1,000 are learned only once or twice on our training set. Our decision to ignore all patterns beyond the top 1,000 is so that we have a support of at least three for each pattern.
We sample a random testset of 2,000 numerical sentences from ClueWeb12 (not used in training). Two annotators with NLP experience annotate each extraction for correctness. We obtain an inter-annotator agreement of 97%, and report the results on the subset where both annotators agree.
Since there are no open numerical extractors available, we compare BONIE against an Open IE system and another closed numerical IE system.   We also perform additional ablation study to evaluate the value of each component. Just the seed patterns themselves have a significantly higher precision but much smaller yield. This is expected, since the seeds must be highly precise for bootstrapping. If Yago matching and other seed filtering heuristics are turned off, the precision of the system goes down drastically due to a very noisy bootstrapped set. If the post-processing of relation phrase construction is turned off, there is a 5 point precision loss and about 7% yield reduction due to some incorrect extracted tuples, which are corrected by post-processing. Finally, Wordnet-based expansion has marginal increase in yield and slight precision loss.
Open IE 4.2 associates a confidence value with each extraction -ranking against which generates a precision-yield curve. For BONIE, we rank the patterns in such a way that the seed patterns are at the top, followed by the learned patterns. The learned patterns are ordered based on their frequencies. Figure 3 reports the curves for both the systems and we find that BONIE has a larger area under the curve as compared to Open IE 4.2.
Estimating recall in Open IE is difficult since it requires annotators to exhaustively tag all open extractions in a sentence. To get an estimate, an author manually tagged 100 sentences with all numerical extractions. We find that BONIE's recall is about 48%. Two-thirds of missed recall is because of missing conjuncts. E.g., it misses the tuple relating retirement age with 68 years in "The retirement age for men is 65 years and 68 years for women." Other missed recall is due to complexity of sentences or inaccuracy of parsers.

Conclusions
We release BONIE 8 , the first open numerical relation extractor and other resources for further research. BONIE is based on bootstrapping and pattern learning and follows previous similar works such as OLLIE. However, for effective bootstrapping and training, it implements various customizations specific to numerical relations in curation of seed fact set, matching of sentences, and construction of relation phrase at the time of extraction. BONIE significantly outperforms both open non-numerical IE, and closed numerical IE systems with 1.5x yield and 15 point precision gain over a state-of-the-art Open IE system. We find that better conjunction processing is an important future step for improving BONIE's recall even further.