EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference

Quantitative reasoning is a higher-order reasoning skill that any intelligent natural language understanding system can reasonably be expected to handle. We present EQUATE (Evaluating Quantitative Understanding Aptitude in Textual Entailment), a new framework for quantitative reasoning in textual entailment. We benchmark the performance of 9 published NLI models on EQUATE, and find that on average, state-of-the-art methods do not achieve an absolute improvement over a majority-class baseline, suggesting that they do not implicitly learn to reason with quantities. We establish a new baseline Q-REAS that manipulates quantities symbolically. In comparison to the best performing NLI model, it achieves success on numerical reasoning tests (+24.2 %), but has limited verbal reasoning capabilities (-8.1 %). We hope our evaluation framework will support the development of models of quantitative reasoning in language understanding.


Introduction
Numbers play a vital role in our lives.We reason with numbers in day-to-day tasks ranging from handling currency to reading news articles to understanding sports results, elections and stock markets.As numbers are used to communicate information accurately, reasoning with them is an essential core competence in understanding natural language (Levinson, 2001;Frank et al., 2008;Dehaene, 2011).A benchmark task in natural language understanding is natural language inference (NLI)(or recognizing textual entailment (RTE)) (Cooper et al., 1996;Condoravdi et al., 2003;Bos and Markert, 2005;Dagan et  2006), wherein a model determines if a natural language hypothesis can be justifiably inferred from a given premise2 .Making such inferences often necessitates reasoning about numbers.
Consider the example, P With 99.6% of precincts counted , Dewhurst held 48% of the vote to 30% for Cruz .H Lt. Gov. David Dewhurst fails to get 50% of primary vote.
To conclude the hypothesis is inferrable, a model must reason that since 99.6% of the precincts are counted, even if all the remaining precincts vote for Dewhurst, he would still fail to get 50% of the primary vote.Scant attention has been paid to building datasets to evaluate this reasoning ability.
3 Quantitative Reasoning in NLI Our interpretation of "quantitative reasoning" draws from cognitive testing and education (Stafford, 1972;Ekstrom et al., 1976), which considers it "verbal problem-solving ability".While inextricably linked to mathematics, it is an inclusive skill involving everyday language rather than a specialized lexicon.To excel at quantitative reasoning, one must interpret quantities expressed in language, perform basic calculations and judge their accuracy, and justify quantitative  claims using both verbal and numeric reasoning.
Based on these requirements, natural language inference lends itself as a test bed for the study of quantitative reasoning.Conversely, the ability to quantitatively reason is important for NLI (Sammons et al., 2010;Clark, 2018).Motivated by this interplay, we present the EQUATE (Evaluating Quantity Understanding Aptitude in Textual Entailment) framework.

The EQUATE Dataset
EQUATE consists of five NLI test sets featuring quantities.These sets (Table 2) are drawn from diverse sources and exhibit a wide range of quantitative reasoning phenomena.Some sets are controlled synthetic tests( §3.2; §3.4) to examine model ability to handle phenomena such as quantifiers, approximations or arithmetic reasoning.EQUATE also includes tests featuring text from news articles and social media ( §3.3); §3.5; §3.6) to examine reasoning about quantities expressed verbally in the wild.Two main restrictions are imposed during test creation.First, we remove all sentences with temporal reasoning, since specialized knowledge is needed to reason about time.Secondly, we focus on sentences containing quantity mentions with numerical values 3 .
3 This is not detrimental but it reduces the probability of observing phenomena such as vague quantification.

Stress Test
We include the numerical reasoning stress test from (Naik et al., 2018) as a sanity check.It requires models to match entities from hypothesis to the premise, and reason with quantifiers.

RTE-Quant
This test set is constructed from the RTE subcorpus for quantity entailment (Roy, 2017), originally drawn from the RTE2-RTE4 datasets (Dagan et al., 2006).The original sub-corpus conflates temporal and quantitative reasoning.Pairs requiring temporal reasoning are discarded, resulting in a set of 166 entailment pairs.

AwpNLI
To evaluate arithmetic ability of NLI models, we repurpose data from arithmetic word problems (Roy and Roth, 2016) which have characteristic structures.First, they establish a world and optionally update its state.Then, a question is posed about the world.This structure forms the basis of our pair creation process (Fig 1).World building and update statements form the premise.A hypothesis template is generated by first identifying modal/auxiliary verbs in the question, and subsequent verbs, which we refer to as secondary verbs.We identify the agent in the sentence and conjugate the secondary verb in present tense followed by the identified unit to form the final template.For every template, the correct guess is used to create an entailed hypothesis.Contradic- tory hypotheses are generated by randomly sampling a wrong guess (x ∈ Z + of correct guess is integer, and x ∈ R + if it is a real number) 4 .We manually examine the dataset for grammaticality, finding only 2% ungrammatical hypotheses to be ungrammatical, which are manually corrected leaving a final test set of 722 pairs.

NewsNLI
This test set is created from the CNN corpus (Hermann et al., 2015) of news articles with abstractive summaries.We identify summary points with quantities, filtering out temporal expressions.For each summary point, the two most similar sentences 5 from the article are chosen, flipping pairs where the premise begins with a firstperson pronoun.The top 50% of similar pairs are retained to avoid lexical overlap bias.We crowdsource annotations for a subset of this data from Amazon Mechanical Turk.To ensure quality, we require that annotators have an approval rate of 95% on atleast 100 prior tasks and pass a qualification test.Crowdworkers are shown two sentences, and asked to determine whether the second sentence is definitely true, definitely false or not inferable given the first.We collect 5 annotations per pair, and consider pairs with lowest token overlap between premise and hypothesis and least difference in premise-hypothesis lengths when stratified by entailment label.Top 1000 samples meeting these criteria form our final test set.To validate crowdsourced labels, experts are asked to annotate a subset of 100 pairs.Crowdsourced gold labels match expert gold labels in 4 from a uniform distribution over an interval of 10 surrounding the correct guess (or an interval of 5 for numbers less than 5) 5 according to Jaccard similarity 85% cases, while individual crowdworker labels match expert gold labels in 75.8%.

RedditNLI
This test set is sourced from the popular social forum \reddit.Since reasoning about quantities is important in domains like finance or economics, we scrape all headlines from the posts on \r\economics, considering titles that contain quantities and do not have meta-forum information.Titles appearing within three days of each other are clustered by Jaccard similarity, and the top 300 pairs are extracted.After filtering out nonsensical titles, such as concatenated stock prices, we are left with 250 sentence pairs.Similar to RTE, two expert annotators label these pairs, achieving a Cohen's kappa of 0.82.Disagreements are discussed to resolve final labels.

Models
We describe the 9 NLI models6 used in this study, and our new baseline.The interested reader is invited to refer to the corresponding publications for further details.

Quantity Segmenter
Inspired by (Barwise and Cooper, 1981), we consider quantities as having a number, a unit and an optional approximator.We extract quantity mentions by identifying all least common ancestor noun phrases from the constituency parse of the sentence that contain cardinal numbers.

Quantity Parser
Our quantity parser constructs a grounded representation for each quantity mention in the premise and hypothesis, henceforth known as a NUMSET .A NUMSET can also be a composition of other NUMSETS .A NUMSET consists of (val, unit) tuples with: 1. val ∈ [R, R]: quantity represented as range 2. unit ∈ S: unit noun associated with quantity To extract values for a quantity, we extract cardinal numbers, recording contiguity.We normalize the number 7 We also handle simple ratios such as quarter, half etc, and extract bounds (eg: less than 10 apples is parsed to [−∞, 10] apples.)To extract units, we examine tokens adjacent to cardinal numbers in the quantity mention and identify known units.If no known units are Definitional Constraints Range restriction Syntactic Constraints First two operands c 0 + r 0 = 1 and c 1 + r 1 = 1 Last operator x L−1 ≥ N − 1 (Last operator should be one of {=, ⊆}) Last operand x L−2 = M − 1 (Last operand should be hypothesis quantity) Other operators where o i = 1 and l is the largest index such that l ≤ (i − 2) and d l = d i Table 4: Mathematical validity constraint definitions for the ILP framework used in our quantity composition module.Functions op1() and op2() return the left and right operands for an operator respectively.
found, we assign the token in a numerical modifier relationship with the cardinal number, else we assign the nearest noun to the cardinal number as the unit.A quantity is determined to be approximate if the word in a adverbial modifier relation with the cardinal number appears in gazetteer 8 .If approximate, range is extended to (+/-)2% of the current value.

Quantity Pruner
The pruner constructs "compatible" premisehypothesis NUMSET pairs.Consider the pair "Insurgents killed 7 U.S. soldiers, set off a car bomb that killed four Iraqi policemen" and "7 US soldiers were killed, and at least 10 Iraqis died".Our parser extracts NUMSETS corresponding to "four Iraqi policemen" and "7 US soldiers" from premise and hypothesis respectively.But these NUMSETS should not be compared as they involve different units.The pruner discards such incompatible pairs.Heuristics to detect unitcompatible NUMSET pairs include direct string 8 roughly', 'approximately', 'about', 'nearly', 'roundabout', 'around', 'circa', 'almost', 'approaching', 'pushing','more or less', 'in the neighborhood of', 'in the region of', 'on the order of','something like', 'give or take (a few)', 'near to', 'close to', 'in the ballpark of' .match or synonymy and hypernymy relations 9 .Like (Roy, 2017), we consider two units compatible if one is a nationality or a job 10 and the other unit is synonymous with people, person or citizen/worker.

Quantity Composition
The composition module detects whether a hypothesis NUMSET is justified by composing "compatible" premise NUMSETS .Our framework generates postfix arithmetic equations from premise NUMSETS , which justify the hypothesis NUMSET 11 .Note, the set of possible equations is exponential in number of NUMSETS , making exhaustive generation intractable.A large number of equations are invalid as they violate constraints such as unit consistency.Thus, our framework uses integer linear programming (ILP) to constrain the equation space.It is inspired by prior work on algebra word problems (Koncel-Kedziorski et al., 2015), with some key differences: 1. Arithmetic equations: We focus on arithmetic equations instead of algebraic for NLI. 9 from WordNet. 10 Lists of jobs, nationalities scraped from Wikipedia. 11Direct comparisons are incorporated by adding "=" as an operator Type Consistency Constraints Type assignment Table 5: Linguistic consistency constraint definitions for the ILP framework used in our quantity composition module.Functions op1() and op2() return the left and right operands for an operator respectively.
2. Range arithmetic: Quantitative reasoning involves ranges, which are handled by representing then as endpoint-inclusive intervals and adding four operators (∪, ∩, \, ⊆) 3. Hypothesis quantity-driven: We optimize an ILP model for each hypothesis NUMSET because a sentence pair is marked "entailment" iff every hypothesis quantity is justified.Definitional, syntactic and operand access constraints ensure mathematical validity while type and operator consistency constraints add linguistic consistency.Constraint formulations are provided in tables 4 and 5.We limit tree depth to 3 and retrieve upto 50 solutions per hypothesis NUMSET , then solve to determine whether the equation is mathematically correct.We discard equations which use invalid operations (division by 0) or add unnecessary complexity (multiplication/ division by 1).The remaining equation trees are considered plausible justifications .

Global Reasoner
The global reasoner predicts the final entailment label, on the assumption that every NUMSET in the hypothesis has to be justified for entailment, as a necessary but not sufficient condition.If any NUMSET in the hypothesis does not have a justification, the label is predicted as neutral, whereas if any NUMSET is contradicted by the premise, the prediction is contradiction.Relying on this intuition, the global reasoner collects justifications from the pruner and composition modules and decides the final entailment label using the procedure described in algorithm 112 .and hypothesis, and perturb the quantity in the hypothesis generating contradictory pairs.For example, the pair 'In addition to 79 fatalities , some 170 passengers were injured .' 'The crash took the lives of 79 people and injured some 170", 'entailment' is changed to 'In addition to 79 fatalities , some 170 passengers were injured .','The crash took the lives of 77 people and injured some 170", 'contradiction' , assuming scalar implicature and event coreference.

Results and Discussion
Our perturbed test set contains 261 pairs.On this set, OpenAI 14 achieves an accuracy of 32.33%, as compared to 71.26% on the unperturbed set.This suggests that the model relies on verbal reasoning rather than numerical reasoning.In comparison, Q-REAS achieves an accuracy of 67.2% on the perturbed set, compared to 55.93% on 14 the best-performing neural model on NewsNLI the unperturbed set, highlighting reliance on quantities rather than verbal information.Closer examination reveals that OpenAI switches to predicting the 'neutral' category for perturbed samples instead of entailment, accounting for 51.7% of it's errors, possibly symptomatic of lexical bias issues (Naik et al., 2018).
• What Quantitative Phenomena Are Hard?We sample 100 errors made by Q-REAS on each test in EQUATE (Table 7), to identify phenomena not addressed by simple quantity comparison.On natural datasets containing sentences with complex linguistic structure, the segmenter and parser cause most errors (66% on average), indicating that identifying quantities, or parsing them into a representation is more difficult in these datasets.Conversely, the composition module has a higher error rate on synthetic data (24.5%)than natural (4.7%).Our analysis of causes for error suggest avenues for future research: 1. Incorporating real world knowledge: Lack of real world knowledge causes errors in identifying quantities and valid comparisons.Errors include inability to map abbreviations to correct units (eq: "m" to "meters"), to detect part-whole coreference (eg: "seats" can be used to refer to "buses"), or correct- J h ← {q p | q p ∈ P, (q p , q h ) ∈ C} 6: if J h ← ∅ then return n 10: for q p ∈ J h do 11: s ← MaxSimilarityClass(q p , q h ) 12: if s = e then 13: if ValueMatch(q p , q h ) then 14: L[q h ] = true 15: if !ValueMatch(q p , q h ) then 16: if ValueMatch(q p , q h ) then 19: L[q h ] = c 20: for q h ∈ H do 21: E q ← {e i ∈ E | hyp(e i ) = q h } 22: if E q = ∅ then 23: L[q h ] = true 24: if c ∈ L then return c 25: if count(L, true) = len(L) then return e 26: return n ly resolve hypernymy/hyponymy (eg: "young men" to "boys").

Inferring underspecified quantities:
Quantity can be implicitly specified, requiring inference to generate a complete representation.Consider "A mortar attack killed four people and injured 80".A system must infer that the quantity "80" refers to people.On RTE-Quant, 20% of such cases stem from zero anaphora, a hard problem even in coreference resolution.

Arithmetic comparison limitations:
These examples require composition between incompatible quantities.For example, consider "There were 3 birds and 6 nests", "There were 3 more nests than birds" .To correctly label this pair "3 birds" and "6 nests" must be composed.4. Integrating verbal reasoning: No model integrates complex verbal and quantitative reasoning.For example, consider the pair "Two people were injured in the attack", "Two people perpetrated the attack" .Quantities "two people" and "two people" are unit-compatible, but must not be compared.Numbers and language are intricately interleaved and developing a reasoner capable of handling this complex interplay is challenging.

Conclusion
In this work, we present EQUATE, an evaluation framework to estimate the ability of models to reason quantitatively in textual entailment.We observe that existing neural approaches rely on the verbal reasoning aspect of the task to succeed rather than reasoning about quantities.We also present Q-REAS , a baseline that symbolically reasons about quantities and while it achieves some success at numerical reasoning, it lacks sophisticated verbal reasoning capabilities, indicating the complexity of inference.We believe that a promising avenue to explore is combining the strengths of neural models and specialized reasoners in hybrid architectures to be more effective, though it remains unclear how this can be achieved.In the future, we hope our insights, and the EQUATE evaluation framework, lead to the development of models that can more precisely reason about quantities in natural language.
Baseline implementation performances on MultiNLI-Dev Matched.All reimplementations closely match performance reported in the original publications.

Figure 1 :
Figure 1: The construction of the AwpNLI dataset.

Table 1 :
After the deal closes , Teva will generate sales of about $ 7 billion a year , the company said .H: Teva earns $ 7 billion a year AWP-NLI P Each of farmer Cunningham 's 6048 lambs is either black or white and there are 193 white ones.H 5855 of Farmer Cunningham 's lambs are black.NEWSNLI P With 99.6% of precincts counted , Dewhurst held 48% of the vote to 30% for Cruz .H Lt. Gov. David Dewhurst fails to get 50% of primary vote.REDDITNLI P Oxfam says richest one percent to own more than rest by 2016.H Richest 1% To Own More Than Half Worlds Wealth By 2016 Oxfam.Examples drawn from four evaluation sets in the EQUATE framework.

Table 3 :
Input, output and variable definitions for the ILP framework used in our quantity comparator-

Table 6 :
Accuracies(%) of 9 NLI Models on five tests for quantitiative reasoning in entailment.M and D represent models and datasets respectively.∆ captures improvement over majority-class baseline for a dataset.Column Avg.reports the average accuracy(%) of each model across all 5 evaluation sets in EQUATE.