How Much is 131 Million Dollars? Putting Numbers in Perspective with Compositional Descriptions

How much is 131 million US dollars? To help readers put such numbers in context, we propose a new task of automatically generating short descriptions known as perspectives, e.g."$131 million is about the cost to employ everyone in Texas over a lunch period". First, we collect a dataset of numeric mentions in news articles, where each mention is labeled with a set of rated perspectives. We then propose a system to generate these descriptions consisting of two steps: formula construction and description generation. In construction, we compose formulae from numeric facts in a knowledge base and rank the resulting formulas based on familiarity, numeric proximity and semantic compatibility. In generation, we convert a formula into natural language using a sequence-to-sequence recurrent neural network. Our system obtains a 15.2% F1 improvement over a non-compositional baseline at formula construction and a 12.5 BLEU point improvement over a baseline description generation.


Introduction
When posed with a mention of a number, such as "Cristiano Ronaldo, the player who Madrid acquired for [. . . ] a $131 million" (Figure 1), it is often difficult to comprehend the scale of large (or small) absolute values like $131 million (Paulos, 1988;Seife, 2010). Studies have shown that providing relative comparisons, or perspectives, such as "about the cost to employ everyone in Texas over a lunch period" significantly improves comprehension when measured in terms of memory retention or outlier detection (Barrio et al., 2016).  Figure 1: An overview of the perspective generation task: given a numeric mention, generate a short description (a perspective) that allows the reader to appreciate the scale of the mentioned number. In our system, we first construct a formula over facts in our knowledge base and then generate a description of that formula.
Previous work in the HCI community has relied on either manually generated perspectives (Barrio et al., 2016) or present a fact as is from a knowledge base (Chiacchieri, 2013). As a result, these approaches are limited to contexts in which a relevant perspective already exists.
In this paper, we generate perspectives by composing facts from a knowledge base. For example, we might describe $100,000 to be "about twice the median income for a year", and describe $5 million to be the "about how much the average person makes over their lifetime". Leveraging compositionality allows us to achieve broad coverage of numbers from a relatively small collection of familiar facts, e.g. median income and a person's lifetime.
Using compositionality in perspectives is also concordant with our understanding of how people learn to appreciate scale. Jones and Taylor (2009) find that students learning to appreciate scale do so mainly by anchoring with familiar concepts, e.g. $50,000 is slightly less than the median income in the US, and by unitization, i.e. improvising a system of units that is more relatable, e.g. using the Earth as a measure of mass when describing the mass of Jupiter to be that of 97 Earths. Here, compositionality naturally unitizes the constituent facts: in the examples above, money was unitized in terms of median income, and time was unitized in a person's lifetime. Unitization and anchoring have also been proposed by Chevalier et al. (2013) as the basis of a design methodology for constructing visual perspectives called concrete scales.
When generating compositional perspectives, we must address two key challenges: constructing familiar, relevant and meaningful formulas and generating easy-to-understand descriptions or perspectives. We tackle the first challenge using an overgenerate-and-rank paradigm, selecting formulas using signals from familiarity, compositionality, numeric proximity and semantic similarity. We treat the second problem of generation as a translation problem and use a sequence-tosequence recurrent neural network (RNN) to generate perspectives from a formula.
We evaluate individual components of our system quantitatively on a dataset collected using crowdsourcing. Our formula construction method improves on F 1 over a non-compositional baseline by about 17.8%. Our generation method improves over a simple baseline by 12.5 BLEU points.

Problem statement
The input to the perspective generation task is a sentence s containing a numeric mention x: a span of tokens within the sentence which describes a quantity with value x.value and of unit x.unit. In Figure 1, the numeric mention x is "$131 million", x.value = 1.31e8 and x.unit = $. The output is a description y that puts x in perspective.
We have access to a knowledge base K with numeric tuples t = (t.value, t.unit, t.description).  people, etc.). The first step of our task, described in Section 4, is to construct a formula f over numeric tuples in K that has the same value and unit as the numeric mention x. A valid formula comprises of an arbitrary multiplier f.m and a sequence of tuples f.tuples. The value of a formula, f.value, is simply the product of the multiplier and the values of the tuples, and the unit of the formula, f.unit, is the product of the units of the tuples. In Figure 1, the formula has a multiplier of 1 and is composed of tuples 1 , 2 and 3 ; it has a value of 1.3e8 and a unit of $.
The second step of our task, described in Section 5, is to generate a perspective y, a short noun phrase that realizes f . Typically, the utterance will be formed using variations of the descriptions of the tuples in f.tuples.

Dataset construction
We break our data collection task into two steps, mirroring formula selection and description generation: first, we collect descriptions of formulas constructed exhaustively from our knowledge base (for generation), and then we use these descriptions to collect preferences for perspectives (for construction).
Collecting the knowledge base. We manually constructed a knowledge base with 142 tuples and 9 fundamental units 1 from the United States Bu-reau of Statistics, the orders of magnitude topic on Wikipedia and other Wikipedia pages. The facts chosen are somewhat crude; for example, though "the cost of an employee" is a very context dependent quantity, we take its value to be the median cost for an employer in the United States, $71,000. Presenting facts at a coarse level of granularity makes them more familiar to the general reader while still being appropriate for perspective generation: the intention is to convey the right scale, not necessarily the precise quantity.
The values and units of the numeric mentions in each sentence were normalized and converted to fundamental units (e.g. from miles to length). We then randomly selected up to 200 mentions of each of the 9 types in bins with boundaries 10 −3 , 1, 10 3 , 10 6 , 10 9 , 10 12 leading to 4,931 mentions that are stratified by unit and magnitude. 2 Finally, we chose mentions which could be described by at least one numeric expression, resulting in the 2,041 mentions that we use in our experiments ( Figure 2). We note that there is a slight bias towards mentions of money and people because these are more common in the news corpus.
Generating formulas. Next, we exhaustively generate valid formulas from our knowledge base. We represent the knowledge base as a graph over units with vertices and edges annotated with tuples ( Figure 3). Every vertex in this graph is labeled with a unit u and contains the set of tuples with this unit: {t ∈ K : t.unit = u}. Additionally, for every vertex in the graph with a unit of the form u 1 /u 2 , where u 2 has no denominator, we add an edge from u 1 /u 2 to u 1 , annotated with all tuples of type u 2 : in Figure 3 we add an edge from money/person to money annotated with the three person tuples in Table 1. The set of formulas with unit u is obtained by enumerating all paths in the graph which terminate at the vertex u. The multiplier of the formula is set so that the value of were well represented in the corpus.
2 Some types had fewer than 200 mentions for some bins.  Table 1.
the formula matches the value of the mention. For example, the formula in Figure 1 was constructed by traversing the graph from money/time/person to money: we start with a tuple in money/time/person (cost of an employee) and then multiply by a tuple with unit time (time for lunch) and then by unit person (population of Texas), thus traversing two edges to arrive at money. Using the 142 tuples in our knowledge base, we generate a total of 1,124 formulas sans multiplier.
Collecting descriptions of formulas. The main goal of collecting descriptions of formulas is to train a language generation system, though these descriptions will also be useful while collecting training data for formula selection. For every unit in our knowledge base and every value in the set {10 −7 , 10 −6 . . . , 10 10 }, we generated all valid formulas. We further restricted this set to formulas with a multiplier between 1/100 and 100, based on the rationale that human cognition of scale sharply drops beyond an order of magnitude (Tretter et al.,Figure 4: A screenshot of the crowdsourced task to generate natural language descriptions, or perspectives, from formulas. 2006). In total, 5000 formulas were presented to crowdworkers on Amazon Mechanical Turk, with a prompt asking them to rephrase the formula as an English expression (Figure 4). 3 We obtained 5-7 descriptions of each formula, leading to a total of 31,244 unique descriptions.
Collecting data on formula preference. Finally, given a numeric mention, we ask crowdworkers which perspectives from the description dataset they prefer. Note that formulas generated for a particular mention may differ in multiplier with a formula in the description dataset. We thus relax our constraints on factual accuracy while collecting this formula preference dataset: for each mention x, we choose a random perspective from the description dataset described above corresponding to a formula whose value is within a factor of 2 from the mention's value, x.value. A smaller factor led to too many mentions without a valid comparison, while a larger one led to blatant factual inaccuracies. The perspectives were partitioned into sets of four and displayed to crowdworkers along with a "None of the above" option with the following prompt: "We would like you to pick up to two of these descriptions that are useful in understanding the scale of the highlighted number" (Figure 5). A formula is rated to be useful by simple majority. 4 Figure 6 provides a summary of the dataset collected, visualizing how many formulas are useful, controlling for the size of the formula. The exhaustive generation procedure produces a large number of spurious formulas like "20 × trash generated in the US × a minute × number of employees on Medicare". Nonetheless, compositional formulas are quite useful in the appropriate context; Table 2 presents some mentions with highly rated perspectives and formulas.

Formula selection
We now turn to the first half of our task: given a numeric mention x and a knowledge base K, select a formula f over K with the same value and unit as the mention. It is easy to generate a very large number of formulas for any mention. For the example, "Cristiano Ronaldo, the player who Madrid acquired for [. . . ] $131 million.", the small knowledge base in Table 1 can generate the 12 different formulas, 5 including the following: 1. 1 × the cost of an employee × the population of Texas × the time taken for lunch.
2. 400 × the cost of an employee × average household size × a week.
3. 1 × the cost of an employee × number of employees at Google × a week.
4. 1 × cost of property in the Bay Area × area of a city block.
Some of the formulas above are clearly worse than others: the key challenge is picking a formula that will lead to a meaningful and relevant perspective.
Criteria for ranking formulas. We posit the following principles to guide our choice in features (Table 3).

Sentence
That's about . . . Formula The Billings-based Stillwater Mining produced 601,000 ounces of platinum.
4 times the weight of an elephant.
4 × weight of an elephant.
Authorities estimate there are about 60 million guns in Yemen.
twice the gun ownership of the population of Texas 2 × gun ownership × population of Texas Water is flowing into Taihu lake at a rate of 150 cubic meters per second.
how much water would flow from a tap left on for a week.
rate of flow of water from tap × a week The bank had held auctions, selling around US$1 billion worth of three-month bills.
half the cost of employing the population of Texas for a work day.   Number of formulas Formulas of length 1 Formulas of length 2 Formulas of length 3 Figure 6: A histogram comparing formula length to ratings of usefulness (clipped for readability). Non-compositional perspectives with a single tuple are broadly useful. Useful compositional perspectives tend to be more context-specific than non-compositional ones, and many of the formulas that can be generated from the knowledge base are spurious.
Proximity: A numeric perspective should be within an order of magnitude of the mentioned value. Conception of scale quickly fails with quantities that exceed "human scales" (Tretter et al., 2006): numbers that are significantly away from 1/10 and 10. We use this principle to prune formulas with multipliers not in the range [1/100, 100] (e.g. example 2 above) and introduce features for numeric proximity.

Familiarity I[t]
142 Compatibility I [t, t ] 20022 Similarity wvec(s) wvec(t.description) 1 Table 3: Feature templates used to score a formulas f and their counts (#), where f.m is the formula's multiplier and t, t ∈ f.tuples are tuples in the formula.
Familiarity: A numeric perspective should be composed of concepts familiar to the reader. The most common technique cited by those who do well at scale cognition tests is reasoning in terms of familiar objects (Tretter et al., 2006;Jones and Taylor, 2009;Chevalier et al., 2013). Intuitively, the average American reader may not know exactly how many people are in Texas, but is familiar enough with the quantity to effectively reason using Texas' population as a unit. On the other hand, it is less likely that the same reader is familiar with even the concept of Angola's population.
Of course, because it is so personal, familiarity is difficult to capture. With additional information about the reader, e.g. their location, it is possible to personalize the chosen tuples (Kim et al., 2016). Without this information, we back off to a global preference on tuples by using indicator features for each tuple in the formula.  Compatibility: Similarly, some tuple combinations are more natural ("median income × a month") while others are less so ("weight of a person × population of Texas"). We model compatibility between tuples in a formula using an indicator feature.
Similarity: A numeric perspective should be relevant to the context. Apart from helping with scale cognition, a perspective should also place the mentioned quantity in appropriate context: for example, NASA's budget of $17 billion could be described as 0.1% of the United States' budget or the amount of money it could cost to feed Los Angeles for a year. While both perspectives are appropriate, the former is more relevant than the latter.
We model context relevance using word vector similarity between the tuples of the formula and the sentence containing the mention as a proxy for semantic similarity. Word vectors for a sentence or tuple description are computed by taking the mean of the word vectors for every non-stop-word token. The word vectors at the token level are computed using word2vec (Mikolov et al., 2013).
Evaluation. We train a logistic regression classifier using the features described in Table 3 using the perspective ratings collected in Section 3. Recall that the formula for each perspective in the dataset is assigned a positive ("useful") label if it was labeled to be useful to the majority of the workers. Table 5a presents results on classifying formulas as useful with a feature ablation. 6 Familiarity and compatibility are the most useful features when selecting formulas, each having a significant increase in F 1 over the proximity baseline. There are minor gains from combining these two features. On the other hand, semantic similarity does not affect performance relative to the baseline. We find that this is mainly due to the disproportionate number of unfamiliar formulas present in the dataset that drown out any signal. Table 4 presents two examples of the system's ranking of formulas.

Perspective generation
Our next goal is to generate natural language descriptions, also known as perspectives, given a formula. Our approach models the task as a sequence-to-sequence translation task from formulas to natural language. We first describe a rulebased baseline and then describe a recurrent neural network (RNN) with an attention-based copying mechanism (Jia and Liang, 2016).
Baseline. As a simple approach to generate perspectives, we just combine tuples in the formula with the neutral prepositions of and for, e.g. "1/5th of the cost of an employee for the population of Texas for the time taken for lunch." Sequence-to-sequence RNN. We use formulaperspective pairs from the dataset to create a sequence-to-sequence task: the input is composed using the formula's multiplier and descriptions of its tuples connected with the symbol '*'; the output is the perspective (Figure 7).
Our system is based on the model described in Jia and Liang (2016). Given a sequence of input tokens (x = (x i )), the model computes a contextdependent vector (b = (b i )) for each token using a bidirectional RNN with LSTM units. We then generate the output sequence (y j ) left to right as follows. At each output position, we have a hidden state vector (s j ) which is used to produce an "attention" distribution (α j = (α ji )) over input tokens: α ji = Attend(s j , b i ). This distribution is used to generate the output token and update the hidden state vector. To generate the token, we ei- (a) the formula construction system. Precision, Recall and F1 are crossvalidated on 10-folds. * significant F1 versus P and S with p < 0.01. + significant F1 versus P, S and F with p < 0.01. † significant F1 versus P, S, F and C with p < 0.05.

System
Train BLEU (b) the description generation system. * significant BLEU score versus the baseline with p < 0.01.   Figure 7: We model description generation as a sequence transduction task, with input as formulas (at bottom) and output as perspectives (at top). We use a RNN with an attention-based copying mechanism.
ther sample a word from the current state or copy a word from the input using attention. Allowing our model to copy from the input is helpful for our task, since many of the entities are repeated verbatim in both input and output. We refer the reader to Jia and Liang (2016) for more details.
Evaluation. We split the perspective description dataset into a training and test set such that no formula in the test set contains the same set of tuples as a formula in the training set. 7 Table 5b compares the performance of the baseline and sequence-to-sequence RNN using BLEU. 7 Note that formulas with the same set of tuples can occur multiple times in the either the training or test set with different multipliers.
The sequence-to-sequence RNN performs significantly better than the baseline, producing more natural rephrasings. Table 6 shows some output generated by the system (see Table 6).

Human evaluation
In addition to the automatic evaluations for each component of the system, we also ran an end-toend human evaluation on an independent set of 211 mentions collected using the same methodology described in Section 3. Crowdworkers were asked to choose between perspectives generated by our full system (LR+RNN) and those generated by the baseline of picking the numerically closest tuple in the knowledge base (BASELINE). They could also indicate if either both or none of the shown perspectives appeared useful. 8 Table 7 summarizes the results of the evaluation and an error analysis conducted by the authors. Errors were characterized as either being errors in generation (e.g. Table 6) or violations of the criteria in selecting good formulas described in Section 4 ( Table 7c). The other category mostly contains cases where the output generated by LR+RNN appears reasonable by the above criteria but was not chosen by a majority of workers. A few of the mentions shown did not properly describe a numeric quantity, e.g. ". . . claimed responsibility for a 2009 gun massacre . . . " and were labeled invalid mentions. The most common error is the selection of a formula that is not contextually relevant to the mentioned text because no such Input formula Generated perspective 7 × the cost of an employee × a week 7 times the cost of employing one person for one week 1/10 × the cost of an employee × the population of California × the time taken for a football game one tenth the cost of an employee during a football game by the population of California 1 × coffee consumption × a minute × population of the world the amount of coffee consumed in one minute on the world 6 × weight of a person × population of California six times the weight of the people who is worth  one half of an area of an average farm Table 8: Examples of perspectives generated by our system that frame the mentioned quantity to be larger or smaller (top to bottom) than initially the authors thought.
formula exists within the knowledge base (within an order of magnitude of the mentioned value): a larger knowledge base would significantly decrease these errors.

Related work and discussion
We have proposed a new task of perspective generation. Compositionality is the key ingredient of our approach, which allows us synthesize information across multiple sources of information. At the same time, compositionality also poses problems for both formula selection and description generation.
On the formula selection side, we must compose facts that make sense. For semantic compatibility between the mention and description, we have relied on simple word vectors (Mikolov et al., 2013), but more sophisticated forms of semantic relations on larger units of text might yield better results (Bowman et al., 2015).
On the description generation side, there is a long line of work in generating natural language descriptions of structured data or logical forms Wong and Mooney (2007); Chen and Mooney (2008); Lu and Ng (2012); Angeli et al. (2010). We lean on the recent developments of neural sequence-to-sequence models (Sutskever et al., 2014;Bahdanau et al., 2014;Luong et al., 2015). Our problem bears some similarity to the semantic parsing work of Wang et al. (2015), who connect generated canonical utterances (representing logical forms) to real utterances.
If we return to our initial goal of helping people understand numbers, there are two important directions to explore. First, we have used a small knowledge base, which limits the coverage of perspectives we can generate. Using Freebase (Bol-lacker et al., 2008) or even open information extraction (Fader et al., 2011) would dramatically increase the number of facts and therefore the scope of possible perspectives.
Second, while we have focused mostly on basic compatibility, it would be interesting to explore more deeply how the juxtaposition of facts affects framing. Table 8 presents several examples generated by our system that frame the mentioned quantities to be larger or smaller than the authors originally thought. We think perspective generation is an exciting setting to study aspects of numeric framing (Teigen, 2015).
Reproducibility All code, data, and experiments for this paper are available on the CodaLab platform at https: //worksheets.codalab.org/worksheets/ 0x243284b4d81d4590b46030cdd3b72633/.