Representing Numbers in NLP: a Survey and a Vision

NLP systems rarely give special consideration to numbers found in text. This starkly contrasts with the consensus in neuroscience that, in the brain, numbers are represented differently from words. We arrange recent NLP work on numeracy into a comprehensive taxonomy of tasks and methods. We break down the subjective notion of numeracy into 7 subtasks, arranged along two dimensions: granularity (exact vs approximate) and units (abstract vs grounded). We analyze the myriad representational choices made by over a dozen previously published number encoders and decoders. We synthesize best practices for representing numbers in text and articulate a vision for holistic numeracy in NLP, comprised of design trade-offs and a unified evaluation.


Introduction
Numbers are an integral part of text. To understand a simple sentence like I woke up at 11, we need not just literacy but also numeracy. We must decode the string 11 to the quantity 11 and infer 11 to denote a time of the day, probably 11 a.m. We need commonsense to reason that 11 a.m. is quite late in the morning. This interpretation of 11 is strongly contextual, as I earn $11 per month evokes different units and value expectations. Note how the semantics remains the same for both sentences if 11 was replaced by 10, i.e., the context is tolerant to some variability.
Numbers are everywhere. Reasoning with quantities and counts is crucial to understanding the world. Evolutionary learning has given numerical cognition skills to several animals, including human beings (Dehaene, 2011). Our ancient ancestors furthered numeracy by developing multiple number systems, similar to but independent from the evolution of languages. Numeracy is an essential skill for language understanding, since numbers are often interspersed in text: the 6 million pages in English Wikipedia have over 150 million numbers.
Numbers are neglected. In NLP, however, numbers are either filtered out explicitly during preprocessing (Graff et al., 2003), or treated the same as words, often collapsing them into an UNK token. Subword tokenization approaches like BPE (Sennrich et al., 2016) and WordPiece (Wu et al., 2016) instead retain numbers, but split them into arbitrary tokens, for example 1234 might be split into two tokens as 12-34 or 123-4 or 1-234.
Recent work has shown that these are suboptimal number representations (Wallace et al., 2019;Zhang et al., 2020). On the DROP Question Answering benchmark, BERT performs five times worse when the answer is a number instead of a span of text (Dua et al., 2019). Relatively simple strategies like switching from subword to char-level tokenization (Geva et al., 2020), or from decimal to scientific notation (Zhang et al., 2020) already boost performance. Such results warrant a deeper study into the best number representations.
Numbers are important. Given the ubiquity of numbers and their fundamental differences with words, enabling NLP systems to represent them effectively is beneficial for domains like scientific articles (Spithourakis and Riedel, 2018) and financial documents (Chen et al., 2019;Jiang et al., 2020). Number understanding is also useful to detect sarcasm (Dubey et al., 2019) and to model dialogues involving price negotiations (Chawla et al., 2020).
Recent NLP progress towards numeracy has been sporadic but encouraging. In this paper, we survey prior work and highlight the kind of numeracy targeted (e.g., arithmetic, measurement, numeration) as well as the kind of representation used (e.g., value embeddings, DigitRNNs). We provide the first NLP-centric taxonomy of numeracy tasks (Section 2) and of number representations (Section 3) for the reader to succinctly comprehend the challenge posed by numeracy. We synthesize key takeaways (Section 5) and propose a unifying vision for future research (Section 6). Table 1: Seven numeracy tasks, arranged along the axes of (rows) granularity -exact vs approximate, and (columns) units -abstract vs grounded. We also list downstream applications requiring a similar granularity of numeracy.

Tasks
There are several different aspects of numeracy. The DROP dataset alone offers a wide variety of numeric reasoning questions such as retrieval-based (How many yards did Brady run?), count-based (How many goals were scored? given a comprehension describing multiple goals), and simple arithmetic (How many years after event 1 did event 2 occur? given dates of both events). Besides downstream applications, there have also been probing experiments to evaluate whether NLP models can decode numbers from strings (e.g., 19 to 19.0), or estimate quantities (e.g., how tall are lions?).
Such a diverse range of abilities are usually all referred to collectively as numeracy, which gives rise to confusion. We limit this abuse of terminology and provide a neat taxonomy for arranging the different tasks proposed under numeracy.

Our Taxonomy of Tasks
Drawing from work in cognitive science (Feigenson et al., 2004), we propose the following two dimensions to organize tasks within numeracy: 1. Granularity: whether the encoding of the number is (1) exact, e.g., birds have two legs, or (2) approximate, e.g., Jon is about 180 cms tall.

2.
Units: whether the numbers are (1) abstract, e.g., 2+3=5, or (2) grounded, e.g., 2 apples + 3 apples = 5 apples. While abstract mathematical tasks are easy to probe and create artificial datasets for, numbers grounded in units are challenging since they need to be understood in the context of words.

Survey of Existing Tasks
We now describe 7 numeracy tasks, arranged according to our taxonomy in Table 1, as well as  downstream tasks (right-most column in the table).
Simple Arithmetic is the task of addition, subtraction, etc. over numbers alone. It is convenient to create synthetic datasets involving such math operations for both masked (Geva et al., 2020) and causal language models (GPT-3 Brown et al. 2020).
Numeration or Decoding refers to the task of mapping a string form to its numeric value, e.g., 19 to 19.0. Within NLP, this task is set up as a linear regressor probe over a (frozen) representation of the string. Numeration has been probed for in static word embeddings , contextualized language models (Wallace et al., 2019), and multilingual number words, e.g., nineteen or dix-neuf (Johnson et al., 2020).
Magnitude Comparison is the ability to tell which of two (or more) numbers is larger. For language models, this has been probed in an argmax setup (choose the largest of five numbers) as well as a binary classification task, e.g., given 23 and 32, pick the label 1 to indicate that 32 > 23 Wallace et al., 2019).
Arithmetic Word Problems (AWP) are the grounded version of simple arithmetic that we find in school textbooks, e.g., Mary had two cookies. She gave one away. How many does she have left? There exist several NLP datasets on math word problems (Amini et al., 2019;Saxton et al., 2019;Hendrycks et al., 2021).
Exact Facts in the context of numeracy involves commonsense knowledge such as dice have 6 faces or birds have two legs. An approximate sense of quantity would be of little help here since assertions like dice have 5 faces or birds have three legs are factually incorrect. Two recent datasets for numeric commonsense facts are Numbergame (Mishra et al., 2020) and NumerSense (Lin et al., 2020). Measurement Estimation is a task in psychology in which subjects are asked to approximately guess measures of objects along certain dimensions, e.g., number of seeds in a watermelon or weight of a telephone (Bullard et al., 2004). VerbPhysics (Forbes and Choi, 2017) is a benchmark of binary comparisons between physical attributes of various objects, e.g., ball < size tiger. DoQ ) is a web-extracted dataset of Distributions over Quantities, which can be used as a benchmark for language models' measurement estimation abilities (Zhang et al., 2020). Lastly, MC-TACO (Zhou et al., 2020) is a collection of temporal-specific measurement estimates, e.g., going for a vacation spans a few days/weeks.
Numerical Language Modeling in its literal sense is not a task but a setup, analogous to masked/causal language modeling for words. Other tasks could be modeled as numeric language modeling, e.g., arithmetic (2+3=[MASK]) and measurement estimation (lions weigh [MASK] pounds). In practice, numerical language modeling refers to the task of making numeric predictions for completing unlabelled, naturally occurring text.
Word predictions in language modeling are typically evaluated with classification metrics such as accuracy or perplexity. Numeric predictions, on the other hand, are evaluated with regression metrics such as mean absolute error, root mean squared error, or their log and percentage variants (Spokoyny and Berg-Kirkpatrick, 2020). Spithourakis and Riedel (2018) also propose an Adjusted Perplexity metric to cancel the effect of the out-of-vocabulary rate on the perplexity of numeric tokens.
Downstream Applications for numeracy are abound. Dubey et al. (2019) detect sarcasm in tweets based on numbers. Chen et al. (2020) identify claims in financial documents using alternative number representations and the auxiliary task of numeral understanding or categorization (Chen et al., 2018). Similarly, simple arithmetic and math word problems serve as auxiliary tasks for GenBERT (Geva et al., 2020) towards improving its score on the DROP QA benchmark.

Other Numeracy Tasks
Here, we describe foundational numeracy-related tasks that cut across our taxonomy of tasks: (Numeric) Paraphrasing is what we call the task of identifying one-to-one correspondences between different surface forms of the same number. Twelve is the same as '12', also referred to as a dozen. This task cuts across all the tasks we discussed, since the same number, expressed in several different ways, should be nevertheless identified by an NLP model before any subsequent reasoning. Similar to how WordNet (Miller, 1995) provides a huge list of synonyms, numeric paraphrases can be obtained by libraries 1 which convert numerals to words, words to numerals, etc. One could also envision this as a learning task given a large enough corpus, such as the NumGen dataset (Williams and Power, 2010) containing 2000 fact-aligned numeric expressions over 110 articles.
Quantity Entailment tasks , analogous to Natural Language Inference, require understanding of not equivalence (as in paraphrasing) but deeper relations like entailment and contradiction, e.g., the premise he was 16 yrs old entails the hypothesis he was a teenager. On similar lines, Mishra et al.
(2020) modify the QuaRel dataset  to force models to perform quantity entailment, e.g., dog1 is light, dog2 is heavy is replaced with dog1 weighs 70 lbs, dog2 weighs 90 lbs.
Numeral Understanding is the task of categorizing numbers into percentages, prices, dates, times, quantities, etc. and their respective subcategories (Chen et al., 2018).
Fused-Head Resolution for numbers is essential to ground them when the context is implicit. For example, the sentence I woke up at 11 has a.m. or o'clock as the fused head to be resolved (Elazar and Goldberg, 2019).
Counting is the task of keeping track of discrete instances of some object. When kids count a set of objects, they quickly learn to keep a track, say on their fingers, but struggle with realizing the Cardinal Principle, i.e., the last counter value denotes the number of entities being considered (Wynn, 1990). Similarly, LSTMs (Suzgun et al., 2019) and transformers (Bhattamishra et al., 2020) have been shown to possess counting skills but in order to answer counting questions, they must also learn to map the counts to number words or numerals. Counting tasks have been proposed in computer vision (Testolin et al., 2020) as well as in NLP (Postma et al., 2018;Talmor et al., 2020).
Domain-specific tasks require background knowledge in addition to exact mathematical skills. Numbergame (Mishra et al., 2020) includes questions on Physics (find the distance travelled in 2 hrs by a train moving at 50 mph) and Chemistry (find the mass percentage of H in C6H6). Project Aristo  solves elementary and high school science problems, which often involve numeric reasoning.

Methods
Analogous to our taxonomy of subtasks in the previous section, here we attempt to arrange the wide variety of alternative number representations proposed in recent literature. We limit our analysis to methods of encoding (numbers → embeddings) and/or decoding (embeddings → numbers) numbers. We do not discuss, for example, methods that use symbolic reasoning (Andor et al., 2019) or modify activation functions to enhance numeracy (Trask et al., 2018).
A typical example of the base architecture could be BERT (Devlin et al., 2019), the workhorse of modern NLP. We assume that there exists an independent parallel process of mapping words into embeddings, such as subword tokenization followed by lookup embeddings in BERT.

Our Taxonomy
We look at two kinds of representations: stringbased and real-based. Real-based representations perform some computation involving the numerical value of the number. The string-based representations instead see numbers in their surface forms; they must assign arbitrary token IDs and look up their embeddings to feed into the architecture.

String Based
By default, language models treat numbers as strings, the same as words. However, within string representations, one could tweak simple changes: Notation: The number 80 could be written in Hindu-Arabic numerals (80), Roman numerals (LXXX), scientific notation (8e1), English words (eighty), or with base 20 as in French (quatrevingts). Nogueira et al. (2021) exclusively study the effect of many such notation choices in language models, on the task of simple arithmetic.
Tokenization: Word level tokenizations are ineffective for numbers, since they are likely to map most numbers to an UNK token, except for a few commonly occuring ones (e.g., 1, 2, 5, 10, 100). Other possibilities are subword tokenizations like BPE and WordPiece, as well as character (or digit) level tokenizations.
Pooling: The pooling dimension of variation springs up after analyzing the effect of tokenization. With subword and character level tokenizations, a single number may now correspond to multiple tokens, e.g., 100 segmented into 10-0 or 1-0-0. Prior work (Spithourakis and Riedel, 2018) has ar-gued for using RNNs or CNNs to instead pool the embeddings of these tokens into a single embedding before feeding to the language model. The default way that language models see numbers are the same as words, hence no pooling is applied.

Real Based
Real-based number encoders can be expressed as f : R → R d whereas decoders can be expressed as g : R d → R. Real-based methods proposed in literature can vary on account of direction (whether they encode, decode or both), scale (linear vs log), and discretization (binning vs continuous valued).
Scale: Inspired by cognitive science literature (Dehaene, 2011), several methods have attempted to model numbers in the log (instead of linear) scale, i.e., to perform mathematical operations on the logarithm of the number to be represented. The first operation in a log-scaled f is log(·) and the last operation in a log-scaled g is exp(·). We discuss more scales in the following subsection, such as the stabilized log scale (Jiang et al., 2020) and the learned scale/flow (Spokoyny and Berg-Kirkpatrick, 2020).

Survey of Existing Methods
Having established dimensions of variance of number representations, we describe some key stringbased and real-based methods used in prior work. Table 2 depicts these methods as individual rows, with the first three columns showing their position in our taxonomy ( § 3.1). The last seven columns correspond to the seven tasks ( § 2.2), with each cell denoting a representative work that introduce it.  Notes: Prototype* is encoder-only but reuses embeddings for the decoder (Jiang et al., 2020). GMM** has been discretized (Spithourakis and Riedel, 2018) as well as continuous valued (Spokoyny and Berg-Kirkpatrick, 2020).
GenBERT Geva et al. (2020) present GenBERT, a question answering model with pretrained BERT serving as both its encoder and decoder. GenBERT tokenizes numbers at the digit level, and is finetuned on auxiliary tasks of arithmetic word problems and simple arithmetic.
NumBERT Zhang et al. (2020) pretrain BERT from scratch over a modified dataset such that all numbers have been converted into scientific notation, i.e., 314.1 is expressed as 3141[EXP]2). Num-BERT hence follows a scientific notation, subword tokenization, and no pooling. we refer to as DigitRNN-sci in Table 2), as well as a simpler alternative: exponent embedding. The latter merely learns a lookup embedding for the exponent, completely ignoring the mantissa.

Real-based methods
DICE Determinisitic Independent-of-Corpus Embeddings (Sundararaman et al., 2020) is an attempt to handcraft number encoder 3 f so as to preserve the relative magnitude between two numerals and their embeddings. Given two scalars i and j, and their embeddings f (i) and f (j), the cosine distance between f (i) and f (j) is intended to monotonically increase/decrease with the Euclidean distance between i and j. DICE is offered as not only a deterministic encoding but also as an auxiliary loss function for softly training number embeddings alongside, say, SQuAD (Rajpurkar et al., 2016) Value Embedding The most intuitive parameterized encoder for real numbers is one that feeds the scalar magnitude of the number through a shallow neural network. The converse of value embedding is to learn a shallow neural network mapping g : R d → R. This decoder is simply the probe used for decoding/numeration task.
The idea of projecting number magnitudes into an NLP model that otherwise inputs only lookup embeddings may appear flawed. But Vaswani et al. (2017) have (rather successfully) encoded positional information into transformers using both learned embeddings (similar to Value) and fixed ones (similar to DICE). Log Value Wallace et al. (2019) also experiment with a log-scaled value encoder in addition to the one on a linear scale. Zhang et al. (2020) experiment with a log value decoder for measurement estimation, which they call the RGR (regress) method. Log scaling has a neuroscientific inspiration since observations of human (and animal) understanding of numbers is better modelled by a log-scale representation (Dehaene, 2011).
Log Laplace In contrast to the point estimate output of the RGR decoder, models can also be used to parameterize a distribution over numbers. Such a formulation is helpful when estimating approximate quantities. Vectors representing some context can be used to parameterize, say, the mean and variance of a Gaussian or Laplace distribution. Spokoyny and Berg-Kirkpatrick (2020) instead transform the space being modeled by parameterizing the location parameter of a Log-Laplace distribution L(X, 1) where X is the context representation of unmasked tokens, in a masked (numerical) language modelling setup. When inferring or decoding a number, they sample a point z~L(X, 1) and exponentiate it, such that the output is exp(z).
Flow Laplace The expressivity of number decoders can be expanded or contracted by merely parameterizing a different distribution. Spokoyny and Berg-Kirkpatrick (2020) propose a more expressive decoder where instead of the log scale, the model learns its own density mapping. After sampling z~L(X, 1), the output is transformed to , where a, b, and c, are also parameters emitted by the same model.
MCC or multi-class classification is another number decoder which outputs a distribution, but a discrete one: over log-scaled bins of numbers, e.g., 1-10, 10-100, and so on (Zhang et al., 2020). Previously described decoders either output a point estimate or a unimodal distribution, thus failing to hedge its predictions for a multimodal ground truth. Given a masked number prediction problem We went to the restaurant at [MASK] p.m., MCC is better equipped to estimate two peaks: one around lunch time (say, 1-2 p.m.) and another around dinner (say, 7-9 p.m.). (Spokoyny and Berg-Kirkpatrick, 2020) where the model parameterizes a multinomial distribution for the exponent (similar to MCC) and uses it to sample an exponent e, which then acts as a latent variable for emitting the mean µ of a Gaussian (standard deviation fixed at 0.05). This Gaussian is finally used to sample the output number z~N (µ, 0.05).

Discrete Latent Exponent (DExp) is another potentially multimodal distribution
GMM Another attempt to circumvent the unimodal Gaussians or point estimates is to learn a Gaussian mixture model. Spithourakis and Riedel (2018) learn a mixture of K Gaussians by pretraining their means (µ i ) and variances (σ i 2 ) over the training corpus with Expectation Maximization algorithms, while the mixing weights π i are derived from the model. Next, to sample a single number from the GMM probability mass function q(u) = K i=1 π i N (u; µ i ; σ i ), the authors first sample the precision (number of decimal places) from yet another Gaussian and use that to discretize the probability mass function into equal sized bins, over which the probabilities are summed. If the sampled precision is, say 2, then the probability of emitting a number 3.14 is given by 3.145 3.135 q(u)du. This likelihood estimate is used to train a causal language model. Spokoyny and Berg-Kirkpatrick (2020)'s GMM implementation is slightly different: it alters the last inference step by sampling directly from the mixture of Gaussians, as they did with Log Laplace, Flow Laplace, and DExp.
GMM-prototype by Jiang et al. (2020) similarly pretrains (with EM/hard-EM) the mean, the variances, but also the mixture weights π i s of a GMM over the training corpus. They then learn K prototype embeddings e i s corresponding to the K Gaussians. When encoding a new numeral n, its (input) embedding is calculated as: where the weights are induced from the GMM: Thus the difference between GMM and GMMprototypes is that after fixing mean and standard deviations of the Gaussian mixtures, in GMM the model learns to predict the mixture weights π i for each individual number prediction, whereas in GMM-prototype, π i 's are frozen and the model learns prototype embeddings e i 's. Note that proto-type embeddings are encoder-only.To decode numbers, the authors implement weight-sharing across input and output embeddings, similar to how word vectors are trained (Mikolov et al., 2013), i.e., finding out which of the numerals in the corpus has the closest embedding.
SOM-prototype GMM-prototype, in effect, merely use the mixture of Gaussians to infer prototypes and to get the weights w i 's. Jiang et al. (2020) tried another variant by identifying prototype numerals with Self Organizing Maps (Kohonen, 1990) and by defining the weights as: w i = |g(x i ) − g(n)| −1 where x i is the ith prototype, n is the number to be encoded, and g is a log-based squashing function.

Results
Having organized the landscape of numeracy tasks and methods, we now present come key results for each numeracy task in NLP from previously published experiments over a subset of the described number representations: Abstract Probes Word Embeddings vastly outperform random embedding baselines on abstract probes such as numeration, magnitude comparison, and sorting (Wallace et al., 2019;. DICE, Value and Log Value embeddings excel at these probes, which makes intuitive sense given that they explicitly encode the numbers' magnitude -although Value embeddings do not easily extrapolate to larger numbers, possibly due to instability in training. The best number encoders with respect to these probes were found to be DigitCNNs, and character-tokenized models, e.g., ELMo, in general outperform subword ones, e.g., BERT (Wallace et al., 2019).
Arithmetic GPT-3 (Brown et al., 2020) performs extremely well at zero shot simple arithmetic, as long as the number of digits in the operands are low. The tokenization scheme could be the cause for limited extrapolation, since language models get better at arithmetic when numbers are tokenized at the digit/character level (Nogueira et al., 2021;Wallace et al., 2019). For arithmetic word problems, state of the art solvers rely on predicting an equation, which is then filled in with specific numeric values from the question (Patel et al., 2021), altogether bypassing the need for encoding numbers into embeddings.
Masked Language Modelling Zhang et al. (2020) show that BERT pretrained over datasets where numbers are in scientific notation (Num-BERT) converges to the same loss as BERT on masked language modelling objective, and scores nearly the same on GLUE language understanding benchmarks. For (causal) numeric language modelling, Spithourakis and Riedel (2018) show that Gaussian Mixture Models are the best decoders. For (masked) numeric language modelling, Spokoyny and Berg-Kirkpatrick (2020) show that modelling the mantissa in scientific notation may be an overkill, since exponent embeddings alone outperform DigitRNN-sci over financial news and scientific articles. Zhang et al. (2020) train a regression probe to predict measurements of objects over the CLS embeddings of BERT/NumBERT. Given a template-lexicalized sentence such as "the dog is heavy," the model must predict the weight of a typical dog, against ground truth from the Distribution over Quantities dataset . They find that NumBERT is a better text encoder than BERT for measurement estimation, the only difference between them being the notation used by the respective pretraining corpora. They also experiment with two number decoders: MCC (multi-class classification) and RGR (regression / Log Value embedding). MCC performs better when trying to predict Distributions over Quantities -perhaps due to the ground truth resembling the predicted gaussians -but not on VerbPhysics -where the ground truth is less noisy. Lastly, even static word embeddings like GloVe have been shown to contain enough knowledge of measurement estimates to contrast two objects, e.g., classifying whether a car is bigger/heavier/fasster than a ball (Goel et al., 2019).

Measurement Estimation
Exact Facts BERT and RoBERTa capture limited numerical commonsense, evident over Nu-merSense (Lin et al., 2020) sentences such as a tricycle has [MASK] wheels, with the answer choices limited to the integers 0-10. Results can be further improved by finetuning over a Wikipediaextracted dataset of numeric information. Mishra et al. (2020) find commonsense question answering to be one of the hardest among their Numbergame challenge, using the NumNetv2 model (Ran et al., 2019) which is commonly used for DROP question answering. Both of these experiments evaluate on exact match metrics, hence it remains to be seen if representing approximate magnitudes yields benefit in modelling numeric facts.

Recommendations
Based on the above results, we now synthesize key insights into a set of directed takeaways to guide practitioners' design of number representations: Rule of thumb for string-based methods? Scientific notation is superior to decimal notation (Zhang et al., 2020) since models can learn to attend mostly to the exponent embedding rather than the mantissa (Spokoyny and Berg-Kirkpatrick, 2020). Character level tokenization outperforms subword level (Nogueira et al., 2021;Wallace et al., 2019;Geva et al., 2020). Pooled representations (DigitRNN, DigitCNN) lack a controlled study with unpooled ones (NumBERT, GenBERT) which makes it hard to proclaim a winner among the two.
Rule of thumb for real-based methods? Log scale is preferred over linear scale (Zhang et al., 2020;Jiang et al., 2020;Wallace et al., 2019;Spokoyny and Berg-Kirkpatrick, 2020), which makes intuitive sense but lacks as rigorous a study as has been undertaken in the cognitive science community (Feigenson et al., 2004). Regarding discretization, Zhang et al. (2020) show that binning (dense cross entropy loss) works better than continuous value prediction (MAE loss) on datasets where ground truth distributions are available. Lastly, modeling continuous predictions is notoriously hard for large ranges (Wallace et al., 2019) but Spithourakis and Riedel (2018) offer a way of binning such distributions by picking a precision level.
Encoding vs Decoding numbers? In our simplified discussions above, we avoid differentiating between methods for encoding and decoding numbers. Value Embedding, for instance, can be used to encode numbers (projecting scalars onto vector space) as well as to decode numbers (collapsing a vector into a scalar). On the other hand, manually-designed encoders like DICE are not easily reversible into decoding methods. Even with reversible methods, the encoders and decoders must usually be independently parameterized, unlike the input and output word embeddings which often share weights (Press and Wolf, 2016). Prototype embeddings by Jiang et al. (2020) are an exception, which share input/output embeddings for a fixed vocabulary of numbers.
Can we mix-and-match multiple methods? Given the wide range of number representations, an obvious next step is to try an ensemble of embeddings. Spokoyny and Berg-Kirkpatrick (2020) show that for encoding numbers, exponent embeddings added to DigitRNN (scientific notation) embeddings barely outperforms the exponent embeddings alone. Similar experiments with a mix of real and string methods are yet to be seen.
Which methods for which tasks? Based on our taxonomy of tasks in Table 1, abstract tasks are good early probes for the grounded ones, e.g., finetuning GenBERT (Geva et al., 2020) on simple arithmetic helps it do well on downstream question answering, and the high scores of DICE (Sundararaman et al., 2020) on numeration and magnitude comparison are an indicator of similar boosts on (numeric) language modelling. With respect to granularity, real-based methods work well for approximate tasks such as measurement estimation and language modeling (Zhang et al., 2020;Spokoyny and Berg-Kirkpatrick, 2020) but not for exact tasks like arithmetic word problems or commonsense. DigitRNNs are broad-purpose number encoders, whereas distribution modeling methods like DExp are effective at decoding numbers.

Vision for Unified Numeracy in NLP
Numeracy is a core system of human intelligence (Kinzler and Spelke, 2007). Teaching numeracy to students works best when taught holistically, while less effective teachers deal with areas of mathematics discretely (Askew and Askew, 1997). While the NLP community genuinely strives to improve language models' numeric skills, not all aspects of numeracy have been sufficiently targeted. It is evident from the sparsity in Table 2 that the community is far from achieving, a holistic solution to numeracy. In this section, we outline our vision for such a unified solution, in the form of three prerequisites to consider for numerical NLU: Evaluation. The first step towards a holistic solution to numeracy requires a benchmark covering its different subtasks. Aggregated leaderboards in NLP like GLUE (Wang et al., 2018) and Su-perGLUE  have incentivized research on natural language understanding, with scores categorized into semantic, syntactic, logical, and background knowledge.
An analogous leaderboard could be constructed to evaluate models on numeric reasoning tasks, categorized according to the skills evaluated, e.g., exact vs approximate granularity, or abstract vs grounded numeracy. Numbergame (Mishra et al., 2020) is one such aggregation focusing on exact numeracy benchmarks, as evaluated by F1 and exact match scores in a reading comprehension setup. Both Numbergame and our own list of tasks (Section 2.2) are preliminary attempts at teasing apart the different aspects of numeracy. We encourage researchers to extend and refine such taxonomies.
A suite of numeracy tasks, matched with evaluations of their respective numerical skills, can enable testing model generalization from one skill to another. Some progress has already been made in this transfer learning setup, e.g., GenBERT (Geva et al., 2020), finetuned on a synthetic dataset of arithmetic problems, is found to score higher on DROP QA. Similarly, DICE (Sundararaman et al., 2020), optimized for numeration, improves score on Numeracy600K order-of-magnitude prediction task. Going forward, we need several such studies, ideally for each pair of tasks to see whether some numeracy skills help models generalize to others.
Design Principles. Number representations vary based on design trade-offs between inductive biases and data-driven variance. The default BERT setup, with subword tokenization and lookup embeddings, occupies the variance end of the spectrum, allowing freedom in representing numbers. Value embeddings and DICE encodings, on the other hand, are closer to the bias end of the spectrum, since the inductive bias of continuity on the number line constrains the learning space. It is important to identify where on the bias-variance scale any representation stands, for a fair comparison.
Following parallel work in cognitive science, the community could explore whether exact and approximate numeracy require two specialized modules (Feigenson et al., 2004) or could be handled with a single representation (Cordes et al., 2001).
Model designers must also make a choice on coverage: whether to target a broad or a narrow range of numbers to be represented. Multi-class classification (Zhang et al., 2020) over a fixed number of bins, restricts the range of numbers expressed, as do DICE embeddings (Sundararaman et al., 2020). Value embeddings are continuous and theoretically unrestricted, but must practically be capped for bugfree training. On the other hand, string-based representations could always fall back to subword/charlevel token embeddings to represent not only floats but also irrational ( √ 2) and complex (1 + 2ι) numbers.  introduced the Quantity-Value Representation format to allow closed and open ranges alongside scalar point numbers.
Broader Impact. Numbers are ubiquitous in natural language and are easily identified, at least in numeral forms. But they are by no means the only class of ordered concepts required for natural language understanding. Successful number representations can inspire work on incorporating more continuous domains into natural language processing systems. For instance, gradable adjectives like good, great, amazing, etc. are arguably on some cardinal scale, which can be mapped using value embeddings or Gaussian mixture models (Sharp et al., 2018;de Marneffe et al., 2010). Days of the week (Mon-Sun) and months of an year (Jan-Dec) form periodic patterns which can be modeled with sinusoidal functions (Martinez et al., 2020).
Lastly, numeracy is essential for natural language understanding. Consider the sentence: "Programmers earn $200,000 versus $100,000 for researchers." An intelligent agent with numeracy skills would identify that $100k is half of $200k, that $100k possibly denotes annual salary, and infer that higher salaries lead to higher standards of living. In short, it was able to learn something about the two concepts programmers and researchers, by crossing the continuous semantic space of numbers! The agent could now make use of this knowledge in a number-free situation, e.g., the mask in "He could not afford a car for several years after earning a CS degree because she took a job as a [MASK]" might better be filled with the word researcher, than with programmer. A key goal of imparting numeracy to NLP models is to help them understand more about the world, using numbers.

Conclusion
This paper summarizes and contextualizes recent work on numeracy in NLP. We propose the first taxonomy of tasks and methods concerning textcentric numeric reasoning. We highlight key takeaways from the several experiments in literature, along with caveats and scope for confirming some of the observed trends. We present a case for lack of a holistic solution to numeracy in NLP, and put forward a set of aspects to consider when working towards one. We draw the following two major conclusions from our study: (1) the default subword segmentation with lookup embeddings used to represent words is clearly suboptimal for numbers (2) there are several unanswered research questions on the level of specificity, coverage, and inductive bias needed to holistically solve numeracy.

Acknowledgements
This work was funded by the Defense Advanced Research Projects Agency with award N660011924033. We would like to thank the countless suggestions we accumulated during preliminary presentations at MLSS 2020, WeCNLP 2020, and GSS 2020, as well as over email correspondences with Biplav Srivastava, Antoine Bosselut, and Harsh Agarwal. We would like to thank the anonymous NAACL 2021 reviewers (particularly #3) for pointing out blind spots in our submission, which we have tried our best to rectify.

Ethical Considerations
This work revolves around the Hindu-Arabic Numeral system and English number words, which are not the only number systems still in use today. We encourage follow-up work to take these systems into consideration, on the lines of Johnson et al. (2020) and Nefedov (2020).