Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers

Numeracy is the ability to understand and work with numbers. It is a necessary skill for composing and understanding documents in clinical, scientific, and other technical domains. In this paper, we explore different strategies for modelling numerals with language models, such as memorisation and digit-by-digit composition, and propose a novel neural architecture that uses a continuous probability density function to model numerals from an open vocabulary. Our evaluation on clinical and scientific datasets shows that using hierarchical models to distinguish numerals from words improves a perplexity metric on the subset of numerals by 2 and 4 orders of magnitude, respectively, over non-hierarchical models. A combination of strategies can further improve perplexity. Our continuous probability density function model reduces mean absolute percentage errors by 18% and 54% in comparison to the second best strategy for each dataset, respectively.

Numeracy and literacy refer to the ability to comprehend, use, and attach meaning to numbers and words, respectively. Language models exhibit literacy by being able to assign higher probabilities to sentences that Figure 1: Modelling numerals with a categorical distribution over a fixed vocabulary maps all out-ofvocabulary numerals to the same type, e.g. UNK, and does not reflect the smoothness of the underlying continuous distribution of certain attributes.
are both grammatical and realistic, as in this example: However, this maps all unseen numerals to the same unknown type and ignores the smoothness of continuous attributes, as shown in Figure 1. In that respect, existing work on language modelling does not explicitly evaluate or optimise for numeracy. Numerals are often neglected and low-resourced, e.g. they are often masked (Mitchell and Lapata, 2009), and there are only 15,164 (3.79%) numerals among GloVe's 400,000 embeddings pretrained on 6 billion tokens (Pennington et al., 2014). Yet, numbers appear ubiquitously, from children's magazines (Joram et al., 1995) to clinical reports (Bigeard et al., 2015), and grant objectivity to sciences (Porter, 1996).
Previous work finds that numerals have higher out-of-vocabulary rates than other words and proposes solutions for representing unseen numerals as inputs to language models, e.g. using numerical magnitudes as features (Spithourakis et al., 2016b,a). Such work identifies that the perplexity of language models on the subset of numerals can be very high, but does not directly address the issue. This paper focuses on evaluating and improving the ability of language models to predict numerals. The main contributions of this paper are as follows: 1. We explore different strategies for modelling numerals, such as memorisation and digit-bydigit composition, and propose a novel neural architecture based on continuous probability density functions.
2. We propose the use of evaluations that adjust for the high out-of-vocabulary rate of numerals and account for their numerical value (magnitude).
3. We evaluate on a clinical and a scientific corpus and provide a qualitative analysis of learnt representations and model predictions. We find that modelling numerals separately from other words can drastically improve the perplexity of LMs, that different strategies for modelling numerals are suitable for different textual contexts, and that continuous probability density functions can improve the LM's prediction accuracy for numbers.

Language Models
Let s 1 ,s 2 ,...,s L denote a document, where s t is the token at position t. A language model estimates the probability of the next token given previous tokens, i.e. p(s t |s 1 ,...,s t−1 ). Neural LMs estimate this probability by feeding embeddings, i.e. vectors that represent each token, into a Recurrent Neural Network (RNN) (Mikolov et al., 2010).
Token Embeddings Tokens are most commonly represented by a D-dimensional dense vector that is unique for each word from a vocabulary V of known words. This vocabulary includes special symbols (e.g. 'UNK') to handle out-of-vocabulary tokens, such as unseen words or numerals. Let w s be the one-hot representation of token s, i.e. a sparse binary vector with a single element set to 1 for that token's index in the vocabulary, and E ∈R D×|V| be the token embeddings matrix. The token embedding for s is the vector e token s =Ew s .
Character-Based Embeddings A representation for a token can be build from its constituent characters (Luong and Manning, 2016;Santos and Zadrozny, 2014). Such a representation takes into account the internal structure of tokens.

Recurrent and Output Layer
The computation of the conditional probability of the next token involves recursively feeding the embedding of the current token e st and the previous hidden state h t−1 into a D-dimensional token-level RNN to obtain the current hidden state h t . The output probability is estimated using the softmax function, i.e.
where ψ(.) is a score function.
Training and Evaluation Neural LMs are typically trained to minimise the cross entropy on the training corpus: A common performance metric for LMs is per token perplexity (Eq. 3), evaluated on a test corpus. It can also be interpreted as the branching factor: the size of an equally weighted distribution with equivalent uncertainty, i.e. how many sides you need on a fair die to get the same uncertainty as the model distribution.

Strategies for Modelling Numerals
In this section we describe models with different strategies for generating numerals and propose the use of number-specific evaluation metrics that adjust for the high out-of-vocabulary rate of numerals and account for numerical values. We draw inspiration from theories of numerical cognition. The triple code theory (Dehaene et al., 2003) postulates that humans process quantities through two exact systems (verbal and visual) and one approximate number system that semantically represents a number on a mental number line. Tzelgov et al. (2015) identify two classes of numbers: i) primitives, which are holistically retrieved from long-term memory; and ii) non-primitives, which are generated online. An in-depth review of numerical and mathematical cognition can be found in Kadosh and Dowker (2015) and Campbell (2005).

Softmax Model and Variants
This class of models assumes that numerals come from a finite vocabulary that can be memorised and retrieved later. The softmax model treats all tokens (words and numerals) alike and directly uses Equation 1 with score function: where E out ∈ R D×|V| is an output embeddings matrix. The summation in Equation 1 is over the complete target vocabulary, which requires mapping any out-of-vocabulary tokens to special symbols, e.g. 'UNK word ' and 'UNK numeral '.

Softmax with Digit-Based Embeddings
The softmax+rnn variant considers the internal syntax of a numeral's digits by adjusting the score function: where the columns of E RNN out are composed of character-based embeddings for in-vocabulary numerals and token embeddings for the remaining vocabulary. The character set comprises digits (0-9), the decimal point, and an end-of-sequence character. The model still requires normalisation over the whole vocabulary, and the special unknown tokens are still needed.
Hierarchical Softmax A hierarchical softmax (Morin and Bengio, 2005a) can help us decouple the modelling of numerals from that of words. The probability of the next token s t is decomposed to that of its class c t and the probability of the exact token from within the class: where the valid token classes are C = {word, numeral}, σ is the sigmoid function and b is a D-dimensional vector. Each of the two branches of p(s t |c t ,h t ) can now be modelled by independently normalised distributions. The hierarchical variants (h-softmax and h-softmax+rnn) use two independent softmax distributions for words and numerals. The two branches share no parameters, and thus words and numerals will be embedded into separate spaces. The hierarchical approach allows us to use any well normalised distribution to model each of its branches. In the next subsections, we examine different strategies for modelling the branch of numerals, i.e. p(s t |c t = numeral,h t ). For simplicity, we will abbreviate this to p(s).

Digit-RNN Model
Let d 1 ,d 2 ...d N be the digits of numeral s. A digit-bydigit composition strategy estimates the probability of the numeral from the probabilities of its digits: The d-RNN model feeds the hidden state h t of the token-level RNN into a character-level RNN (Graves, 2013;Sutskever et al., 2011) to estimate this probability. This strategy can accommodate an open vocabulary, i.e. it eliminates the need for an UNK numeral symbol, as the probability is normalised one digit at a time over the much smaller vocabulary of digits (digits 0-9, decimal separator, and end-of-sequence).

Mixture of Gaussians Model
Inspired by the approximate number system and the mental number line (Dehaene et al., 2003), our proposed MoG model computes the probability of numerals from a probability density function (pdf) over real numbers, using a mixture of Gaussians for the underlying pdf: where K is the number of components, π k are mixture weights that depend on hidden state h t of the token-level RNN, N k is the pdf of the normal distribution with mean µ k ∈ R and variance σ 2 k ∈ R, and B ∈R D×K is a matrix.
The difficulty with this approach is that for any continuous random variable, the probability that it equals a specific value is always zero. To resolve this, Figure 2: Mixture of Gaussians model. The probability of a numeral is decomposed into the probability of its decimal precision and the probability that an underlying number will produce the numeral when rounded at the given precision.
we consider a probability mass function (pmf) that discretely approximates the pdf: where F (.) is the cumulative density function of q(.), and r = 0.5×10 −r is the number's precision. The level of discretisation r, i.e. how many decimal digits to keep, is a random variable in N with distribution p(r). The mixed joint density is: Figure 2 summarises this strategy, where we model the level of discretisation by converting the numeral into a pattern and use a RNN to estimate the probability of that pattern sequence:

Combination of Strategies
Different mechanisms might be better for predicting numerals in different contexts. We propose a combination model that can select among different strategies for modelling numerals: where M={h-softmax, d-RNN, MoG}, and A∈R D×|M| . Since both d-RNN and MoG are openvocabulary models, the unknown numeral token can now be removed from the vocabulary of h-softmax.

Evaluating the Numeracy of LMs
Numeracy skills are centred around the understanding of numbers and numerals. A number is a mathematical object with a specific magnitude, whereas a numeral is its symbolic representation, usually in the positional decimal Hindu-Arabic numeral system (McCloskey and Macaruso, 1995). In humans, the link between numerals and their numerical values boosts numerical skills (Griffin et al., 1995).
Perplexity Evaluation Test perplexity evaluated only on numerals will be informative of the symbolic component of numeracy. However, model comparisons based on naive evaluation using Equation 3 might be problematic: perplexity is sensitive to outof-vocabulary (OOV) rate, which might differ among models, e.g. it is zero for open-vocabulary models. As an extreme example, in a document where all words are out of vocabulary, the best perplexity is achieved by a trivial model that predicts everything as unknown. Ueberla (1994) proposed Adjusted Perplexity (APP; Eq. 14), also known as unknown-penalised perplexity , to cancel the effect of the out-of-vocabulary rate on perplexity. The APP is the perplexity of an adjusted model that uniformly redistributes the probability of each out-of-vocabulary class over all different types in that class: where OOV c is an out-of-vocabulary class (e.g. words and numerals), and |OOV c | is the cardinality of each OOV set. Equivalently, adjusted perplexity can be calculated as: where N is the total number of tokens in the test set and |s ∈ OOV c | is the count of tokens from the test set belonging in each OOV set.
Evaluation on the Number Line While perplexity looks at symbolic performance on numerals, this evaluation focuses on numbers and particularly on their numerical value, which is their most prominent semantic content (Dehaene et al., 2003;Dehaene and Cohen, 1995).
Let v t be the numerical value of token s t from the test corpus. Also, letv t be the value of the most probable numeral under the model s t = argmax (p(s t |h t ,c t =num)). Any evaluation metric from the regression literature can be used to measure the models performance. To evaluate on the number line, we can use any evaluation metric from the regression literature. In reverse order of tolerance to extreme errors, some of the most popular are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Median Absolute Error (MdAE): The above are sensitive to the scale of the data. If the data contains values from different scales, percentage metrics are often preferred, such as the Mean/Median Absolute Percentage Error (MAPE/MdAPE):

Data
To evaluate our models, we created two datasets with documents from the clinical and scientific domains, where numbers abound (Bigeard et al., 2015;Porter, 1996). Furthermore, to ensure that the numbers will be informative of some attribute, we only selected texts that reference tables.
Clinical Data Our clinical dataset comprises clinical records from the London Chest Hospital. The records where accompanied by tables with 20 numeric attributes (age, heart volumes, etc.) that they partially describe, as well as include numbers not found in the tables. Numeric tokens constitute only a small proportion of each sentence (4.3%), but account for a large part of the unique tokens vocabulary (>40%) and suffer high OOV rates.

Experimental Results and Discussion
We set the vocabularies to the 1,000 and 5,000 most frequent token types for the clinical and scientific datasets, respectively. We use gated token-character embeddings (Miyamoto and Cho, 2016) for the input of numerals and token embeddings for the input and output of words, since the scope of our paper is numeracy. We set the models' hidden dimensions to D = 50 and initialise all token embeddings to pretrained GloVe (Pennington et al., 2014   RNNs are LSTMs (Hochreiter and Schmidhuber, 1997) with the biases of LSTM forget gate were initialised to 1.0 (Józefowicz et al., 2015). We train using mini-batch gradient decent with the Adam optimiser (Kingma and Ba, 2014) and regularise with early stopping and 0.1 dropout rate (Srivastava, 2013) in the input and output of the token-based RNN. For the mixture of Gaussians, we select the mean and variances to summarise the data at different granularities by fitting 7 separate mixture of Gaussian models on all numbers, each with twice as many components as the previous, for a total of 2 7+1 − 1 = 256 components. These models are initialised at percentile points from the data and trained with the expectation-minimisation algorithm. The means and variances are then fixed and not updated when we train the language model.

Quantitative Results
Perplexities Table 2 shows perplexities evaluated on the subsets of words, numerals and all tokens of the test data. Overall, all models performed better on the clinical than on the scientific data. On words, all models achieve similar perplexities in each dataset.
On numerals, softmax variants perform much better than other models in PP, which is an artefact of the high OOV-rate of numerals. APP is significantly worse, especially for non-hierarchical variants, which perform about 2 and 4 orders of magnitude worse than hierarchical ones.
For open-vocabulary models, i.e. d-RNN, MoG, and combination, PP is equivalent to APP. On numerals, d-RNN performed better than softmax variants in both datasets. The MoG model performed twice as well as softmax variants on the clinical dataset, but had the third worse performance in the scientific dataset. The combination model had the best overall APP results for both datasets.
Evaluations on the Number Line To factor out model specific decoding processes for finding the best next numeral, we use our models to rank a set of candidate numerals: we compose the union of in-vocabulary numbers and 100 percentile points from the training set, and we convert numbers into numerals by considering all formats up to n decimal points. We select n to represent 90% of numerals seen at training, which yields n=3 and n=4 for the clinical and scientific data, respectively. Table 3 shows evaluation results, where we also include two naive baselines of constant predictions: with the mean and median of the training data. For both datasets, RMSE and MAE were too sensitive to extreme errors to allow drawing safe conclusions, particularly for the scientific dataset, where both metrics were in the order of 10 9 . MdAE can be of some use, as 50% of the errors are absolutely smaller than that.
Along percentage metrics, MoG achieved the best MAPE in both datasets (18% and 54% better that the second best) and was the only model to perform better than the median baseline for the clinical data. However, it had the worst MdAPE, which means that MoG mainly reduced larger percentage errors. The d-RNN model came third and second in the clinical and scientific datasets, respectively. In the latter it achieved the best MdAPE, i.e. it was effective at reducing errors for 50% of the numbers. The combination model did not perform better than its constituents. This is possibly because MoG is the only strategy that takes into account the numerical magnitudes of the numerals.

Learnt Representations
Softmax versus Hierarchical Softmax Figure 3 visualises the cosine similarities of the output token embeddings of numerals for the softmax and h-softmax models. Simple softmax enforced high similarities among all numerals and the unknown numeral token, so as to make them more dissimilar to words, since the model embeds both in the same space. This is not the case for h-softmax that uses two different spaces: similarities are concentrated along the diagonal and fan out as the magnitude grows, with the exception of numbers with special meaning, e.g. years and percentile points. Figure 4 shows the cosine similarities between the digits of the d-RNN output mode. We observe that each primitive digit is mostly similar to its previous and next digit. Similar behaviour was found for all digit embeddings of all models.

Predictions from the Models
Next Numeral Figure 5 shows the probabilities of different numerals under each model for two  examples from the clinical development set. Numerals are grouped by number of decimal points. The h-softmax model's probabilities are spiked, d-RNNs are saw-tooth like and MoG's are smooth, with the occasional spike, whenever a narrow component allows for it. Probabilities rapidly decrease for more decimal digits, which is reminiscent of the theoretical expectation that the probability of en exact value for a continuous variable is zero. Table 4 shows development set examples with high selection probabilities for each strategy of the combination model, along with numerals with the highest average selection per mode. The h-softmax model is responsible for mostly integers with special functions,  showed affinity to different indices from catalogues of astronomical objects: d-RNN mainly to NGC (Dreyer, 1888) and MoG to various other indices, such as GL (Gliese, 1988) and HIP (Perryman et al., 1997). In this case, MoG was wrongly selected for numerals with a labelling function, which also highlights a limitation of evaluating on the number line, when a numeral is not used to represent its magnitude. Figure 5 shows the distributions of the most significant digits under the d-RNN model and from data counts. The theoretical estimate has been overlayed, according to Benford's law (Benford, 1938), also called the first-digit law, which applies to many real-life numerals. The law predicts that the first digit is 1 with higher probability (about 30%) than 9 (< 5%) and weakens towards uniformity at higher digits. Model probabilities closely follow estimates from the data. Violations from Benford's law can be due to rounding (Beer, 2009) and can be used as evidence for fraud detection (Lu et al., 2006).

Related Work
Numerical quantities have been recognised as important for textual entailment (Lev et al., 2004;Dagan et al., 2013). Roy et al. (2015) proposed a quantity entailment sub-task that focused on whether a given quantity can be inferred from a given text and, if so, what its value should be. A common framework for acquiring common sense about numerical attributes of objects has been to collect a corpus of numerical values in pre-specified templates and then model attributes as a normal distribution (Aramaki et al., 2007;Davidov and Rappoport, 2010;Iftene and Moruz, 2010;Narisawa et al., 2013;de Marneffe et al., 2010). Our model embeds these approaches into a LM that has a sense for numbers.
Other tasks that deal with numerals are numerical information extraction and solving mathematical problems. Numerical relations have at least one argument that is a number and the aim of the task is to extract all such relations from a corpus, which can range from identifying a few numerical attributes (Nguyen and Moschitti, 2011;Intxaurrondo et al., 2015) to generic numerical relation extraction (Hoffmann et al., 2010;Madaan et al., 2016). Our model does not extract values, but rather produces an probabilistic estimate.
Much work has been done in solving arithmetic (Mitra and Baral, 2016;Hosseini et al., 2014;Roy and Roth, 2016), geometric (Seo et al., 2015), and algebraic problems (Zhou et al., 2015;Koncel-Kedziorski et al., 2015;Shi et al., 2015; expressed in natural language. Such models often use mathematical background knowledge, such as linear system solvers. The output of our model is not based on such algorithmic operations, but could be extended to do so in future work. In language modelling, generating rare or unknown words has been a challenge, similar to our unknown numeral problem. Gulcehre et al. (2016) and Gu et al. (2016) adopted pointer networks  to copy unknown words from the source in translation and summarisation tasks. Merity et al. (2016) and Lebret et al. (2016) have models that copy from context sentences and from Wikipedia's infoboxes, respectively.  proposed a LM that retrieves unknown words from facts in a knowledge graph. They draw attention to the inappropriateness of perplexity when OOV-rates are high and instead propose an adjusted perplexity metric that is equivalent to APP. Other methods aim at speeding up LMs to allow for larger vocabularies , such as hierarchical softmax (Morin and Bengio, 2005b), target sampling (Jean et al., 2014), etc., but still suffer from the unknown word problem. Finally, the problem is resolved when predicting one character at a time, as done by the character-level RNN (Graves, 2013;Sutskever et al., 2011) used in our d-RNN model.

Conclusion
In this paper, we investigated several strategies for LMs to model numerals and proposed a novel openvocabulary generative model based on a continuous probability density function. We provided the first thorough evaluation of LMs on numerals on two corpora, taking into account their high out-of-vocabulary rate and numerical value (magnitude). We found that modelling numerals separately from other words through a hierarchical softmax can substantially improve the perplexity of LMs, that different strategies are suitable for different contexts, and that a combination of these strategies can help improve the perplexity further. Finally, we found that using a continuous probability density function can improve prediction accuracy of LMs for numbers by substantially reducing the mean absolute percentage metric.
Our approaches in modelling and evaluation can be used in future work in tasks such as approximate information extraction, knowledge base completion, numerical fact checking, numerical question answering, and fraud detection. Our code and data are available at: https://github.com/uclmr/ numerate-language-models.