Methods for Numeracy-Preserving Word Embeddings

Word embedding models are typically able to capture the semantics of words via the distributional hypothesis, but fail to capture the numerical properties of numbers that appear in a text. This leads to problems with numerical reasoning involving tasks such as question answering. We propose a new methodology to assign and learn embeddings for numbers. Our approach creates Deterministic, Independent-of-Corpus Embeddings (referred to as DICE) for numbers, such that their cosine similarity reﬂects the actual distance on the number line. DICE outperforms a wide range of pre-trained word embedding models across multiple examples of two tasks: ( i ) evaluating the ability to capture numeration and magnitude; and ( ii ) to perform list maximum, decoding, and addition. We further explore the utility of these embed-dings in downstream applications by initializing numbers with our approach for the task of magnitude prediction. We also introduce a regularization approach to learn model-based embeddings of numbers in a contextual setting.

While word embeddings effectively capture semantic relationships between words, they are less effective at capturing numeric properties associated with numbers. Though numbers represent a significant percentage of tokens in a corpus, they are often overlooked. In non-contextual word embedding models, they are treated like any other word, which leads to misinterpretation. For instance, they exhibit unintuitive similarities with other words and do not contain strong prior information about the magnitude of the number they encode. In sentence similarity and reasoning tasks, failure to handle numbers causes as much as 29% of contradictions (De Marneffe et al., 2008). In other data-intensive tasks where numbers are abundant, like neural machine translation, they are masked to hide the translation models inefficiency in dealing with them (Mitchell and Lapata, 2009).
There are a variety of tests proposed to measure the efficiency of number embeddings. For instance, Naik et al. (2019) shows that GloVe (Pennington et al., 2014), word2vec (Mikolov et al., 2013b), and fastText (Joulin et al., 2016;Bojanowski et al., 2017) fail to capture numeration and magnitude properties of a number. Numeration is the property of associating numbers with their corresponding word representations ("3" and "three") while magnitude represents a number's actual value (3 < 4). Further, Wallace et al. (2019) proposes several tests for analyzing numerical reasoning of number embeddings that include list maximum, decoding, and addition.
In this paper, we experimentally demonstrate that if the cosine similarity between word embeddings of two numbers reflects their actual distance on the number line, the resultant word embeddings are useful in downstream tasks. We first demonstrate how Deterministic, Independent-of-Corpus Embeddings (DICE) can be constructed such that they almost perfectly capture properties of numera-tion and magnitude. These non-contextual embeddings also perform well on related tests for numeracy (Wallace et al., 2019).
To demonstrate the efficacy of DICE for downstream tasks, we explore its utility in two experiments. First, we design a DICE embedding initialized Bi-LSTM network to classify the magnitude of masked numbers in the 600K dataset (Chen et al., 2019). Second, given the popularity of modern contextual model-based embeddings, we devise a regularization procedure that emulates the hypothesis proposed by DICE and can be employed in any task-based fine-tuning process. We demonstrate that adding such regularization helps the model internalize notions of numeracy while learning task-based contextual embeddings for the numbers present in the text. We find promising results in a numerical reasoning task that involves numerical question answering based on a sub-split of the popular SQuAD dataset (Rajpurkar et al., 2016). Our contribution can be summarized as follows: • We propose a deterministic technique to learn numerical embeddings. DICE embeddings are learned independently of corpus and effectively capture properties of numeracy.
• We prove experimentally that the resultant embeddings learned using the above methods improve a model's ability to reason about numbers in a variety of tasks, including numeration, magnitude, list maximum, decoding, and addition.
• We also demonstrate that properties of DICE can be adapted to contextual models, like BERT (Devlin et al., 2018), through a novel regularization technique for solving tasks involving numerical reasoning.

Related Work
The major research lines in this area have been dedicated to (i) devising probing tests and curating resources to evaluate the numerical reasoning abilities of pre-trained embeddings, and (ii) proposing new models that learn these properties. Naik et al. (2019) surveyed a number of noncontextual word embedding models and highlighted the failure of those models in capturing two essential properties of numbers -numeration and magnitude. Chen et al. (2019) created a novel dataset named Numeracy-600k, a collection of approximately 600,000 sentences from market comments with a diverse set of numbers representing age, height, weight, year, etc. The authors use neural network models, including a GRU, BiGRU, CRNN, CNN-capsule, GRU-capsule, and BiGRUcapsule, to classify the magnitude of each number. Wallace et al. (2019) compares and contrasts the numerical reasoning ability of a variety of noncontextual as well as contextual embedding models. The authors also proposed three tests -list maximum, decoding, and addition -to judge the numerical reasoning ability of embeddings of numerals. They infer that word embedding models that perform the best on these three tests have captured the numerical properties of numbers well. Therefore, we consider these proposed tests in our evaluation. (Spithourakis and Riedel, 2018) used a variety of models to distinguish numbers from words, and demonstrated that this ability reduces model perplexity with neural machine translation. Weiss et al. (2018) found that neural networks are capable of reasoning numbers with explicit supervision.
Numerically Augmented QANet (NAQANet) (Dua et al., 2019) was built by adding an output layer on top of QANet (Yu et al., 2018) to predict answers based on addition and subtraction over numbers in the DROP dataset. Our work, in contrast, offers a simple methodology that can be added to any model as a regularization technique. Our work is more similar to Jiang et al. (2019), where embedding of a number is learned as a simple weighted average of its prototype embeddings. Such embeddings are used in tasks like word similarity, sequence labeling and have been proven to be effective.

Methods
To overcome NLP models inefficiency in dealing with numbers, we consider our method DICE to form embeddings. To begin, we embed numerals and word forms of numbers as vectors e i ∈ R D , where i indexes numerals identified within a corpus. We first preprocess by parsing the corpora associated with each of our tasks (described below) for numbers in numeral and word forms to populate a number vocabulary. Then, the dimensionality of the embeddings required for that task is fixed. We explicitly associate the embeddings of a numeral and word forms of numbers to have the same embedding.

DICE embeddings
In designing embeddings that capture the aforementioned properties of numeration and magnitude, we consider a deterministic, handcrafted approach (depicted in Figures 1a and 1b). This method relies on the fact that tests for both numeration and magnitude are concerned with the correspondence in similarity between numbers in token space and numbers in embedding space. In token space, two numbers x, y ∈ R, in numeral or word form (with the latter being mapped to its corresponding numeral form for comparison), can be compared using absolute difference, i.e.: The absolute value ensures that two numbers are treated as equally distant regardless of whether x ≥ y or y ≥ x. On the other hand, two embeddings x, y ∈ R D are typically compared via cosine similarity, given by: where θ is the angle between x and y and d e (x, y) is their cosine distance. Normalization by the vector lengths ensures that the metric is independent of the lengths of the two vectors.
Note that numerals are compared in terms of distance while their embeddings are compared by similarity. As cosine distance increases, the angle between x and y increases monotonically. A distance of zero is achieved when x and y are oriented in the same direction. When x ⊥ y, the cosine distance is 1; and when x and y are antiparallel, cosine distance is 2.
We seek a mapping (x, y) → (x, y) such that d e monotonically increases as d n increases. We first bound the range of numbers for which we wish to compute embeddings by [a, b] ⊂ R and, without loss of generality, restrict x and y to be of unit length (i.e., ||x|| 2 = ||y|| 2 = 1). Since the cosine function decreases monotonically between 0 and π, we can simply employ a linear mapping to map distances s n ∈ [0, |a − b|] to angles θ ∈ [0, π]: This mapping achieves the desired direct relationship between s n and d e . Since there are infinitely many choices for x and y with angle θ, we simply fix the direction of the vector corresponding to the numeral a. Numbers that fall outside [a, b] are mapped to a random angle in [−π, π]. In the corpora we considered, a and b are chosen such that numbers outside [a, b] represent a small fraction of the total set of numbers (approximately 2%). We employ this mapping to generate numeral embeddings in R D . Figure 1a shows deterministic, independent-of-corpus embeddings of rank 2 (DICE-2). In this approach we represent angles as vectors in R 2 using the polar-to-Cartesian coordinate transformation: where we choose r = 1 without loss of generality. We then sample a random matrix M ∈ R D×D where D ≥ 2 and m ij ∼ N (0, 1) and perform a QR decomposition on M to obtain a matrix Q whose columns q i , i = 1, . . . , D constitute an orthonormal basis for R D . The DICE-2 embedding e ∈ R D of each numeral is then given by e = Q 1:2 v, where the subscript on Q indicates taking the first two columns of Q.
In Figure 1b we consider DICE-D, in which we generate vectors in R D by applying a polarto-Cartesian transformation in D dimensions (Blumenson, 1960): where the subscripts indicate the coordinate in v.
We again apply a QR-decomposition on a random matrix M generated as above, except here we project v using all D basis vectors. This allows for a random rotation of the embeddings to avoid bias due to choosing e a1 = 1, e ai = 0 ∀i = 1. We employ DICE-D embeddings throughout this paper as word embeddings are practically not 2 dimensional.

Experiments
To observe the numerical properties of DICE, we consider two tasks: Task 1 deals with the numeration (NUM) and magnitude (MAG) properties as proposed by (Naik et al., 2019); Task 2 performs list maximum, decoding, and addition as proposed by (Wallace et al., 2019). We then experiment on two additional tasks to demonstrate the applications of DICE.

Task 1: Exploring Numeracy
In this task, proposed by Naik et al. (2019), there are three tests for examining each property of numeration (NUM, 3 = "three") and magnitude (MAG, 3 < 4). For each of these tests, target numbers in its word or numeral form are evaluated against other numbers as follows: • One-vs-All (OVA): The distance between the embedding vector of the target and its nearest neighbor should be smaller than the distance between the target and any other numeral in the data.
• Strict Contrastive (SC): The distance of the embedding vector of the target from its nearest neighbor should be smaller than its second nearest neighbor numeral.
• Broad Contrastive (BC): The distance of the embedding vector of the target numeral from its nearest neighbor should be smaller than its furthest neighbor.
Training Details. We use the Gigaword corpus obtained from the Linguistic Data Consortium to populate the list of numbers from the dataset. Parsing was performed using the text2digits 1 Python module. As done by Naik et al. (2019), we employ D = 300 for the DICE-D embeddings. Embeddings of numerals are assigned using the principle explained in Section 3.1, while the embedding of words that denote numbers (word form) simply points to the embedding of that numeral itself. We then perform the six tests ( Results. Table 1 shows comparisons of the performance of embeddings created by each of the DICE methods on the MAG tests. Compared to the baselines, both DICE methods outperform all commonly employed non-contextual word embedding models in OVA, SC, and BC tests. This is attributed to the cosine distance property addressed in the DICE embeddings. Specifically, because the magnitude of the number is linearly related to its angle, sweeping through numbers in order guarantees an increase in angle along each axis. Numbers that are close to each other in magnitude are rotated further but in proportion to their magnitude. Thus, small and large numbers are ensured to lie  near other small and large numbers, respectively, in terms of cosine distance. On the NUM tests, DICE achieves perfect accuracy. The primary reason DICE embeddings perform so well on numeracy tasks is that the preprocessing steps taken allow us to parse a corpus for word forms of numbers and explicitly set matching embeddings for both word and numeral forms of numbers. Each of these embeddings is guaranteed to be unique since a number's embedding is based on its magnitude, i.e., the larger the magnitude, the greater the angle of the embedding, with a maximum angle of π. This ensures that the numeral form of a number is always able to correctly identify its word form among all word forms in the corpus as that with the smallest cosine distance (which equals zero). Performance on OVA-NUM is a lower bound on the performance of SC-NUM and BC-NUM, so those tests are guaranteed to pass under our approach.

Task 2: List Maximum, Decoding, and Addition
This task considers the operations proposed by (Wallace et al., 2019) -list maximum, decoding, and addition. List maximum deals with the task of predicting the maximum number given the embedding of five different numbers. Decoding deals with regressing the value of a number given its embedding. An additional task involves predicting the sum of two numbers given their embeddings.
Training Details. The list-maximum test presents to a Bi-LSTM neural network a set of five numbers of the same magnitude, and the network is trained to report the index of the maximum number. In the decoding test, a linear model and a feed-forward network are each trained to output the numeral corresponding to the word form of a number based on its embedding. Finally, in the addition test, a feed-forward network is trained to take in the embeddings of two numbers as its input and report the sum of the two numbers as its output. Each test is performed on three ranges of integers [0, 99], [0,999], and [0, 9999], with an 80/20 split of training and testing data sampled randomly. The neural network is fed with the embedding of numbers; the task is either classification (in the case of list maximum) or prediction of a continuous number (in case of addition and decoding). We replicate the exact experimental conditions and perform the three tests with DICE embeddings. For the sake of consistency with the tests proposed by (Wallace et al., 2019), we also only deal with positive in this experiment.
Evaluation. List maximum again uses accuracy as its metric while decoding and addition use root mean squared error (RMSE), since predictions are continuous.
Results. Given the strong performance of the DICE-D method on the NUM and MAG tests, we next consider its performance on tasks involv-ing neural network models. In their empirical study, (Wallace et al., 2019) compared a wide range of models that included a random baseline; character level models such as a character-CNN and character-LSTM, which were both untrained and trained; a so-called value embedding model in which numbers are embedded as their scalar value; traditional non-contextual word embedding models including Word2Vec and GloVe; contextual word embedding models including ELMo and BERT; and the Numerically Aware Question Answering (NAQA) Network, a strong numerical reasoning model proposed on the Discrete Reasoning over Paragraphs (DROP) dataset.
We compare the performance of our DICE-D embedding to that of the other models on each of the three tasks proposed by (Wallace et al., 2019). Results are presented in Table 2. We find that our DICE embedding exceeds the performance of more sophisticated models by large margins in all but four cases. In two of those four, our model fell short by only a few percentage points. We attribute the success of the DICE-D approach to the fact that the model is, by design, engineered to handle numeracy. Just as the value embedding modelwhich proved to be reasonably successful in all three tasks across a wide range of numbers -captures numeracy through the magnitude of embeddings, our model captures numeracy through the angle corresponding to the embeddings. The value embedding model, however, breaks down as the range of the processed numbers grows. This is likely because, as demonstrated by Trask et al. (2018), networks trained on numeracy tasks typically struggle to learn an identity mapping. We reason that our model outperforms the value embedding model because the network learns to associate features between the set of inputs such that the input vectors can be scaled, rotated, and translated in D dimensions to achieve the desired goal.
More precisely, for a neural network to learn addition, numbers must be embedded such that their vector embeddings can be consistently shifted, rotated, and scaled to yield the embedding of another number (see Figure 1c). The choice of embedding is essential as it may be impractical for a network to learn a transformation for all embeddings that obeys this property (without memorization). DICE is quite similar to the value embedding system, which directly encodes a number's value in its embeddings. However, DICE performs bet-ter due to its compatibility with neural networks, whose layers are better suited for learning rotations and scaling than identity mappings.
Finally, both the value embedding models for a small number range and the character level models remain somewhat competitive, suggesting again that exploring a digit-by-digit embedding of numerals may provide a means of improving our model further.

Magnitude Classification
We examine the importance of good initialization for number embedding vectors (Kocmi and Bojar, 2017), particularly for better contextual understanding. In particular, we experiment on the magnitude classification task, which requires the prediction of magnitudes for masked numbers. The task is based on the 600K dataset proposed by Chen et al. (2019), which requires classification into one of seven categories corresponding to powers of 10 in {0, 1, 2, 3, 4, 5, 6}.
Training Details. We use a bi-LSTM (Hochreiter and Schmidhuber, 1997) with soft attention (Chorowski et al., 2015) to classify the magnitude of masked numbers. Numerals are initialized with corresponding DICE embeddings, and the target number is masked by substituting a random vector. Each token x n in a sequence of length N is associated with a forward and backward LSTM cell. The hidden state h n of each token is given by the sum of the hidden states of the forward and backward cells: To generate a context vector c for the entire sentence, we compute attention scores α n by taking the inner product of each hidden state h n with a learned weight vector w. The resulting scores are passed through a softmax function, and the weights are used to form a convex combination of the h n that represents the context c of the sentence. Logits are obtained by taking the inner product of c with trained embeddings for each of the seven categories, and cross-entropy loss is minimized. More details on training can be found in Appendix A.
Evaluation. Following Chen et al. (2019), we use micro and macro F1 scores for classifying the magnitude of a number.    From Table 4 the BiLSTM with DICE model achieves the best micro-F1 score when the embedding dimension is 256. However, the macro-F1 score peaks when the embedding dimension is 512. These results suggest that while DICE embeddings yield superior performance in non-contextual numerical tasks, such as computing the maximum and performing basic mathematical operations, data agnostic embeddings such as DICE may not be ideal for textual reasoning tasks in which words surrounding a number provide important information regarding the magnitude of the number. Hence, we introduce a model-based regularization method that utilizes the DICE principles to learn number embeddings in 5.2.

Model-Based Numeracy Embeddings
In the previous section, we demonstrated how DICE could be explicitly incorporated for numbers in the text. Here, we propose a methodology that help models implicitly internalize the properties of DICE. Our approach involves a regularization method (an auxiliary loss) that can be adopted in the fine-tuning of any contextual NLP model, such as BERT. Auxiliary losses have shown to work well for a variety of NLP downstream tasks .
During the task-specific training of any model, the proposed auxiliary loss L num can be applied to the input embeddings of numbers available in a minibatch. For any two contextual numerical embeddings x, y obtained from the final hidden layer of the model, the L num loss for the pair of numbers (x, y) is calculated as: where d cos (x, y) = 1 − x T y x 2 y 2 is the cosine distance between the embeddings x and y. In essence, L num follows the same motivation as DICE where cosine distance between the embeddings of two numbers are encouraged to be proportional to their (scaled) absolute magnitude distance on the number line.
Training Details. To evaluate the proposed L num , we test the regularization on the task of question answering (QA) involving numerical answers. In particular, we take the popular Stanford Question Answering Dataset (SQuAD 1.1) (Rajpurkar et al., 2016) dataset and create sub-splits (ranges from [1, 30000]) where the (i) training QA pairs have answers strictly containing numerical digits (Sub-split 1, less than 10K examples), and (ii) training QA pairs have answers containing a number as one of their tokens, for e.g. "10 apples" (Sub-split 2, slightly more than 10K examples). We create these splits to evaluate BERT model's reasoning involving numbers to pick these answers. We choose BERT-base-uncased as baseline model and train it on both the datasets. Within each batch, we calculate L num by randomly sampling a pair of numbers x, y from the available numbers in the contexts. The corresponding embeddings of the numbers are x and y, which are extracted from the last hidden layer of the BERT model. We then enforce the distance of embeddings to match the difference between number values by L num . The scores are reported on the development set (less than 1000 examples) as the test set cannot be pruned for our purpose. The assumption here is that the BERT model needs to perform numerical reasoning to come up with answers for these particular kinds of QA pairs. The models were trained on Nvidia Tesla P100 GPU. More details on choosing According to the same statistics, the average age of people living in Newcastle is 37.8 (the national average being 38.6). Many people in the city have Scottish or Irish ancestors. There is a strong presence of Border Reiver surnames, such as Armstrong, Charlton, Elliot, Johnstone, Kerr, Hall, Nixon, Little and Robson. There are also small but significant Chinese, Jewish and Eastern European (Polish, Czech Roma) populations. There are also estimated to be between 500 and 2,000 Bolivians in Newcastle, forming up to 1% of the populationthe largest such percentage of any UK city.

Context Question
What is the smallest number of Bolivians it's estimated live in Newcastle?
Answer 500 Ground truth: BERT : BERT + : ℒ num between 500 and 2,000 500 A) Although the reciprocating steam engine … use, various companies … alternative to internal combustion engines. The company Energiprojekt AB in Sweden … the power of steam. The efficiency … steam engine reaches some 27-30% on high-pressure engines. It is a single-step, 5-cylinder engine (no compound) with superheated steam and consumes approx. 4 kg (8.8 lb) of steam per kWh.  Figure 2: Qualitative examples where BERT + L num performed better than BERT-base hyper-parameter for BERT + L num is discussed in Appendix B.

Context Question
Evaluation. Exact Match is a binary measure (i.e., true/false) of whether the predicted output matches the ground truth answer exactly. This evaluation is performed after the string normalization (uncased, articles removed, etc.). F1 is the harmonic mean of precision and recall. Table 5 show that the BERT model with numeracy objective achieves an improvement of 0.48 F1 points when the answers are purely numerical digits. When the BERT model is trained on QA pairs with answers containing at least a number with several words, and evaluated on pairs with answers containing only numbers, we see an improvement of 1.12 F1 points over the baseline model.

Results. Results in
The BERT-base model on the original SQuAD data was finetuned for 3 epochs owing to its complexity. However, we find that 1 epoch is sufficient to capture the complexity of the pruned SQuAD data. Table 5 shows BERT + L num consistently performs better than BERT-base across epochs.
Interestingly, BERT-base performs worse when finetuned with QA pairs containing a mix of words and numbers as answers (sub-split 2). This informs us that the baseline model learns to pick numbers better but fails to do as well when fine-tuned with a mix of words and numbers. In both the cases, the evaluation set consists of pruned SQuAD dev set QA pairs with answers strictly containing numerical digits only. We find that BERT + L num gives the maximum improvement on sub-split 2 data highlighting the efficiency of our regularization technique to learn numerical embeddings. Figure 2 shows some qualitative examples where the BERT + L num performs better than BERT-base (Sub-split 2). In this analysis, we found that the  baseline model picks the whole sentence or paragraph involving the numerical value (Figure 2 B) as the answer. Our method picks numbers within the classification span (Figure 2 B) and sometimes helps the BERT model to accurately pick up correct numbers (Figure 2 A), contributing to exact match and F1. More such examples are shown in Appendix C.
During our experiments, we observed the potential issue of weak signals from the loss when the availability of numerical pairs is sparse. In the future, our efforts would be to overcome this issue to ensure further gains.

Conclusion
In this work, we methodologically assign and learn embeddings for numbers to reflect their numerical properties. We validate our proposed approach with several experiments that test number embeddings.
The tests that evaluate the numeral embeddings are fundamentally applicable to all real numbers. Finally, we introduced an approach to jointly learn embeddings of numbers and words that preserve numerical properties and evaluated them on a contextual word embedding based model. In our future work, we would like to extend this idea to unseen numbers in vocabulary as a function of seen ones.

A Training details for Magnitude Classification Experiment
The Bi-LSTM with attention model initialized with DICE embeddings were trained on the market comments data. The model was trained for a fixed number of 9 epochs. We found that the micro and macro F1 scores peaked for a certain epoch and then flattened out. We picked the best micro and macro pair the model obtained in that certain epoch.

B Hyperparameter for BERT + L num
Our model involves a regularization method (an auxiliary loss) that can be adopted in the fine-tuning of BERT. This loss was finetuned with a hyperparameter λ and added to the existing BERT classification loss for detecting the correct span. The hyperparameter search space is between 0, 1. We sweeped through the values manually within the search space and found that the best model that gave the maximum improvement in F1 scores had a hyperparameter value of 10 −3 . The values were sweeped based on the observed performance. The performance faded as the hyperparameter was set to a higher value (closer to 1).
C Examples for BERT vs. BERT + L num Figure 3 provides additional samples where BERT + L num outperformed the baseline BERT model. Similar to previous observations, our regularized approach is able to pinpoint the correct number as opposed to selecting a substring via pattern matching.

Context Question
In what years did Spain and Portugal join the European Union?

Context Question
When did Tesla make the induction motor? The second main legislative body is the Council, which is composed of different ministers of the member states. … (a distinct body) that the TEU article 15 defines as providing the 'necessary impetus for its development and shall define the general political directions and priorities'. … The minister must have the authority to represent and bin the member states in decisions. When voting takes place it is weighted … dominated by larger member states. In total there are 352 votes, … , if not consensus. TEU article 16(4) and TFEU article 238(3) define this to mean at least 55 per cent of the Council members (not votes) representing 65 per cent of the population of the EU: currently this means around 74 per cent, or 260 of the 352 votes. This is critical during the legislative process.

Context Question
What are the total number of votes to be counted during the voting process?