Exploring Numeracy in Word Embeddings

Word embeddings are now pervasive across NLP subfields as the de-facto method of forming text representataions. In this work, we show that existing embedding models are inadequate at constructing representations that capture salient aspects of mathematical meaning for numbers, which is important for language understanding. Numbers are ubiquitous and frequently appear in text. Inspired by cognitive studies on how humans perceive numbers, we develop an analysis framework to test how well word embeddings capture two essential properties of numbers: magnitude (e.g. 3<4) and numeration (e.g. 3=three). Our experiments reveal that most models capture an approximate notion of magnitude, but are inadequate at capturing numeration. We hope that our observations provide a starting point for the development of methods which better capture numeracy in NLP systems.


Introduction
Word embeddings operationalize the distributional hypothesis, where a word is characterized by "the company it keeps" (Harris, 1954;Firth, 1957), and have been shown to capture semantic regularities in vector space (Mikolov et al., 2013c). They have been a driving force in NLP in recent years, and enjoy widespread use in a variety of semantic tasks (Rumelhart et al.;Mikolov et al., 2013a,b;Collobert and Weston, 2008;Glorot et al., 2011;Turney and Pantel, 2010;Turney, 2013).
However, to what extent do these word representations capture numeric properties? Numbers often need to be dealt with precisely, and understanding the meaning of text also requires a careful understanding of the quantities involved. They have been identified to play an important role in textual entailment, a benchmark natural language * *The first two authors contributed equally to this work. understanding task. Marneffe et al. (2008) extract pairs of contradictions that occur naturally on Wikipedia and Google News, and find that as many as 29% of contradictions arise due to numeric discrepancies. They also identify that on many Recognizing Textual Entailment (RTE) datasets, 8.8% of contradictory pairs feature numeric contradictions. Naik et al. (2018) find that model inability to do numerical reasoning causes 4% of errors made by state-of-the-art models in Natural Language Inference. Spithourakis and Riedel (2018) emphasize the importance of numeracy in language modeling. Yet, numbers are often forgotten and even masked in NLP applications (Mitchell and Lapata, 2009).
In several domains such as economics, finance and scientific articles numbers can play a crucial role in text. Take for example a recent news headline, Met Office: Global Warming could exceed 1.5 C within five years Ideally, the text representation we use should be able to capture that global warming can exceed 1.5 C, not 100 C. Magnitude is an essential aspect of a number's meaning 1 (Dehaene et al., 1998;Whalen et al., 1999;Cantlon and Brannon, 2006;Gross, 2011;Cutini and Bonato, 2012;Agrillo et al., 2012;Feigenson et al., 2004) . Systems should also be able to draw valid inferences irrespective of whether the text uses "five" or "5". This requires an understanding of symbolic representations used to record numbers in text. Such representation systems are called numeration systems, and individual symbols within the system are called numerals 2 . Systems must handle numeration, i.e. associations between distinct symbols used for the same number under different systems (3=three).
In this work, we examine the extent to which word embeddings are capable of representing numeracy attributes, asking the question -if pretrained word embeddings are utilized for representing text across NLP tasks, what can they represent about numbers? Our framework formulates triples of numbers to probe word embeddings on their ability to represent magnitude, and their robustness to differences in numeration. We hope this analysis highlights limitations of current pretrained word embeddings at capturing numeracy, and will motivate future research to develop more careful treatments of quantities in text.

Analysis Framework
We construct an analysis framework to evaluate embeddings on their ability to capture magnitude and numeration. Numbers follow a well-defined ordering, under a mathematical system, which holds independent of textual context (e.g.: 0 < 1 < 2...). This ordering is established by magnitude (Izard and Dehaene, 2008;Russell, 2009) and is consistent across numeration systems. Therefore, an embedding representation that captures magnitude and numeration precisely should maintain this ordering across numeration systems in the embedding space. We evaluate this ability by constructing contrastive tests (Zhu et al., 2018).
A contrastive test for a property p is defined as a triple (x, x + , x − ) such that x is closer to x + than x − under p. If embeddings capture p, x will be closer to x + than x − in the embedding space, indicating that the embedding method passes the test. We propose three categories of tests, which differ in the choice of x − 3 : 1. OVA-MAG: (3, 4, x), such that x = {y|y ∈ X − {3}, y = 4} 2. SC-MAG: (3,4,5) 3. BC-MAG: (3,4,1000000) Similarly for numeration,

Representations
We evaluate the following embedding methods: Skipgram (Mikolov et al., 2013a): Feedforward network trained to predict words within a fixed window surrounding the current word, with hidden weights used as embeddings. We evaluate with window sizes in {2, 5}, dependency  Table 2: Performance (% accuracy) of various embedding models on magnitude tests. We also report the performance of a random embedding baseline.

Retrained Word Vectors
We retrain all models on GigaWord 5 and English Wikipedia 6 , under the setting: window size=5; di-mensionality=100. To evaluate whether having more occurrences of numerals in the training data correlates with better representations, we train two variants for each model: one on sentences containing numbers (56M in total; 1.5B tokens) (Num), and another on 56M sentences (1.5B tokens) subsampled without constraints (All).

Experiments
How many numerals have representations? Table  1 shows the proportion of English 7 and Arabic numerals in each. Overall, numerals make up less than 5% vocabulary in all models. Despite this, all variants contain representations for sufficient numerals to allow us to apply our framework.

Evaluating Off-The-Shelf Embeddings
Tables 2 and 3 present the performance of offthe-shelf embeddings on magnitude and numeration tests respectively. We use cosine similarity 9 as the distance metric. High performance on BC-MAG indicates that all models capture an approximate notion of magnitude, distinguishing between very large and very small numbers. We speculate this might be because numbers from different magnitude classes often appear in different contexts (See §5.1). As tests become stricter, model performance drops massively. Models perform nearly 10x worse on OVA-MAG as compared to BC-MAG. This suggests model are unable to capture magnitude precisely. Across models, Skip-Gram variants and FastText-Wiki perform best on BC-MAG. However, GloVe outperforms all others on OVA-MAG and SC-MAG. On numeration tests, models fare much worse. With the exception of GloVe models on BC-NUM, no model significantly outperforms a random baseline. Table 4 presents the performance of retrained embeddings and a random embedding baseline on magnitude and numeration tests. There is no significant difference in performance between Num and All variants, suggesting that seeing more numerals during training does not necessarily result in better representations. Results follow similar trends as off-the-shelf embeddings. All models capture an approximate notion of magnitude (high performance on BC-MAG), but do not capture numeration. Across models, FastText variants fare   best.

Performance on Magnitude Tests
Tables 2 and 4 show that most models do not capture magnitude precisely (low performance on OVA-MAG; SC-MAG), but seem to learn an approximate notion of magnitude (high performance on BC-MAG) 10 . To examine the difference in contexts that separates numbers of vastly varying magnitudes, we sample 1 million sentences containing numbers from English Wikipedia and Gi-gaWord and compute pointwise mutual information (PMI), defined as PMI (number, class) = log p(number,class) p(number,·)p(·,class) 10 Cognitive studies show that human babies initially start recognizing numbers by approximation and their ability to identify numbers precisely improves over their lifespan (Halberda et al., 2012). (Moyer and Landauer, 1967) were the first to observe that humans took longer to distinguish between closer numbers (eg: 8 and 9) than numbers which were further away in distance (eg: 2 and 9). This finding has since been replicated several times (Dehaene, 2011). In our framework, models find it harder to distinguish between closer numbers (SC-MAG) than distant numbers (BC-MAG)-however the differences here likely arise from different contexts in which numbers of vastly varying magnitudes are used. between the contexts of primitive numbers (numbers 1-10) and large numbers (>500, >1000, >3000, >10000, >100000) as shown in Table 5. We consider the word immediately following the number as context, since it appears in the context of the number across embedding methods, regardless of sliding window size. We apply add-100 smoothing to identify contexts with maximum discriminatory power.
We observe in table 5 that terms separating primitives from larger numbers fall into categories such as days in a month, which are less than 31, or percentages which are <= 100. In comparison, contexts of larger numbers include terms like 'election', 'census' and 'world'. As we move beyond numbers that are likely to be dates (>3000), we observe terms such as 'ZIP' occurring with ZIP codes in text, 'block' occurring in contexts such as 'house in 9600 block of Washington Boulevard, 'Refugees' which appears in contexts such as 'relocate about 125,000 refugees away from the border'. We observe that different contexts characterize classes of numbers, and speculate that this may allow embeddings to distinguish between numbers that appear consistently in vastly different contexts Primitives λ = 500 Primitives λ = 1000 Primitives λ = 3000 Primitives λ = 10000 Primitives λ = 100000  leading to good performance on BC-MAG.

Recovering magnitude information from nearest neighbours
Model performance on SC-MAG and BC-MAG indicates whether ordering relationships between a number,its closest, second-closest, and furthest numbers are maintained. However, infinite numbers exist, making it infeasible to construct contrastive tests to check ordering relationships between all triples. To mitigate this, we experiment with a paradigm that performs regression with a number's nearest neighbors to predict its magnitude. If magnitude can be recovered from the structure of the embedding space, this provides evidence that magnitude ordering relations are maintained to some extent. For this experiment, we divide the set of 2260 numbers common across offthe-shelf variants 11 into training (80%) and test (20%) sets and run a kNN (k-nearest neighbor) regressor model to predict magnitude. R2 scores for are shown in table 6. Most models show reasonably high R2 scores, indicating that some ordering relationships must be maintained, helping embeddings capture approximate notions of magnitude. While this property of current embedding models is interesting, their failure to capture precise magnitude is an important issue. Word embeddings are used for semantic tasks such as natural language inference or reading comprehension, wherein models might need to reason more precisely about numbers.

Conclusion
Current NLP systems rely heavily on word embeddings. In this work we demonstrate that three 11 We do this to compare results across all models. Retrained variants contain embeddings for all 2260 numbers.

Model
R2 Score GloVe-6B-50D 0.53  Table 6: Results of kNN Regression Test for Magnitude popular embedding models are inadequate at dealing precisely with numbers, in two aspects: magnitude and numeration. We hope this work will promote a more careful treatment of language, and serve a cautionary purpose against using word embeddings in downstream tasks without recognizing their limitations. This work also raises important questions about other categories of word-like tokens that need to be treated like special cases. We hope the community will carefully consider that distributed word representations cannot be relied upon in all scenarios.

B Numeration Tests with Euclidean Distance
Tables 8 and 9 describe the performance of word embedding models on numeration tests with Euclidean distance.  Table 9: Performance (% accuracy) of embedding models on BC-NUM