Do Language Embeddings capture Scales?

Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense reasoning. We identify contextual information in pre-training and numeracy as two key factors affecting their performance, and show that a simple method of canonicalizing numbers can have a significant effect on the results.


Introduction
The success of contextualized pretrained Language Models like BERT (Devlin et al., 2018) and ELMo (Peters et al., 2018) on tasks like Question Answering and Natural Language Inference, has led to speculation that they are good at Common Sense Reasoning (CSR).
On one hand, recent work has approached this question by measuring the ability of LMs to answer questions about physical common sense (Bisk et al., 2020) ("How to separate egg whites from yolks?"), temporal reasoning (Zhou et al., 2020) ("How long does a basketball game take?"), and numerical common sense (Lin et al., 2020). On the other hand, after realizing some high-level reasoning skills like this may be difficult to learn from a language-modeling objective only, (Geva et al., 2020) injects numerical reasoning skills into LMs by additional pretraining on automatically generated data. All of these skills are prerequisites for CSR. Figure 1: Scalar probing example. The mass of "dog" is a distribution (gray histogram) concentrated around 10-100kg. We train a linear model over a frozen (shown by the snowflake in the figure) encoder to predict this distribution (orange histogram) using either a dense cross-entropy or a regression loss (Section 3).
Here, we address a simpler task which is another pre-requisite for CSR: the prediction of scalar attributes, a task we call Scalar Probing. Given an object (such as a "wedding ring") and an attribute with continuous numeric values (such as Mass or Price), can an LM's representation of the object predict the value of that attribute? Since in general, there may not be a single correct value for such attributes due to polysemy ("crane" as a bird, versus construction equipment) or natural variation (e.g. different breeds of dogs), we interpret this as a task of predicting a distribution of possible values for this attribute, and compare it to a ground truth distribution of such values. An overview of this scalar probing is shown in Figure 1. Examples of ground-truth distributions and model predictions for different objects and attributes are shown in Figure 2.
Our analysis shows that contextual encoders, like BERT and ELMo, perform better than noncontextual ones, like Word2Vec, on scalar probing de-spite the task being non-contextual (Mikolov et al., 2013). Further, we show that using scientific notation to represent numbers in pre-training can have a significant effect on results (though sensitive to the evaluation metric used). Put together, these results imply that scale representation in contextual encoders is mediated by transfer of magnitude information from numbers to nouns in pre-training and making this mechanism more robust could improve performance on this and other CSR tasks. We also show improvements by zero-shot transfer from our probes to 2 related tasks: relative comparisons (Forbes and Choi, 2017) and product price prediction (Jianmo Ni, 2019), indicating that our results are robust across datasets.

Problem Definition and Data
We define the scalar probing task (see Figure 1) as the problem of predicting a distribution over values of a scalar attribute of an object. We map these values into 12 logarithmically-spaced buckets, so that our task is equivalent to predicting (the distribution of) the order of magnitude of the target value. We explore both models that predict the full distribution and models that predict a point estimate of the value, which is essentially a distribution with all the mass concentrating on one bucket.
Our primary resource for the scalar probing task is Distributions over Quantities (DoQ; Elazar et al., 2019) which consists of empirical counts of scalar attribute values associated with >350K nouns, adjectives, and verbs over 10 different attributes, collected from web data. In this work, we focus only on nouns (which we refer to as objects) over the scalar attributes (or scales) of MASS (in grams), LENGTH (in meters) and PRICE (in US Dollars). For each object and scale, DoQ provides an empirical distribution over possible values (e.g. Figure 2) that we map into the 12 afore-mentioned buckets and treat it as "ground truth". We note that DoQ itself is derived heuristically from web text and itself contains noise; however, we use it as a starting point to evaluate the performance of different models. Moreover, we validate our findings with transfer experiments shown in Section 6, using DoQ to train a probe that is evaluated on the ground-truth data of Forbes and Choi (2017) and Jianmo Ni (2019).
To explore the role of context in scalar probing, we also trained specialized probing models on a subset of DoQ data in narrow domains: MASS of Animals and PRICE of Household products.

Probing Model
We probe three different embedding models: Word2vec (Mikolov et al., 2013), ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) (the latter two of which are contextualized encoders). For each encoder, the input layer extracts an embedding of the object and the probing layer predicts the scalar magnitude. 2 Input representations For Word2vec, we follow the standard practice of averaging the embeddings of each word in the object's name. If an object name is a full phrase in the dictionary, we use its embedding instead. As BERT and ELMo are contextual text encoders operating on full sentences, we generate artificial sentences with the following templates: • MASS: The X is heavy.
• PRICE: The X is expensive.
• LENGTH: The X is big. and use the CLS token emebedding (for BERT) or final state embedding (for ELMo) as the input representation. For LENGTH, We use "big" instead of "long", since LENGTH measurements in DoQ can be widths or heights as well. Variations of these templates with different adjectives and sentence structures (e.g. "The X is small." or "What is the length of X?" for LENGTH) led to very similar performance in our evaluations.
Probes We use linear probes in all cases following many previous probing work (Shi et al., 2016;Ettinger et al., 2016;Pimentel et al., 2020) since we want to use a simple probe to find easily accessible information in a representation. Hewitt and Liang (2019) also demonstrates that linear probes achieve relatively high selectivity compared to non-linear ones like MLP.
We experiment with two different approaches for predicting scales: Regression (rgr) For the point estimate, we use a standard Linear Regression model trained Multi-class Classification (mcc) We take a non-parametric approach to modeling the full distribution of scalar values and treat the prediction of which bucket a measurement will fall under as a multi-class classification task, with one class per bucket. A similar approach was shown by (Van Oord et al., 2016) to perform well for modeling image pixel values. This approach discards the relationship between adjacent bucket values, but it allows us to use the full empirical distribution as soft labels. We train a linear model with softmax output, using a dense cross-entropy loss against the empirical distribution from DoQ.
More details of the model and training procedure are in the Appendix.

Numeracy through Scientific Notation
Wallace et al. (2019) showed that BERT and ELMo had a limited amount of numeracy or numerical reasoning ability, when restricted to numbers of small magnitude. Intuitively, it seems that significant model capacity is expended in parsing the natural representation of numbers as Arabic numerals, where higher and lower order digits are given equal prominence. As further evidence of this, it is shown in Appendix B of Wallace et al. (2019) that the simple intervention of left-padding numbers in ELMo instead of the default right-padding used in Char-CNNs greatly improves accuracy on these tasks.
To examine the effect of numerical representations on scalar probing, we trained a new version of the BERT model (which we call NumBERT) by replacing every instance of a number in the training data with its representation in scientific notation, a combination of an exponent and mantissa (for example 314.1 is represented as 3141[EXP]2 where [EXP] is a new token introduced into the vocabulary). This enables the BERT model to more easily associate objects in the sentence directly with the magnitude expressed in the exponent, ignoring the relatively insignificant mantissa. This model converged to a similar loss on the original BERT Masked LM+NSP pre-training task and a standard suite of NLP tasks (See Appendix) as BERT-base, demonstrating that it was not over-specialized for numerical reasoning tasks.

Evaluation
We offer the following aggregate baseline to help interpret our results: For each attribute, we compute the empirical distribution over buckets across all objects in the training set, and use that as a predicted distribution for all objects in the test set (this is a stronger version of the majority baseline used in classification tasks). Since we are comparing results from regression and classification models, we report results on 3 disparate metrics that give a full picture of performance: Accuracy For mcc we use the highest scoring bucket from the predicted distribution as the predicted bucket, while for rgr we map the predicted scalar to the single containing bucket and use that as the predicted bucket. Then the accuracy is calculated between the predicted bucket and the groundtruth bucket, which is the highest scoring bucket in the empirical distribution in DoQ.
Mean Square Error (MSE) When used to compare distributions, this is also known as the Cramer-Von Mises distance ( Baringhaus and Henze, 2017) . It ignores the difference in magnitude between different buckets (a difference in probability mass between buckets i and i + 1 is equivalent to the same difference between buckets i and any other), but is upper-bounded by 1, making it easier to interpret. To calculate MSE for rgr, we assume that it assigns a probability of 1 to the single containing bucket. 3 Earth Mover's Distance (EMD) Also known as the Wasserstein distance (Rubner et al., 1998).
Given two probability densities p 1 and p 2 on Ω, and some distance measure d on Ω, the Earth Mover's Distance is defined as follows: where the infimum is over all non-negative measures π on Ω×Ω satisfying π(E×Ω)−π(Ω×E) = Intuitively, EMD measures how much "work" needs to be done to move the probability mass of p 1 to p 2 , while MSE measures pointwise what the difference in densities is. So EMD accounts for the distance between buckets, and predictions to neighboring buckets are penalized less than those further away. EMD is favored in the statistics literature because of its better convergence properties (Rubner et al., 1998), and there is evidence that it is more robust to adversarial perturbations of the data distribution (Liu et al., 2019), which is relevant for our transfer tasks described below.
Transfer experiments We also evaluate models trained on DoQ on 2 datasets containing ground truth labels of scalar attributes. The first is a humanlabeled dataset of relative comparisons (e.g. (person, fox, weight, bigger)) (Forbes and Choi, 2017). Predictions for this task are made by comparing the point estimates for rgr and highest-scoring buckets for mcc. The second is an empirical distribution of product price data extracted from the Amazon Review Dataset (Jianmo Ni, 2019). We retrained a model on DoQ prices using 12 power-of-4 buckets to support finer grained predictions. Table 1 shows results of scalar probing on DoQ data. 4 For MSE and EMD the best possible score is 0, while for accuracy we take a loose upper bound to be the performance of a model that samples from the ground-truth distribution and is evaluated against the mode. This method achieves accuracies of 0.570 for lengths, 0.537 for masses, and 0.476 for prices. Compared to the baseline, we can see that mcc over the best encoders capture about half (as measured by accuracy) to a third (by MSE and EMD) of the distance to the upper bound, suggesting that while a significant amount of scalar information is available, there is a long way to go to support robust commonsense reasoning. From Table 1, we see that the more expressive models using mcc consistently beat rgr, with the latter frequently unable to improve upon the Aggregate baseline. This shows that scale information is present in the embeddings, but training on the median alone is not enough to reliably extract it; the full data distribution is needed.

Results
Comparing results by encoders, we see that Word2Vec performs significantly worse than the contextual encoders -even though the task is noncontextual -indicating that contextual information during pre-training improves the representation of scales.
Despite being weaker than BERT on downstream NLP tasks, ELMo does better on scalar probing, consistent with it being better at numeracy (Wallace et al., 2019) due to its character-level tokenization.
NumBERT does consistently better than ELMo and BERT on the EMD metric, but worse on MSE and Accuracy. This is in contrast to other standard benchmarks such as Q/A and NLI, where Num-BERT made no difference relative to BERT. Our key takeaway is that the numerical representation has an impact on scale prediction (see Figure 2 for qualitative differences), but the direction is sensitive to the choice of evaluation metric. As discussed in Section 5, we believe EMD to be the most robust metric a priori, but this finding highlights the need to still examine the full range of metrics.
Results on Animal Masses (Table 1) show that training models only on objects in a narrow domain can significantly improve scalar prediction, underscoring the importance of context. For example, while "crane" in general can refer to either a bird or a piece of construction equipment, only the former is relevant in the animal domain, giving the model a simpler distribution of masses to predict.
Note that, despite significant differences in the raw numbers for each scale (mass/length/price), the relative behavior of encoders, metrics and probes are the same, indicating that our conclusions are broadly applicable.
Transfer experiments On the F&C relative comparison task (Table 2), rgr+NumBERT performed best, approaching the performance of using DoQ as an oracle, though short of specialized models for this task (Yang et al., 2018). Scalar probes trained with mcc perform poorly, possibly because a finer-grained model of predicted distribution is not useful for the 3-class comparative task. On the Amazon price dataset (Table 3) which is a full distribution prediction task, mcc+NumBERT did best on all three metrics. On both zero-shot transfer tasks, NumBERT was the best encoder on all configurations of metric/objective, suggesting that manipulating numeric representations can signifi-   cantly improve performance on scalar prediction.

Conclusion
From our novel scalar probing experiments, we find there is a significant amount of scale information in object embeddings, but still a sizable gap to overcome before LMs achieve this prerequisite of CSR. We conclude that although we observe some non-trivial signal to extract scale information from language embedding, the weak signals suggest these models are far from satisfying common sense scale understanding. Our analysis points to improvements in modeling context and numeracy as directions in which progress can be made, mediated by the transfer of scale information from numbers to nouns. The NumBERT intervention has a measurable impact on scalar probing results, and transfer experiments suggest that it is an improvement. For future work we would like to extend our models to predict scales for sentences bearing relevant context about scalar magnitudes, e.g. "I saw a baby elephant".

A Model Hyperparameters
Here we provide the model hyperparameters we use for reproducibility.

A.1 Probing Layer of the Scalar Probing Model
For the regression model, we use a ridge regression with regularization strength of 1. For the multiclass classification model, we use a linear classifier with a softmax activation function and regularization strength of 0.01.  For experiments on the narrow domains with smaller datasets, we first use PCA to reduce embeddings down to 150 dimensions before training the probing model.

A.2 NumBERT
NumBERT is pretrained on the Wikipedia and Books corpora used by the original BERT paper (Devlin et al., 2018). The BERT configuration is the same as BERT-Base (L=12, H=768, A=12, Total Parameters=110M). The language model masking is applied after WordPiece tokenization with a uniform masking rate of 15%. Maximum sequence length (number of tokens) is 128. We train with batch size of 64 sequences for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. All the other hyperparameters and implementation details (optimizer, warm-up steps, etc.) are the same as the original BERT implementation. Table 4 shows a comparison of Num-BERT vs a re-implementation of BERT-base with identical settings as above, on a suite of standard NLP benchmarks, and we conclude that the two models reach similar performance on these tasks. Table 5 shows the statistics of 3 datasets/resources we use in this paper. For DoQ, we take the original resource and get each subset by filtering using the corresponding dimensions and/or object types (e.g. all objects, animals, product categories, etc.). Also, only objects with more than 100 values collected in the resource are used. For F&C Cleaned dataset, we use the data and the train/dev/test splits from (Elazar et al., 2019).

C Complete Experimental Results
We model the distributions of those scalar attributes as categorical distributions over 12 categories. We first take the base-10 logarithm of all the values and then round them to the nearest integer (between -2 and 9 for all scales). We treat each integer as a bucket and use the normalized counts in each bucket as the true distribution for that scalar attribute of the object.
To explore the effect of ambiguity, we divide all the data in DoQ for each scale into 2 sets, Unimodal where the distribution has one well-defined peak and Multimodal, where multiple peaks are present. The number of peaks were identified by a simple hill-climbing algorithm.
As words often have more than one meaning in different contexts or even modifiers, their corresponding distribution from DoQ should reflect the different senses if they appeared enough in the data. When the objects are different enough (e.g. an ice-cream have mainly one meaning and its size doesn't vary much, as opposed to a truck which can be a toy truck, which is very small, or an actual vehicle, which is very big), they may have different modalities. In order to better understand our results, we wish to separate between objects of different modalities to objects with a single modality.
In order to estimate a multi-modal function, we take the bucketed DoQ distribution and smooth it into a probability density function. Then, by finding local maxima over the fitted density function, we estimate a distribution to be multi-modal if we find more than one maximum, otherwise we determine it to be a single-modal distribution.
The complete experiment results including the mutlimodal experiments are in Table 6.