Are Distributional Representations Ready for the Real World? Evaluating Word Vectors for Grounded Perceptual Meaning

Distributional word representation methods exploit word co-occurrences to build compact vector encodings of words. While these representations enjoy widespread use in modern natural language processing, it is unclear whether they accurately encode all necessary facets of conceptual meaning. In this paper, we evaluate how well these representations can predict perceptual and conceptual features of concrete concepts, drawing on two semantic norm datasets sourced from human participants. We find that several standard word representations fail to encode many salient perceptual features of concepts, and show that these deficits correlate with word-word similarity prediction errors. Our analyses provide motivation for grounded and embodied language learning approaches, which may help to remedy these deficits.


Introduction
Distributional approaches to meaning representation have enabled a substantial amount of progress in natural language processing over the past years. They center around a classic insight from at least as early as Harris (1954);Firth (1957): You shall know a word by the company it keeps. (Firth, 1957, p. 11) Popular distributional analysis methods which exploit this intuition such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) have been critical to the success of many recent All project code available at github.com/lucy3/ grounding-embeddings.
large-scale natural language processing applications (e.g. Turney and Pantel, 2010;Turian et al., 2010;Collobert and Weston, 2008;Socher et al., 2013;Goldberg, 2016). These methods operationalize distributional meaning via tasks where words are optimized to predict words which cooccur with them in text corpora. These methods yield compact word representations -vectors in some high-dimensional space -which are optimized to solve these prediction tasks. These vector representations form the foundation of practically all modern deep learning models applied within natural language processing.
Despite the success of distributional representations in standard natural language processing tasks, a small but growing consensus within the artificial intelligence community suggests that these methods cannot be sufficient to induce adequate representations of words and concepts Gauthier and Mordatch, 2016;Lazaridou et al., 2015). These sorts of claims, which often draw on experimental evidence from cognitive science (see e.g. Barsalou, 2008), are used to back up arguments for multimodal learning (at the weakest) or complete embodiment (at the strongest).  claim the following: . . . the best way for acquiring humanlevel semantics is to have machines learn through (physical) experience: if we want to teach a system the true meaning of "bumping into a wall," we simply have to bump it into walls repeatedly.
Discussions like the one above have an intuitive pull: certainly "bump" is best understood through a sense of touch, just as "loud" is best understood through a sense of sound. It seems inefficientor perhaps just wrong -to learn these sorts of concepts from distributional evidence.
Despite the intuitive pull, there is not much evidence from a computational perspective that grounded or multimodal learning actually earns us anything in terms of general meaning representation. Will our robots and chat-bots be worse off for not having physically bumped into walls before they hold discussions on wall-collisions? Will our representation of the concept loud somehow be faulty unless we explicitly associate it with certain decibel levels experienced in the real world? Before we proceed to embed our learning agents in multimodal games and robot-shells, it is important that we have some concrete idea of how grounding actually affects meaning.
This paper presents a thorough analysis of the contents of distributional word representations with respect to this question. Our results suggest that several common distributional word representations may indeed be deficient in the sort of grounded meaning necessary for languageenabled agents deployed in the real world.

Related work
This paper uses semantic norm datasets to evaluate the content of distributional word representations. Semantic norm datasets consist of concepts and norms concerning their perceptual and conceptual features, as provided by human participants. They are a popular resource within psychology and cognitive science as models of human concept representation, and have been used to explain psycholinguistic phenomena from semantic priming and interference (Vigliocco et al., 2004) to the structure of early word learning in child language acquisition (Hills et al., 2009). Andrews et al. (2009) show how "experiential" semantic norm information can be used to model human judgments of concept similarity. They show that this semantic norm data provides information distinct from the information found in basic word representations. Our work extends the findings of Andrews et al. to a larger semantic norm dataset and evaluates particular implications within natural language processing.
A small NLP literature has compared distributional representations with semantic norm datasets and other external resources. Rubinstein et al. (2015) confirm that word representations are especially effective at predicting taxonomic features versus attributive features. Collell and Moens (2016) find that word representations fail to pre-  dict many visual features of concepts, and show how representations from computer vision models can help improve these predictions. Several studies have used distributional representations to reconstruct aspects of these semantic norm datasets (Herbelot and Vecchi, 2015;Fagarasan et al., 2015;Erk, 2016). The majority of the NLP work in this space has focused on the downstream task of augmenting word representations with novel grounded information, often evaluating on standard semantic similarity datasets (Agirre et al., 2009;Bruni et al., 2012;Faruqui et al., 2015;. Young et al. (2014) develop an alternative operationalization of denotational meaning using image captioning datasets, and demonstrate gains over distributional representations on textual similarity and entailment datasets.
This applied work has demonstrated that something worthwhile is indeed gained by augmenting distributional representations with some orthogonal grounded or multimodal information. We believe it is critical to analyze the original successes and failures of distributional representations in order to motivate this move to grounded meaning representation.

Meaning representations 3.1 Distributional meaning
This paper examines representations produced by two popular unsupervised distributional methods. Table 1 shows the statistics of the corpora used to generate these vectors.
GloVe: GloVe (Pennington et al., 2014) estimates word representations w i by using them to reconstruct a word-word co-occurrence matrix X collected from a large text corpus:  here f (X ij ) is a weighting function on word pairs and b i , b j are learned per-word bias terms. We use two pre-trained GloVe vector datasets: one trained on a concatenation of Wikipedia 2014 and Gigaword 5 (GloVe-WG), and another trained on a Common Crawl dump (GloVe-CC). 1 word2vec: word2vec (Mikolov et al., 2013) estimates word representations by optimizing a skip-gram objective to predict all words w j within a context window c of a word w i given their word representations: where T is the total number of words in a corpus. We use a publicly available word2vec dataset trained on the Google News corpus. 2

Semantic norms
Semantic feature norm datasets consist of reports from human participants about the semantic features of various natural kinds. A proportion of the features contained in these datasets are properties of concepts which may be obvious to humans but are perhaps difficult to find written in text corpora. For this reason, we selected two semantic norm datasets to serve as gold-standard comparisons of concept meaning. Table 2 displays basic statistics about the semantic norm datasets we use in this paper.
McRae Our initial experiments use the semantic norm dataset from McRae et al. (2005), which consists of 541 concrete noun concepts with associated feature norms, collected from 725 participants. For a given concept, the McRae dataset includes all feature norms which were reported independently by at least five participants (2,526 in total). After removing concepts indicated to have ambiguous meanings to mitigate polysemy effects (such as tank (army) and tank (container)) and one concept without a GloVe representation (dunebuggy), we had a resulting set of 515 concepts for analysis. The dataset groups features into several perceptual and non-perceptual categories: taxonomic, encyclopedic, function, visual-motion, visual-form andsurface, visual-colour, sound, tactile, and taste (McRae et al., 2005). We use the McRae dataset and feature categories to perform basic pilot analyses and form hypotheses about the nature of the distributional representations tested.
CSLB We reproduce and extend our results on a second semantic norm dataset collected by the Cambridge Centre for Speech, Language and the Brain (CSLB; Devereux et al., 2014). CSLB contains 638 concepts provided by 123 participants. Their data collection closely followed McRae et al. (2005), though features were included if at least 2 participants named that feature. We removed concepts with two-word names, ambiguous meanings, or missing vector representations to yield a vocabulary of 597 concepts from this dataset. CSLB also includes a feature categorization schema, though the categories are broader than those in McRae: visual perceptual, other perceptual, functional, taxonomic, and encyclopedic.
The mapping between the two categorization schemes is far from perfect. While some perceptual features in McRae are categorized as perceptual features in CSLB, other features (e.g. those related to swimming, flying, eating) are reclassified as "functional" in CSLB. The two datasets disagree on abstract conceptual properties as well. For example, CSLB classifies is forfootball as a functional property, while McRae classifies the comparable feature associatedwith football games as encyclopedic.
The encyclopedic category is somewhat difficult to distinguish in both datasets. It is composed mainly of abstract factual features, but also contains attributive features such as is coldblooded and does use electricity as well as is scary and is cool.
Meanwhile, the functional category mixes features for behaviors associated with the concept (does dive) as well as functions that people perform on or with the concept (is hit). This classification system may need some readjustments to provide a clear understanding of what is perceptual and what is conceptual, and it may be that some features, such as has a steeringwheel, are both.
Given the significant noise of this classification scheme, we focus our investigation on a single contrast between features in clearly perceptual categories (visual, tactile, sound, etc.) and nonperceptual categories (functional and taxonomic). Because the encyclopedic category contains an ambiguous mix of both sorts, we exclude it from our formal predictions later in the paper.

The feature view
We first investigate how well distributional word representations directly encode information about semantic norms. 3 For each feature in a semantic norm dataset, we construct a binary classification problem which predicts the presence or absence of the feature for each concept. Concretely, for each feature f i we have a label vector y i ∈ {0, 1} nc , where n c is the total number of concepts in the dataset, and y ij is 1 when concept j has feature f i and 0 otherwise. We build label vectors only for features with five or more associated concepts. After filtering, we have n f = 267 label vectors in the McRae dataset and n f = 775 in CSLB.
For each feature, we construct a binary logistic regression model p i which predicts the presence or absence of the feature for a concept given its word representation x j : This base model is extremely prone to overfitting, as most features have only several associated concepts -that is, each classifier has only a few positive examples -and the input word representations are of a high dimensionality. In order to prevent overfitting, we add an independent L2 regularization term to each regression model. For each feature f i , we use leave-one-out crossvalidation to select the regularization parameter λ i which maximizes the following modified logistic 3 The remainder of this paper describes a general analysis performed on both the McRae and CSLB datasets. We used McRae as a pilot dataset to form hypotheses, and checked these hypotheses on the CSLB dataset as a test set. All of the graphs and numbers reported in this paper correspond to results on CSLB. objective: Here p i,λ i −j (·) represents a regression model (Equation (3)) trained without example (x j , y ij ) in the training set and with regularization parameter λ i . The first term of the summand calculates the log-probability of the left-out concept having the desired feature, and the second term calculates the average log-probability that any other concept (outside of the feature group f i ) does not have the feature. The regularization terms λ i are selected independently for each feature to maximize the objective L i .
After fitting the regularized logistic regression models, we calculate a set of "feature fit" metrics. For each feature f i , we evaluate the binary F1 score of its classifier's predictions p i (y i ). Figure 1 shows each feature as a point in a swarm-plot (grouped by feature category).
Pilot tests with the McRae dataset suggested that the categories associated with strictly perceptual features were not well encoded in the distributional representations relative to strictly nonperceptual categories (taxonomic and functional features).
We use the CSLB dataset as a test set for this prediction. We perform a bootstrap confidence interval test on the difference between the median feature fit scores for CSLB features in nonperceptual and perceptual categories. The 95% confidence intervals on this bootstrap are positive for two of the three representations tested (GloVe-CC and word2vec). 4 Figure 1 shows the feature fit scores on CSLB evaluated with GloVe-CC, and the word2vec evaluation effectively shows the same result: taxonomic and functional features score higher on average than strictly perceptual features. This comparison failed on GloVe-WG, however, where features classed as "functional" scored far lower on average than those in perceptual categories. Across all three sets of distributional representations, the median score of ency-   It is obvious from Figure 1 that each category contains a wide range of feature fit values. As discussed earlier in Section 3.2, this categorization of features is far from perfect. Many of the lower-scoring features classed as "encyclopedic" are simple attributive features not deserving of the category label, such as is fresh and is filling. Many of the higher-scoring encyclopedic features seem genuinely encyclopedic, such as is found on farms; other highscoring features are arguably "functional," such as does grow on trees. Many of the higher scoring visual perceptual features state structural part-whole relations, such as has legs and has an engine. Table 3 provides more examples of low-and high-scoring features in each category. Despite the rather noisy classification scheme used in this dataset, we still managed to find a regular trend in two of three evaluations, matching our expectations from prior pilot experiments. We believe that a revised classification scheme could help to

Matching word representation sources
For each feature, we compare its feature fit score evaluated with GloVe-CC word vectors and its score evaluated with word2vec vectors in Figure 2.  The trend in the figure suggests that both representations have similar feature fit deficiencies and strengths, though the trend becomes weaker near the (100%, 100%) corner -the two representations correlate well at low feature fit scores, and seem to fan out at higher scores. A large group of points also sit in the figure at y = 100 and x = 100; these features are perfectly captured by one representation and not by the other. This correlation is somewhat surprising, given that the word2vec and GloVe vectors are the products of different algorithms executed on very different corpora. There are two likely explanations behind this correlation: 1. Some features in the CSLB semantic norm data are unusually difficult, or are perhaps missing associated concepts. GloVe and word2vec correlate in performance because they don't match these noisy or incomplete features.
2. There are systematic deficiencies in the word vectors due to their shared reliance on the distributional method.
It is difficult to differentiate these two explanations on these small semantic norm datasets, but we hope to distinguish these in the future by testing new predictions for concepts not covered in these datasets. We will return to this idea in the conclusion of the paper.

The concept view
The previous section demonstrated that several classes of perceptual features are not well encoded on average by distributional word representations, and that these deficiencies systematically match across representations. How does this deficiency in feature representation carry over into computations on the word representations themselves?
We evaluate the matching between distributional representations and representations from other sources by comparing their predictions of word-word similarity. For distributional word representations, we compute word-word similarity by cosine distance: We derive compact concept representations from the semantic norm datasets with LSA (Landauer et al., 1998). We compute a truncated SVD on the feature matrix Y ∈ {0, 1} nc×n f , which is the concatenation of the binary feature label vectors introduced in Section 4. We define conceptconcept similarity by the cosine distance between their corresponding LSA vectors.
As a secondary data source, we also compute word-word similarity judgments from the Word-Net taxonomy (Miller, 1995). We use the Resnik metric (Resnik et al., 1999) to compute the similarity between concept names c i , c j : where S(c i , c j ) selects the common ancestors of the concepts in the WordNet taxonomy, and p(c) is the unigram probability of a concept as computed on an external corpus. This selects the ancestor of the two concepts in the taxonomy which has maximal information content (surprisal). We use Word-Net as additional verification that the trends observed between semantic norms and distributional representations are non-coincidental.
We use these similarity metrics to compute pairwise distance measures for concepts present in the semantic norm datasets. For each metric, we produce a symmetric pairwise distance matrix D ∈ R nc×nc , where an element D ij indicates the distance between concepts i and j according to the metric.
We next compute how well each concept's pairwise similarity is correlated between the various metrics.
For a given concept, we compute the Pearson correlation between the concept's GloVe/word2vec pairwise distance vector and the LSA and WordNet pairwise distance vectors. 5 The correlation values of interest are m(GloVe/word2vec, CSLB) and m(GloVe/word2vec, WordNet) -that is, the correlations between the pairwise distance vectors for GloVe/word2vec and CSLB and between the pairwise distance vectors for GloVe/word2vec and WordNet. Figure 3a plots both of these correlation values evaluated with GloVe-CC for all concepts. The two m measures are evidently positively correlated, though with some noise (r = 0.6160). This is to be expected, as the CSLB dataset and Word-Net overlap only partially in the semantic features they encode.
Each concept in Figure 3 is colored according to the median feature fit score of its associated features. In Figure 3b, we show this feature fit metric on the vertical axis. There is a positive relationship here between feature fit scores and the correlation metric m(GloVe-CC, CSLB) (r = 0.3323). Because the correlation between m(·, CSLB) and feature fit metrics is weaker than expected, we run post-hoc multiple regression significance tests for each distributional representation. An F-test shows that the regression feature m(·, CSLB) significantly improves predictions of feature fit val-5 The Pearson correlation between two vectors is equivalent to the cosine distance between their mean-centered forms.

Domain-level analysis
We next investigate whether some domains of concepts are particularly affected by the deficiencies discussed in the previous sections. We perform agglomerative clustering on concepts from the CSLB dataset using a custom distance metric: where LSA i is the LSA vector representation computed from the semantic norm data for concept i as introduced earlier in this section, and FF i is the median feature-fit score for a concept i. We select the weight α manually to produce the most semantically coherent clusters.  Figure 4 shows the distribution of feature fit scores for each of the resulting 40 domains. We find that settings of α which yield semantically coherent clusters also yield groups of concepts with very low variance in feature fit scores. In Table 4 we list select domains and their median feature fit scores. This clustering suggests that deficiencies at the feature level affect entire coherent semantic domains of concepts.

Conclusion
This paper has analyzed how well various standard distributional representations encode aspects of grounded meaning. We chose to use semantic norm datasets as a gold standard of grounded meaning, and tested how word representations predicted features within these datasets. We grouped these features into high-level categories and found that, despite large within-category variance, several standard distributional representations underperformed on average in predicting perceptual features. The difference in prediction performance proved statistically significant on two of the three representations we evaluated. These deficiencies in feature encoding matched between GloVe and word2vec representations trained on different corpora, suggesting that certain classes of features may be poorly represented by distributional methods in general.
We also examined the consequences of these deficiencies in feature encoding for the word representations themselves. We compared the wordword similarity predictions made with distributional representations with those made with the semantic norm dataset and with WordNet, and found that words having features badly encoded within the distributional representations were also likely to make different similarity predictions than the predictions from these two corpora. A final domain-level concept analysis suggested that some semantic domains are particularly impacted by these issues in feature encoding.
The semantic norm datasets used in this paper are subject to saliency biases: they only contain the concept-feature mappings which experimental subjects think to mention when queried. These saliency effects add noise to our results, as mentioned in Section 4.1, and may have caused us to generally underestimate the performance of distributional models within all feature categories. In future work, we plan to repeat the sorts of tests conducted in this paper while avoiding possible saliency confounds. We also plan to develop a causal explanation for the deficiencies in the word embeddings found in this paper, showing how cooccurrence information (or lack thereof) present in the training corpus can bias performance on these tasks. Both of these studies will verify that the results we have found are due entirely to deficiencies in distributional methods rather than in the datasets used here.
We think these deficiencies should be worrying: if neural models of language are to have any knowledge about concepts, it ought to be in their word embeddings. Our findings show that these embeddings are lacking in basic features of perceptual meaning. These results suggest that distributional meaning (as operationalized by modern distributional models) may miss out on fundamental elements of semantics. We hope they will help motivate further work in developing multimodal representations which can prepare us to deploy more fluent language agents in the real world.