Feature2Vec: Distributional semantic modelling of human property knowledge

Feature norm datasets of human conceptual knowledge, collected in surveys of human volunteers, yield highly interpretable models of word meaning and play an important role in neurolinguistic research on semantic cognition. However, these datasets are limited in size due to practical obstacles associated with exhaustively listing properties for a large number of words. In contrast, the development of distributional modelling techniques and the availability of vast text corpora have allowed researchers to construct effective vector space models of word meaning over large lexicons. However, this comes at the cost of interpretable, human-like information about word meaning. We propose a method for mapping human property knowledge onto a distributional semantic space, which adapts the word2vec architecture to the task of modelling concept features. Our approach gives a measure of concept and feature affinity in a single semantic space, which makes for easy and efficient ranking of candidate human-derived semantic properties for arbitrary words. We compare our model with a previous approach, and show that it performs better on several evaluation tasks. Finally, we discuss how our method could be used to develop efficient sampling techniques to extend existing feature norm datasets in a reliable way.


Introduction
Distributional semantic modelling of word meaning has become a popular method for building pretrained lexical representations for downstream Natural Language Processing (NLP) tasks (Baroni and Lenci, 2010;Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017). In this approach, meaning is encoded in a dense vector space model, such that words (or concepts) that have vector representations that are spatially close together are similar in meaning. A criticism of these real-valued embedding vectors is the opaqueness of their representational dimensions and their lack of cognitive plausibility and interpretability (Murphy et al., 2012;Ş enel et al., 2018). In contrast, human conceptual property knowledge is often modelled in terms of relatively sparse and interpretable vectors, based on verbalizable, human-elicited features collected in property knowledge surveys (McRae et al., 2005;. However, gathering and collating human-elicited property knowledge for concepts is very labour intensive, limiting both the number of words for which a rich feature set can be gathered, as well as the completeness of the feature listings for each word. Neural embedding models, on the other hand, learn from large corpora of text in an unsupervised fashion, allowing very detailed, high-dimensional semantic models to be constructed for a very large number of words. In this paper, we propose Feature2Vec, a computational framework that combines information from human-elicited property knowledge and information from distributional word embeddings, allowing us to exploit the strengths and advantages of both approaches. Feature2Vec maps human property norms onto a pretrained vector space model of word meaning. The embedding of feature-based information in the pretrained embedding space makes it possible to rank the relevance of features using cosine similarity, and we demonstrate how simple composition of features can be used to approximate concept vectors.

Related Work
Several property-listing studies have been conducted with human participants in order to build property norms -datasets of normalized humanverbalizable feature listings for lexical concepts (McRae et al., 2005;. One use of feature norms is to critically examine distributional semantic models on their ability to encode grounded, human-elicited semantic knowledge. For example, Rubinstein et al. (2015) demonstrated that state-of-the-art distributional semantic models fail to predict attributive properties of concept words (e.g. the properties is-red and is-round for the word apple) as accurately as taxonomic properties (e.g. is-a-fruit). Similarly, Sommerauer and Fokkens (2018) investigated the types of semantic knowledge encoded within pretrained word embeddings, concluding that some properties cannot be learned by supervised classifiers. Collell and Moens (2016) compared linguistic and visual representations of object concepts on their ability to represent different types of property knowledge. Research has shown that state-of-the-art distributional semantic models built from text corpora fail to capture important aspects of meaning related to grounded perceptual information, as this kind of information is not adequately represented in the statistical regularities of text data (Li and Gauthier, 2017;Kelly et al., 2014). Motivated by these issues, Silberer (2017) constructed multimodal semantic models from text and image data, with the goal of grounding word meaning using visual attributes. More recently, Derby et al. (2018) built similar models with the added constraint of sparsity, demonstrating that sparse multimodal vectors provide a more faithful representation of human semantic representations. Finally, the work that most resembles ours is that of Fagarasan et al. (2015), who use Partial Least Squares Regression (PLSR) to learn a mapping from a word embedding model onto specific conceptual properties. Concurrent work recently undertaken by Li and Summers-Stay (2019) replaces the PLSR model with a feedforward neural network. In our work, we instead map property knowledge directly into vector space models of word meaning, rather than learning a supervised predictive function from concept embedding dimensions to feature terms.

Method
We make primary comparison with the work of Fagarasan et al. (2015), although their approach differs from ours in that they map from an embedding space onto the feature space, while we learn a mapping from the feature domain onto the embedding space. We outline both methods below.

Distributional Semantic Models
For our experiments, we make use of the pretrained GloVe embeddings (Pennington et al., 2014) provided in the Spacy 1 package trained on the Common Crawl 2 . The GloVe model includes 685, 000 tokens with embedding vectors of dimension 300, providing excellent lexical coverage with a rich set of semantic representations.
In our analyses we use both the McRae property norms (McRae et al., 2005), which contain 541 concepts and 2526 features, and the CSLB norms  which have 638 concepts with 2725 features. For both sets of norms, a feature is listed for a concept if it has been elicited by five or more participants in the property norming study. The number of participants listing a given feature for a given concept name is termed the production frequency for that concept×feature pair. This gives sparse production frequency vectors for each concept over all the features in the norms.

Partial Least Square Regression (PLSR)
Fagarasan et al. (2015) used partial least squares regression (PLSR) to map between the GloVe embedding space and property norm vectors. Suppose we have two real-valued matrices G ∈ R n×m and F ∈ R n×k . In this context, G and F represent GloVe embedding vectors and property norm feature vectors, respectively. For n available concept words, G is a matrix which consists of stacked pretrained embeddings from GloVe and F is the (sparse) matrix of production frequencies for each concept×feature pair. G and F share the same row indexing for concept words. For a new dimension size p ∈ N, a partial least squared regression learns two new subspaces with dimensions n × p, which have maximal covariance between them. The algorithm solves this problem by learning a mapping from the matrix G onto F , similar to a regression model. The fitted regression model thus provides a framework for predicting vectors in the feature space from vectors in the embedding space.
In this work, we use the PLSR approach as a baseline for our model. In implementing PLSR, we set the intermediate dimension size to 50, following Fagarasan et al. (2015). We also build a PLSR model using 120 dimensions, which in preliminary experimentation we found gave the best performance from a range of values tested.

Skip-Gram Word2Vec
Mikolov et al. (2013) proposed learning word embeddings using a predictive neural-network approach. In particular, the skip-gram implementation with negative sampling mines word cooccurrences within a text window, and the network must learn to predict the surrounding context from the target word. More specifically, for a vocabulary V , two sets of embeddings are learned through gradient decent, one for target embeddings and one for context embeddings. Given a target word w ∈ V and a context word c ∈ V in its window, the network calculates the dot product for the embeddings for w and v and a sigmoid activation is applied to the output ( Fig. 1(a)). Negative samples are also generated for training, where the context is not in the target word's window. Let (w, c) ∈ D be the positive word and context pairs and (w, c) ∈ D be the negative word and context pairs. Then, using binary cross entropy loss, we learn a parameterization θ of the neural network that maximizes the function where σ is the sigmoid function and v w and v c are the corresponding real-valued embeddings for the target words and context words .
In this work, we adapt this skip-gram approach to the task of constructing semantic representations of human property norms by mapping properties into an embedding space (Figure 1). We achieve this by using a neural network to predict the properties from the input word, using the skipgram architecture with negative sampling on the properties. We replace context embeddings and windowed co-occurrence counts from the conventional skip-gram architecture with property embeddings and concept-feature production frequencies. The loss function for training remains the same; however, there are two modifications to the learning process. The first is that the target embeddings for the concept words are pre-trained (i.e. the GloVe embeddings), and gradients are not applied to this embedding matrix. The layer for the property norms is randomly initialized, and gradients are applied to these vectors to learn a semantic representation for properties aligned to the pre-trained distributional semantic space for the words. Secondly, the negative samples are generated from randomly sampled properties. We downweight negative samples by multiplying their associated loss by one over the negative sampling rate, so that the system pays more attention to real cases and less to the incorrect negative examples. Due to the sparsity of word-feature production frequencies, we generate all positive instances and randomly sample negative examples after each epoch to create a new set of training samples. We name this approach Feature2Vec 3 . We use a learn-

Experiments
We train with the McRae and CSLB property norms separately and report evaluations for each dataset. For the McRae dataset we use 400 randomly selected concepts for training and the remaining 141 for testing, and for the CSLB dataset we use 500 randomly selected concepts for training and the remaining 138 for testing 4 .

Predicting Feature Vectors
We first evaluate how well the baseline PLSR model performs on the feature vector reconstruction task used by Fagarasan et al. (2015). In this evaluation, the feature vector for a test concept is predicted and we test whether the real concept vector is within the top N most similar neighbours of the predicted vector. We report results over both 50 (as in Fagarasan et al. (2015)) and 120 dimensions for a range of values of N (Table 1).

Constructing Concept Representations from Feature Vectors
For Feature2Vec, we embed property norm features into the GloVe semantic space, giving a representation of properties in terms of GloVe dimensions. To predict a held-out concept embedding, we build a representation of the concept word by averaging the learned feature embedding vectors for that word using the ground truth information from the property norm dataset. This gives a method to construct embeddings for new words 4 We use the Python pickle package to store the numpy state for reproducible results in our code.  using property knowledge and associated production frequencies (for example, for a held-out word unicorn, its GloVe embedding vector might be predicted from all features of horse, along with the features is-white, has-a-horn, and is-fantastical).
We compare these predicted embeddings to the held-out Glove embeddings (Table 1). However, we note that this approach is different to the PLSR models, so we do not make a direct comparison between PLSR and Feature2vec nearest neighbour results. Nevertheless, the results show that the word embeddings composed from the learned Feature2Vec feature embeddings appear relatively frequently amongst the most similar neighbour words in the pretrained GloVe space, indicating that feature embedding composition approximates the original word embedddings reasonably well.

Predicting Property Knowledge
The evaluation task that we are most interested in is how well the models can predict feature knowledge for concepts, given the distributional semantic vectors. More specifically, for a given concept with K features, we wish to take the top K predicted features according to each method, and record the overlap with the true property norm listing. In this evaluation, we make direct comparisons between all three models (PLSR 50,PLSR 120,& Feature2Vec). For the PLSR models, we predict the feature vector for a given target word using the embedding vector as input and take the top K weighted features. For Feature2Vec, we

Concepts Properties
Kingfisher has wings does fly has a beak has feathers is a bird does eat has a tail does swim has legs does lay eggs Avocado is eaten edible is tasty does grow is green is healthy is used in cooking has skin peel is red is food is a vegetable Door made of metal has a door doors is useful has a handle handles made of wood made of plastic is heavy is furniture does contain hold is found in kitchens Dragon is big large is an animal has a tail does eat is dangerous has legs has claws is grey is small does fly rank all feature embeddings by their distance to the embedding for the target word, using cosine similarity, and take the top K most similar features ( Table 2). The results demonstrate that Fea-ture2Vec outperforms the PLSR models on property knowledge prediction, for both training and testing datasets.

Analysis
Following previous work, we provide the top 10 feature predictions for a few sample concepts, displayed in Table 3. Properties underlined and in bold represent features that match the available ground truth data (i.e., the concept×feature pair occurs in the norms). The first two words in Table 3 were sampled from the CSLB norms test set, whilst the last two words were randomly sampled from the word embedding lexicon and are not concept words appearing in the CSLB norms. We find that the predicted features that are not contained within the ground truth property set still tend to be quite reasonable, even for the two concepts not in the test dataset. As property norms do not represent an exhaustive listing of property knowledge, this is not surprising, and predicted properties not in the norms are not necessarily errors (Devereux et al., 2009;Fagarasan et al., 2015). Moreover, the set of features used within the norms are dependent on the concepts that were presented to the human participants. It is therefore notable that the conceptual representations predicted by our model for the two outof-norms concept words are particularly plausible, even though the attributes were never intended to conceptually represent these words. Our analysis supports the view that such supervised models could be utilised as an assistive tool for surveying much larger vocabularies of words.

Conclusion
We proposed a method for constructing distributional semantic vectors for human property norms from a pretrained vector space model of word meaning, which outperforms previous methods for predicting concept features on two property norm datasets. As discussed by Fagarasan et al. (2015) and others, it is clear that property norm datasets provide only a semi-complete picture of human conceptual knowledge, and more extensive surveys may provide additional useful property knowledge information. By predicting plausible semantic features for concepts through the leveraging of corpus-derived word embedding data, our method offers a useful tool for guiding the expensive and laborious process of collecting property norm listings. For example, existing property norm datasets can be extended through human verification of features predicted with high confidence by Feature2Vec, with these features being added to the norms and subsequently incorporated into Feature2Vec in an iterative, semi-supervised manner (Kelly et al., 2012). Thus, Feature2Vec provides a useful heuristic to add interpretable feature-based information to these datasets for new words in a practical and efficient way.