Inferring Psycholinguistic Properties of Words

We introduce a bootstrapping algorithm for regression that exploits word embedding models. We use it to infer four psycholinguistic properties of words: Familiarity, Age of Acquisition, Concreteness and Imagery and further populate the MRC Psycholinguistic Database with these properties. The approach achieves 0 . 88 correlation with human-produced values and the inferred psycholinguistic features lead to state-of-the-art results when used in a Lexical Simpliﬁcation task.


Introduction
Throughout the last three decades, much has been found on how the psycholinguistic properties of words influence cognitive processes in the human brain when a subject is presented with either written or spoken forms. A word's Age of Acquisition is an example. The findings in (Carroll and White, 1973) reveal that objects whose names are learned earlier in life can be named faster in later stages of life. Zevin and Seidenberg (2002) show that words learned in early ages are orthographically or phonologically very distinct from those learned in adult life.
Other examples of psycholinguistic properties, such as Familiarity and Concreteness, influence one's proficiency in word recognition and text comprehension. The experiments in (Connine et al., 1990;Morrel-Samuels and Krauss, 1992) show that words with high Familiarity yield lower reaction times in both visual and auditory lexical decision, and require less hand gesticulation in order to be described. Begg and Paivio (1969) found that humans are less sensitive to changes in wording made to sentences with high Concreteness words.
When quantified, these aspects can be used as features for various Natural Language Processing (NLP) tasks. The Lexical Simplification approach in  is an example. By combining various collocational features and psycholinguistic measures extracted from the MRC Psycholinguistic Database (Coltheart, 1981), they trained a ranker (Joachims, 2002) that reached first place in the English Lexical Simplification task at SemEval 2012. Semantic Classification tasks have also benefited from the use of such features: by combining Concreteness with other features, (Hill and Korhonen, 2014) reached the state-of-theart performance in Semantic Composition (denotative/connotative) and Semantic Modification (intersective/subsective) prediction.
Despite the evident usefulness of psycholinguistic properties of words, resources describing such properties are rare. The most extensively developed resource for English is the MRC Psycholinguistic Database (Section 2). However, it is far from complete, most likely due to the inherent cost of manually entering such properties. In this paper we propose a method to automatically infer these missing properties. We train regressors by performing bootstrapping (Yarowsky, 1995) over the existing features in the MRC database, exploiting word embedding models and other linguistic resources for that (Section 3). This approach outperform various strong baselines (Section 4) and the resulting properties lead to significant improvements when used in Lexical Simplification models (Section 5).

435
Introduced by Coltheart (1981), the MRC (Machine Readable Dictionary) Psycholinguistic Database is a digital compilation of lexical, morphological and psycholinguistic properties for 150,837 words. The 27 psycholinguistic properties in the resource range from simple frequency measures (Rudell, 1993) to elaborate measures estimated by humans, such as Age of Acquisition and Imagery (Gilhooly and Logie, 1980). However, despite various efforts to populate the MRC Database, these properties are only available for small subsets of the 150,837 words.
We focus on four manually estimated psycholinguistic properties in the MRC Database: • Familiarity: The frequency with which a word is seen, heard or used daily. Available for 9,392 words.
• Age of Acquisition: The age at which a word is believed to be learned. Available for 3,503 words.
• Concreteness: How "palpable" the object the word refers to is. Available for 8, 228 words.
• Imagery: The intensity with which a word arouses images. Available for 9,240 words.
All four properties are real values, determined based on different quantifiable metrics. We focus on these properties since they have been proven useful and are some of the most scarce in the MRC Database. As we discussed in Section 1, these properties have been successfully used in various approaches for Lexical Simplification and Semantic Classification, and yet are available for no more than 6% of the words in the MRC Database.

Bootstrapping with Word Embeddings
In order to automatically estimate missing psycholinguistic properties in the MRC Database, we resort to bootstrapping. We base our approach on that by (Yarowsky, 1995), a bootstrapping algorithm which aims to learn a classifier over a reduced set of annotated training instances (or "seeds"). It does so by performing the following five steps: 1. Initialise training set S with the seeds available.
2. Train a classifier over S.

Predict values for a set of unlabelled instances U .
4. Add to S all instances from U for which the prediction confidence c is equal or greater than ζ.
5. If at least one instance was added to S, go to step 2, otherwise, return the resulting classifier.
One critical difference between this approach and ours is that our task requires regression algorithms instead of classifiers. In classification, the prediction confidence c is often calculated as the maximum signed distance between an instance and the estimated hyperplanes. There is, however, no analogous confidence estimation technique for regression problems. We address this problem by using word embedding models.
Embedding models have been proved effective in capturing linguistic regularities of words (Mikolov et al., 2013b). In order to exploit these regularities, we assume that the quality of a regressor's prediction on an instance is directly proportional to how similar the instance is to the ones in the labelled set. Since the input for the regressors are words, we compute the similarity between a test word and the words in the labelled dataset as the maximum cosine similarity between the test word's vector and the vectors in the labelled set.
Let M be an embeddings model trained over vocabulary V , S a set of training seeds, ζ a minimum confidence threshold, sim(w, S, M ) the maximum cosine similarity between word w and S with respect to model M , R a regression model, and R(w) its prediction for word w. Our bootstrapping algorithm is depicted in Algorithm 1.
Algorithm 1: Regression Bootstrapping input: M, V, S, ζ; output: R; repeat We found that 64,895 out of the 150,837 words in the MRC database were not present in either Word-Net or our word embedding models. Since our bootstrappers use features extracted from both these resources, we were only able to predict the Familiarity, Age of Acquisition, Concreteness and Imagery values of the remaining 85,942 words in MRC.

Evaluation
Since we were not able to find previous work for this task, in these experiments, we compare the performance of our bootstrapping strategy to various baselines. For training, we use the Ridge regression algorithm (Tikhonov, 1963). As features, our regressor uses the word's raw embedding values, along with the following 15 lexical features: • Word's length and number of syllables, as determined by the Morph Adorner module of LEXenstein (Paetzold and Specia, 2015).
• Minimum, maximum and average distance between the word's senses in WordNet and the thesaurus' root sense.
• Number of images found for word in the Getty Images database 1 .
We train our embedding models using word2vec (Mikolov et al., 2013a) over a corpus of 7 billion words composed by the SubIMDB corpus, UMBC webbase 2 , News Crawl 3 , SUBTLEX (Brysbaert and New, 2009), Wikipedia and Simple Wikipedia (Kauchak, 2013). We use 5-fold cross-validation to optimise parameters: ζ, embeddings model architecture (CBOW or Skip-Gram), and word vector size (from 300 to 2,500 in intervals of 200). We include four strong baseline systems in the comparison: • Max. Similarity: Test word is assigned the property value of the closest word in the training set, i.e. the word with the highest cosine similarity according to the word embeddings model.
• Avg. Similarity: Test word is assigned the average property value of the n closest words in the training set, i.e. the words with the highest cosine similarity according to the word embeddings model. The value of n is decided through 5-fold cross validation.
• Simple SVM: Test word is assigned the property value as predicted by an SVM regressor (Smola and Vapnik, 1997) with a polynomial kernel trained with the 15 aforementioned lexical features.
• Simple Ridge: Test word is assigned the property value as predicted by a Ridge regressor trained with the 15 aforementioned lexical features.
• Super Ridge: Identical to Simple Ridge, the only difference being that it also includes the words embeddings in the feature set. We note that this baseline uses the exact same features and regression algorithm as our bootstrapped regressors.
The parameters of all baseline systems are optimised following the same method as with our approach. We also measure the correlation between each of the aforementioned lexical features and the psycholinguistic properties. For each psycholinguistic property, we create a training and a test set by splitting the labelled instances available in the MRC Database in two equally sized portions. All training instances are used as seeds in our approach. As evaluation metrics, we use Spearman's (ρ) and Pearson's (r) correlation. Pearson's correlation is the most important indicator of performance: an effective regressor would predict values that change linearly with a given psycholinguistic property.
The results are illustrated in Table 1. While the similarity-based approaches tend to perform well for Concreteness and Imagery, typical regressors capture Familiarity and Age of Acquisition more effectively. Our approach, on the other hand, is consistently superior for all psycholinguistic properties, with both Spearman's and Pearson's correlation   scores varying between 0.82 and 0.88. The difference in performance between the Super Ridge baseline and our approach confirm that our bootstrapping algorithm can in fact improve on the performance of a regressor. The parameters used by our bootstrappers, which are reported below, highlight the importance of parameter optimization in out bootstrapping strategy: its performance peaked with very different configurations for most psycholinguistic properties: • Familiarity: 300 word vector dimensions with a Skip-Gram model, and ζ = 0.9.
• Age of Acquisition: 700 word vector dimensions with a CBOW model, and ζ = 0.7.
Interestingly, frequency in the SubIMDB corpus 4 , composed of over 7 million sentences extracted from subtitles of "family" movies and series, has good linear correlation with Familiarity and Age of Acquisition, much higher than any other feature. For Concreteness and Imagery, on the other hand, the results suggest something different: the further a word is from the root of a thesaurus, the most likely it is to refer to a physical object or entity.

Psycholinguistic Features for LS
Here we assess the effectiveness of our bootstrappers in the task of Lexical Simplification (LS). As shown in , psycholinguistic features can help supervised ranking algorithms capture word simplicity. Using the parameters described in Section 4, we train bootstrappers for these two properties using all instances in the MRC Database as seeds. We then train three rankers with (W) and without (W/O) psycholinguistic features: • Horn (Horn et al., 2014): Uses an SVM ranker trained on various n-gram probability features.
• Glavas (Glavaš andŠtajner, 2015): Ranks candidates using various collocational and semantic metrics, and then re-ranks them according to their average rankings.
• Paetzold (Paetzold and Specia, 2015): Ranks words according to their distance to a decision boundary learned from a classification setup inferred from ranking examples. Uses n-gram frequencies as features.
We use data from the English Lexical Simplification task of SemEval 2012 to assess systems' performance. The goal of the task is to rank words in different contexts according to their simplicity. The training and test sets contain 300 and 1,710 instances, respectively. The official metric from the task -TRank  -is used to measure systems' performance. As discussed in (Paetzold, 2015), this metric best represents LS performance in practice. The results in Table 2 show that the addition of our features lead to performance increases with all rankers. Performing F-tests over the rankings estimated for the simplest candidate in each instance, we have found these differences to be statistically significant (p < 0.05). Using our features, the Paetzold ranker reaches the best published results for the dataset, significantly superior to the best system in SemEval

Conclusions
Overall, the proposed bootstrapping strategy for regression has led to very positive results, despite its simplicity. It is therefore a cheap and reliable alternative to manually producing psycholinguistic properties of words. Word embedding models have proven to be very useful in bootstrapping, both as surrogates for confidence predictors and as regression features. Our findings also indicate the usefulness of individual features and resources: word frequencies in the SubIMDB corpus have a much stronger correlation with Familiarity and Age of Acquisition than previously used corpora, while the depth of a word's in a thesaurus hierarchy correlates well with both its Concreteness and Imagery.
In future work we plan to employ our bootstrapping solution in other regression problems, and to further explore potential uses of automatically learned psycholinguistic features.