Discovering Phonesthemes with Sparse Regularization

We introduce a simple method for extracting non-arbitrary form-meaning representations from a collection of semantic vectors. We treat the problem as one of feature selection for a model trained to predict word vectors from subword features. We apply this model to the problem of automatically discovering phonesthemes, which are submorphemic sound clusters that appear in words with similar meaning. Many of our model-predicted phonesthemes overlap with those proposed in the linguistics literature, and we validate our approach with human judgments.


Introduction
Linguists have long held that language is arbitrary, or that a word's phonetic and orthographic forms have no relation to its meaning (de Saussure, 1916). For example, there is nothing about an apple that suggests that apple is the proper word for it-this link between meaning and the representation in language is arbitrary. Arbitrariness is a defining feature of human language, and it is a key component of the design features of language proposed by Hockett (1960).
Despite this, work over the last decades has revealed several exceptions to the arbitrariness of language. One such exception is iconicity, where the form of a word directly resembles its meaning. For example, Ohala (1984) showed that speakers tend to associate vowels with high acoustic frequency with smaller objects, while vowels with low acoustic frequency are associated with larger objects. In this case, speakers make a link between the phonetic form of a word and its perceived meaning because of an innate belief that smaller entities emit higher-frequency vowels while larger entities tend to emit low-frequency vowels.
Similarly, Köhler (1929) and Ramachandran and Hubbard (2001) observed a non-arbitrary con-nection between the shapes of objects and speech sounds. American college undergraduates and Tamil speakers were presented with a jagged shape and a rounded shape and asked which is "kiki" and which is "bouba". In both groups, 95% to 98% selected the jagged shape as "kiki" and the rounded shape as "bouba", demonstrating that the human brain connects sounds to shapes in a consistent way. D'Onofrio (2014) posits that the rounded shape is commonly named "bouba" since the mouth forms a rounded shape in producing the word, whereas pronouncing "kiki" requires a tighter, more angular mouth shape that seems more apt for the jagged object. In this case, there is a strong, non-arbitrary link between the articulatory properties of the sound and their perceived meaning.
Phonesthemes are another exception to the arbitrariness of language. Phonesthemes are noncompositional, submorphemic phonetic units that consistently occur in words with similar meanings. For example, the word-initial gl-, occurs at the beginning of many English words relating to light or vision, like glint, glitter, gleam, glamour, etc. (Hutchins, 1998;Bergen, 2004). The work of Hutchins (1998) includes a compilation of 46 phonesthemes proposed by linguists.
There is a body of previous work suggesting that phonesthemes are units in the mental lexicon of native speakers. For example, the work of Hutchins (1998), Magnus (2000, and Bergen (2004) uses priming experiments and other methods from psycholinguistics to demonstrate that phonesthemes significantly affect native speaker reaction times in a range of language processing tasks. In another line of work, Otis and Sagi (2008) and Abramova and Fernández (2016) verify phonesthemes by analyzing whether the words containing a given phonestheme are more semantically similar than expected by chance, where se-mantic similarity is derived from a distributional semantic model. While there has been much work in verifying previously proposed phonesthemes, there has been little work on automatically discovering new ones. In this work, our goal is to identify the likely phonesthemes of a language from a collection of semantic vectors. We do this by identifying the character or phoneme sequences that are predictive of word meaning by training a model to predict word vectors from subword features. Then, we use standard feature selection techniques to find a subset of features that best predict the vectors; this subset of features contains the model-predicted phonesthemes. Lastly, we validate the model-predicted English phonesthemes with human judgments and also find that many of our predicted phonesthemes overlap with those documented in previous work.

Method
To extract phonesthemes from a set of vectors, we want to find submorphemic units (e.g., character or phoneme n-grams) that are highly predictive of word meaning. We approach this problem through the lens of feature subset selection: given a model capable of predicting semantic vectors from submorpheme information, our goal is to select the subset of submorphemes (model features) that are most predictive. Intuitively, if a submorpheme is especially predictive of the word vectors, then it may be a meaning-bearing phonestheme.
We use linear regression to predict word vectors from binary feature vectors that encode the submorphemes occurring in a surface form. We use sparse regularization to select relevant features from this model, which enables it to automatically choose a subset of the submorpheme features that predict the vectors (our predicted phonesthemes).
Specifically, we regularize our linear regression model with the elastic net (Zou and Hastie, 2005). We used scikit-learn (Pedregosa et al., 2011) to train our models, and we tune the L 1 and L 2 regularization strengths on held-out error in 5-fold cross-validation.
Mitigating the Effect of Morphemes A principal concern is that the model will detect morphemes rather phonesthemes. Many past studies on the relationship between form and meaning in language (Shillcock et al., 2001;Monaghan et al., 2014;Gutiérrez et al., 2016;Dautriche et al., 2017) mitigated this concern by only considering monomorphemic words, discarding a large fraction of the lexicon in the process.
We take a different approach to this problem by proposing a two-step model designed to mitigate the effect of morphemes. We begin by training an unregularized linear regression model to predict semantic vectors from morpheme-level features. Then, we use the residuals of this first stage morpheme-level model as the new target vectors for the sparsely regularized phonestheme extraction model. This removes the components of the word vector that are predictable from morphemelevel information, leaving only the aspects of word meaning not covered by morphology.
We use the the morphological analyses in the CELEX lexical database (Baayen et al., 1996) to compile a list of morphemes, which is used to create the morpheme-level feature vectors. We also use this list to remove any morphemes that may appear in the final model output.

Data
For our experiments, we use 300-dimensional GloVe (Pennington et al., 2014) English word embeddings trained on the cased Common Crawl. Many of the terms in the set of pretrained vectors are not English words. As a first attempt toward removing non-English words and named entities, we discard types that are not alphabetical or not completely lowercased. In addition, it's unlikely that rare words or very common words will contribute to the formation of sound-meaning associations (Hutchins, 1998). To further filter these rare or common words (and remove additional non-English types), we remove types that either occur less than 1000 times in the Gigaword corpus or in more than half of all Gigaword documents. Lastly, we remove types that share the same lemma if the lemma is also in the set of filtered word vectors. After this process, we are left with 7889 types out of the original 2.2 million.
We phonemicize our vectors by associating each word's vector to the word's ARPAbet symbol sequence, as provided in the CMU Pronouncing Dictionary (Carnegie Mellon University, 2014). If multiple types have the same ARPAbet symbol sequence (and are thus homophones), we discard them all. We also do not use types that are not in the CMU Pronouncing Dictionary. Phonemicizing the filtered set of vectors results in a set of 6633 vectors. Note that our model can be applied using either orthographic or phonemicized vectors. Phonesthemes are an inherently phonetic phenomenon, which suggests that it is ideal to model the features at the phoneme level. However, using character-level features, in some cases, will be a reasonable approximation, especially since many of our extracted phonesthemes have a consistent orthographic representation. We release code for preprocessing data and training the models at http://nelsonliu.me/ papers/phonesthemes/.

Experiments and Results
The candidate phonesthemes considered by the model are the word-initial phoneme bigram sequences that occur more than five times in our set of phonemicized vectors; we set a frequency threshold for feature inclusion since rare prefixes are unlikely to carry meaning. Each word's feature vector is a one-hot encoding of its bigram phoneme prefix. We choose to focus on wordinitial bigrams since the bulk of prior work in linguistics has also focused on phonesthemes in this position. However, our method easily extends to larger subword units (e.g., trigrams), candidate phonesthemes within or at the end of a word, even other languages; we leave analysis of phonesthemes of other sizes, in different positions, and of different languages for future work.
We train our two-stage model on the phonemicized vectors; the features that are assigned a nonzero weight are our model-predicted phonesthemes. The features of our morpheme-level model are binary indicator features corresponding to 181 different morphemes extracted from the CELEX2 database. In total, our phonestheme extraction model considers 307 candidate phonesthemes; tuning the regularization strength on held-out error in 5-fold cross-validation results in a model that selects 123 candidate phonesthemes as predictive. The phoneme bigrams corresponding to the 30 features with the highest absolute model weight are in Table 1. Qualitatively, the words with the lowest error under the model containing each selected phonestheme candidate seem semantically coherent.
Many of the phonesthemes identified by our model have been proposed and validated by past work. 13 of the top 15 model-predicted phonesthemes were in Hutchins' set of 17 proposed word-  initial phoneme bigram phonesthemes. This is an improvement over past work; Otis and Sagi (2008) identified 8 as statistically significant, with a hypothesis space restricted to 50 pre-specified word beginnings and endings. Gutiérrez et al. (2016) also identified 8, but with a much larger hypothesis space of 225 candidates. Our model considers an even larger hypothesis space of 307 candidate phonesthemes, which are all automatically extracted from the set of word vectors.
Validating Phonesthemes with Human Judgments Following the method of Hutchins (1998) and Gutiérrez et al. (2016), we empirically evaluate our phonesthemes by soliciting naïve human judgments about how well-suited a word's form is to its meaning. We randomly selected 5 words containing each of the top 15 model-selected phonesthemes and 5 words containing 15 random phonestheme candidates that were not selected by the model, for a total of 150 words. We recruited native English-speaking participants through Mechanical Turk, and asked them to judge how well each word fits its meaning on a Likert scale from 1 to 5. 150 words is too many judgments for a single HIT (annotators would become fatigued and words might start to lose meaning). As a result, we randomly divided the task into 10 different HITs, each with 15 of the words to be tested. We required Amazon Mechanical Turk Masters status for the crowdworkers and compensated them $0.20 per HIT; each word received 30 ratings.
Following Hutchins (1998), we compute ratings for each candidate phonestheme by averaging the rating of the words that contain it. On average, model-predicted phonesthemes were rated 0.58 points higher than unselected phonestheme candidates (3.66 versus 3.08, respectively). To assess whether this difference is statistically significant, we use the one-tailed Mann-Whitney U test (Mann and Whitney, 1947) since the data is ordinal and unpaired. Based on the results of the test, we reject the null hypothesis that the average rating of words containing model-selected phonesthemes is not greater than the average rating of words that contain phonesthemes not selected by the model (p < 10 −9 ). Figure 1 plots the human ratings of the top 15 model-selected phonesthemes against their absolute weight under the model; there is a weak positive correlation (r = 0.081).
2 of the 15 model-predicted phonesthemes with the highest absolute weight were not previously proposed by (Hutchins, 1998): br-and wi-. Both of these sound clusters seem like plausible phonesthemes. To the authors, the br-cluster evokes the idea of a raw, almost uncultured force, with words like "brags," "brutish," and "brusque" appearing among the words with the lowest error under the model. The types containing the word-initial wicluster with the lowest error under the model seem to convey fragility: "wimpy," "wince," and "weak." From Figure 1, we can see that the br-phonestheme candidate received a very high model weight, but received lower ratings on average from human annotators. On the other hand, the average human rating of the wi-phonestheme candi-date seems in line with its assigned model weight. Future work could further explore whether br-and wi-have psychological reality to native speakers.

Related Work
Several psycholinguistic studies have shown that native speakers associate certain sounds with a particular meaning, and phonesthemes have been identified in languages from English (Wallis, 1699;Firth, 1930) to Swedish (Abelin, 1999) and Japanese (Hamano, 1998). Bergen (2004) additionally demonstrates that phonesthemes affect online implicit language processing, and Parault and Schwanenflugel (2006) suggest that they play a role in language acquisition.
In recent years, the work of Otis and Sagi (2008) and Abramova and Fernández (2016) used computational methods to automatically detect and validate phonesthemes by examining whether words that contain a candidate phonestheme are more semantically similar than predicted by chance, according to a distributional semantic model. Dautriche et al. (2017) analyze lexicons of Dutch, English, German, and French and find that the space of monomorphemic word forms is clumpier than what would be expected by chance, according to lexical, phonological, and network measures.
Most similar to our work is that of Gutiérrez et al. (2016), who introduce an algorithm for learning weighted string edit distances that minimize kernel regression error and use it to detect systematic form-meaning relationships within language. Our model uses linear regression between candidate phonestheme features and semantic vectors. In addition, our model directly selects the predicted phonesthemes with sparse regularization; their model instead provides a systematicity score for each type, and they extract phonesthemes by taking the word-beginnings with mean errors lower than predicted by a random distribution of errors across the lexicon.

Conclusion
In this work, we present a simple model for extracting non-systematic form-meaning relationships from a collection of word vectors. Our model is a sparsely regularized linear regression model that seeks to predict a word's semantic vector from a feature vector that encodes information about the candidate phonesthemes it contains; the sparse solutions of the regression problem have the effect of automatically selecting the features that are most predictive of word meaning, which we take as predicted phonesthemes.
We also develop a simple and effective twostage approach for mitigating the effect of morphemes in the model. We initially train a model to map from morpheme-level features to word vectors, and then use the residuals of the morphemelevel model as the targets for the downstream phonestheme extraction model.
We empirically compare our model's predicted phonesthemes and find that many were previously proposed by linguists. We verified our results with human judgments of proposed and unselected phonesthemes, and annotators believe that words with a model-selected phonestheme "fit their meaning" more than words that contain a candidate phonestheme that was not selected by the model.