Meaning to Form: Measuring Systematicity as Information

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram ‘gl’ have any systematic relationship to the meaning of words like ‘glisten’, ‘gleam’ and ‘glow’? In this work, we offer a holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. We employ these in a data-driven and massively multilingual approach to the question, examining 106 languages. We find a statistically significant reduction in entropy when modeling a word form conditioned on its semantic representation. Encouragingly, we also recover well-attested English examples of systematic affixes. We conclude with the meta-point: Our approximate effect size (measured in bits) is quite small—despite some amount of systematicity between form and meaning, an arbitrary relationship and its resulting benefits dominate human language.


Introduction
Saussure (1916) expounded on the arbitrariness of the sign. Seen as a critical facet of human language (Hockett, 1960), the idea posits that a sign in human language (a word, in our inquiry) is structured at two levels: the signified, which captures its meaning, and the signifier, which has no meaning but manifests the form of the sign. Saussure himself, however, also documented instances of sound symbolism in language (Saussure, 1912). In this paper, we present computational evidence of relevance to both aspects of Saussure's work.
While dominant among linguists, arbitrariness has been subject to both long theoretical debate (Wilkins, 1668;Eco, 1995;Johnson, 2004; Pullum  Figure 1: We use two independent language models to estimate the mutual information between word forms and meaning-i.e. systematicity, as per our definition. The language models provide upper bounds on H(W ) and H(W | V ), which can be used to estimate I(W ;V ). This estimate is as good as the upper bounds are tightsee discussion in §3.4. and Scholz, 2007) and numerous empirical and experimental studies (Hutchins, 1998;Bergen, 2004;Monaghan et al., 2011;Abramova and Fernández, 2016;Blasi et al., 2016;Gutierrez et al., 2016;Dautriche et al., 2017). Taken as a whole, these studies suggest non-trivial interactions in the formmeaning interface between the signified and the signifier (Dingemanse et al., 2015). While dominant among linguists, the idea has been subject to both long theoretical debate (Wilkins, 1668;Eco, 1995;Johnson, 2004;Pullum and Scholz, 2007) and numerous empirical and experimental studies (Hutchins, 1998;Bergen, 2004;Monaghan et al., 2011;Abramova and Fernández, 2016;Blasi et al., 2016;Gutierrez et al., 2016;Dautriche et al., 2017). Taken as a whole, these studies suggest non-trivial interactions in the form-meaning interface between the signified and the signifier (Dingemanse et al., 2015). meaning associations range across multiple languages, methods and working hypotheses, they all converge on two important dimensions: 1. The description of meaning is parametrized with pre-defined labels -e.g., by using existing ontologies such as List et al. (2016). 2. The description of forms is restricted to the presence, absence or sheer number of occurrence of particular units (such as phones, syllables or handshapes).
We take an information-theoretic approach to quantifying the relationship between form and meaning using flexible representations in both domains, rephrasing the question of systematicity: How much does certainty of one reduce uncertainty of the other? This gives an operationalization as the mutual information between form and meaning, when treating both as random variables-the signifier as a word's phone string representation in the International Phonetic Alphabet (IPA), and the signified as a distributed representation (Mikolov et al., 2013) for that word's lexical semantics, devoid of morphological or other subword information. We show how to estimate mutual information as the difference in entropy of two phone-level LSTM language models-one of which is conditioned on the semantic representation. This operationalization, depicted in Figure 1, allows us to express the global effect of meaning on form in vocabulary datasets with wide semantic coverage.
In addition to this lexicon-level characterization of systematicity, we also show that this paradigm can be leveraged for studying more narrowlydefined form-meaning associations such as phonesthemes-submorphemic, meaning-bearing unitsin the style of Gutierrez et al. (2016). These short sound sequences typically suggest some aspect of meaning in the words that contain them, like -ump for rounded things in English. Previous computational studies, whether focusing on characterizing the degree of systematicity (Monaghan et al., 2014b(Monaghan et al., ,a, 2011Shillcock et al., 2001), discovering phonesthemes (Liu et al., 2018), or both (Gutierrez et al., 2016, have invariably framed systematicity in terms of distances and/or similarities-the relation between word-form distance/similarity on the one hand (e.g., based on string edit distance) and semantic distance/similarity on the other (e.g., as defined within a semantic vector space). Our methods have the virtue of not relying on some predefined notion of similarity or distance in either domain for our measurement of systematicity.
Empirically, we focus on two experimental regimes. First, we focus on a large corpus (CELEX) of phone transcriptions in Dutch, English, and German. In these three languages, we find a significant yet small mutual information even when controlling for grammatical category. Second, we perform a massively multilingual exploration of soundmeaning systematicity ( §5.1) on the NorthEuraLex corpus (Dellert and Jäger, 2017). This corpus contains expanded Swadesh lists in 106 languages using a unified alphabet of phones. It contains 1016 words in each language, which is often not enough to detect systematicity-we trade the coverage of CELEX for the breadth of languages. Nevertheless, using our information-theoretic operationalization, in most of the languages considered (87 of 106), we find a statistically significant reduction in entropy of phone language modeling by conditioning on a word's meaning ( §5.2). Finally, we find a weak positive correlation between our computed mutual information and human judgments of formmeaning relatedness.
2 Systematic form-meaning associations

Arbitrariness
The lack of a forceful association between form and meaning is regarded as a design feature of language (Hockett, 1960). This arbitrariness of the sign is thought to provide a flexible and efficient way for encoding new referents (Monaghan et al., 2011). It has been claimed that it enhances learnability because newly acquired concepts can be paired to any word, instead of devising the word that properly places the concept in one's constellation of concepts (Gasser et al., 2005), and that it facilitates mental processing compared to an icon-based symbol system, in that the word-meaning map can be direct (Lupyan and Thompson-Schill, 2012). Most importantly, decoupling form from meaning allows communication about things that are not directly grounded in percepts (Clark, 1998;Dingemanse et al., 2015). This opens the door to another of Hockett (1960)'s design features of language: duality of patterning (Martinet, 1949), the idea that language exsists on the level of meaningless units (the distinctive; typically phonemes) composed to form the level of meaningful units (the significant; typically morphemes).

Non-arbitrariness and systematicty
Contemporary research has established that nonarbitrary form-meaning associations in vocabulary are more common and diverse than previously thought (Dingemanse et al., 2015). Some non-arbitrary associations might be found repeatedly across unrelated languages presumably due to species-wide cognitive biases (Blasi et al., 2016), others are restricted to language-specific word classes that allow for more or less transparent iconic mappings -so-called ideophones, see Dingemanse (2012;2018) -and yet others might emerge from properties of discourse and usage rather than meaning per se (Piantadosi et al., 2011).
Systematicity is meant to cover all cases of nonarbitrary form-meaning associations of moderate to large presence in a vocabulary within a language (Dingemanse et al., 2015). In morphology-rich languages, systematic patterns are readily apparent: for instance, across a large number of languages recurring TAM markers or transitivity morphemes could be used to detect verbs, whereas case markers or nominalizing morphemes can serve as a cue for nouns. Yet a sizable portion of research on systematicity is geared towards subtle patterns at the word root level, beyond any ostensive rules of grammar.
By and large, systematicity is hailed as a trait easing language acquisition. It reduces the radical uncertainty humans find when first encountering a new word by providing clues about category and meaning (Monaghan et al., 2014a). Systematic patterns can display a large scope within a language: for instance, systematic associations distinguishing nouns from verbs have been found in every language where a comparison was performed systematically (e.g. Monaghan et al., 2007). But at its extreme, systematicity would manifest as an ontology encoded phonetically, e.g., all plants begin with the letter 'g', and animals with the letter 'z' (Wilkins, 1668;Eco, 1995). As Dingemanse et al. (2015) note, a system of similar forms expressing similar meanings "would lead to high confusability of the very items most in need of differentiation".

Phonesthemes
One particular systematic pattern comes in the form of phonesthemes (Firth, 1964). These are submorphemic and mostly unproductive affixal units, usually flagging a relatively small semantic domain. A classic example in English is gl-, a prefix for words relating to light or vision, e.g. glimmer, glisten, glitter, gleam, glow and glint (Bergen, 2004).

Estimating Systematicity with
Information Theory

Notation and formalization
Following Shillcock et al. (2001), we define a sign as a tuple (v (i) , w (i) ) of a word's distributional semantic representation (a vector) and its phone string representation (a word form). For a natural language with a set of phones Σ (including a special end-of-string token), we take the space of word forms to be Σ * , with w (i) ∈ Σ * . We treat the semantic space as a high-dimensional real vector space R d , with v (i) ∈ R d . The particular v (i) and w (i) are instances of random variables V and W . Further, we want to hunt down potential phonesthemes; we define these to be phone sequences which, compared to others of their length, have a larger mutual information with their meaning. We eliminate positional confounds by examining only words' prefixes w <k and suffixes w >k . 1

A variational upper bound
Entropy, the workhorse of information theory, captures the uncertainty of a probability distribution. In our language modeling case, the quantity is Entropy is the average number of bits required to represent a string in the distribution, under an optimal coding scheme. When computing it, we are faced with two problems: We do not know the distribution over word-forms Pr(W ) and, even if we did, computing Equation 1 requires summing over the infinite set of possible strings Σ * . We follow Brown et al. (1992) in tackling these problems together. Approximating Pr(W ) with any known distribution Q(W ), we get a variational upper bound on H(W ) from their cross-entropy, i.e.
Equation 2b still requires knowledge of Pr(W ) and involves an infinite sum, though. Nonetheless, we can use a finite setW of samples from Pr(W ) to get an empirical estimate of this value.
(3) with equality if we let N → ∞. 2 We now use Equation 3 as an estimate for the entropy of a lexicon.
Conditional entropy Conditional entropy reflects the average additional number of bits needed to represent an event, given knowledge of another random variable. If V completely determines W , then the quantity is 0. Conversely, if the variables are independent, then H(W ) = H(W | V ). Analogously to the unconditional case, we can get an upper bound for the conditional entropy by approximating Pr(W | V ) with another distribution Q.

Systematicity as mutual information
Mutual information (I) measures the amount of information (bits) that the knowledge of either form or meaning provides about the other. It is the difference between the entropy and conditional entropy: Systematicity will thus be framed as (statistically significant) nonzero mutual information I(V ;W ).

Learning Q
Our method relies on decomposing mutual information into a difference of entropies, as shown in Equation 5b . We use upper bounds on both the entropy and conditional entropy measures, so our calculated mutual information is an estimate. This estimate is as good as our bounds are tight, being perfect when Pr(W ) = Q(W ) and Pr(W |V ) = Q(W |V ). Still, as we subtract two upper bounds, we cannot guarantee that our MI estimate approaches the real MI from above or below because we do not know which of the entropies' bounds are tighter. There is nothing principled that we can say about the result, except that it is consistent.
The procedure for learning the distribution Q is, thus, essential to our method. We must first define a family of distributions Ψ from which Q is learned. Then, we learn Q ∈ Ψ by minimizing the righthand-size of Equation 2b-which corresponds to maximum likelihood estimation In this work, we employ a state-of-the-art phonelevel LSTM language model as our Ψ to approximate Pr(W ) as closely as possible.

Recurrent neural LM
A phone-level language model (LM) provides a probability distribution over Σ * : Recurrent neural networks are great representation extractors, being able to model long dependencies-up to a few hundred tokens (Khandelwal et al., 2018)-and complex distributions Pr(w i | w <i ) (Mikolov et al., 2010;Sundermeyer et al., 2012). We choose LSTM language models in particular, the state-of-the-art for character-level language modeling (Merity et al., 2018). 3 Our architecture embeds a word-a sequence of tokens w i ∈ Σ-using an embedding lookup table, resulting in vectors z i ∈ R d . These are fed into an LSTM, which produces high-dimensional representations of the sequence (hidden states): where h 0 is the zero vector. Each hidden state is linearly transformed and fed into a softmax function, producing a distribution over the next phone:

Datasets
We first analyze the CELEX database (Baayen et al., 1995), which provides many word types for Dutch, English, and German. In measuring systematicity, we control for morphological variation by only considering monomorphemic words, as in Dautriche et al. (2017). Our type-level resource contains lemmata, eliminating the noisy effect of morphologically inflected forms. CELEX contains 6040 English, 3864 German, and 3603 Dutch lemmata for which we have embeddings.
While CELEX is a large, well annotated corpus, it only spans three lexically related languages. The NorthEuraLex database (Dellert and Jäger, 2017) is thus appealing. It is a lexicon of 1016 "basic" concepts, written in a unified IPA scheme and aligned across 107 languages that span 21 language families (including isolates). 4 While we cannot restrict NorthEuraLex to monomorphemic words (because it was not annotated by linguists and segmentation models are weak for its low-resource languages), it mainly contains word types for basic conceptse.g., animal names or verbs-so we are comfortable in the modeling assumption that the words are not decomposable into multiple morphemes.
Unlike Dautriche et al. (2017), who draw lexicons from Wikipedia, or Otis and Sagi (2008), we directly use a phone string representation, rather than their proxy of using each language's orthography. This makes our work the first to quantify the interface between phones and meaning in a massively multilingual setting.
Blasi et al. (2016) is the only large-scale exploration of phonetic representations that we find. They examine 40 aligned concepts over 4000 languages and identify that sound correspondences exist across the vast majority. Their resource (Wichmann et al., 2018) does not have enough examples to train our language models, and we add to their findings by measuring a relationship between form and meaning, rather than form given meaning. 4 We omit Mandarin; the absence of tone annotations leaves its phonotactics greatly underspecified. All reported results are for the remaining 106 languages.

Embeddings
We use pre-trained WORD2VEC representations as meaning vectors for the basic concepts. For CELEX, specific representations were pretrained for each of the three languages. 5 For NorthEuraLex, as its words are concept aligned, we use the same English vectors for all languages. Pragmatically, we choose English because its vectors have the largest coverage of the lexicon. This does not mean that we assume that semantic spaces across languages to be strictly comparable. In fact, we would expect that more direct methods of estimating these vectors would be preferable if they were practical. Note that the methods described above are likely underestimating the semantic systematicity in the data, for a couple of reasons. First, WORD2VEC and other related methods have been shown to do a better job at capturing general relatedness rather than semantic similarity per se (Hill et al., 2015). Second, our use of the English vectors across the concept-aligned corpora is a somewhat coarse expedient. To the extent that the English serves as a poor model for the other languages, we should expect smaller MI estimates. In short, we have chosen easy-to-replicate methods based on commonly used models, rather than extensively tuning our approach for these experiments, possibly at the expense of the size of the effect we observe.
To reduce spurious fitting to noise in the dataset, we reduce the dimensionality of these vectors from the original 300 to d while capturing maximal variance, using principal components analysis (PCA).
These resulting d-dimensional vectors are kept fixed while training the conditional language model. Each d-dimensional vector v is linearly transformed to serve as the initial hidden state of the conditional LSTM language model: We reject morphologically informed embeddings (e.g., Bojanowski et al., 2017) because this would be circular: We cannot question the arbitrariness of the form-meaning interface if the meaning representations are constructed with explicit information from the form. This is the same reason that we do not fine-tune the embeddings-our goal is to enforce as clean a separation as possible of model and form, then suss out what is inextricable.

Controlling for grammatical category
The value of WORD2VEC comes from distilling more than just meaning. It also encodes the grammatical classes of words. Unfortunately, this is a trivial source of systematicity: if a language's lemmata for some class follow a regular pattern (such as the verbal infinitive endings in Romance languages), our model will have uncovered something meaningless. Prior work-e.g., (Dautriche et al., 2017;Gutierrez et al., 2016)-does not account for this. To isolate factors like these, we can estimate the mutual information between word form and meaning, while conditioning on a third factor. The expression is similar to Equation 5a: where C is our third factor-in this case, grammatical class. 6 Both CELEX and NorthEuraLex are annotated with grammatical classes for each word. We create a lookup embedding for each class in a language, then use the resulting representation as an initial hidden state to the LSTM (h 0 = c). When conditioning on both meaning and class, we concatenate half-sized representations of the meaning (pre-trained) and class to create the first hidden state

Hypothesis testing
We follow Gutierrez et al. (2016) and Liu et al. (2018) in using a permutation test to assess our statistical significance. In it, we randomly swap the sign of I values for each word, showing mutual information is significantly positive. Our null hypothesis, then, is that this value should be 0. Recomputing the average mutual information over many shufflings gives rise to an empirical p-value: asymptotically, it will be twice the fraction of permutations with a higher mutual information than the true lexicon. In our case, we used 100,000 random permutations.

Hyperparameters and optimization
We split both datasets into ten folds, using one fold for validation, another for testing, and the rest for 6 If markers of subclasses within a given part of speech are frequent, these may also emerge.
training. We optimize all hyper-parameters with 50 rounds of Bayesian optimization-this includes the number of layers in the LSTM, its hidden size, the PCA size d used to compress the meaning vectors, and a dropout probability. Such an optimization is important to get tighter bounds for the entropies, as discussed in §3.4. We use a Gaussian process prior and maximize the expected improvement on the validation set, as in Snoek et al. (2012). 7

Identifying systematicity
We find statistically significant nonzero mutual information in all three CELEX languages (Dutch, English, and German), using a permutation test to establish significance. This gives us grounds to reject the null hypothesis. We also find a statistically significant mutual information when conditioning entropies in words' grammar classes. These results are summarized in Table 1.
But how much could the mutual information have been? A raw number of bits is not easily interpretable, so we provide another informationtheoretic quantity, the uncertainty coefficient, expressing the fraction of bits we can predict given the meaning: U(W | V ) = I(W ;V ) H(W ) .The mutual information I(W ;V ) is upper-bounded by the language's entropy H(W ), so the uncertainty coefficient is between zero and one. 8 For the CELEX data, we give the uncertainty coefficients with and without conditioning on part of speech in Table 1.
By comparing results with and without conditioning on grammatical category, we see the importance of controlling for known factors of systematicity. As expected, all systematicity (mutual information) results are smaller when we condition on part of speech. After conditioning, systematicity remains present, though. In English, we can guess about 3.25% of the bits encoding the phone sequence, given the meaning. In Dutch and German, these quantities are higher. The effect size of systematicity in these languages, though, is small.

Broadly multilingual analysis
On the larger set of languages in NorthEuraLex, we see that 87 of the 106 languages have statistically significant systematicity (p < 0.05), after Benjamini-Hochberg (1995) corrections. When 7 Our implementation is available at https://github. com/tpimentelms/meaning2form. 8 Because of our estimation, it may be less than zero.  we control for grammatical classes (I(W ;V | POS)), we still get significant systematicity across languages (p < 10 −3 ). A per-language analysis, though, only finds statistical significance for 17 of them, after Benjamini-Hochberg (1995) corrections. This evinces the importance of conditioning on grammatical category; without doing so, we would find a spurious result due to crafted, morphological systematicity. We present kernel density estimates for these results in Figure 2 and give full results in Appendix A. Across all languages, the average uncertainty coefficient was 1.37% (Cohen's d 0.1936). When controlling for grammatical classes, though, it was only 0.2% (Cohen's d 0.0287). There were only 970 concepts with corresponding WORD2VEC representations in this dataset, and our language models easily overfit when conditioned on these. As we optimize the used number of PCA components (d) for these word embeddings, we can check its 'optimum' size. The average d across NorthEuraLex languages was only ≈ 22, while on CELEX it was ≈ 153. This might imply that the model couldn't find systematicity in some languages due to the dataset's small size-models were too prone to overfitting.

Fantastic phonesthemes and where to find them
As a phonestheme is, by definition, a sequence of phones that suggest a particular meaning, we expect them to have higher mutual information values when compared to other k-grams in the lexiconmeasured in bits per phone. To identify that a prefix of length k, w ≤k , is a phonestheme, we compare it to all such prefixes, being interested in the mutual information I(W ≤k ,V ). For each prefix in our dataset, we compute the average mutual information over all n words it appears in.We then sample 10 5 other sets of n words and get their average mutual information. Each prefix is identified as a phonestheme with a p-value of r 10 5 , where r is how many comparison where it has a lower systematicity than the random sets. 9 Table 2 shows identified phonesthemes for English, Dutch, and German.
Inspecting the German data, it is clear that some of these prefixes and affixes that we find are fossilized pieces of derivational etymology. Further, many of the endings in German are simply the verb ending -/@n/ with an additional preceding phone. Dutch and English are less patterned. While we find few examples in Dutch, all are extremely significant. It can be argued that two examples (-/@l/ and -/xt/) are not semantic markers but rather categorizing heads in the framework of distributed morphology (Marantz and Halle, 1993)-suggestions that the words are nouns. Further, in English, we find other examples of fossilized morphology, (/k@n/-) and (/In/-). In this sense, our found phonesthemes are related to another class of restrictedapplication subword: bound morphemes (Bloomfield, 1933;Aronoff, 1976;Spencer, 1991), which carry known meaning and cannot occur alone.  From the list of English prefix phonesthemes we present here, all but /In/-and /k@n/-find support in the literature (Hutchins, 1998;Otis and Sagi, 2008;Gutierrez et al., 2016;Liu et al., 2018). Furthermore, an interesting case is the suffix -/mp/, which is identified with a high confidence. This might be picking up on phonesthemes -/ump/ and -/amp/ from Hutchins (1998)'s list.

Correlation with human judgments
As a final, albeit weak, validation of our model, we consider how well our computed systematicity compares to human judgments (Hutchins, 1998;Gutierrez et al., 2016;Liu et al., 2018). We turn to the survey data of Liu et al. (2018), in which workers on Amazon Mechanical Turk gave a 1-to-5 judgment of how well a word's form suited its meaning. For each of their model's top 15 predicted phonesthemes and 15 random non-predicted phonesthemes, the authors chose five words containing the prefix for workers to evaluate. 10 Comparing these judgments to our model-computed estimates of mutual information I(W <2 ;V ), we find a weak, positive Spearman's rank correlation (ρ = 0.352 with p = 0.03). This shows that prefixes for which we find higher systematicity-according to mutual information-also tend to have higher humanjudged systematicity.

Conclusion
We have revisited the linguistic question of the arbitrariness-and the systematicity-of the sign. We have framed the question on informationtheoretic grounds, estimating entropies by state-ofthe-art neural language modeling. We find evidence in 87 of 106 languages for a significant systematic pattern between form and meaning, reducing approximately 5% of the phone-sequence uncertainty of German lexicons and 2.5% in English and Dutch, when controlling for part of speech.
We have identified meaningful phonesthemes according to our operationalization, and we have good precision-all but two of our English phonesthemes are attested in prior work. An avenue for future work is connecting our discovered phonesthemes to putative meanings, as done by Abramova et al. (2013) and Abramova and Fernández (2016).
The low uncertainty reduction suggests that the lexicon is still largely arbitrary. According to the