How (Non-)Optimal is the Lexicon?

The mapping of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf’s law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world’s languages. Despite their importance in shaping lexical structure, the relative contributions of these factors have not been fully quantified. Taking a coding-theoretic view of the lexicon and making use of a novel generative statistical model, we define upper bounds for the compressibility of the lexicon under various constraints. Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon’s optimality and to explore the relative costs of major constraints on natural codes. We find that (compositional) morphology and graphotactics can sufficiently account for most of the complexity of natural codes—as measured by code length.


Introduction
Communication through language can be modeled under Shannon's classic communication framework (Shannon, 1948). Under this perspective, linguistic utterances are codes-which need to be decoded by a receiver (listener) who is interested in the message (meaning) they encode. Famously, Zipf (1949) posited that language users shape these codes so to accommodate the principle of least effort. The most widely discussed and investigated empirical evidence for this feature is the so-called law of abbreviation, an ostensive negative correlation between word frequency and word length (Zipf, 1935;Bentz and Ferrer-i-Cancho, 2016). Communication effort decreases by encoding frequent messages in shorter words. * Equal contribution  (Finnish). The distance between the baselines can be thought of as the cost of each constraint added to the system. This correlation, however, is characteristically modest. There are many instances of short lowfrequency words, like wen and jib in English, 1 and long frequent words, like happiness and anything. While the lexicon might be shaped by economy of expression, it is clearly not fully optimized for it. There are multiple-possibly competing-reasons why this could be the case.
First of all, the sequence of speech sounds, signs, or orthographic characters that serve as building blocks in a language comply with specific rules. These are referred to as phonotactics (in the case of speech sounds) and graphotactics (in written language). 2 On top of these constraints, the lexicons of many languages of the world re-use subparts of words; these sub-parts can be productively composed to produce new meanings-which is referred to as morphological composition. This largely determines the family of wordforms associated with a given basic meaning-for instance, given the wordform health and its meaning, the nominal morphology of English readily provides the forms for many of its derived meanings, including healthy, unhealthy, healthier, etc.
Beyond these well-attested constraints, it might be argued that the negative correlation between the length of a word and its frequency is not the locus of optimization given the economy of expression pressure. Instead, wordforms might be efficiently encoding meanings based on their contextual surprise rather than frequency (Piantadosi et al., 2011). Finally, there is no reason to expect lexicons to be fully optimized for the economy of expression-this factor might steer languages in a given direction, but there is certainly room for non-compliance. Languages are, after all, not engineered systems but cultural artifacts.
In this paper, we examine how marginally nonoptimal the lexicon is by taking the vantage point of the law of abbreviation. We develop a method to quantify the role of different linguistic constraints on determining wordforms, and we produce estimates on how compressible the lexicon could be in their absence (including morphology and phonotactics/graphotactics). We thus define an upper bound for the compressibility of a lexicon optimized purely for word length efficiency.

(Non-)Optimality in the Lexicon
As stated above, our notion of optimality is derived from Zipf's principle of least effort in the form of the law of abbreviation (Zipf, 1949;Mandelbrot, 1953;Ferrer-i-Cancho et al., 2020). However, this is by no means the only theory under which wordforms are optimized for encoding their messages.
One influential hypothesis is that languages optimize for uniform information density (Fenk and Fenk, 1980;Aylett and Turk, 2004;Levy and Jaeger, 2007)-roughly keeping the amount of information conveyed in a unit of time constant. In an information-theoretic setting, this would be equivalent to maximizing the use of a noisy channel between the speaker and an audience-keeping the transmission rate close to the channel capacity.
Under this view, it is not necessarily the case that words should be as short as possible. Rather, words that are infrequent or typically less predictable in context should be longer and take more time to produce-perhaps because the increased duration makes them more robust to noise. Consistent with this perspective, it has been shown that, in production, words with higher information content take longer to pronounce (Bell et al., 2003;Jurafsky et al., 2001;Gahl, 2008). Additionally, words which are typically predictable in context are shorter than words which are less predictable in context (Piantadosi et al., 2011).
On another note, a purely coding-theoretically efficient language could make the lexical codes context dependent (Piantadosi et al., 2012), since context often disambiguates words (Dautriche et al., 2018;Pimentel et al., 2020a). Additionally, the meaning or message being conveyed by a given word might bias its form. Within languages, there seems to be a pressure for more semantically similar words to also be more phonologically similar (Monaghan et al., 2014;Dingemanse et al., 2015;Dautriche et al., 2017;Pimentel et al., 2019). Across languages, words for the same referents exhibit detectable patterns in term of their phonological makeup (Blasi et al., 2016), phonotactics (Pimentel et al., 2021b), as well as word length (Lewis and Frank, 2016)-this is driven by semantic features such as size, quality or complexity. Finally, there is a cross-linguistic tendency for lexicons to place higher surprisal in word-initial segments (van Son and Pols, 2003a,b;King and Wedel, 2020;Pimentel et al., 2021a) making words more constrained in their choice of final segments. These aspects of language might also collide with a purely Zipfian conception of lexicon optimality.
In this work, however, we consider optimality exclusively in the Zipfian sense of compressibility, and we ask how far natural language lexicons are from accommodating to this paradigm. We build a number of models that differ in relation to whether they accommodate to the law of abbreviation, to compositional morphology and to graphotactics. The comparison among these systems allows us to explore the extent to which each part of the linguistic system contributes to the overall cost of the linguistic code. It should be noted, though, that the consequences of unmodeled sources of structure in the lexicon (such as persistent sound-meaning associations or the adaptation of the code to surprisal effects) will forcibly be confounded with the overall lack of fit between our models and the data.
The morphological cost -i.e. the cost of morphology to a code's length-is associated with the fact that, across many languages, words are often constructed of meaningful sub-parts that are productively reused across the lexicon. Practically, this means that the wordforms of different meanings might not be independent if they overlap in a partic-ular dimension that is captured by the morphology of the language. For instance, most wordforms that express two or more referents of a kind share a word-final suffix -s in English (towers, cats, ideas, etc). We treat this cost by considering optimal codes where the basic unit in the lexicon is not the word but sub-pieces, as determined by the unsupervised morphological parser Morfessor (Creutz and Lagus, 2007;Smit et al., 2014). 3 Under this regime, a word like unmatched is parsed into the tokens un, match, and ed.
The graphotactic cost concurrently imposes a set of additional constraints, determining which sequences of grapheme segments can constitute a valid wordform in a given language. While the main driver of these lexical constraints is actually phonotactics-which imposes rules dictating the possible phoneme sequences-we focus on graphotactics because our object of study is written language corpora. The degree to which phonotactics and graphotactics mirror each other vary substantially across languages; thus, in this work (which uses corpora from Wikipedia) we make our claims about language in the written modality and leave it to future work to generalize this work to the phonological domain. This could be done by applying the same method to phonemic representations of words.

A Coding-theoretic View of the Lexicon
This paper treats the lexicon, which we define as a set of pairs: L = {(m n , w n )} N n=1 . In general, this set will be infinite; m n refers to a lexical meaning, taken from an abstract set M, and w n refers to a wordform, taken from Σ * , the Kleene closure of a grapheme alphabet Σ. 4 When the exact index is unnecessary in context, we will drop the subscripted n; and we make use of uppercase letters to refer to random variables (e.g. M or W ) where necessary. We will write meanings in typewriter font, e.g. cat, and wordforms in italics: cat (English), kissa (Finnish).
Viewing the lexicon from a coding-theoretic perspective, we consider the mapping from meaning to form as a code: C : M → Σ * . Every language comes endowed with a natural code C nat , which is the observed mapping from lexical meanings to forms. As an example, consider the meaning cat and its Finnish form: we have C nat (cat) = kissa. The topic of interest in this paper is the efficiency of language's natural codes.
The space of meanings and lexical ambiguity. The space of meanings M is non-trivial to define, but could be operationalized as R d , which is infinite, continuous and uncountable (Pilehvar and Camacho-Collados, 2020). Meanwhile, the space of wordforms Σ * is also infinite, but discrete and countable. As such, many meanings m n must be mapped to the same form, resulting in lexical ambiguity. See Pimentel et al. 2020a for a longer discussion on these operationalizations. In this work, though, we do not engage with such ambiguity, considering M as an abstract set of meanings, each of which defined by a distinct wordform-i.e. the code C nat is a bijection. A consequence of this strategy is that we take the space of meanings to be infinite, but discrete and countable; we only distinguish as many meanings as there are words, therefore, we end up with a countable number of meanings. Additionally, by considering a distinct meaning m n for each wordform w n in the lexicon, we only consider codes with as much lexical ambiguity as in the original language. 5

Words as Meanings
The unigram distribution represents the frequency of each wordform in a text, i.e. the probability of a token without conditioning on context p(W = kissa). In this work, though, we assume the unigram distribution is a distribution over M, e.g. p(M = cat)-this way we can analyze how changing the code C would affect its efficiency.
As stated above, though, we take C nat to be a bijection. Such an assumption implies there is a deterministic function from wordforms to meanings in a specific lexicon C −1 nat (w) = m. Probabilistically speaking, we write This mapping implies Given this equality, we can reduce the problem of estimating the unigram distribution over meanings p(m) to the one over wordforms p(w).

Code-length and optimality
As stated above, we assume the unigram distribution to be a distribution over M. We now define the cost of a code as its expected length: A smaller cost, then, implies a more efficient code. The famous source-coding theorem of Shannon (1948) gives us a theoretical limit on coding cost: where we define C to be the most efficient code, and where H(M ) is the entropy of distribution p: According to the source-coding theorem, if we know the true distribution p over lexical meanings, then we know how to optimally code them. This turns the problem of estimating the efficiency of the lexicon into the one of estimating the entropy of an unknown discrete distribution p, a well-defined task with a pool of previous work (Miller, 1955;Antos and Kontoyiannis, 2001;Paninski, 2003;Archer et al., 2014). Because the distributions over wordforms and meanings are equivalent, we estimate the entropy H(M ) using wordforms:

Finite and Infinite Support
This section reviews a few technical results as regards the construction of codes from a probability distribution. If p had finite support-i.e. there were a finite set of possible meanings or wordforms-a simple Huffman encoding (Huffman, 1952) would give us an optimal code for our lexicon. However, this is not the case-p(w) has support on all of Σ * -so we might need a more complex strategy to get such a code. Linder et al. (1997) proved the existence of an optimal encoding for a distribution with infinite support, given that it has finite entropy.
Proposition 1. If distribution p(w) has finite entropy, i.e. H(W ) < ∞, then there exists an optimal encoding for it such that: cost(C ) < H(M ) + 1.
Luckily, under a weak assumption, this is the case for a well-trained language model.
This fairly weak assumption states that partial wordforms have a lowerbound on their probability of ending. As such, there is an upperbound on the probability of a wordform which decreases exponentially with its length. Armed with this assumption, we can now show that any ε-smooth language model has a finite entropy.
Proof. See App. C.
Safe-guarded by Propositions 1 and 2, we now train a model to capture the unigram distribution. We will then use this model to estimate the code-length of an optimal lexicon.

Modeling the Unigram Distribution and its Challenges
Zipf's (1935) law states that the frequency of a word in a corpus is inversely proportional to its rank, resulting in a power-law distribution where a small subset of the words dominate the corpus. As such, naïvely training a character-level model on a language's tokens (i.e. predicting non-contextual wordforms with their natural corpus frequencies) would be unlikely to capture morphological regularities (Goldwater et al., 2011). Furthermore, it would burden the model to learn a mostly arbitrary assignment between form and frequency.
As an example, the English verb make is much more common than the nouns cake and lake, even if graphotactically they may be equally probable.
A closer inspection of English shows that most frequent words tend to come from closed lexical classes including articles, pronouns, prepositions, and auxiliaries, such as the, of, it and be (Sinclair, 1999). These words tend to be short and manifest fossilized graphotactics (and phonotactics) as well as a more abundant prevalence of otherwise rare segments, such as the voiced and voiceless dental fricatives (orthographically expressed with th). These rare segments would be overrepresented in such a naïve training regime, making it hard for the character-level model to correctly represent the language's graphotactics.
In order to address the problem of skewed frequencies, we use a novel neuralization of Goldwater et al.'s (2011) two-stage model to capture the unigram distribution. This model consists of two components: a wordform generator and a token frequency adaptor. The generator is a characterlevel model which produces wordforms, for which we use an LSTM; 7 this model should place similar probability mass on graphotactically "good" wordforms, such as make, cake, and lake. Meanwhile, the adaptor sets the frequency with which these wordforms will appear as tokens. Following Goldwater et al., we base our adaptor on the Pitman-Yor Chinese restaurant process (PYCRP; Pitman and Yor, 1997), which allows the adaptor to model a power-law distribution; this model is then responsible for capturing the fact that make is a more frequent token than cake, and lake.

A Two-stage Model
The generative process of our two-stage model is presented graphically in Fig. 2. Our generator is a character-level LSTM language model, which generates a potentially infinite number of i.i.d. wordforms { k } K k=1 . Independently, the PYCRP adaptor assigns each observed token in a dataset to a cluster {z n } N n=1 . In the literature, the value of z n is the "table assignmment" of the n th token. These clusters are then used as lookup indices to the wordforms, producing the observed word tokens {w n } N n=1 where w n = zn . In general N K, so tokens with the same wordform are grouped in few clusters. In this way, the adaptor sets the frequency with which wordforms appear as tokens in a corpus by defining each cluster's probability.
Generating Wordforms. As mentioned above, wordforms are sampled i.i.d. from a distribution p φ over strings defined by the generator. Specifically, this distribution over forms is defined as follows: 7 LSTMs have been shown to be able to model phonotactics well by Pimentel et al. (2020b), and so we expect them to also work well with graphotactics. where is a vector of characters forming a word and t is its t th character. 8 Each of these characters is encoded with a lookup vector, producing representations e t ∈ R d 1 where d 1 is the embedding size. These embeddings are then used as input to an LSTM (Hochreiter and Schmidhuber, 1997), producing the representations h t ∈ R d 2 , where d 2 is the size of the LSTM's hidden layer. The LSTM output is further used to obtain the distribution over potential characters: In this equation, both W ∈ R |Σ|×d 2 and b ∈ R |Σ| are learnable parameters and the zero vector is used as the initial hidden state h 0 . The distribution p φ , representing the generator, is then used to generate the set of wordforms { k } K k=1 , which is expected to represent the graphotactics and morphology of the language. Notedly, these wordforms do not explicitly capture any notion of token frequency. 9 Adapting Word Frequencies. The adaptor is responsible for modeling the word frequencies, and it has no explicit notion of the wordforms themselves. The PYCRP assigns each token n to a cluster z n . Each cluster z n , in turn, has an associated wordform zn , sampled from the generator. Consequently, all instances in a cluster share the same wordform. The probability of an instance n being assigned to cluster z n is defined as follows: ∝ c (zn) <n − a 1 ≤ z n ≤ K <n (old cluster) a · K <n + b z n = K <n + 1 (new cluster) In this equation, K <n is the current number of populated clusters; while c (zn) <n is the number of instances currently assigned to cluster z n . The PY-CRP has two hyperparameters: 0 ≤ a < 1 and b ≥ 0. The parameter a controls the rate in which the clusters grow (Teh, 2006), while b controls an initial preference for dispersion. Together, these ensure the formation of a long-tail-concocting a power-law distribution for the cluster frequencies. This property allows a cluster with wordform make, for example, to have an exponentially larger frequency than its graphotactic neighbor cake.
Modeling Word Tokens. Finally, given the set of wordforms and the cluster assignments, defining the form associated with a token is deterministic. Since each cluster only contains instances of one wordform, the form of a token is defined looking up the label of the cluster it was assigned to zn : p(W n = w n | z n , ) = 1{w n = zn } (11) This way, the adaptor captures the frequency information of the words in the corpus-whereas the generator can focus on learning the language's graphotactics and morphology.
Model training. Unfortunately, we cannot directly infer the parameters of our model with a closed form solution. We thus use a solution akin to expectation maximization (Wei and Tanner, 1990): We freeze our LSTM generator while learning the PYCRP parameters, and vice versa. The PYCRP is trained using Gibbs sampling. For each token, we fix all cluster assignments z −n except for one z n . This cluster is then re-sampled from the marginal p(Z n = z n | z −n , , w n ), where we have access to w n since it is an observed variable. During this optimization small clusters may vanish, and new clusters z n = K + 1 (previously with no samples) may be created. This procedure, thus, may also produce new sets of wordforms { k } K k=1 , composed of the populated clusters' labels (where K is the new number of clusters). We assume the distribution of these wordforms-which have dampened frequencies-to be more balanced than in the original full set of word tokens. The LSTM is trained using stochastic gradient descent, minimizing the cross-entropy of precisely this set of cluster's wordforms. As such, it is expected to be a more representative model of a language's graphotactics; the irregular common words are less dominant in this training set. We give a longer explanation of our model training procedure, together with the used hyperparameters, in App. B.

A More Intuitive Explanation
Despite its slightly odd formulation, the two-stage model has an intuitive interpretation. Once we have learned (and fixed) its parameters, we obtain the marginal probability of a wordform as: In this equation, c w is the count of tokens with form w in the training set, while n w is the number of distinct clusters with this same form. The model interpolates between a smoothed unigram corpus frequency and the probability an LSTM gives the analyzed wordform. This interpolation enables the model to place a non-zero probability mass on all possible wordforms-thus modeling an open vocabulary and having infinite support-while also placing a large probability mass on frequent wordforms. Furthermore, the smoothing factors per word type, together with the interpolation weight, are holistically learned by the PYCRP model using the training set. 10 5 Experimental Setup

Evaluation
The value in which we are interested in this work is the expected cost of a code, given in eq. (4). We can easily estimate this value for a natural code by using its sample estimate: |w n | (13) For an optimal code, we can upperbound it using the entropy of the distribution, while the entropy itself can be upperbounded by the cross-entropy of a model on it. We can compute this upperbound with a sample estimate of the cross-entropy: In practice, we get a tighter estimate by using the Shannon (1948) code's lengths directly: where · is the ceiling operation.

Morphological Constraints
As mentioned in §2, we use Morfessor (Smit et al., 2014) to tokenize our corpus into morphological units. Morfessor is a method for finding morphological segmentations from raw text data. As an unsupervised model, Morfessor is inherently noisy, but we take it as a proxy for a language's morphological segmentation. To compare the robustness of our results across different unsupervised segmentation algorithms, though, we also run our experiments using byte pair encoding (BPE; Gage, 1994;Sennrich et al., 2016) and WordPieces (Schuster and Nakajima, 2012). We train Morfessor on all pre-tokenized sentences in our language-specific Wikipedia corpus (described in §5.4). With this pre-trained model in hand, we tokenize all words in our training, development and test sets. We get a set of morpheme tokens {u n,j } Jn j=1 for each word w n , where this word is split into J n morphological units.
We can now get the optimal length of a morphologically constrained code. With this in mind, we first train a fresh version of our two-stage model on the full set of morphological unit tokens-i.e. {u n,j | n ≤ N, j ≤ J n }, as opposed to the set of full word tokens, {w n } N n=1 . We estimate the length of this code with the following equation: Note that this cost estimate is still the average codelength per word token, as such we take the expectation over the meanings distribution. Each word's code-length, though, is now defined as the sum of the length of each of its constituent morphemes.

Graphotactic Constraints
The second linguistic constraint we would like to impose on our codes is graphotactic wellformedness-i.e. we wish our code to be composed only by sequences of characters that comply with the regularities observed in the language, such as e.g. vowel harmony, syllable structure, or word-initial and word-final constraints. We use our generator LSTM for this. As mentioned before, this model is trained on wordforms with dampened frequencies-we thus expect it to learn a language's graphotactic patterns above a minimum quality threshold. We use this characterlevel model to sample (without replacement) as many unique wordforms as there are word types in that language (see Tab. 3 in App. A). 11 We assign each of these sampled wordforms w n , ordered by word length, to one of the languages meanings m n , inversely ordered by unigram probability, i.e. C graph (m n ) = w n -thus generating an optimally Zipfian frequency-length correlation. With these assignments, we estimate the cost of a graphotactically constrained code: Analogously, with the generator trained on morpheme units we get an optimal code under both morphological and graphotactic constraints.

Dataset
We use Wikipedia data in our experiments. The data is preprocessed by first splitting it into sentences and then into tokens using SpaCy's language-specific sentencizer and tokenizer (Honnibal et al., 2020). After this, all punctuation is removed and the words are lower-cased. We subsample (without replacement) one million sentences of each language for our experiments, due to computational constraints. We then use an 80-10-10 split for our training, validation and test sets. We choose typologically diverse languages for our experiments, each from a different language family: English, Finnish, Hebrew, Indonesian, Tamil, Turkish and Yoruba. 12 These languages vary in their graphotactic tendencies and morphological 11 Unfortunately, our LSTMs use a softmax non-linearity to assign probabilities and, as such, can't produce zeros. Furthermore, due to the compositional nature of wordform probabilities (see eq. (8)), short implausible forms may have larger probability mass than long plausible ones. To mitigate this effect, when sampling wordforms we impose a minimum threshold of 0.01 on each transition probability p( t | <t).
12 Dataset statistics are presented in App. A. complexity. In order to improve our data quality, we hand-defined an alphabet for each language and filter sentences with them, only considering sentences consisting exclusively of valid characters. 13

Summary
In this paper we consider the following codes: Optimal. An information-theoretically optimal code under our two-stage model, estimated as defined by eq. (15). This is our most compressed code and does not include either morphlogical or graphotactic contraints.

Morph+Graph.
A code constrained by both morphology and graphotactics; defined by eq. (18).
Natural. The natural code-equivalent to the average token length and defined by eq. (13). This is the code length actually observed in our corpora.
Zipfian. A code estimated by re-pairing wordforms with meanings based on their frequencies; we then compute eq. (13) in this new code. This would be equivalent to the natural code length if lexicons had a perfect word length-frequency correlation (i.e., a Spearman's rank correlation of 1).

Shuffle.
A code estimated by randomly repairing wordforms with meanings and computing eq. (13) in this new code. This would be equivalent to the natural code length if Zipf's law of abbreviation did not exist, i.e. lexicons had no word length-frequency correlation. 13 We define these sets of valid characters based on Wikipedia entries for the languages and the alphabets available in https://r12a.github.io/app-charuse/.

Results
The average length for each considered code is presented in Fig. 3 and Tab. 1. As expected, we find that the average code length across natural languages is shorter than the shuffle condition and longer than the optimal condition. Interestingly, the codes produced by the other conditions investigated here also have the same identical order across all analyzed languages.
Adding morphological constraints on the code incurs no more than one extra character over the optimal condition-except for Finnish, for which the cost of morphology is slightly above one character. Notably, the use of unsupervised morphological segmentation may introduce some noise into our measurements. Consistently with our expectations, though, Yoruba (a morphologically poor language) pays the smallest cost for its morphology, while Finnish (a morphologically rich one) pays the largest.
BPE and WordPiece systematically produce shorter codes than Morfessor. This is sensible, since the first two would keep most frequent wordforms intact, generating a unique code for each of them. This would lead to codes in which the morphological productivity of frequent and infrequent words differ, amplifying frequency effects encountered in natural languages (Lieberman et al., 2007).
The graphotactic condition yields systematically longer codes than the morphological one, although here there are important differences between languages: English, Hebrew and Indonesian have similar code lengths for both code constraints; in the other languages the graphotactic code is substantially longer than the morphological one.
In all cases, the natural code is longer than the one with both graphotactic and morphological constraints-suggesting languages are not opti-  mally compressed, even when accounting for these constraints. That said, all of the natural languages are considerably more compressed than a lexicon produced by randomly reassigning wordforms.

Discussion and Conclusion
In this paper, we introduced a model-based strategy to assess the relative contribution of different constraints on word (code) length at large. In particular, we evaluated how much natural languages differ from systems optimized for Zipf's law of abbreviation. Our proposed model improves upon an old method used to consider the efficiency of the lexicon: random typing models (Miller, 1957;Moscoso del Prado, 2013;Ferrer-i-Cancho et al., 2020). Miller introduced the idea of monkeys typing randomly on a keyboard and analyzed the properties of its resulting language. The monkeys' text, however, has no morphological or graphotactic constraints (but see Caplan et al., 2020) and does not follow a language's unigram distribution (Howes, 1968). As such, it cannot directly encode the same meanings or messages as the original language.
Our results show that, while natural languages do tend to map frequent messages to shorter words, the magnitude of this effect varies widely across our set of diverse languages. Notably, the distance between natural languages and the optimal codes is Figure 5: Fraction of code length accounted for by the combined morphology and graphotactics model larger than the distance between natural languages and their corresponding shuffled code (see Fig. 4). In other words, natural codes are closer to not being optimized (in the Zipfian sense) than to being maximally compressed.
That said, our morphological and graphotactic baselines, when combined, yield codes that display mean code lengths that are (in most cases) closer to the natural code than to the optimal (see Fig. 5). If our models are indeed able to capture the true patterns in our data, then this means that (compositional) morphology and graphotactics, along with the law of abbreviation, are sufficient to account for most of the length of natural codes-as observed in real languages. Graphotactic (primarily) and morphological constraints are enough to derive a code with a similar complexity to that of natural languages, which suggests the other factors discussed above (associated with, e.g., surprisal and non-arbitrary form-meaning mappings) likely play a more modest role in pushing natural languages away from the optimal Zipfian code.
The optimality of the lexicon occupies a major place in the scientific study of the structure and functional evolution of languages (Bentz and Ferrer-i-Cancho, 2016;Gibson et al., 2019;Mahowald et al., 2020). We hope that the method presented here-which allows for a more precise quantification of the (non-)optimality of lexiconswill be used to further the goal of understanding why languages are structured in the ways that they are, while offering insight into the functional tradeoffs that underlie language variation and change.

Ethical Concerns
This paper concerns itself with investigating lexicons' optimality under the perspective of Zipf's Law of Abbreviation. As we focus on computational linguistic experiments, we see no clear ethical concerns here. Nonetheless, we note that Wikipedia (from where we collect data) is not a fully representative source of a language's datathe biases in the data will likely also be present in our results.

Train
Validation Test

B Model Training
As mentioned in the main text, we cannot directly infer the parameters of our model and we use a solution similar to expectation maximization (Wei and Tanner, 1990). We freeze our LSTM generator while learning the PYCRP parameters and then freeze the PYCRP to train the LSTM model.
Expectation step. This step uses a Gibbs sampling procedure to estimate the parameters of the PYCRP. For each token in our dataset, we fix all cluster assignments z −n except for the given token's one z n . We then re-sample this token's cluster based on the marginal probability p(z n |z −n , , w n ). We do this for 5 epochs, and use the assignments which result in the best development set cross-entropy. This process can both remove clusters and create new ones by replacing tokens. The set of populated clusters (together with their wordform labels) then allows creating a new wordform dataset of size K , where the distribution of the token frequencies is expected to be less skewed. In practice, this wordform dataset is thus created from the resulting set of cluster labels { k } K k=1 , i.e. a word will appear in this new dataset as many times as it was assigned as a cluster label.
Maximization step. We use the set of populated cluster labels to train the generator LSTMassuming that this allows learning a more representative model of a language's graphotactics as the irregular common words are less dominant in its training set. In other words, at each epoch, the generator will be trained in a wordform as many times as it has been assigned as a cluster label.

Hyperparameters and implementation details.
For the PYCRP, we fix hyper-parameters a = 0.5 and b = 10,000, and we use the optimized Gibbs sampling algorithm designed by Blunsom et al. (2009). As our generator, we use a three layers LSTM with an embedding size of 128, a hidden size of 512 and dropout of .33. This LSTM is trained using AdamW (Loshchilov and Hutter, 2019) and we hotstart it by initially training on the set of word types in the training set (the set of unique wordforms in it).

C Proof of Proposition 2
We present here the proof of Proposition 2. This proposition is repeated here for convenience: Proposition 2. If a language model p(w) is εsmooth, then its entropy is finite, i.e. H(W ) < ∞.
Proof. To prove this, we will break the entropy of a string in two parts. The entropy of the first character, plus the entropy of the following ones given the first, as in: We first bound the entropy of the first character using a uniform distribution upperbound: ≤ − log |Σ| We now use the ε-smoothness property to upperbound the entropy of the following characters: Finally, with simple algebraic manipulations we complete the proof: