Modeling Color Terminology Across Thousands of Languages

There is an extensive history of scholarship into what constitutes a “basic” color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Collectively, the 14 empirically-grounded computational linguistic metrics we design—as well as their aggregation—correlate strongly with both the Berlin and Kay basic/secondary color term partition (γ = 0.96) and their hypothesized universal acquisition sequence. The measures and result provide further empirical evidence from computational linguistics in support of their claims, as well as additional nuance: they suggest treating the partition as a spectrum instead of a dichotomy.


Introduction
How many colors are in the rainbow? An infinite number, but each language divides up perceptual space into a finite number of categories by giving names to colors. The seminal work on color categories, by Berlin and Kay (1969, hereafter B&K), characterizes a universal evolutionary sequence for languages' core colors (their basic color terms) and their corresponding categories, at each stage refining the partition of color space.
A handful of criteria define basic color terms, including abstractness, monomorphemicity, and not being subsumed by a broader basic term. (See §2 for the complete list.) These criteria are accused of biasing analyses of color systems-especially in non-Western societies (Wierzbicka, 2006). To mitigate this bias, a pan-lingual approach to analyzing color systems may reveal general ("universal") trends more reliably than smaller datasets. While

Language Color Word Literal Gloss
Welsh brown brown Italian marrone chestnut Persian ‫ای‬ ‫ﻗﻬﻮه‬ coffee + of Cantonese ar coffee + color data are hard to find in the long tail of languages, we still aim to consider more than ever before-2491 languages and dialects. 1 We leverage natural language processing tools to operationalize longstanding literature on language universals. We provide a three-pronged investigation of the classic criteria for basic color terms, examining the degree to which color words are abstract ( §5), monomorphemic/monolexemic ( §6), and salient ( §7). Our operationalization of these (B&K) criteria shows that individual features do not reflect the basic/non-basic divide. Nor is this divide binary, as B&K suggest: We show that abstractness, monomorphemicity, and even salience do not cleanly divide colors.
Nonetheless, by treating basicness as a spectrum and aggregating these features (like human-judged concreteness, frequency of compounding, and word length) into basicness scores ( §8), we can largely distinguish between basic and non-basic colors (validating our measures), and our scores recreate the historical sequence of color acquisition in language. The sequence is in no way directly encoded in the criteria for basic color terms; as such, recreating it is a separate and novel empirical discovery.

Color Terminology
Not all languages have the same number of color words; for instance, a single Korean color word (pureu-n) applies to both grass and sky-an unusual concept for native English speakers. Similarly, Russian distinguishes between two families of what English speakers call "blue": the lighter goluboy and the darker siniy. Reaction time experiments show the cognitive importance of these categories (Gilbert et al., 2006;Winawer et al., 2007), and the existence of a named category both aids (Brown and Lenneberg, 1954) and guides (Bae et al., 2015;Cibelli et al., 2016) color judgment and memory.
Color terms may be concrete (i.e., derived from a real-world referent like "blood" or "sky") or abstract. Diachronic processes can weaken the link between a concrete term and its referent, until a new cohort of speakers believes the term to be abstract. Indeed, this process explains the development of English color words (Casson, 1994). In addition to metonymy with named things, the words may be borrowed, compounded, or inherited from an ancestor language.
While industrialized societies' languages possess a wealth of color words (Hardin, 2014), only a handful are considered basic color terms; the remainder are secondary. A basic color term (BCT) must satisfy four obligatory criteria (B&K): 1. It must be monolexemic (and monomorphemic). "Light blue" and "blue-green" each contain two lexemes and do not qualify. 2. It may not possess any color hypernyms (superordinate color terms). (E.g., "lavender" has the hypernym "purple".) 3. It may not be limited in application to a narrow class of objects. "Blond(e)" may only be applied to a handful of referents like hair, wood, and beer, for example. 4. It must be psychologically salient. This implies that the color term has a stable range of reference across speakers and has an entry in the lexemic inventory of most (if not all) native speakers' respective idiolects. Additional criteria are introduced in cases of doubt (Kay and McDaniel, 1978), though these are subjectively applied (Crawford, 1982). Among these: (5) a BCT is not the name of an object that characteristically has a particular color; in other words, the color must be abstract, and not grounded in some concrete object (which rules out colors like gold purple pink orange grey 3 7 5 Figure 1: The diachronic sequence of color acquisition (Berlin and Kay, 1969). and salmon). Additionally, (6) recent foreign loan words "are suspect", and (7) if the lexemic status of the word is difficult to judge, then multimorphemic words are also "suspect".
In addition to this definition, (B&K) surveyed speakers of 20 languages in the San Francisco Bay Area, plus a sweeping examination of the literature, to find a sequence to the emergence of color words in language. Cultures with two color words universally used them to distinguish light and warm colors from dark and cool ones; the third color was universally red, and the sequence continued until matching the set of eleven colors represented by English basic color terms. We present their partial ordering in Figure 1, though later authors have proposed alterations (Heider, 1972;Kay, 1975).
We are not the first to assess the notion of a basic color term. Crawford (1982) gives a point-bypoint rebuttal on pragmatic grounds-the criteria are hard for a field worker to assess, and many introduce subjectivity that will bias data collection. Lucy (1997) argues that the definition provides more of a post-hoc screening tool for when the "denotational net" of elicitation has captured too many terms, as opposed to a morphosyntactically informed approach (e.g., Conklin, 1955). Finally, Wierzbicka (2006) argues that other societies may not share the Western conception of hue-based color terms, making the application of the concept inappropriate. In addition to these postulatory objections, a vast literature of similarity judgments, reaction times, and other human measures debates the question from a cognitive perspective (Heider, 1972;Jameson, 2005;Roberson et al., 2005Roberson et al., , 2008Goldstein et al., 2009;Loreto et al., 2012;Persaud and Hemmer, 2014, inter alia).
By contrast, we examine the conditions empirically, broadly and automatically on a massively multilingual scale (versus manually and theoretically). Our evidence for assessing B&K's criteria of abstractness, monomorphemicity, and salience comes from a multilingual dragnet of color terms.

Data
We investigate the three aspects of our theory assessment-abstractness, monomorphemicity, and salience-through multilingual dictionaries. We additionally leverage English corpora to explore abstractness and salience. We use these to construct a dataset of color senses and translations, with scores along numerous axes. As a final resource to investigate salience, we use a global elicitation of color terms from pre-industrialized societies.
In English, the basic color terms are red, orange, yellow, green, blue, purple, brown, pink, black, white, and grey. These align to the eleven basic color categories identified by Berlin and Kay (1969). In addition to these eleven, we consider a list of 92 second-tier color terms identified by Casson (1994). These were elicited from 30 speakers over several days to ensure salience, then filtered by a dictionary to keep only conventional (rather than novel) color descriptors. We omit 12 of these which do not appear in the datasets we employ.
Translation may act as a useful resource for disambiguating word senses (Diab and Resnik, 2002). By translating English BCTs into other languages, we can find their basic color terms. Then, backtranslating to English, we obtain a list of the potential senses of a given color term. We draw translations of color terms from two large type-level dictionary resources, PanLex (Baldwin et al., 2010;Kamholz et al., 2014) and Wiktionary, which together provide color word translations for 2491 languages or dialects.
For each of the 11 basic color concepts and 80 secondary terms e in English, we translate it into every available foreign language`by dictionary lookup to get a set of non-English color words F`. We then back-translate each term into English (again by lookup) to get a set The final dataset contains tuples of the form (English color word, foreign language, foreign word, English back-translation).
to qualitative observations, our experiments evaluate these qualities through several metrics, illuminating flaws in the definition of "basic color term". When averaging the 14 features together, the implied total ordering is suggestive of the original B&K sequence.
Goodman and Kruskal's gamma We measure correlation between basicness and our features with Goodman and Kruskal's gamma (Goodman and Kruskal, 1954, 1963, 1972, which is well suited for comparing binary variables to ordinal ones. It is a pair-counting measure which ignores tied values. We compute it by maximum likelihood estimation, giving an expression: where N s is the number of color pairs for which basicness and a feature agree in their ranking; N d is the number of pairs ranked in opposite orders. Arbitrarily, we represent basicness as 1 and nonbasicness as 0. We will be concerned only with the magnitude, not the direction of correlation. A score of ±1 thus indicates perfect correlation with basicness.
A remark on noise Each of our measures is imperfect; it possesses a bias. Combining the results from many weak indicators gives robustness to the noise in any given measure. Foreshadowing future discussion, we see this supported empirically by the effect of aggregation on Goodman and Kruskal's gamma.

Abstractness
The basic color terms in English did not start as abstract concepts. For instance, "orange" and "pink" were originally derived from the concrete color of a fruit (Citrus sinensis) and flower (Dianthus plumarius) respectively, and earlier the English color term "black" had its origin in a word for soot, but the abstract color senses of these words rose in relative popularity; many supplanted the original definitions as the more common word sense (Casson, 1994). It could be that many "basic" color terms emerged metonymically from concrete, real-world referents, as in English.

Concreteness judgments
As a first pass, we directly look up the concreteness for each color word on our list, in a dataset of   (Simpson et al., 1989).
40,000 English lemmas' concreteness (Brysbaert et al., 2014) as rated on a Likert scale by Amazon Mechanical Turk workers in the United States. As expected, we find that concreteness negatively correlates with basicness (g = 0.58). Many nonbasic colors are less concrete than basic terms; for instance, "beige" is the least concrete color (3.41), and most words are less concrete than "orange".

A hologeistic perspective
A word may have multiple senses, which we hope to capture by taking a pan-lingual, or hologeistic, perspective, getting at the concept itself rather than any surface form. To do this, we find the human-judged concreteness of each of our backtranslations, then average these for each color term, weighted by the number of languages for which a word is a back translation. This balances between a single frequent sense and multiple infrequent senses.
Performing this averaging over languages and senses magnifies our correlation to g = 0.62; clearly, exploiting the diversity of senses is beneficial. Still, there is no clear separation between basic and secondary color terms. Concreteness or abstractness thus provides incomplete evidence of basicness.

Part of speech as a proxy for concreteness
Adjectives are, on average, perceived as more abstract than nouns (Darley et al., 1959). We affirm this finding: in the Brysbaert et al. judgments, nouns are less abstract than adjectives on average (3.53 versus 2.50). Because of this, we are comfortable using color words' part of speech as a coarse hint of their abstractness.
We collect part of speech annotations from two sources: the Google Books Ngram Corpus, containing about 4% of all books ever printed (Michel et al., 2011), and the Penn Treebank (Marcus et al., 1993). The former is machine-annotated for partof-speech (and thus noisier); the latter is annotated by linguists. For each corpus, we compute the ratio of adjectival relative to nominal part of speech, as well as total frequency. 3

Morphology
In this section, we ask whether there are affixes that are highly correlated with color; these can be either general derivational affixes or sequences specific to color terms, as in Table 1 and Table 4. The presence of subword structure in basic concepts' translations would show that they violate the monomorphemicity criterion.
Although the B&K criteria demand monolexemic and monomorphemic words, color terms in many languages are formed by some derivational process from a concrete term. To discover the components in an unsupervised fashion, as is necessary for languages in the long tail of linguistic resource availability, we look for constituent morphemes and constituent lexemes by segmentation and compound detection, respectively, on each foreign language of our dataset, then use these to score colors' basicness.

Affix discovery
Segmentation and affix discovery is a challenge for the low-resource languages in our study. To give signal to the model, we leverage our other metrics. We compute the percentage of the time that a color word's translation occurs with a suffix that is strongly associated with one of the top 10-highestranking colors on the basicness scale, according to the aggregation we mention in §4 and detail in §8. In other words, words that associate with a typical color affix in that language tend to be colors. This is not a test for basicness, per se, but rather being a likely color, so it supplements the part of speech measures. But it does diminish the rank of almost all words with senses/translations that are not primarily colors. In addition, this measure has the highest correlation of any of ours with basicness (g = 0.92)-though this is not surprising, as it was computed by bootstrapping from the other results, which already correlate well in aggregate.
As another tack, we identify likely colorrelated and general derivational affixes by unsupervised morphological segmentation. We define a probability distribution over segmentations. Let S = s 1 s 2 . . . s m segment the word W = BOSw 1 w 2 . . . w n EOS. We seek to find the optimal S ⇤ . To do this, we decode a model using the Viterbi algorithm, where the individual segment probabilities are maximum a posteriori estimates under a Dirichlet prior (a = 0.01). The model, inspired by Ge et al. (1999), is similar to other unigram segmentation models (Creutz and Lagus, 2005;Kudo, 2018). We then search for these affixes across the terms recorded in that language, to determine whether the affix is broadly derivational or specific to color terms. Select results are given in Table 3.
We also see derivational morphemes, which are applied to words from a given part-of-speech class to convert them to another class-e.g., "тут" in Archi (aqc) in Table 3, which is a fused morpheme denoting adjectivalization and marking Archi's fourth gender. This morpheme appears in the Archi's terms for black and white. As with Nahuatl, this implies that Archi lacks basic color termsaccording to a strict interpretation of the criteria.

Compound detection
To particularly identify color names which are formed by compounding of words, we extend a model for compound discovery to identify color terms which were produced compositionally. This lets us ask two questions about the BCT definition: (1) Across languages, are there "basic" color terms that are not monolexemic, and (2) Are "basic" terms less likely to be compounds? The answer to both, we find, is "yes". Wu and Yarowsky (2018) propose a multilingual compound analysis and generation method that only requires a readily available multilingual dictionary. They first extract potential compounds by splitting any word into three substrings corresponding to a left component, glue, and right com-  ponent. These compounds are used to construct compound "recipes". For example, they discovered that the concept of 'hospital' is frequently represented across a variety of languages as a compound of 'sick'/'disease' and 'house'/'home' in their respective language. They use these recipes to score their initial list of potential compounds, filter out low-scoring, unlikely compounds, and performing a second pass of recipe construction, resulting in a higher-quality compound dataset. Compound analysis is performed in a similar manner as recipe construction. Compound generation takes into account language-specific knowledge of which components and glues are common in that language. Wu and Yarowsky (2018) consider only singleor zero-character glue between the two components. By contrast, we allow glues of arbitrary length, exhaustively searching through all segmentations into three parts. This increases the algorithmic complexity by Q(K), where K is the length of the word. Searching through our PanLex and Wiktionary foreign translations of only our basic color terms, we find several examples of compounded words. Some examples are given in Table 4. The frequency of a color being expressed by compounding, though, turns out to be a weak indicator of basicness (g = 0.35).

An aside on borrowings
One of B&K's lower-tier criteria for color terms was that borrowings were "suspect". Here we examine how often color terms are borrowed, as well as other avenues for color construction.
In addition to translations and definitions, Wiktionary provides etymologies for many languages. These relations have been parsed and extracted as EtymDB (Sagot, 2017).
We report the aggregation of EtymDB's parsed etymologies in Table 5; these are broken down by color in the supplementary material. Basic colors are more often created by borrowing, suffixing, and compounding than secondary colors; nevertheless, this should be taken with a grain of salt: The annotations for secondary colors are less complete, so while we use these scores, we do not take the criterion of borrowings as a prong of our criticism.

Salience
B&K assessed salience by a color's tendency to appear earlier when asking speakers to list their language's colors. More general assessments are outlined by Hays et al. (1972): word length, frequency of use, ethnographic frequency, and correlation of vocabulary size with cultural complexity. While the final one of these is beyond our scope, we present simple experiments to test the other three, based on our translation dataset. Durbin (1972) supposed, based on Zipf's Law of Abbreviation (Zipf, 1932(Zipf, , 1949, that word length would decrease for more salient, broadly used color terms. We test this across over 2400 languages by computing the mean length of all translations for each English color word, regardless of script.

Word length
With the exception of grey, the first six colors of the (B&K) sequence-black white, red, yellow, green, and blue-have lower average lengths than the subsequent five. This supports Durbin's twophase theory of basic color terms. Still, beyond this handful, there is no clear separation between basic and non-basic colors. There is only a moderate correlation (g = 0.41) with basicness.

Frequency: Usage and ethnography
Neither English corpora nor multilingual dictionaries give a complete picture of the world's languages; after all, only 3 to 4 thousand of them have writing systems (Lewis et al., 2015). To augment our analyses, we consider the grounded speaker elicitation performed in the World Color Survey (WCS; Cook et al., 2005). In it, 2568 speakers from 110 pre-industrialized societies gave their color naming judgments about colored chips on a stimulus palette (see Figure 2) which evenly varied hue and brightness. Field workers then transcribed the utterances and applied the B&K criteria to ascertain which colors were basic. With the WCS, we can discover the synchronic homogeneity of a term's use for the labeling task.
Do all speakers agree that a given term should be used? We examine the consensus of use The stimulus palette used to collect color namings by Berlin and Kay (1969) and later the World Color Survey. Colors vary vertically in lightness and horizontally in hue. All are fully saturated. Bottom: The extensions of the six colors identified by one speaker of the Iduna language, spoken in Papua New Guinea. among words elicited in the World Color Survey. For each elicited color word in the language, we count the number of speakers who used it. Indeed, the distribution is varied. In Figure 3, we give a histogram of the total number of color terms elicited from each language, as well as the consensus for each individual color. We see a lexical core and periphery for most languages. Most color terms are unique, and languages may have up to 79 terms given among their speakers. It is natural for a speaker to use an unexpected word, still believing the color to fit in a more basic category (as with 'turquoise' and 'blue' in English), but it is surprising that these words would not be sifted out by field linguists' "denotational net".  Do all speakers have the same number of colors? We look at the variability of sizes of each speaker's inventory for a language, rather than the consensus on each term. The standard deviation of inventory sizes from the speakers of each language shows notable variation, especially when the mean inventory size exceeds 6. Heterogeneity in within-language color inventories is unsurprising; as Kay (1975) noted, younger members of numerous language communities possess a broader inventory of basic color terms, following Berlin and Kay's evolutionary sequence of inventories. Nevertheless, basic color terms are defined on a per-language level; they ignore this synchronic variation. This raises questions for future work: Are there categories salient only to young speakers, not the old? What would be a population's (or language's) basic color terms?

Aggregation of Features
We have operationalized some B&K criteria for basic color terms. Independently, each fails to match up with the known set of basic color terms. Now, we aggregate our independent scores to create a robust measure of basicness from the weak measures.
We do not cherry-pick our measures; some that we include actually harm the ordering. Instead, we  operationalize each criterion, then see what shakes out. We take an unweighted average of normalized scores for the same reason. This makes the result more evocative; they come purely from our operationalization, rather than targeted tuning. 4 Using the aggregated measure produces the highest correlation with both basicness and the order of the B&K sequence; see Table 6. We recover the first six colors in the evolutionary sequence from Figure 1: white and black, red, green and yellow, and blue! An extended ordering is given in Table 7; it suggests that the most primary of the non-basic terms are gold, scarlet, crimson, and beige. Some further inquiry is possible. We see that orange is the 24th ranked color out of 91. This is a stark separation from the ten other basic color terms; beyond this, it has the highest concreteness of the basic terms.

Discussion
While the notion of a basic color term has been widely used, its validity has been taken as given. With NLP techniques in the broadest multilingual survey by far, we add to the literature investigating the definition of basic color terms. The ability to produce color templates shows that monomorphemicity is an unreasonable criterion. The concreteness of many color words' back-translations violates abstractness. Finally, the heterogeneity of color naming data contends with the salience requirement. None of the traditional criteria for basic color terms hold up robustly. Despite this, when taken in aggregate (and operationalized as we do), they suggest the traditional sequence of color terms and a coarse division between basic and non-basic colors.
As color terms are often decomposable, we can turn the decomposition on its head to generate missing color words. We have shown cross-lingual patterns of word formation that future work can exploit, giving plausible entries in a bilingual dictionary (Wu and Yarowsky, 2018): Without the word for "hospital", one can convey the concept by "sick"+"house"; likewise, without the word for "gray", one can use "ash"+DERIVATIONAL AFFIX. Future work will investigate generation and validation of unseen color terms.
Finally, given that the divide between basic and secondary color terms is so blurred, future computational models of these should employ models of graded membership, such as fuzzy set theory.

Conclusion
This paper has investigated the universal basic color term theories of Berlin and Kay (1969) and others. It provides empirically-grounded computational linguistic metrics with evidence from 2491 languages, harnessing multiple on-line resources of varying quality. We have shown that although the obligatory criteria do not in fact cleanly separate basic from non-basic colors, our features' aggregation correlates strongly with the Berlin and Kay basic/secondary color term partition (g = 0.96). The aggregation also largely predicts the Berlin and Kay hypothesized universal acquisition sequence, which is in no way directly entailed by the basicness criteria. Thus, we provide further empirical evidence from computational linguistics in support of the B&K claims, while also providing additional nuance and perspective thereon.