Quantifying the Semantic Core of Gender Systems

Many of the world's languages employ grammatical gender on the lexeme. For example, in Spanish, the word for 'house' (casa) is feminine, whereas the word for 'paper' (papel) is masculine. To a speaker of a genderless language, this assignment seems to exist with neither rhyme nor reason. But is the assignment of inanimate nouns to grammatical genders truly arbitrary? We present the first large-scale investigation of the arbitrariness of noun-gender assignments. To that end, we use canonical correlation analysis to correlate the grammatical gender of inanimate nouns with an externally grounded definition of their lexical semantics. We find that 18 languages exhibit a significant correlation between grammatical gender and lexical semantics.


Introduction
In his semi-autobiographic work about his time traveling through Germany, A Tramp Abroad, Twain (1880) recounted his difficulty when learning the German gender system: "Every noun has a gender, and there is no sense or system in the distribution; so the gender of each must be learned separately and by heart. In German, a young lady has no sex, while a turnip has. Think what overwrought reverence that shows for the turnip, and what callous disrespect for the girl." Although this humorous take on German grammatical gender is clearly a caricature, the quote highlights the fact that the relationship between the grammatical gender of nouns and their lexical semantics is often quite opaque.
As arbitrary as certain noun-gender assignments may appear overall, a relatively clear relationship often exists between grammatical gender and lexical semantics for some of the lexicon. The portion of the lexicon where this relationship is clear usually consists of animate nouns; nouns referring to people morphologically reflect the sociocultural notion of "natural genders." This portion of the lexicon-the "semantic core"-seems to be present in all gendered languages (Aksenov, 1984;Corbett, 1991). But how many inanimate nouns can also be included in the semantic core? Answering this question requires investigating whether there is a correlation between grammatical gender and lexical semantics for inanimate nouns.
Our primary technical contribution is demonstrating that grammatical gender and lexical semantics can be correlated using canonical correlation analysis (CCA)-a standard method for computing the correlation between two multivariate random variables. We consider 18 gendered languages, following 3 steps for each: First, we encode each inanimate noun as a one-hot vector representing the noun's grammatical gender in that language; we then create 5 operationalizations of each noun's lexical semantics using word embeddings in 5 genderless "donor" languages (English, Japanese, Korean, Mandarin Chinese, and Turkish); finally, for each genderless language, we use CCA to compute the desired correlation between grammatical gender and lexical semantics. This process yields a single value for each of the 90 genderedgenderless language pairs, revealing a significant correlation between grammatical gender and lexical semantics for 55 of these language pairs. Secondarily, we investigate semantic similarities between the 18 languages' gender systems-i.e., their assignments of nouns to grammatical genders. We analyze the projections of lexical semantics (operationalized as word embeddings in English) obtained via CCA, finding that phylogenetically 5735 similar languages have more similar projections.

Grammatical Gender
Languages range from employing no grammatical gender on inanimate nouns, like English, Japanese, Korean, Mandarin Chinese, and Turkish, to drawing grammatical distinctions between tens of gender-like classes (Corbett, 1991). Although there are many theories about the assignment of inanimate nouns to grammatical genders, to the best of our knowledge, the linguistics literature lacks any large-scale, quantitative investigation of arbitrariness of noun-gender assignments. However, with the advent of modern NLP methodsparticularly with advancements in distributional approaches to semantics (Harris, 1954;Firth, 1957)and with the copious amounts of text available on the internet, it is now possible to conduct such an investigation. We focus on languages that have either two (masculine-feminine) or three (masculinefeminine-neuter) genders, to which nouns are exhaustively assigned, and investigate whether a correlation exists between grammatical gender and lexical semantics for inanimate nouns-i.e., whether noun-gender assignments are arbitrary or not.
In many languages, a noun's grammatical gender can be predicted from its spelling and pronunciation (Cucerzan and Yarowsky, 2003;Nastase and Popescu, 2009). For example, almost all Spanish nouns ending in -a are feminine, whereas Spanish nouns ending in -o are usually masculine. These assignments are non-arbitrary; indeed, Corbett (1991, Ch. 4) provides a thorough typological description of how phonology pervades gender systems. We emphasize that these assignments are not the subject of our investigation. Rather, we are concerned with the relationship between grammatical gender and lexical semantics-i.e., when asking why the Spanish word casa is feminine, we do not consider that it ends in -a.
Finally, our investigation is related to that of Kann and Wolf-Sonkin, which assumes that noungender assignments are non-arbitrary and examines the predictability of grammatical gender from lemmatized word embeddings; in contrast, we investigate the arbitrariness of noun-gender assignments.

Lexical Semantics via Word Embeddings
The NLP community has widely adopted word embeddings as way of representing lexical semantics.
The underlying motivation behind this adoption is the observation that words with similar meanings will be embedded as vectors that are closer together. As we explain in §3, our investigation requires a definition of lexical semantics that is independent of grammatical gender. However, in many gendered languages, word embeddings effectively encode grammatical gender because this information is trivially recoverable from distributional semantics. For example, in Spanish, singular masculine nouns tend to occur after the article el, whereas singular feminine nouns tend to occur after the article la.
For this reason, we use an externally grounded definition of lexical semantics: we create 5 operationalizations of each noun's lexical semantics using word embeddings in 5 genderless "donor" languages (English, Japanese, Korean, Mandarin Chinese, and Turkish). We use 5 languages that are phylogenetically distinct and spoken in distinct regions to minimize any spurious correlations. 1 Our investigation is based on the linguistic assumption that word embeddings in a genderless "donor" language are a good proxy for genderless lexical semantics. In practice, however, this assumption is generally false: word embeddings are largely a reflection of the text with which they were trained. For example, the embedding of the word snow will differ depending on whether the training text was written by people near the equator or people near the North Pole, even if both groups speak the same language. Such differences will be more pronounced for rare words, which are arguably more language-and culture-specific than many common words. For this reason, we limit the scope of our investigation to only those inanimate nouns that are likely to be used consistently across different languages. To implement this limitation, we use a Swadesh list (Buck, 1949;Swadesh, 1950Swadesh, , 1952Swadesh, , 1955Swadesh, , 1971Swadesh, /2006)-a list of words constructed to contain only very frequent words that are as close to culturally neutral as possible. By limiting the scope of our investigation to only those inanimate nouns that appear in a Swadesh list, we can be reasonably confident that their word embeddings in English, Japanese, Korean,   (Bojanowski et al., 2017;Grave et al., 2018). 4 For each genderedgenderless language pair, we limit the scope of our investigation to only those inanimate nouns that occur in both our Swadesh list and in FASTTEXT; we provide the resulting counts in Table 1. Finally, we randomly partition the set of nouns for each language pair into a 75%-25% training-testing split.

Notation
We first establish the requisite notation. Let V ,m = {1, . . . , V ,m } denote a set of integers representing the inanimate nouns for gendered language and genderless language m. Let G denote the (arbitrarily ordered) genders in language ; for exam-2 http://compling.hss.ntu.edu.sg/omw/summx.html 3 This imbalance as a limitation of our investigation. 4 The FASTTEXT word embeddings were trained using Common Crawl and Wikipedia data, using CBOW with position weights, with character n-grams of length 5. For more information, see http://fasttext.cc/docs/en/crawl-vectors.html. ple, let G spanish = (MSC, FEM). Given an inanimate noun n ∈ V ,m , let g (n) denote a one-hot vector representing n's grammatical gender in language , so that the i th entry corresponds to the i th gender in G . Similarly, let e m (n) ∈ R 50 denote the 50-dimensional word embedding representing the lexical semantics of n in language m. Let G ∈ R |G |×V ,m collectively denote the inanimate nouns' grammatical genders in language , so that the n th column is g (n), and let E m ∈ R 50×V ,m collectively denote the inanimate nouns' lexical semantics, so that the n th column is e m (n). Finally, let G train and E train m respectively denote the columns of G and E m that correspond to the inanimate nouns in the training set and let G test and E test respectively denote the columns of G and E m that correspond to the inanimate nouns in the testing set.

Canonical Correlation Analysis
CCA is a standard method for computing the correlation between two multivariate random variables. In our investigation, we are interested in the correlation between grammatical gender and lexical semantics for each gendered-genderless language pair. To compute this correlation, we start by solving the following optimization problem: Although this optimization problem is non-convex, it can be solved in closed form using singular value decomposition (SVD). We use a standard implementation of CCA (Pedregosa et al., 2011).
Having found the projections a ∈ R |G | and b ∈ R 50 that maximize the correlation, we then use them to compute the correlation between grammatical gender and lexical semantics as follows: To establish statistical significance, we follow the approach of Monteiro et al. (2016). We create B = 100, 000 permutations of the columns of G train ; for each permutation b, we then repeat the steps above to obtain ρ ,m ; finally, we compute Because our investigation involves testing 90 different hypotheses, we use Bonferroni correction (Dror et al., 2017)-i.e., we multiply p by 90. If the resulting Bonferroni-corrected p-value is small, then we can reject the null hypothesis that there is no correlation between grammatical gender and lexical semantics for that language pair. Secondarily, we investigate semantic similarities between the 18 languages' gender systems by analyzing their projections of lexical semantics. For each pair of gendered languages and , we compute the correlation (cosine distance) between b and b for each of the 5 genderless languages.

Results
We find a significant correlation between grammatical gender and lexical semantics (i.e., the Bonferroni-corrected p-value is less than 0.05) for 55 of the 90 gendered-genderless language pairs. These results are depicted in Figure 1. For Slovak, Croatian, and Ukranian, we find no correlation for any of the genderless languages; for Slovenian, we find a significant correlation for only Mandarin Chinese. We suspect that these results are due the relatively small number of inanimate nouns considered for each of these language pairs (see Table 1 for the counts). We also find slightly different patterns of correlation for the different genderless languages  that we use to create our 5 operationalizations of lexical semantics. For Japanese, we find significant correlations for 13 of the 18 gendered languages; for English and Chinese, we find significant correlations for 12; for Korean and Turkish, we find significant correlations for 9 of the gendered languages.
For each pair of gendered languages and , Figure 2 depicts the the correlation (cosine distance) between b and b for English. We find higher correlations for pairs of languages that are phylogenetically similar. For example, French has higher correlations with Spanish and Italian than with Polish. This is likely because phylogenetically similar languages exhibit historical similarities in their gender systems as a result of a common linguistic origin (Fodor, 1959;Ibrahim, 2014;Stump, 2015).

Conclusion
Our investigation is the first to quantitatively demonstrate that there is a significant correlation between grammatical gender and lexical semantics for inanimate nouns. Although our results provide evidence for the non-arbitrariness of noun-gender assignments, they must be contextualized. In contrast to animate nouns, it is not clear that a single cross-linguistic category explains our results. Moreover, we limit the scope of our investigation to frequent inanimate nouns. These nouns tend to be distributed across genders, whereas less frequent inanimate nouns tend to be assigned to a single gender (Dye et al., 2015). We leave the investigation of less frequent inanimate nouns for future work.