Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions

A grammatical gender system divides a lexicon into a small number of relatively ﬁxed grammatical categories. How similar are these gender systems across languages? To quantify the similarity, we deﬁne gender systems ex-tensionally, thereby reducing the problem of comparisons between languages’ gender systems to cluster evaluation. We borrow a rich inventory of statistical tools for cluster evaluation from the ﬁeld of community detection (Driver and Kroeber, 1932; Cattell, 1945), that enable us to craft novel information-theoretic metrics for measuring similarity between gender systems. We ﬁrst validate our metrics, then use them to measure gender system similarity in 20 languages. Finally, we ask whether our gender system similarities alone are sufﬁcient to reconstruct historical relationships between languages. Towards this end, we make phylogenetic predictions on the popular, but thorny, problem from historical linguistics of inducing a phylogenetic tree over extant Indo-European languages. Languages on the same branch of our phylogenetic tree are notably similar, whereas languages from separate branches are no more similar than chance.


Introduction
As many as half the world's languages carve nouns up into classes (Corbett, 2013). In these languages, nouns are subdivided into gender categories, which together comprise the language's grammatical gender system. A gender system tends to use a small, fixed number of categories with fixed usage across speakers. Such categories, like 'feminine', can be defined extensionally, 1 and are reflected by agreement with other words within the noun phrase (i.e., concord). Gender Figure 1: Two gender systems partitioning N = 6 concepts. German (a) has three communities: Obst (fruit) and Gras (grass) are neuter, Mond (moon) and Baum (tree) are masculine, Blume (flower) and Sonne (sun) are feminine. Spanish (b) has two communities: fruta (fruit), luna (moon), and flor are feminine, and cesped (grass), arbol (tree), and sol (sun) are masculine. exhaustively divides up the language's nouns; that is, the union of gender categories is the entire nominal lexicon. Taken this way, a gender system can be viewed as a partition of the lexicon into communities of same-gendered nouns. Given this, a lexical typologist might naturally wish to ask: how similar are two languages' gender systems?
Using modern statistical and informationtheoretic tools from the community detection literature, we offer the first cluster evaluation (Jardine et al., 1971) perspective on grammatical gender, and quantify the overlap of gender systems. We can compare the pairwise overlap of partitions of gender systems using a rich literature of measures, such as mutual information and several variants (Meilȃ, 2003;Vinh et al., 2010;McCarthy et al., 2019a), which we survey and contrast. Individual partitions of lexicons can also be framed as members of distributions over partitions-for instance, the distribution consisting of all partitions of N items, or of all partitions of N items into K gender clusters, as in Figure 1. For example, Spanish is bi-gendered (with masculine and feminine): a lexicon of Spanish nouns (N = 1000) and their genders would come from a distribution over partitions of N = 1000 items into K = 2 clusters.
The same lexicon translated into German, a trigendered language, would come from a distribution of N = 1000 items partitioned into K = 3 clusters. Indeed, languages needing different numbers of gender clusters makes this problem non-trivial. From this, we can compare the similarity to what we would expect for the same lexica if nouns were randomly supplied with gender specifications. That way, we can distinguish meaningful relationships from noise.
Armed with the first way to quantify communitywise similarity of gender systems, we ask: Do gender system similarities reflect linguistic phylogeny, or something else, like areal effects? Across 20 languages, we find that our pairwise overlap results measurably align with standard pairwise phylogenetic relationships. Zooming in on Indo-European, we find that we can recast pairwise similarities into an accurate phylogenetic tree, simply by measuring distance between gender systems and performing hierarchical agglomerative clustering (see §6.2).
The primary contribution of this work is a novel metric for lexical typology that measures the pairwise similarity of gender systems. We operationalize gender systems as partitions over a shared set of nouns ( §3). We design and evaluate our measurements of gender system similarity under this formulation ( §4), drawing on insights from community detection. Then we recover robust phylogenetic relationships between pairs of gender systems by applying these to 20 gendered languages ( §6) and find that similarity between Slavic and Romance gender systems does not exceed chance levels. Finally, we show that our quantification of gender system similarity allows us to construct phylogenetic trees that closely resemble those posited for Indo-European in historical linguistics (e.g., Pagel et al. 2000;Gray and Atkinson 2003;Serva and Petroni 2008).

Background: Grammatical Gender
Grammatical gender is a highly fixed classification system for nouns. Native speakers rarely make errors in gender recall, which might tentatively argue against tremendous arbitrary variation (Corbett, 1991). Some regularity can surely be found in the associations between gender and various features of the noun, such as orthographic or phonological form, or semantics. With respect to form-based regularities, Cucerzan and Yarowsky (2003a) devise a system for inferring noun gender (masculine or feminine) from contextual clues and character representations, even in inflected forms of the noun. Nastase and Popescu (2009) also find that phonological form can lead to predictability of gender in two three-gender systems. With respect to word semantics,  quantify the relationship between the gender on inanimate nouns and their distributional word vectors.
We can't rely on form. Using phonological or orthographic form to derive gender is fraught with complications: particular to our study, epicene nouns (i.e., words that can appear in multiple genders) can pose issues. In German, only gender concord on the definite article and adjectives can disambiguate the gender of some nouns; the same wordform Band means "volume" when masculine, but "ribbon" when neuter and "band, musical group" in feminine. Another complication with determining gender from the phonological or orthographic form of the noun is that correspondences between are rarely absolute. For example, even though nouns ending in -e are usually 'feminine' in German, this is not universally the case; for example Affe, and Löwe etc. are masculine. To sidestep these complications, we abstract away from particular word forms and observe the objective consequences of gender over sets of cross-lingual concepts, i.e., indices not word forms, and instead compare those across gender systems (see Figure 1).
Which gender systems are likely to be similar? Several accounts highlight similarities between the gender systems of phylogenetically-related languages (Fodor, 1959;Ibrahim, 2014) and argue that they are likely to be at least partially due to historical relations between communities and socio-political factors governing language use. Given this, can we recover phylogenetic similarities across gender systems using our methods? If so, this should provide validation that we are indeed measuring at least some of the genuine similarity that exists between gender systems.

Gender Systems as Partitions
Any concept can be related to its referents either intensionally or extensionally. While linguistic research has historically sought to uncover the rules for associating a noun with gender in terms of surface features or semantics (see Corbett 1991 for an overview), we take an extensional approach. That is, we treat a gender category in a language solely as the set of words it covers. This maps directly to the notion of a community in the network science task of community detection: A community is defined by membership, not by other arbitrary properties, just as a gender here is defined by the union of all nouns it subsumes, not by its phonological realization or contributions to semantics. The disjoint set of communities forms a partition of the set of nouns: Each noun is a member of one and only one cluster.
Although some epicene nouns are present in our investigated languages (see §2), these are very rare. We thus make the simplifying modeling assumption of identifying each word with only a single gender (in our case, the most frequent). This assumption is necessary for our reduction of gender system comparison to clustering evaluation. Without it, we would be forced for words like German der/die/das Band to consider overlapping or "fuzzy" partitions, which although an intriguing option, will be left for future work.
Notation. A language's gender system is a partition, named in sans serif (e.g., A). A gender system A has K components called gender classes (i.e., communities, e.g., {A MSC , A FEM , . . .}); these are in turn sets whose members are items drawn from a finite base set A ⊆ L, where A is a sublexicon selected from the full lexicon L. In our case, A holds all inanimate concepts in our data (see §5). We use Ω to name the set of all partitions of N = |A| items (in our case, inanimate nouns) into K communities. When comparing two languages' respective gender systems, we will use the letters A and B.

Comparing Partitions
A partition groups items into a set of disjoint categories. We could compare any two gender systems (i.e., partitions) which organize the same nouns by determining how similar their gender labelings are. A first pass at quantifying the similarity of two gender partitions would be to measure simple overlap. We could ask: What fraction of A agrees in gender across languages? That is, for each noun in our multilingual vocabulary, do both languages lexicalize it with the same gender? This is an easily interpretable, accuracy-like measure, bounded by 0 and 1. Still, it has no capacity for comparing systems with different numbers of categories; the measure would be handicapped when comparing two-gender systems to three-gender ones.
Comparing systems with different numbers of categories, though, is a well known problem in the field of community detection. While this looks insurmountable from the gender perspective, where gender categories refer to something we recognize, in community detection, the labels themselves are meaningless-there's no notion of a so-called "Cluster 2". The field has circumvented issues arising from comparing systems differing in number of categories by introducing information-theoretic measures to compare partitions. Cluster evaluation functions in community detection are, by and large, based on information-theoretic concepts. We define a gender system A's entropy as: where we observe the standard convention that 0 log 0 def = 0. How is this notion of entropy for partitions related to the entropy of a probability distribution? These are connected through maximumlikelihood estimation (MLE). In our case, the maximum-likelihood estimate that an inanimate noun a is located in a given partition turns out to be the size of that partition divided by N , e.g. we have p MLE (MSC) = |AMSC| /N. Recall that the Shannon entropy of a distribution p is defined as We have equality between Eq. 1 and Eq. 2 when we plug the definition of p MLE into Eq. 2, which is why Eq. 1 is considered the entropy of a partition.

Mutual information (MI)
Mutual information is a workhorse of quantifying similarity between two probability distributions, measuring how much information (in bits) is shared between two random variables. Now we consider the case of the similarity between two partitions. If we have two partition A and B, we may generalize the entropy of a single partition to the mutual information between two partitions as follows: As the equality above shows, we find, again, that Eq. 3 has an interpretation as the standard definition of probabilistic mutual information applied to the maximum-likelihood estimate of joint partition membership distribution. To foreshadow future discussion, we note the mutual information between any two clusterings on N items is bounded below by 0 and above by log N . Beyond its interpretation as shared information, mutual information gives little in terms of interpretability: It has no consistent reference points, beyond that the minimum possible MI is zero. Therefore, several variants of MI are preferred in community detection.
Normalization. Furthermore, MI is often normalized to increase its interpretability, as: While our denominator is the geometric mean, any generalized mean of the partitions' entropies can be used as a bound to normalize MI (Yang et al., 2016). As we divide bits by bits (or nats by nats), normalized mutual information (NMI) is unitless, unlike entropy and MI. It expresses the amount of revealed information as a percentage. Unfortunately, NMI has both theoretical and empirical flaws (Peel et al., 2017;McCarthy, 2017;McCarthy et al., 2019b); namely, it suffers from the finite-size effect: the baseline rises as N increases. (Recall that MI is bounded above by log N .) High reward for guessing even the trivial partition into singleton clusters rises, making the measure-like vanilla mutual information (as in Eq. 3)-difficult to interpret. For its flaws, we exclude NMI in favor of the following MI-based measures that are both more interpretable and more pertinent.

Adjusted mutual information (AMI)
Spurious correlations between two gender systems can mislead the results, showing a higher-thandeserved agreement. We select a measure which adjusts for these chance clusterings: the adjusted mutual information ( where the expectation is taken under the uniform distribution over Ω, all clusterings on N items with K A and K B clusters (Gates and Ahn, 2017). The maximum is also taken over Ω. This distinguishes it from the textbook form of AMI, where the expectation is over a subset of Ω-only those partitions whose community sizes match those of the arguments. As we have subtracted the mean, the expected numerator is centered at 0; the denominator serves to re-normalize the measure. The measure thus compares the mutual information for the observed pair of gender systems to all others within their family. Using AMI also lends some beneficial properties in cluster evaluation: Remark 1. AMI has a fixed maximum score 1.0 for exactly matching gender systems.
Remark 2. The mathematical expectation of AMI is 0 so spurious correlations are not rewarded.

Variation of Information (VI)
Unlike MI and AMI, Variation of Information (Meilȃ, 2003) is a distance (metric), meaning each language becomes a point in this metric space, whose set is all possible partitions of N items. VI is useful because it satisfies the triangle inequality (Meilȃ, 2007). Additionally, as a metric, it guarantees identity of indiscernibles: if two partitions are at a distance 0, then they are identical. VI is defined as and is the summation of two conditional entropies. It can also be normalized by dividing by the joint entropy, H(A, B). (This measure would be topologically equivalent to Eq. 6.) We do not adjust VI for chance. This would deprive it of its metric property, because of the subtraction in the numerator.

Data
Swadesh lists & NorthEuraLex. Our starting point is Swadesh lists (Buck, 1949;Swadesh, 1950Swadesh, , 1952Swadesh, , 1955Swadesh, , 1971Swadesh, /2006: concept-aligned minimal inventories of common, "core" or "basic" terminology thought to be "frequent, universal, and resistant to change over time" (Kaplan, 2017). For our purposes, concept-aligned sources are appealing, because they ensure a consistently present base set A across all our languages, maximizing comparability. We also use the NorthEuraLex dataset (Dellert and Jäger, 2017)-essentially, an extended Swadesh list covering 1016 concepts-to further validate our findings on the original Swadesh lists. Because grammatical gender on animate nouns has the added complication that it generally matches "natural" gender (or expressed preference) of living creatures across languages (Corbett, 1991;Romaine, 1997;Kramer, 2015), we omit animate nouns to remove semantic confounds from our investigation of cross-lingual gender assignments. We now take the base set A from the larger concept list in a broader swath of languages. We have 69 inanimate nouns in the Swadesh lists and 387 in NorthEuraLex.
Gender dictionaries. We choose a corpus-based approach to identifying a word's gender. We study the gendered languages available in Universal Dependencies v2.3 2 (Nivre et al., 2018), resulting in a sample of 20 (Hebrew, Greek, Hindi, Lithuanian, Latvian, Polish, Croatian, Slovak, Ukrainian, Russian, Slovenian, Bulgarian, Swedish, Danish, Romanian, French, Catalan, Italian, Spanish, Portuguese). This sample is somewhat skewed based on family, with all but one language (Hebrew) belonging to Indo-European. All are members of the Standard Average European Sprachbund (Whorf, 1997;Haspelmath, 2001), except Hebrew, Hindi, and Greek, which are the only representatives of their groups. Why the Indo-European focus? First, we needed aligned concept lists with gender and animacy annotations in languages which possess a gender system. Second, it is natural to test unsupervised methods on a sample with a known ground truth. Indo-European phylogeny, while not without its debates, is relatively well studied, making it a strong testbed for verifying our methods. Future work can enable greater linguistic diversity by scraping annotated dictionaries. Gender labels are drawn from the MarMoT contextual morphological tagger (Müller et al., 2013) trained on Universal Dependencies corpora (Nivre et al., 2018) in each language and applied to Wikipedia in that language. In the case of epicene words and polysemy, we select the consensus gender (Cucerzan and Yarowsky, 2003b) for the character sequence-its most frequent gender label. We fill gaps manually using bilingual English-target language dictionaries. When multiple words are given to express a concept in a language, we select the most frequent.

Experiments
We apply each measure to the gender systems from our Swadesh lists, then validate our results on NorthEuraLex. We apply validation to ensure that they are picking up robust similarities as opposed to just reflecting properties of particular word lists. (See github.com/aryamccarthy/ gender-partitions.) We then reconstruct phylogenetic trees of the languages involved. The trees show high agreement with ground truth, compared to random baselines.

Similarity measures
We apply the three evaluation measures ( §4) to the partitions computed for our languages over the common conceptual lexicon. Figure 2 shows the pairwise scores for languages' gender systems (on the Swadesh list) as partitions. The rows and columns have been reordered according to a "ground truth" of pairwise distances (Serva and Petroni, 2008), for reasons we will explain in the next subsection. 3 Regardless of measure, a few clusters emerge along the diagonal. The (Balto-)Slavic branch (i.e., Polish, Croatian, Slovene, Ukrainian, Slovenian, Russian, and Bulgarian) is present at the top left, and the Romance branch (i.e., French, Catalan, Italian, Spanish, and Portuguese) appears at the bottom right. Outside of these blocks, AMI shows us that the similarity of gender systems is no better than a chance relationship; at the whole-lexicon level, influence from the common Indo-European root is absent.
We also apply our measures to the wider swath of languages and larger aligned inventories of NorthEuraLex. The Romance languages again form a block, as do the Balto-Slavic languages. Figure 3 shows similar separation into families for both MI (a) and AMI (c), though this is less pronounced for Variation of Information (b). Variation of Information shows some surprising associations not present in AMI, such as associating Hebrew and Slovene highly with the Romance block.
Romanian deserves particular note: It is a Romance language but has been geographically isolated from its family for over a millennium, instead sharing membership in the Balkan Sprachbund with Greek and Bulgarian. As such, we may ask whether its phylogeny or its areal effects are reflected in the gender similarity metrics. While Romanian differs from other Romance languages in many ways (Dinu and Dinu, 2005;   Dobrovie-Sorin, 2011)-e.g., it possesses three genders instead of two 4 -it is still more similar to its phylogenetically related Romance relatives than to Balto-Slavic languages. This is easiest to discern in the Variation of Information plot: weak connections surface between Romanian and both Slovene and Ukrainian, but the majority of the Balto-Slavic languages are quite distant from it.

Phylogeny
Inspired by the findings in the previous section (especially the high similarity among Romance languages), we further validate our measure, asking whether the resulting similarities reflect known phylogenetic ground truth-namely, the developmental history of Indo-European languages. Obviously, there are many more facets to languages' relatedness than their gender systems, so it is interesting to find signal this strong from a single category. Rabinovich et al. (2017) cluster languages based on simple features of their translations into a common target language to craft phylogenetic trees. We take a similar approach, asking whether the pairwise similarities of gender systems are enough to reveal phylogenetic truth or some other relationship. We create phylogenetic trees through agglomerative hierarchical clustering, using both VI and one minus the AMI as distance measures. We use the weighted pair group method of averages (Sokal and Michener, 1958;Müllner, 2011) as implemented in the SciPy library (Jones et al., 2001).
The resulting trees ("dendrograms") can be visualized showing the sequence of cluster formations during hierarchical clustering (Figure 4 and Figure 5). In a dendrogram, any ordering of the leaves maintains fidelity to the computed tree structure, so long as the branching is still correct. We choose to improve upon this by optimally ordering the leaves, swapping subtrees to convey similarity both within and across subtrees (Bar-Joseph et al., 2001). On the whole, our dendrograms recover known phylogenetic relationships between the languages we consider; this serves to largely validate our measures as having uncovered some meaningful sim- ilarity between the languages' gender system. Indeed, in every case, we reconstruct the subtree of Romance languages with high fidelity. The only difference is that on NorthEuraLex, Catalan is more similar to Portuguese and Spanish than Italian is. In all trees, Romanian is always grouped with the Romance languages, matching its ancestry. The Balto-Slavic subtree is less perfect. MI and AMI recover similarities between Russian and Ukrainian (Eastern Slavic), Slovak and Polish (Western Slavic), and Croatian and Bulgarian (South Slavic) fairly well. Further, the Slavic and Baltic languages are properly joined to form a Balto-Slavic group. We take this as validation of our method.
When measuring with Variation of Information, though, things go awry. While it correctly pairs Russian and Ukrainian and recreates the same Romance subtree as the other measures, there are some major discrepancies. Hebrew, the only non-Indo-European language, is found to be closer to the Romance languages than to the Balto-Slavic cluster. Hindi's closeness to others is similarly exaggerated. In fact, everything seems to be close for VI, except Greek! As the other measures better capture the phylogeny, we suggest that similarity measured with Variation of Information is ill suited to our main task.

Quantitative Evaluation
Our proposals to measure similarity of gender systems give rise to dendrograms that resemble phylogenetic trees. But how much so? We answer this by measuring the similarity to the ground truth tree. To measure the similarity of two trees T 1 and T 2 , we use Rabinovich et al. (2017)'s extension of the L 2 norm to leaf pair distance. Here, we sum the number of edges on a path between two nodes to get their distance d. We then compute the total distance as the sum of squared where each i identifies one language (or leaf).
We show that the distance according to any of our three measures is significantly more like the ground truth (from Serva and Petroni, 2008) than chance by comparing the computed trees to 1000 randomly generated trees on the same set of languages. (We report mean and standard deviation of distance from the ground truth. We use Rabinovich et al. (2017)'s unweighted distance.) For each combination of dataset and measure, we use McNemar's test for significance and find p < 0.0001.

Related Work
There is a baffling dearth of work on quantifying similarity of gender systems. There is, however, ample work on characterizing intensional gender systems, i.e., sets of grammatical rules, that can be divided (Corbett, 1991) into sets of rules based on morphology (Tucker et al., 1977;Gregersen, 1967;Wald, 1975;Plank, 1986, i.a.) and on phonology (Bidot, 1925;Tucker et al., 1977;Newman, 1979;Hayward and Corbett, 1988;Marchese, 1988). Intensional approaches, particularly those with typological leanings, contribute very fine grained research on particular pairwise similarities for particular languages and dialects. Although we cannot survey these in detail here, we would love for our measures to contribute findings that can complement these approaches.
Relatedly, other recent works have investigated grammatical gender and other types of noun classification systems with information theoretic tools. For example, Williams et al. 2020b uses mutual information to quantify the strength of the relationships between declension class, grammatical gender, distributional semantics, and orthographic form respectively in several languages. Williams et al. 2020a, which is arguably closest to this work, measures the strength of semantic relationships between inanimate nouns and verbs or adjectives that takes those nouns as arguments, and that work can be seen as comparing the similarity of nouns clustered by their gender, with the same nouns clustered by the adjectives that modify them or the verbs that take them as arguments.
Although we adopt information theoretic measures, here there are two other major classes of cluster evaluation measures: set-matching measures, and pair-counting measures, which tally which pairs of items are in the same or different communities. One popular set-matching measure in information retrieval, purity (Manning et al., 2008), is asymmetric and biased by the size and number of communities (Danon et al., 2005). Its symmetric form, the F-measure (Artiles et al., 2007), has clear bounds but gives no indication of average-case performance.
The adjusted Rand index (ARI; Hubert and Arabie, 1985) is the preeminent pair-counting measure. It is related to AMI, adjusting the Rand index in the same way that AMI adjusts MI. ARI also computes an expectation, which can be computed over the proper distribution (Gates and Ahn, 2017), but it is empirically better suited to large, balanced clusters. In our case of small and uneven clusters, AMI should be preferred (Romano et al., 2016). We can only survey a representative handful of the numerous cluster evaluation measures in the limited space we have here. See McCarthy et al. (2019b) for an outline of desiderata for comparing partitions, as well as a general class of appropriate measures, and for further motivation for AMI using a different null model-languages have a fixed number of gender classes, so we select one over N items with K communities, rather than an arbitrary number of communities.

Conclusion
We have presented a clean method for comparing grammatical gender systems across languages: By defining gender classes extensionally, we reduced the problem to cluster evaluation from community detection. We validate three metrics by recovering known phylogenic relationships in our languages, with measurable success. Separate Indo-European branches are no more similar than chance.
We emphasize that our methods are not specifically tailored to gender systems. One could apply them more broadly other aspects of the lexicon, e.g. to Indo-European verb classes, Bantu noun classes, or diachronic time slices of a single language's gender system, data permitting. A related challenge is East and Southeast Asian numeral classifier systems, which associate nouns with classifiers based largely on the semantic properties of the nouns (Kuo and Sera, 2009;Zhan and Levy, 2018;Liu et al., 2019). They display more idiolectal variation, and often more than one classifier can accompany a given noun (Hu, 1993), unlike for gender (where this is rare). We note that we could further extend our measures to fuzzy partitions, which remain less explored in community detection, but are a promising avenue for future work.