Statistical Modeling of Creole Genesis

,


Introduction
While most linguistic applications of computational phylogeny rely on lexical data (Gray and Atkinson, 2003;Bouckaert et al., 2012), there is a growing trend to make use of typological data (Tsunoda et al., 1995;Dunn et al., 2005;Teh et al., 2008;Longobardi and Guardiano, 2009;Murawaki, 2015). One advantage of typological features over lexical traits (cognates) is that they allow us to compare an arbitrary pair of languages even if they do not share enough cognates. For this reason, they have the potential of uncovering external relations involving language isolates and tiny language families such as Ainu, Basque, and Japanese.
However, our understanding of typological changes is far from satisfactory in at least two respects. First, typological changes are less intuitive than the birth and death of a lexical trait. Modeling word-order change with a single transition matrix (Maurits and Griffiths, 2014), for example, appears to be an oversimplification because some complex mechanisms must be hidden behind the changes (Murawaki, 2015).
The second point, the main focus of this paper, is that it is not clear whether typological data fit into the traditional tree model for a group of languages, which has long been used as the default choice to summarize evolutionary history (Schleicher, 1853). To be precise, regardless of whether typological features are involved, linguists have viewed the tree model with suspicion. A central problem of the tree model is its assumption that after a branching event, two resultant languages evolve completely independently. However, linguists have noted that horizontal contact is a constitutive part of evolutionary history. Various models for contact phenomena have been proposed to address this problem, including the wave theory (Schmidt, 1872) and the gravity model (Trudgill, 1974). As for linguistic typology, areal linguistics has worked on the diffusion of typological features across languages within a geographical area (Campbell, 2006).
In this paper, we study creole languages as an extreme case of non-tree-like evolution (Wardhaugh and Fuller, 2015). A creole is developed as a result of intense contact between multiple languages: typically one socioculturally dominant language (superstrate) and several low-prestige languages (sub-strates). Superstrates are also known as lexifiers because the lexicon of a creole is largely derived from its superstrate. In spite of this, the grammar of a creole is drastically different from that of its lexifier. It is often said that creole grammars are simpler than non-creole ones although it is not easy to measure grammatical simplicity.
Creoles are not irrelevant to historical linguistics because some have speculated about the plausibility of creole status for Middle English (Bailey and Maroldt, 1977) and (pre-)Old Japanese (Kawamoto, 1974;Akiba-Reynolds, 1984;Kawamoto, 1990) while they have been criticized harshly by others.
This controversy can only be settled by fully understanding creole genesis, or the question of how creoles emerge, which also remains unresolved (Wardhaugh and Fuller, 2015). One theory called the gradualist hypothesis suggests their staged development from pidgin languages. A pidgin arises when speakers of different languages with no common language try to have a makeshift conversation, which results in a drastic simplification of grammar called pidgin formation. A pidgin is then acquired by children as their first language and is transformed into a fully functional language. This process of grammatical elaboration is known as creole formation.
We follow Bakker, Daval-Markussen and colleagues in taking data-driven approaches to this problem (Bakker et al., 2011;Daval-Markussen and Bakker, 2012;Daval-Markussen, 2013). They argue that creoles are typologically distinct from noncreoles. They compare four theories of creole genesis: 1. Superstratist. The lexifier plays a major role in creole genesis.  Figure 1: NeighborNet graph of creoles (c), lexifiers (L), substrates (S) and other non-creole languages (X). A language type (c/L/S/X) is followed by a three-letter language code. The bottom-up clustering (i.e., basically tree-building) method produced a cluster of creoles on the right. We To test these explanations, they apply Neighbor-Net (Bryant and Moulton, 2004), a bottom-up, distance-based clustering method developed in the field of computational biology. By demonstrating that, as in Figure 1, creoles form a cluster distinguishable from lexifiers, substrates and other noncreole languages, they argue for the universalist position and for creole distinctiveness.
However, we find both theoretical and methodological problems in their discussion. Theoretically, the synchronic question of creole distinctiveness is confused with the diachronic question of creole genesis. If creoles are distinct from noncreoles, then something specific to creoles (e.g., restructuring universals) would play a role, but not vice versa. It is logically possible that even with restructuring universals, creoles are indistinguishable from (a subgroup of) non-creoles. Methodologically speaking, NeighborNet does not straightforwardly explain fundamentally non-tree-like evolution because it does assume tree-like evolution even though it represents conflicting signals with reticulations.
We begin by addressing the question of creole (non-)distinctiveness. Whether one is distinct from another is straightforwardly formalized as a binary classification problem. We show that an SVM classifier fails to separate creoles from non-creoles. Fol- lowing a practice in population genetics, we visualize the data using principal component analysis (PCA). The result suggests that although creoles have a substantially different distribution from noncreoles, they nevertheless overlap.
Next, we propose to model creole genesis with mixture models. In this approach, a creole is stochastically generated by mixing its lexifier, substrate(s) plus a special restructurer. Conceptually, this is the opposite of the tree model, as illustrated in Figure 2. Specifically, we present two Bayesian models. The first one considers one mixing proportion per creole, and the other decomposes the proportions into per-feature and per-creole factors. Our experimental results suggest that the restructurer dominates creole genesis, dismissing the superstratist, substratist and feature pool theories. We also find some statistical universals in the restructurer although we refrain from identifying them as restructuring universals. In this way, we represent a first step toward understanding the complex process of creole genesis through statistical models.

Like Bakker,
Daval-Markussen and colleagues (Bakker et al., 2011;Daval-Markussen and Bakker, 2012;Daval-Markussen, 2013), we borrow ideas from computational biology. For reasons unknown to us, they chose clustering models that basically assume tree-like evolution (Saitou and Nei, 1987;Bryant and Moulton, 2004). However, creole genesis is more comparable to models that explicitly take into account genetic admixture (i.e., contact phenomena). See Jones et al. (2015), for example, to take a look at standard practices in population genetics.
Population genetic analysis of genotype data (binary sequences comparable to sets of linguistic features) can be grouped into two types: populationlevel and individual-level analysis. Populations, such as Sardinian, Yoruba and Japanese, are predefined sets of individuals. Population-level analysis utilizes genetic variation within a population (Patterson et al., 2012). From a modeling point of view, languages are more comparable to individuals. Although a language is spoken by a population, no linguistic data available are comparable to a set of individuals.
Individual-level analysis, where population labels are used only for the purpose of visualization, is often done using PCA and admixture analysis. PCA is used for dimensionality reduction: by selecting the first two principal components, high-dimensional sequences are projected onto an informative twodimensional diagram (Patterson et al., 2006).
Admixture analysis (Pritchard et al., 2000;Alexander et al., 2009) closely resembles topic models, most notably Latent Dirichlet Allocation (LDA) (Blei et al., 2003), in NLP. It assumes that each individual is a mixture of K ancestral components (i.e., topics). One difference is that while each LDA topic is associated with a single word distribution (K distributions in total), each SNP (i.e., feature type) has its own distribution (K × J distributions in total for sequences with length J).

Linguistic Typology and Non-tree-like Evolution
Like lexical data, typological features are usually analyzed with a tree model, but Reesink et al. (2009) are a notable exception. They applied admixture analysis to Australian and Papuan languages, for which tree-building techniques had not been successful. They related inferred ancestral components to putative prehistoric dispersals and contacts. Independently of biologically-inspired studies, Daumé III (2009) incorporated linguistic areas into a phylogenetic tree. In his Bayesian generative model, each feature of a language has a latent variable which determines whether it is derived from an areal cluster or the tree. Thus his model can be seen as a mixture model.

Data and Preprocessing
We used the online edition 1 of the Atlas of Pidgin and Creole Language Structures (APiCS) (Michaelis et al., 2013), a database of pidgin and creole languages. It was larger than the datasets of Bakker et al. (2011). As of 2015, it contained 76 languages (104 varieties). It was essentially a pidgin-andcreole version of the online edition 2 of the World Atlas of Language Structures (WALS) (Haspelmath et al., 2005), but it contained sociolinguistic features, phonological inventories and example texts in addition to typological features. As APiCS did not mark creoles, we used the sociolinguistic feature "Ongoing creolization of pidgins" as a criterion to select creoles. Specifically, we filtered out languages whose feature value was neither "Not applicable (because the language is not a pidgin)" nor "Widespread." In APiCS, 48 out of 130 typological features were mapped to WALS features. We used these features to combine creoles from APiCS with non-creoles from WALS. Since the WALS database was sparse, we selected languages for which at least 30% of the features were present. As a result, we obtained 64 creoles and 541 non-creoles.
We imputed missing data using the R package missMDA (Josse et al., 2012). It handled missing values using multiple correspondence analy-sis (MCA). Specifically, we used the imputeMCA function to predict missing feature values.
When investigating creole distinctiveness, we used binary representations of features. Using a oneof-K encoding scheme, we transformed 48 categorical features into 220 binary features.
Our mixture models require each creole to be associated with a lexifier and substrate(s). Unfortunately, APiCS described these languages in an obscure way (and many of them are indeed not fully resolved). We had no choice but to manually select several modern languages as proxies for them. For simplicity, we chose only one substrate per creole, but it is not difficult to extend our model for multiple substrates. We are aware that these are oversimplification, but we believe they would be adequate for a proof-of-concept demonstration.

Creole Non-distinctiveness 4.1 Binary Classification
To determine whether creoles are distinct from noncreoles, we apply a linear SVM classifier to the typological data. Here, linearity is assumed for two reasons. First, since the supposed distinctiveness is explained by restructuring universals, there is no way for creoles to have an XOR-like distribution. Second, Daval-Markussen (2013) claims that as few as three features are sufficient to distinguish creoles from non-creoles. If this is correct, it is expected that given 48 categorical features, even a simple lin-

Reference
C 54 10 NC 7 534 ear classifier can work nearly perfectly. The classifier is trained to classify whether a given language, represented by binarized features, is a creole (+1) or non-creole (−1). We use 5-fold cross validation with grid search to tune hyperparameters.
In our experiments, the accuracy, recall, precision and F1-measure were 97.2%, 88.5%, 84.4% and 86.4%, respectively. Table 1 shows the confusion matrix. We can see that the classifier failed to separate creoles from non-creoles. Although the classifier worked well, borderline cases remained.

PCA
For exploratory analysis and visualization, we applied PCA to creoles and non-creoles, again represented by binarized features. Figure 3 depicts the scatterplot of the first two principal components. We can see that creoles were characterized by quite a different distribution from that of non-creoles. The creoles were concentrated on the lower center while most non-creoles belonged to one of two clusters in the middle. However, the distribution of creoles did overlap with that of non-creoles.
Having a closer look at the diagram, we found that Negerhollands (Dutch), Cape Verdean Creole of Brava (Portuguese) and Vincentian Creole (English) were among the most "typical" creoles (lexifiers in parentheses). Tok Pisin (English) and Bislama (English) were at the periphery of the cluster. Forsaking the quest for synchronic distinctiveness, we take a more direct approach to the diachronic question of creole genesis. Since multiple languages are involved in creole genesis, it is reasonable to apply a mixture model. We assume that a creole is stochastically generated by mixing three sources: (1) a lexifier, (2) a substrate and (3) a global restructurer. Under this assumption, the main question is with what proportions these sources are mixed.
An unusual property of our model as a mixture model is that not only outcomes (creoles) but most sources (lexifiers and substrates) are observed. We only need to infer the restructurer. Thus another question is what the restructurer looks like. Note that our model is constructed such that it does not commit to a particular theory of creole genesis. If the superstratist theory is correct, then lexifiers would dominate the inferred mixing proportions. The same is true of the substratist theory. Similarly, the feature pool theory entails that the restructurer only occupies negligible portions. Also note that even if the restructurer plays a significant role, it does not necessarily imply the universalist position. The restructurer is a set of catch-all feature distributions for those which are explained neither by the lexifier nor by the substrate (that is why we avoid calling it restructuring universals). In order for it to be linguistic universals, it must show some consistent patterns in its distributions.

MONO Model
Our idea is materialized in two Bayesian generative models. The first one, called MONO, is similar to the STRUCTURE algorithm of admixture analysis (Pritchard et al., 2000). 4 Every language in the model is represented by a sequence of categorical features. The number of possible values varies among feature types. For feature j of creole i, the latent assignment variable z i,j determines from which source the feature is derived, a lexifier (L), a substrate (S) or the restructurer (R). Each creole i is associated a priori with a lexifier and a substrate. Let y i,j,L and y i,j,S be the values of feature j of creole i's lexifier and substrate, respectively. If the source is the lexifier (or substrate), the creole simply copies y i,j,L (or y i,j,S ). For the sake of uniformity, we can think of a lexifier (or substrate) as a set of feature distributions each of which concentrates all probability mass on its observed value (i.e., the δ function). The remaining source, the restructurer, is a set of categorical feature distributions each of which is drawn from a Dirichlet prior.
The assignment variable z i,j is generated from θ i , which in turn is generated from a Dirichlet prior. θ i = (θ i,L , θ i,S , θ i,R ) is the parameter of a categorical distribution which specifies the mixing proportion of the three sources for creole i. 5 More concretely, the generative story of MONO is as follows: 1. For each feature type j ∈ {1, · · · , J} of the restructurer: (a) draw a distribution from a symmetric Dirichlet distribution ϕ j ∼ Dir(β j ) 2. For each creole i ∈ {1, · · · , N }: (a) draw a mixing proportion from a symmetric Dirichlet distribution θ i ∼ Dir(α i ) (b) then for each feature type j ∈ {1, · · · , J}: i. draw a topic assignment z i,j ∼ Categorical(θ i ) ii. draw a feature value As usual, we marginalize out ϕ j and θ i using conjugacy of Dirichlet and categorical distributions (Griffiths and Steyvers, 2004). We use Gibbs and two local, observed components. 5 By letting another categorical distribution subdivide θi,S, we can incorporate multiple substrates into the model. sampling to infer z i,j , whose probability conditioned on the rest is proportional to is the number of assignments for creole i, except z i,j , whose values are k, and c −(i,j) R,j,l is the number of observed features for feature type j, except x i,j , that is derived from the restructurer and has l as its value. Intuitively, the first term gives priority to the source from which many other features of creole i are derived. The second term concerns how likely the source generates the feature value. For the lexifier or the substrate, it is 1 only if the source shares the same feature value with the creole; otherwise 0. To tune hyperparameters α i and β j , we set a vague gamma prior Gamma(1, 1) and sample these parameters using slice sampling (Neal, 2003).

FACT Model
It is said that some features are more easily borrowed than others (Matras, 2011). For creoles, some seems to reflect substrate influence on phonology while reduced inflections might be attributed to the restructurer. Inspired by these observations, we extend the MONO model such that some feature types can have strong connections to particular sources. We call this extended model FACT.
To do this, we decomposes the mixing proportions into per-feature and per-creole factors. We apply additive operations to these factors in log-space in a way similar to the Sparse Additive Generative model (Eisenstein et al., 2011). As a result of this extension, every feature j of creole i has its own mixing proportion, θ i,j = (θ i,j,L , θ i,j,S , θ i,j,R ): where m j,k is a factor specific to feature type j and n i,k is the one specific to creole i. To penalize extreme values, we put Laplacian priors on m j,k and n i,k , with mean 0 and scale γ.
To sum up, the generative story of FACT is as follows:  Table 2: Summary of mixing proportions. The arithmetic mean of 50 samples after 5,000 iterations, with an interval of 100 iterations.
For inference, a modification is needed to infer z i,j : the first term α i + c −(i,j) i,k of Equation (1) is replaced with θ i,j,k . m j,k and n i,k are sampled using the Metropolis algorithm, with a Gaussian proposal distribution centered at the previous value. Hyperparameter γ is set to 10.

Table 2 summarizes mixing proportions.
For MONO and FACT (combined), we use a fraction of assignment variables pointing to a particular souce. Per-feature and per-creole factors are converted into probabilities as follows: per-feature proportionsφ j = (φ j,L ,φ j,S ,φ j,R ), whereφ j,k = exp(m j,k ) ∑ k exp(m j,k ) . Similarly, per-creole proportionsθ i = (θ j,L ,θ j,S ,θ j,R ), whereθ i,k = exp(n i,k ) ∑ k exp(n i,k ) . We can see that the overwhelming majority of features were derived from the restructurer both in Figure 4: Mixing proportions of MONO projected onto a simplex. Each point denotes a creole. It is the parameter of the posterior predictive distribution of an assignment variable:θ i = ( One sample after 10,000 iterations. MONO and FACT (combined). The restructurer was followed by lexifiers, and substrates were the least influential. 6 These results can be interpreted as counter-evidence to the superstratist, substratist and feature pool theories.
MONO and FACT (combined) exhibited similar patterns. When the mixing proportions are decomposed into per-feature and per-creole factors, percreole factors exhibited less uneven distributions than per-feature factors. This implies heterogeneous behavior of features in creole genesis. Table 3 lists top-5 feature types for each source. Figure 4 plots creoles on a simplex of mixing proportions in MONO. Creoles scattered across the simplex but leaned toward the restructurer. This implies that a lexifier cannot be mixed with substrates without interference from the restructurer.
Compared with MONO, FACT tended to push points to the edges of the simplex. This can be confirmed in Figure 5. In particular, Figure 5(c) is directly comparable to Figure 4. It is possible that halfway points in MONO were artifacts of its limited expressive power. Table 4 lists the top-10 feature type-value pairs that were derived from the restructurer. In other words, we stochastically removed the influence of the lexifiers and substrates from creole data. These features can be regarded as (statistical) universals     Table 2. (c) N points for per-creole factorsθ i . although our model leaves the possibility that they were not restructuring universals. To answer this question, we need to break down the restructurer by types of linguistic universals. Among the 10 feature type-value pairs, only four apply to Japanese (Negative Indefinite Pronouns and Predicate Negation, Intensifiers and Reflexive Pronouns, Alignment of Case Marking of Pronouns, and Order of Numeral and Noun). For reference, English has seven. Combined with the PCA analysis in Section 4.2, this suggests that Japanese is a very non-creole-like language. However, we are unsure if the possibility of creole status for (pre-)Old Japanese is completely rejected. This question might be an-swered if we figure out how long it takes to make creole-like traits disappeared.
It is often said that creoles have SVO word order. According to APiCS, the number of creoles with SVO order was 61 (exclusive) and 71 (exclusive plus shared) in the 76 language dataset. However, this feature value only gained the ratio of 67.3%. This is mainly because SVO is the word order of most lexifiers, but it can also be attributed to data representation: since WALS did not allow multi-valued features (e.g., SVO and SOV), some creoles with multiple word orders were mapped to a separate category "No dominant order," underestimating the influence of SVO.  Table 4: Top-10 features derived from the restructurer in FACT. The ratio of the feature type-value pair (j, l) is defined as |{(i | x i,j = l, z i,j = R}| / N . The arithmetic mean of 50 samples after 5,000 iterations, with an interval of 100 iterations.

Discussion
The main contribution of our work is the introduction of mixture models to creole studies. This is, however, only the first step toward understanding the complex process of creole genesis by means of statistical modeling. Better data are needed with respect to proxies for substrates, missing values, multi-valued features among others.
With better data, more elaborate models could uncover the detailed process of creole genesis. Our models mix several sources in one step, but we may want to model the staged development of pidgin formation and creole formation. As a result of continued influence from its superstrate, a creole might undergo decreolization. It is argued that pidgins themselves have several development stages, from each of which creoles can emerge (Mühlhäusler, 1997). Hopefully, these hypotheses could be tested with statistical models.
Our finding that the restructurer plays a dominant role in creole genesis has a negative implication for tree-based inference of language relationships. If most features of a language come from nowhere, we are unable to trace its origin back into the deep past. In the meanwhile, it has been argued that creole genesis only occurred in modern and early-modern, exceptional circumstances and cannot be responsible for most historical changes. Thus identifying the social conditions under which creoles arise (Tria et al., 2015) is another research direction to be explored.

Conclusion
In this paper, we present several statistical models of linguistic typology to answer questions concerning creole genesis. First, we formalized creole (non-)distinctiveness as a binary classification problem. Second, we propose to model creole genesis with mixture models, which makes more sense than tree-building techniques.
Recent studies on linguistic applications of computational phylogeny have been heavily influenced from computational biology. They often depend on ready-to-use software packages developed in that field. We observe that, as a result, linguistic phenomena that lack exact counterparts in biology tend to be left untouched. In this paper, we have hopefully demonstrated that computational linguists could fill the gap.