Uncovering Probabilistic Implications in Typological Knowledge Bases

The study of linguistic typology is rooted in the implications we find between linguistic features, such as the fact that languages with object-verb word ordering tend to have postpositions. Uncovering such implications typically amounts to time-consuming manual processing by trained and experienced linguists, which potentially leaves key linguistic universals unexplored. In this paper, we present a computational model which successfully identifies known universals, including Greenberg universals, but also uncovers new ones, worthy of further linguistic investigation. Our approach outperforms baselines previously used for this problem, as well as a strong baseline from knowledge base population.


Introduction
Linguistic typology is concerned with mapping out the relationships between languages with reference to structural and functional properties (Croft, 2002). A typologist may ask, for instance, how a language encodes syntactic features and relationships. Does it place its verbs before objects or after, and does it have prepositions or postpositions? It is well established that many features of languages are highly correlated, sometimes to the extent that they imply each other. Based on this observation, Greenberg (1963) establishes the notion of implicational universals, i.e., cases where the presence of one feature strictly implies the presence of another.
Universals are important to investigate as they offer insight into the inner workings of language and define the space of plausible languages. Universals can aid cognitive scientists examining the underlying processes of language, as there arguably is a cognitive reason for why, e.g., languages with OV ordering are postpositional (Greenberg, 1963). In the context of natural language processing (NLP), when creating synthetic data for multilingual NLP, one should consider universals to maintain the plausibility of the data (Wang and Eisner, 2016). Computational typology can furthermore be used to induce language representations, useful in, e.g., language modelling (Östling and Tiedemann, 2017) and syntactic parsing (de Lhoneux et al., 2018).
In this paper, we argue that the deterministic Greenbergian view of implications (Greenberg, 1963) is outdated. Instead, we suggest that a probabilistic view of implications is more suitable, and define the notion of a probabilistic typological implication as a certain conditional probability distribution. We do this by first placing a joint distribution over the vector of typological features, and then marginalising out all features other than the two under consideration. This computation is made tractable by learning a tree-structured graphical model (Figure 1) with the PC algorithm of Neapolitan (2004) and then applying the belief propagation (BP) algorithm (Pearl, 1982). We draw inspiration from manual linguistic efforts to this problem (Greenberg, 1963;Lehmann, 1978), as well as from previous computational methods (Daumé III and Campbell, 2007;Bjerva et al., 2019a). Additionally, we provide a qualitative analysis of predicted implications, as well as performing an empirical evaluation on typological feature prediction, com-arXiv:1906.07389v1 [cs.CL] 18 Jun 2019 paring to strong baselines.

From A Generative Model to
Probabilistic Implications Notation. We now seek a probabilistic formalisation of typological implications. First, we will introduce the relevant notation. Let be a language.
We will seek to explain the observed, languagespecific binary vector of typological features, or parameters, π where π i = 1 indicates that the i th typological feature is "on" in language . When it is unambiguous, we will drop the superscript . Note that we call the vector π due to a spiritual similarity to the principle-and-parameters framework of Chomsky (1981).
A Generative Model of Typology. We construct a simple generative probability model over the the vector of typological features π, which factorises according to some tree structure T . We will discuss the provenance of T below. Concretely, this distribution is defined as where pa T [·] is a function that returns the parents of π i , if any, in the tree T . Each conditional p(π i | pa T [π i ]) is treated tabularly with one parameter per table entry: each table entry is a unique configuration of the feature π i and its parents pa T [π i ]. We place a symmetric Dirichlet prior with concentration parameter α = 5, over each of p(π i | pa T [π i ])'s table entries. This corresponds to add-5 smoothing.
Probabilistic Implications. Although the original Greenbergian view of implications is deterministic, we argue that a probabilistic approach is more suitable. Indeed, logical implications are a special case of conditional probabilities that only take the values 0 and 1, rather than values in [0, 1]. Specifically, we argue that probabilistic implications should take the form of the following conditional probability distribution: where π is a subvector that omits the indices i and j. In text, our goal is to sum out all possible languages, holding two typological features, π i and π j , fixed. We note that since our model p factorises according to the tree T , this sum may be performed in polynomial time using dynamic programming, specifically the belief propagation algorithm (Pearl, 1982). Note that we contend this improves upon the ideas of Daumé III and Campbell (2007), who only considered pair-wise interactions of features: Our definition of probabilistic implications marginalises out all other features.
Discovering Probabilistic Implications. How can we use a generative model to discover typological implications? What we would like to know is how often p(π i | π j ) is significantly different than p(π i ). We note that p(π i ) can also be computed with BP. We now reduce the search for typological implications as asking when the quantity |p(π i | π j ) − p(π i )| is statistically significantly greater than 0. Given a sufficiently expressive generative model p, this allows for a richer notion of implication than Greenberg original proposed, as it admits the softer notion of typological influence.
Learning the Structure of p. There are many ways to learn the tree structure T , and we choose the PC algorithm of Neapolitan (2004). This algorithm works in two steps-first, it learns a skeleton graph from the data (in our case, a typological data base), with undirected edges. Next, it orients these edges so as to form a directed acyclic graph. Once we have fit this graph so as to represent p(π), we are left with a tractable model we can use to predict held-out typological features and discover typological implications.
Parameter Estimation. We apply maximum a posteriori (MAP) inference in order to estimate the parameters of our model. If all the data were observed, i.e. there were no missing values in WALS, this could be achieved by counting and normalising across the typological database in question with the previously mentioned Dirichlet prior. (The prior simply corresponds to add-λ smoothing.) However, in many cases we do have missing data. In fact, we almost never observe all the values in WALS. Thus, we must rely on expectation-maximisation to perform MAP estimation (Dempster et al., 1977). The gist of the algorithm is simple: we compute "pseudocounts" for the missing entries using belief propagation, which we smooth as if they had been observed values. Using these pseudocounts, we get a new estimate of the parameters by count-anddivide as in the fully supervised case. We iterate between updating the pseudocounts and perform-  (2013)).
ing count-and-divide. This is a standard technique in the literature.
Decoding. In section 4, we are interested in predicting typological features given others. If we wish to predict π i given observed features for a language π obs , we compute where we marginalize out all those features π unobs unobserved or held out in a given language. The conditional may be computed with belief propagation and the argmax is over the set {0, 1}. This makes the computation tractable.

WALS: A Typological Database
Before explaining our experimental setup, we first explain the data set we use in evaluation. We evaluate on the World Atlas of Language Structure (WALS, Dryer and Haspelmath (2013), which is the largest openly available typological database.
It comprises approximately 200 linguistic features with annotations for more than 2,500 languages. These annotations have been made by expert typologists through meticulous study of grammars and field work. WALS is quite sparse, however, as only 100 of these languages have annotations for all features. For instance, Figure 2 shows the distribution of consonant inventory sizes across the languages for which this feature is annotated. Although this is not our main contribution, the fact that we can predict held-out features offers a way to fill in the feature value gaps which exist for the vast majority of languages.
Pre-processing We pre-process our data similarly to Daumé III and Campbell (2007). We fil-  ter out features which are not encoded for at least 100 languages, and feature values which occur for fewer than 10% of the languages. The reason for this is that any implications found for exceedingly rare features is likely to be inconclusive. We further follow Daumé III and Campbell (2007) in that we binarise features with more than 7 feature values such that they simply encode whether or not a language has a feature. For instance, features are not likely to have implicants determining the number of tones, but rather the presence or absence of tones. Finally, they take into account that languages are not independent, as phylogenetic similarity can help infer features in closely related languages. We do not use this information, as we are interested in finding implications which ought to be independent of language relatedness.

Two Typological Experiments
In order to evaluate our probabilistic approach to typological implications, we define two tasks. Our empirical evaluation is based on predicting features so as to get an objective measure of our model, which is comparable both to previous work and other strong baselines. Second, we include a qualitative evaluation, as we are interested in uncovering both known and novel typological implications.

Predicting Typological Features
Feature prediction is a commonly used task in evaluating how well a given model is able to explain the typological features of languages (Daumé III and Campbell, 2007;Malaviya et al., 2017;Cotterell and Eisner, 2018;Ponti et al., 2018;Bjerva et al., 2019a). This is an important task which can highlight the extent to which a model has learned interdependencies between languages and features. We include this evaluation to first show that our model has predictive power which surpasses strong baselines, before investigating the main research question of this work, i.e., the extent to which we can uncover probabilistic implications. We evaluate the models on feature prediction by fitting our model on 80% of the languages in WALS, and leaving out 10% of the languages for development and testing, respectively. We split our evaluation of our model up across the feature categories present in WALS. These cover areas such as phonology, morphology etc., listed in Table 1. During the typological feature prediction experiments, we consider a single such WALS category at a time.We vary the number of implicants by allowing the model to observe 2 to 6 features from within this category as well as the values of features in other categories. This is done as having access to, e.g., all word-order features when predicting a final word order feature would be much easier than our setting. Hence, our experiment will show the extent to which increasing the number of features from the current feature category affects predictive power. We vary the number of implicants k from 2 to 6 features in each category with a total of n features, this gives us n k total sets per number of implicants k. For each such set, we attempt to predict all held-out features in that category in a leave-one-out-style evaluation. This results in n k (n − k) predictions to make per category per number of implicants k. Performance is measured by averaging the accuracy of predictions of all held-out features over the entire test set, across categories.
Baseline #1: Most frequent Since many typological features have low-entropy distributions, a most frequent class baseline is a relatively strong lower bound for prediction of typological features. For instance, this yields an accuracy of 45% when predicting the canonical subject-object-verb ordering in a language.
Baseline #2: Pairwise prediction We implement a simple baseline based on pairwise prediction of typological features. This is inspired by the approach in Daumé III and Campbell (2007). As this code was not publicly available we provide our own non-Bayesian implementation.
Baseline #3: PRA Since WALS can be seen as a knowledge base, we apply a strong baseline from the field of knowledge base population. Path Ranking Algorithm (PRA) is an algorithm which finds relation paths by traversing the knowledge graph, which can then be used to predict implicatures and feature values (Lao and Cohen, 2010;Lao et al., 2011). 1 We train PRA using the standard hyperparameters of the existing implementation, which includes regularising with 1 = 0.001 and 2 = 0.001, as well as using negative sampling.
Baseline #4: Language embeddings Although we aim to predict implications, and not only feature values, we compare with previous work on predicting typological features in WALS (Bjerva and Augenstein, 2018a). As their setup is different, we use their highest reported score as a baseline.
Feature Prediction Results. Table 1 contains the results from feature prediction across the chapters outlined in WALS. Our implementation is able to predict features across categories above baseline levels. At increasing numbers of implicants, prediction power tends to increase. This is not the case for all feature categories, however. One such case is Nominal Syntax, in which performance peaks at 3 implicants. This is expected, as correlations only exist between some features, thus at a certain point access to more typological features no longer helps performance. Note that although the baseline numbers are based on predicting the same features as our model, the baseline models do not observe the same features during prediction -for instance Baseline #4 does not make predictions based on other feature values, but is trained on one feature at a time.

Discovering Typological Implications
Having established that our method bests several competitive baselines for prediction of typological features, we next look at what implications our probabilisation of typology allows us to find. We search for those conditional probabilities where the quantity |p(π i | π j ) − p(π i )| is statistically significantly greater than 0, as found with an independent # Implicant ⊃ Implicand  two-tailed t-test. 2 After adjusting for multiple tests with the Bonferroni correction, we report those implications where p < 0.05. We report the full list of implications found by our model in the Supplements and show a subset of these in Table 2. 3 We note that we are able to find the same implications listed by Daumé III and Campbell (2007), some of which are listed in the table. These implications include Greenberg universals (Greenberg, 1963), showing that our approach to probabilisation of linguistic universals is suitable to replicate previous work.
Transitivity across implications At first glance, it is not clear why postpositions should imply SV word order, as stated in #4. Yet, #2 is a wellestablished universal (Greenberg, 1963) and #3 comes with strong statistical evidence: SV order is much more frequent than VS word order in OV languages (98.44% of these are predominantly SV). Our model has thus used transitive reasoning of the The power of multiple implicants Implications #10 and #11 concern the order between nouns and their numeral modifiers. The two main alternatives here, Noun-Numeral and Numeral-Noun are of comparable frequency in WALS; they occur in 607 and 479 languages, respectively, i.e. Noun-Numeral holds the majority with only 55%. If we consider each of the three implicants listed in implication #11 on their own, the strongest statistical power goes to the Degree word-Adjective feature: conditioned on this feature, the Numeral-Noun order holds in 79% of the relevant languages. The combination of all three implicants, on the other hand, results in a subset of languages with 91% Numeral-Noun order. The Numeral-Noun order can thus be implied with considerably more confidence from a combination of multiple implicants.

Related Work
Typological implications outline the space of possible languages, based on evidence from observed languages, as recorded and classified by linguists (Greenberg, 1963;Lehmann, 1978;Hawkins, 1983). While work in this direction has been manual, typological knowledge bases do exist now (Dryer and Haspelmath, 2013;Partick Littel and Levin, 2016), which allows for automated discovery of implications. Although previous computational work exists (Daumé III and Campbell, 2007), we are the first to introduce a probabilisation of typological implications. In addition to work on finding implications based on known features, there is an increasing amount of work on computational methods to discovering typological features (Ponti et al., 2018). Work in this area includes unsupervised discovery of word order (Östling, 2015) or other linguistic features (Asgari and Schütze, 2017), typological probing of language representations (Bjerva et al., 2019b;Beinborn and Choenni, 2019), and several papers attempt to predict typological features in WALS (Georgi et al., 2010;Malaviya et al., 2017;Bjerva and Augenstein, 2018a,b;Eisner, 2017, 2018;Bjerva et al., 2019a).

Conclusions
We defined the notion of probabilistic implications, and presented a computational model which successfully identifies known universals, including Greenberg universals, but also uncovers new ones, worthy of further investigation by typologists. Additionally, our approach outperforms strong baselines for prediction of typological features.