Real-Valued Logics for Typological Universals: Framework and Application

This paper proposes a framework for the expression of typological statements which uses real-valued logics to capture the empirical truth value (truth degree) of a formula on a given data source, e.g. a collection of multilingual treebanks with comparable annotation. The formulae can be arbitrarily complex expressions of propositional logic. To illustrate the usefulness of such a framework, we present experiments on the Universal Dependencies treebanks for two use cases: (i) empirical (re-)evaluation of established formulae against the spectrum of available treebanks and (ii) evaluating new formulae (i.e. potential candidates for universals) generated by a search algorithm.

The availability of comparable treebanks -syntactically annotated corpora -for a growing number of typologically distinct languages (most prominently in the collaborative Universal Dependencies project (Nivre et al., 2016)) has led to a recent surge of interest in computational work aiming to detect systematic patterns in the grammatical systems of natural languages and/or to test hypotheses from theoretical work in language typology against empirical evidence. The treebank-based approach (Liu, 2010;Lochbihler, 2017;Gerdes et al., 2019;Bjerva et al., 2019c;Hahn et al., 2020) adds a more data-driven perspective to a strand of research in computational typology (Daumé and Campbell, 2007;Malaviya et al., 2017;Oncevay et al., 2019;Bjerva et al., 2019a;Bjerva et al., 2019b) that is based on carefully curated typological databases such as WALS 1 (Dryer and Haspelmath, 2013) or URIEL 2 .
The research strand in computational typology which relies on databases essentially builds on the language features that the long tradition of typological research has identified as most relevant for identifying language universals. Examples of such features are the relative order of verbs and their objects, and the order of nouns and their dependents such as adjectives, numerals and genitives. In the computational research relying on typological knowledge bases, the features are typically assumed to be Boolean and universals are formulated as propositional formulae. A major focus has been on (a) detecting universals that have the form of an implication between two typological variables, and (b) predicting the value of unknown features in typological databases based on systematic patterns in attested grammatical systems. Graphical models have been widely used to calculate the strength of an implication (Daumé and Campbell, 2007;Lu, 2013;Bjerva et al., 2019b;Bjerva et al., 2019a). While this approach is suitable if one wants to marginalize out the influence of confounding variables, it also constrains the investigated universals to have the form of an implication consisting of one implicand and usually one (but possibly multiple) implicant(s).
In principle, comparable treebanks can provide the basis for observing the empirical distribution of arbitrary grammatical patterns and thus explore a much larger space of potential candidates for universal typological properties -including combinations of more than two variables that cannot be reduced to logical implication. However, an integration of such an approach with linguistically grounded hypothesis checking has to address two related challenges: (i) a theoretically guided way of navigating the enormous space of candidate propositions has to be developed, and (ii) a perspicuous framework is required for expressing multi-variable propositions and for evaluating them empirically against a full collection of comparable treebanks -while doing justice to the possibility of language-internal variation and tentative preferences by modeling features as real-valued. This paper proposes an expressive framework that addresses the latter challenge. We specify a formalism and its semantics to evaluate typological formulae of arbitrary logical complexity. The core elements of our framework are customizable which gives prospective users the freedom to use their own implementations.
We also include a method for counteracting the bias in the sample of well-studied languages (which are much more likely to be included among the languages with a treebank), many of which are phylogenetically closely related, while other language families are only very sparsely represented.
To demonstrate its usefulness for empirical hypothesis testing, we run a number of experiments: 1) We re-evaluate established universals to test whether the evidence for universals provided by the framework reflects the broad consent. 2) Universals are always evaluated on a subset of all natural languages. We investigate how reliable it is to transfer the evaluation result obtained from a subset of languages or language families to unknown languages. 3) We use the framework to search new potential universals.

Language-property matrix
Universals are modelled as logical formulae which consist of variables and logical connectives. The variables are organized in an |L| × |P | matrix V , where L is the set of investigated languages and P is the set of typological properties/features. The variable at v p represents the value for the property p of the language . In our framework, these values are truth values in the range [ 0, 1 ]. In the simplest case, they are binary truth values as in the example matrix below.
Formulae can be constructed from the language-property matrix by logical connectives, such as the conjunction in (2). 3 (2)

Valuation function
The valuation function V maps formulae to truth values. The function specifically defines the logical connectives for negation (¬), conjunction (∧), disjunction (∨) etc. For example, the valuation for conjunction in Boolean logic is defined as With the language-property matrix in (1) and the valuation in (3), the formula in (2) evaluates as Prep. ) = 1.

Averaging over languages
Let V(φ ) be the truth value of the formula φ for the language ∈ L under the valuation V. Then |L| truth values can be calculated for V(φ ), one for every language. A universal is a statement that is supposed to hold for "all" languages -in the statistical sense rather than the absolute sense -but a universal quantification (i.e. a conjunction of the values) would be misleading since it is not robust to outliers. The score for a formula is hence calculated as the (weighted) average with w being the weight function. An unweighted average corresponds to a weighted average with all weights being 1.

Implementation
The framework allows implementing individual definitions for the variable matrix V , the valuation V and the weight function w. We experimented with various implementations for all of those (Dönicke, 2020); this section explains the set-up used for the experiments in section 3. The UD treebanks v2.5 (Zeman et al., 2019) consist of dependency treebanks for 90 languages from 20 families and 39 subfamilies. The treebanks have a uniform annotation which allows defining properties based on dependency constructions and extracting statistics for various languages of the world. This data can be used to investigate syntactic universals.

Variables from Universal Dependencies
We extracted a language-property matrix from the UD treebanks. The properties are specific constructions and the values are their relative frequencies. We extracted two types of properties: • Single-link property: relative frequency of a construction involving one head and one dependent, e.g.
• Double-link property: relative frequency of a construction involving one head and two dependents, e.g.
Here, #[ * ]( ) returns how often the construction * appears for . The subscripts of v represent the constructions: hyphens connect words, the head is represented by its part-of-speech tag, and the dependents are represented by their dependency relation and their part-of-speech tag, combined by a colon.

Fuzzy logic
The relative frequencies are in the interval [ 0, 1 ], hence our valuation function has to define logical connectives for a real-valued logic. A common example for real-valued logic is fuzzy logic (Zadeh, 1965), which defines negation, conjunction and disjunction as follows: Implication and equivalence are shortcuts for combinations of the three previous connectives: All of these valuation functions are generalizations from Boolean logic. Other examples for manyvalued logics are described in e.g. Smith (2012) and many other works. We chose fuzzy logic, in contrast to e.g. probabilistic logic (Mizraji, 1992), because the definitions of the fuzzy connectives do not make any assumption about the dependence of typological properties (cf. Dubois and Prade (1993)). Properties like "SVO" and "prepositions" could both be considered a subtype of head-initiality and therefore not be independent under a linguistic point of view. In probabilistic logic, the truth value of a conjunction is defined as the product of each conjunct's truth value, which assumes the independence of the conjuncts.

Phylogenetic weighting
To confirm a typological universal, it is not sufficient to validate it on as many languages as possible, it is also necessary to validate it on languages from many language families. For example, if a candidate for a universal is tested on English, German, French, Italian, Spanish and Japanese and the universal holds for all of them but Japanese, then Japanese is not simply an outlier, it is also possible that the "universal" only holds within the Indo-European languages. Traditionally, typologists counteract the influence of overrepresented language families through different sampling methods (cf. Bickel (2011), Song (2018), e.g. sampling only languages with different values (on the properties of interest) for each family (Dryer, 1989;Bickel, 2008). This "genealogical sampling", however, requires binary/categorical values and a representative database of the world's languages, and is not applicable in our experiments with the UD treebanks. Thus, we give each language family equal importance by setting the weight function to The weights for the current example are shown in Figure 1. This approach does not undersample languages, instead all available data is used. To the best of our knowledge, this is a novel method that maximizes the usage of all available data while alleviating the sampling bias in the data. 5 We believe its utility is worth further examination, however, in our preliminary experiments, the phylogenetic weighting demonstrates better agreement with the existing universals from the literature than the unweighted average. 6 1.0 All As Bickel (2011) and Dryer (2018) point out, geography is also an important factor, i.e. neighboring languages are more likely to have common properties than distant languages. However, both works agree that measuring geographic distance of languages brings its own complications and therefore it is not taken up in this paper.

Experiments
The goal of our experiments is to demonstrate the usefulness of the proposed framework for empirical exploration of the language-typological space beyond simple implications and Boolean variables. The framework is to serve as a methodological tool and should ultimately be complemented with a theoretically motivated agenda for exploring systematic correlations among underexplored properties of grammatical systems. To establish how the framework can be used we proceed in three steps: First, we re-evaluate some famous universals of Greenberg (1963) and discuss the interpretations of varying truth values and standard deviations (Sec. 3.1).
In the second step, we demonstrate a scenario for testing unexplored candidates for typological universals (which can consist of arbitrary formulae from propositional logic) against the full spectrum of treebanks, taking into variability with respect to each property in the real-valued logic. We work with a simple procedure of generating candidate formulae: taking over the shape of potentially relevant logical combinations from established universals and replacing the properties they include with new ones. A framework is useful for empirical exploration if its diagnostics are robust: a statement capturing a systematic relationship should generalize from one sufficiently large sample of languages to another. We test this by splitting the set of treebanks in an estimation set and a held-out set (Sec. 3.2).
Finally, we enumerate implications of two variables from a selected set of single-link properties and discuss the top-scoring formulae, also in comparison to the findings from previous work on binary properties (Sec. 3.3).
The formulae are ranked by framework score, i.e. the weighted average truth value. 11 of the formulae score 0.90 or higher, i.e. they could be classified as "very true". #2 is an equivalence and #16 is a quasiequivalence which are generally harder to fulfil than one-way implications (in fuzzy logic, the truth value of an equivalence is defined as the minimum of the truth values of the two composed implications). The formula with the lowest score, #13, is an example for two conflicting universal tendencies, namely 1) the tendency to order subordinate clause and main verb the same as object and verb (this is predicted by different linguistic theories, e.g. the head-dependent theory), and 2) the tendency to put clauses after their single-word siblings (e.g. Dryer (2003)). One can reformulate #13 to take both tendencies into account by flipping the word orders on both sides of the implication or, equivalently, reversing the direction of the implication. This yields s = 0.95 (σ = 0.09).
σ denotes the (weighted) standard deviation which ranges between 0 and 0.5. A value of 0 means that the formula is equally true for all languages; a value of 0.5 means that the formula is absolutely true for half of the languages and absolutely false for the other half. Compare e.g. #3 (VSO ⇒ prepositions) and #4 (SOV ⇒ postpositions) -and the respective Figures 2 (a) and (b) -which have similar values for s but different values for σ. Most of the 90 languages in the treebanks show (near) 0% VSO order or 100% prepositions and therefore get a truth value of 1 for #3. Only some languages (at the bottom-left corner of Figure 2 (a)) get lower truth values for #3 which raises the standard deviation to 0.01. Regarding #4, there are more languages showing mid-range values for SOV order and postpositons, and therefore also mid-range truth values for #4. Since most languages still have a high truth value for #4, the score remains at 0.96 but the standard deviation of 0.12 signals a greater number of languages with truth values diverging from that score. 7 Addition is not a logical connective but used here to calculate the value of properties which cannot be directly extracted from the UD treebanks. For example, the degree to which a language has S-O order is the percentage occupied by V-S-O, S-V-O and S-O-V constructions. Thus, the value v nsubj:noun-obj:noun can be calculated as v verb-nsubj:noun-obj:noun + v nsubj:noun-verb-obj:noun + v nsubj:noun-obj:noun-verb . Addition is restricted to variables that are mutually exclusive, i.e. that are calculated with the same denominator (compare eq. (7)). That said, (out-of-logic) addition in our framework is comparable to addition of probabilities of disjoint events in probability theory; it is not as powerful and does not serve the same purpose as (in-logic) addition ("strong disjunction") in certain real-valued logics, e.g. Łukasiewicz logic (Łukasiewicz and Tarski (1930); cf. Bergmann (2008, p. 179)), where it is possible to add arbitrary variables.
The result of an addition thus expresses the truth value for any (combination) of several alternative constructions being present in a language. Greenberg (1963), as others, uses the terms "dominant" and "alternative" to indicate two degrees of relative frequency of word orders. This binary distinction cannot be directly expressed with UD variables because they already express infinitely many degrees of relative frequency. We denote "p is a dominant word order in " as v p and "p is an alternative word order of q in " as v p + v q , since we think that these are the most appropriate real-valued equivalents.

Universal prediction
For the second experiment, we generate over 200k formulae from the Greenbergian universals in Table 1 by keeping the structures and varying the variables. (The aim of keeping the structures is to reduce the search space and to generate only formulae that have the "usual shape" of traditional universals.) Logical tautologies are excluded by prohibiting the variables from appearing more than once within the same formula. We evaluate these formulae in six runs. In each run, we randomly split the languages into two disjoint sets for every formula and evaluate it on both sets separately. The idea is to predict the score for the second set of languages from the score on the first set of languages. Ideally, the scores from both sets should be identical because this means that the initial score is reliable even for unknown data. After each run, we take the two values s 1 (φ), s 2 (φ) for every formula φ and calculate the overall root-mean-square error (RMSE): This measures how similar our predictions are to the actual values. Since we are especially interested in high-scoring formulae, we additionally calculate the RMSE on only those value pairs where s 1 (φ) ≥ 90%.
The six runs differ in the split method. 1) The languages are split either on the language, subfamily or family level. For the latter two, the languages of a (sub)family are either completely in the first or the second set. 2) The languages/subfamilies/families are either split 50%/50% or 20%/80%. The second percentages are more realistic since there is often data for a few languages/families from which one wants to predict how true a formula is in general.
The results in Table 2 illustrate two things. The less surprising finding is that evaluations on two balanced sets are more similar than evaluations on imbalanced sets. Arguably, because an evaluation on 20% of the data is much less representative than an evaluation on 80% of the data, whereas the two sets of a 50/50 split are about equally representative. Also, splitting on the language level yields more similar evaluation results than evaluating on two sets with completely different (sub)families. The second finding is that the RMSE is generally lower for formulae that are very true in at least one set of languages. This is also visible in Figure 3, the data points scatter in the shape of a lens around the identity graph, i.e. formulae that are found to be very true or very false on one set of languages are likely to achieve a similar score on a completely different set of languages. Formulae with mid-range truth values (0.2-0.8), on the  other hand, exhibit a much larger variance. This indicates that certain formulae have a higher potential of being universals than others. For subfamilies and families, the prediction error increases (the appendix contains scatter plots for all six runs).

Search for universals
Implications between two variables have been investigated in many previous studies. In the third experiment, we choose ten dependency relations 8 and calculate pairwise implications between the corresponding single-link properties. Since each relation can be left-or right-directed, there are 20 × 20 possible implications (the appendix contains the table with all implications). 9 Table 3 ranks the implications by average truth value. As a matter of fact, the subject precedes its head in the majority of UD languages, resulting in high scores for implications where "HEAD-nsubj" is the antecedent or "nsubj-HEAD" is the consequent (principle of explosion). Those implications are responsible for about half of the rows in Table 3. For example, (#19) HEAD-advmod ⇒ nsubj-HEAD and (#20) advmod-HEAD ⇒ nsubj-HEAD are quite uninteresting implications since both achieve a score of 0.92 but have complementary antecedents. Thus, they only reveal that most languages exhibit subject-verb order, independently from the position of adverbs. However, there are also some formulae that were already discovered in different works and are also included in the Universals Archive, such as (#22) a. amod-HEAD ⇒ nummod-HEAD b. "IF the descriptive adjective precedes the noun, THEN, with overwhelmingly more than chance frequency, the demonstrative and the numeral do likewise." (Universals Archive, #57), which is part of the 18 th universal of Greenberg, part of the 57 th in the Universals Archive and the 7 th in Daumé and Campbell (2007).   Greenberg (1963), Daumé and Campbell (2007), Bjerva et al. (2019b) The implication with the highest score, (#1) a. acl-HEAD ⇒ nmod-HEAD b. "IF the relative clause precedes the noun, THEN the genitive precedes the noun." (Universals Archive, #176), concerning the order of adjectival clauses and nominal noun modifiers and first described by Hawkins (1983), is listed in the UA but was not detected by the computer-assisted models of Daumé and Campbell (2007) and Bjerva et al. (2019b). This shows that corpus statistics can gain advantage over knowledge bases in some cases. Last but not least, universals that, as far as we know, have not been described so far are suggested, e.g. the following four and their composition in (1).
Although the suggested universals are not guaranteed to be linguistically sound or meaningful, it could certainly provide inspiration to typologists for careful examination and interpretation.

Conclusion
This paper proposes a framework for the expression of typological statements which uses real-valued logics to capture the empirical truth value of a formula in a given collection of comparable treebanks. We demonstrate the application of the framework by evaluating established formulae as well as new formulae generated from a search algorithm. The components of the framework can be customized: users not working with the Universal Dependencies can simply exchange the data component; the weighting approach we provide can also be easily exchanged.
If a user wants to empirically check their own linguistic features, it is straightforward to obtain the relative frequencies from the treebank collection, complementing existing curated databases such as WALS. Very little manual work is needed. Both approaches have their pros and cons: typological knowledge bases could be subjective or incomplete, while data from the treebanks are subject to problems such as annotation inconsistencies, genre differences etc.
This paper discussed some experiments which have the purpose of providing an illustration of the expressiveness of the framework, focusing here on word-order variables. However, other variables could be expressed as well, and the framework also supports inclusion and combination of variables from completely different sources (e.g. phonological properties of a language), as long as the values range between 0 (false) and 1 (true). This especially allows using variables from corpora and from typological knowledge bases together.