NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task

This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features. We describe two submitted ridge regression-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the microaveraged accuracy score of 0.66 on 149 test languages.


Introduction
The rapidly developing field of computational lin guistic typology (Ramat, 2011) is becoming in creasingly popular in natural language processing (NLP) research (Bender, 2016; Ponti et al., 2019, where typological corpora such as the World At las of Language Structures (WALS) (Dryer and Haspelmath, 2013) and AUTOTYP (Bickel et al., 2020) are seeing increased use (Naseem et al., 2012; Burdick et al., 2020. Despite their popular ity, typological corpora are very sparse. Accord ing to Murawaki and Yamauchi (2018), less than 15% of the feature values are present for the 2,679 languages represented in WALS. These databases are humancurated and depend on grammatical de scriptions for their development; sparsity is often due to linguistic sources lacking data on particular features for a given language. Developing method ologies for accurately predicting missing typolog ical features on the basis of existing knowledge is therefore crucial for a wider adoption of typologi cal resources in NLP tasks and beyond (Evans and Levinson, 2009). This paper presents the work done by the "NEMO Team" (Google London and Tokyo) on the constrained subtask for the SIGTYP 2020 Shared Task (Bjerva et al., 2020). We experimented with a variety of machine learning models using only the features provided in the training, devel opment and test sets. Our features included ge netic features (genus and family), areal features (clusters of languages within a radius of the tar get language), and derived implicational univer sals. Originally introduced by Greenberg (1963) and demonstrated to capture the syntactic typology well (Dryer, 1992, 2009; Dunn et al., 2011, sim ilarly to others (Daumé III and Campbell, 2007) we use the framework of universal implications to capture correlations between other types of typo logical features as well. We describe how these features were derived in detail in Section 3.
As we report below in Section 4, the per formance of machine learning algorithms varied across different feature predictions, with some al gorithms working better for some features, and less well for others. On balance however we found that ridge regression (Hoerl and Kennard, 1970), also known as Tikhonov regularization (Franklin, 1974), was the most useful approach.

Related Work
Here we review the approaches corresponding to the constrained subtask of the shared task, where no external data, such as unlabeled texts, is al lowed.
A popular approach to modeling the typologi cal diversity of the world's languages is based on Bayesian probabilistic inference. Despite recently drawing some criticism on linguistic grounds by Ono (2020), this approach possesses impressive explanatory power. In what possibly represents the earliest modelbased typological feature imputer, Daumé III and Campbell (2007) introduced the probabilistic Bayesian model for uncovering uni versal implications in WALS data by associating random variables with individual WALS features and discovering the interfeature correlations from statistical dependence between random variables. The modeling power of the Bayesian approach was further demonstrated by Daumé III (2009) in a non parametric hierarchical Bayesian model combin ing linguistic areas and phylogeny. Murawaki (2015) proposed a deep learning ap proach to phylogenetic inference by mapping the language vectors to a latent space using an auto encoder trained using typologicallyinspired objec tive on WALS data with missing values imputed us ing a regularized iterative variant of Multiple Cor respondence Analysis (MCA) Husson, 2012; Josse et al., 2012). In later work, Murawaki (2017Murawaki ( , 2019 and Murawaki and Yamauchi (2018) abandoned an earlier model in favor of a Bayesian autologistic approach and demonstrated the superi ority of Bayesian predictor over MCA. In this ap proach, the languages are represented as random variables that are explained in terms of other lan guages related to each other through phylogenetic and spatial neighborhood graphs. Bjerva et al. (2019) introduce a generative model inspired by the Chomskyan principlesandparameters frame work, drawing on the correlations between typo logical features of languages to tackle the novel task of typological collaborative filtering, a con cept borrowed from the area of recommender sys tems.
While most stateoftheart missing feature im putation methods are modelbased, recently Buis and Hulden (2019) employed the iterative tech nique based on singular value decomposition (SVD) from the wellstudied area of lowrank ma trix completion and reported the performance on par with the prior art. Although lacking explana tory power, similarly to MCA, such techniques are attractive due to their simplicity and computational efficiency. Takamura et al. (2016) took a stan dard machine learning approach by training multi nomial logistic regression classifiers for individ ual WALS features based on other features present in the database under various experimental condi tions, hypothesizing that the classifier would cap ture feature correlations implicit in the data.
The frequentist approach we take is similar in some respects to work of Takamura et al. (2016) in that we also train vector space classifiers, but there are a few notable differences, we: (i) use typologicallymotivated "probabilistic" frequency based input space by explicitly representing areal and phylogenetic associations and implicational universals, and (ii) explore a wider range of clas sification approaches.
Although for this task there is a special inter est in the geographic aspect of the modeling, for our final submission we limited ourselves to the rather orthodox approach of representing language associations through fixed neighborhoods, in or der not to overcomplicate our method (described in the next section) that can be difficult to rec oncile with the more sophisticated models, such as the very promising model of language evolu tion from Kauhanen et al. (2019) and the findings emerging from the fields of dialectology and di alectometry (Szmrecsanyi, 2011; Wieling and Ner bonne, 2015; Nerbonne et al., 2020.

Method
Here we outline the details of our approach used to generate the final submission. The opensource implementation of our training and evaluation pipeline has been released in public domain. 1

Precomputation of Features
Typological features were preprocessed to find likely associations between genetic and areal prop erties of the language. For each typological feature and value from a set of its values we com puted the following probability estimates: where count( ) = ∑ ∈ count( ). Here, "fam ily" and "genus" in equations 1 and 2 were as given in the data, and "area" in equation 3 com prised all languages within a 2,500 kilometer ra dius around the target language's latitude and lon gitude computed using the Haversine formula (Ro busto, 1957 In addition we computed a set of implicational universals (Greenberg, 1963). For each feature value , for feature , we compute the probabil ity of of given , and each , pair from the set of known featurevalue pairs in the data ( | , , ) = count( ) count ( , , ) .
For each of the genetic (family and genus), areal (neighborhood) and universal implication types of associations we kept a separate table, where, for all the known features we stored • the feature , • the total number of samples in the data of the given type with , denoted ( ), • the value with the highest estimated proba bility per the above equations, denoted maj , • the prior maj corresponding to maj .
Examples of the most likely associations computed above are shown in the first three rows of Ta ble 1. For the Niger-Congo language family, the most likely value for the feature Green_and_Blue (observed 9 times) is 3 Black/green/blue, with the corresponding prior 0.667. For the Bantoid language genus, the feature Green_and_Blue was observed twice, both times with the same value 3 Black/green/blue. The areal exam ple corresponds to the neighborhood of Yoruba, boring languages according to closeness and use weighted clustering instead." We agree that in principle more sophisti cated approaches would be nice, but one should bear in mind that the geographic centroids for languages provided in the data are at best crude, and so doing anything more sophis ticated seemed to us to be crude. Also, to do this properly, distance is really not sufficient: one would also need to ac count for the presence of possible barriers to contact, includ ing impassable mountain ranges, seas, and hostile neighbors, elements that would be hard to model.  ( , )=(8.0, 4.3) (where denotes latitude), for which 13 Green_and_Blue features were observed, with the most likely value corresponding to 3 Black/green/blue with prior 0.692.
In addition for the implicational features, we stored the prior probability of given the con ditional feature from equation 4, and the to tal count for . As an example of a weak im plicational preference consider the example given in the fourth row of Table 1, which means that if a language has 2 Red/yellow for the fea ture Red_and_Yellow, then there is a slight pref erence ( =0.583, estimated on the basis of 12 examples) for having 3 Black/green/blue as the value for Green_and_Blue. In other words, 2 Red/yellow ⊃ 3 Black/green/blue. There were 54 cases of Green_and_Blue in the training data, for which the estimated a priori probability for 3 Black/green/blue is 0.148. Table 2 shows the overall sizes of the association tables for the genetic, areal and implicational types described above for the different partitions of the shared task data.

Sparse Language Vectors
For each typological feature we train a separate fea ture estimator, resulting in 185 estimators overall. When training an individual estimator, we repre sent each language in the training and development set as a sparse feature vector. Likewise, the lan guages whose features need to be predicted at test time are also represented similarly.
The makeup of individual language vector for a given typological feature and a language is shown in Table 3. The vector consists of dense and sparse subvectors; the components shown in the first four rows of the table are mostly dense. The first subvector consists of language's latitude and longitude coordinate, represented as two numeric features exactly the same as given in the shared task data. There is no particular rationale for this choice other than we previously found that for a different task (speech synthesis) choosing al ternative representations for the language location (e.g., distances to all other languages in the train

Subvector Size
Components Type ing set) in the input features did not significantly improve the results (Gutkin and Sproat, 2017). The next three subvectors representing genus, family and area, are structured similarly using the three components used in association tables de scribed previously: the majority value maj ( ) rep resented as a categorical feature, the prior corre sponding to this value maj ( ) and the feature fre quency ( ), both represented as numeric features. For these three subvectors, the missing values are represented by the threetuple ( ∅ , 10 −6 , 0), where ∅ denotes a global dummy typological feature value. 3 The first four subvectors described above are fol lowed by multiple subvectors representing individ ual universal implications, as shown in the fifth row of Table 3. Each implicational, describing the dependence of feature on , is represented as a fivetuple whose elements are stored in the associ ations table for implicational universals: The most likely value maj ( ) of corresponding to the highest conditional probability ( | , , ) (inter preted as probability of taking value given that is ), the total count ( , , ) of when is , the prior ( | ) and the total count ( ) for when is . The missing implicational is rep resented as a fivetuple ( ∅ , 10 −6 , 0, 10 −6 , 0). As mentioned above, the implicational portion of the language vector is very sparse because, for a fea ture ∈ the language vector belongs to, one needs to compute all its correlations to other fea 3 The choice of 10 −6 for a missing value prior is arbitrary. Any small nonzero value valid for a log transform is suitable.  tures ∈ , where is the set of all 182 known features. Since the typological database is very sparse, most of the observed correlations between and for a given language are poorly instanti ated.
The categorical features in the language vector are represented using a onehot encoding and the numeric features are scaled to zero mean and unit variance. For probability components, prior to nu meric feature scaling, the probability features are transformed into log domain.
The overall representation results in the lan guage vectors with a rather high dimensional ity.
For example, the language vectors for the Order_of_Subject,_Object_and_Verb fea ture have the dimension of 4111. As we shall see below from the shared task details, this representa tion may already be too specific given the training set which only contains 1,125 data points. This observation also explains our choice of represent ing features and implicationals with the attributes associated with their most likely majority values rather than all the values for that feature observed in the data, as suggested by a reviewer -doing so will dramatically increase the dimension of input feature space even further and render out approach completely intractable.

Experiments and Discussion
In what follows, we provide a brief overview of the datasets, introduce the baseline systems we evalu ated against during the development of our method, provide the evaluation results of miscellaneous ma chine learning algorithms on the development set that guided our final model selection, describe the two configurations submitted to the constrained subtask and, finally, mention the approaches that did not work well in our settings.

Data Overview
Some of the properties of the data used in our ex periments 4 are shown in Table 4, where we sum marize the counts for three partitions of the data: the training and development sets, and the blind test set. For each of the sets, the total number of languages is shown along with the number of unique language families and genera. The count of typological feature types is displayed along with the number of unique feature values and the total number of observed values. As can be seen from the table, the size of the data is very small, which precludes us from using stateoftheart deep learn ing methods, especially given our approach to representing each language as a point in high dimensional space.
The 1,357 longitude and latitude language coor dinates provided in the data are displayed in Fig  ure 1 (best viewed in color) using the Mercator projection. The languages in the training set are shown in blue, the development set languages in green and the test set languages in yellow.

Baselines
For our baselines five simple prediction algorithms based on deterministic search were implemented. Initially we evaluated these systems on the devel opment (DEV) set and later on the test set, after the golden truth data was released by the organizers.
The global majority class predictor (denoted B 1 ) accumulates the frequencies of all the typological feature values in the training data and predicts the most frequent value for the feature in question irre spective of its phylogenetic or areal attributes. The clade majority class predictor (denoted B 2 ) extends the previous predictor by also taking into account language genera and families, producing predic tions of the form Predictor B 3 only relies on the distances between languages' geographic coordinates defined in the data. For each typological feature in question, the predictor returns the value belonging to the closest language, according to the Haversine formula (Ro busto, 1957), with a known value for that feature. The Haversine distance ( , ) is computed be tween each pair of languages and represented as points on an ideal sphere using their respec tive latitude and longitude coordinates ( , ) and ( , ). A naïve approach for combining areal and phylo genetic knowledge is implemented by predictor B 4 which performs its search in two additional clus ters of interlanguage distances, grouped by gen era and families: The algorithm first searches for  the known feature within the geographically clos est languages of the same genus, followed by a search within the closest languages from the same family, falling back to the closest languages from the global set.
The final baseline configuration B 5 uses an en semble approach by combining areal and phyloge netic information by using majority voting. The areal estimate is provided by the majority estimate from the candidate's neighborhood (as described in Section 3.1) or, if that information is not avail able due to the nearest language being located be yond the neighborhood radius, by the known fea ture from the closest languages outside the neigh borhood. The phylogenetic clade estimates are pro vided by the majority class values from genera and families (B 2 ), respectively. The final estimate is produced by the majority voting, where at least two predictions have to agree, otherwise the predictor falls back to global majority class estimates (B 1 ).
The microaveraged accuracies for five of the baseline systems are shown in Table 5. We per formed the evaluation by first obtaining the esti mates from a set shown in the first column and then predicting each known feature value for the languages from a set shown in the second column (DEV or golden TEST) by pretending that this value is unknown. Best accuracies are highlighted in bold. As can be seen from the table, the best base line configuration on the DEV set is the B 5 ensem ble method, with the pure cladebased B 2 coming second. On the golden TEST set, B 2 is the winning configuration, while the second best configuration is B 4 . It is interesting to note, that the methods that completely ignore phylogeny (B 1 and B 3 ) did not perform well in either evaluation.
As one reviewer notes, the apparent predomi nance of phylogeny in our results may seem sur prising given that areal features are wellknown to be important in many areas of the world -e.g. In dia (Emeneau, 1956). This is likely due at least in part to the fact that many of the language families in the sample are small families spoken in a rela tively limited geographic area, or else are families  like PamaNyungan, which are more or less iso lated from unrelated neighbors. This would tend to confound the influence of phylogeny versus geog raphy, since it is only when part of a language fam ily is spoken in an area that is populated by speak ers of unrelated families that one will see robust effects of geography.

Model Selection Using Development Set
We approach the feature prediction task by train ing 185 multiclass classifiers, one for each typo logical feature. Several standard machine learn ing algorithms were evaluated on the develop ment set to establish the most optimal algorithm for the task. In particular, we trained decision trees (Breiman et al., 1984), Naive Bayes (Lang ley et al., 1992) with Gaussian mixtures, Gaussian processes (Rasmussen and Williams, 2006), clas sifiers based on linear and quadratic discriminant analysis (LDA and QDA, respectively) (Tharwat, 2016), support vector machines (SVM) with lin ear kernel (Suykens and Vandewalle, 1999), multi nomial logistic regression (Böhning, 1992), ridge regression (Hoerl and Kennard, 1970) and simple feedforward neural networks with a single layer of 200 units (DNN). In addition, three ensemble con figurations were also evaluated: multiclass Ad aBoost (with 100 estimators) (Zhu et al., 2009), random forests (with 200 estimators, minimum of three samples per leaf and information gain as split ting criterion) (Breiman, 2001) and bagging en sembles (Breiman, 1999). For all the algorithms we used the implementation provided by scikitlearn toolkit. 5 We mostly used default hyper parameters provided performing no special tuning. The evaluation results for each of the classifiers are shown in Table 6. As can be seen from the table, there are only five algorithms out of twelve which are remotely comparable to our development set baselines summarized in Table 5: DNN, linear SVM, multinomial logistic regression, random for est and ridge regression. Out of those, the ran dom forest ensemble method and the ridge regres sion are the only competitive models, with ridge regression estimation strongly outperforming all of our baselines on the development set achieving the microaveraged accuracy of 75.88%. On the basis of this result we chose ridge regression algorithm for our experiments on the test set and the final sub mission.
It is worth noting, however, that it is not difficult to find cases where the ridge regression estimator does not perform as well as other methods. For ex ample, for feature SVNegO_Order, the DNN (93% accuracy) and SVM (87% accuracy) both outper form ridge regression (80%). We hypothesized that a better alternative to fixing a certain method for all the features is to employ an ensemble of clas sifiers which are featurespecific. Although we in clude this approach in our implementation, so far we only tested it under scenario of performing an iterative stratified fold crossvalidation (Witten et al., 2011) over the training, rather than develop ment, set. The optimal featurespecific classifiers obtained by this method did not fare well on the de velopment set, which is probably an indicator that development set was not an optimal reflection of the training data.
Aside from the ridge regression and random for est ensemble methods, the poor average perfor mance of other classification algorithms on the development set is rather unexpected. We hy pothesize that this may be due to several con founding factors. The first issue is that our approach of representing training data for each typological feature as sparse language vectors results in small amounts of highdimensional training data that may not be enough to reli ably train most of the classifiers in our partic ular setting. The second issue is the high fea ture sparsity which adversely affects multiclass classification for most of the algorithms. Con sider the Order_of_Subject,_Object_and_Verb feature.
The frequencies of its seven val ues (corresponding to class label counts) in the training data represented as an ordered set are {311, 226, 113, 51, 16, 6, 1}. Although we employ balanced class weighting for all the algorithms, us ing the label values to adjust the weights inversely proportional to class frequencies observed in the data, this may not be enough. Some unsatisfactory attempts to further remedy this are described be low.
We further note that the relatively good per formance of ridge regression and random forest classifiers on the development data without any hyperparameter finetuning may possibly be ex plained by the relative robustness of both meth ods to sparseness and collinearity effects (which severely affect other types of parametric and non parametric predictors in our experiment), as pre viously analyzed in detail by Tomaschek et al. (2018) and observed by others in a typological set ting (Burdick et al., 2020) and elsewhere (Josifoski et al., 2019). We briefly reevaluate these assump tions on the released test set in Section 4.6.

Shared Task Submission
Our submission to shared task consists of two ridge regression classifier configurations denoted NEMO_system1 and NEMO_system2. 6 The differ ence between the two configurations is in how the training data is generated from the orig inal shared task subsets.
The training data for NEMO_system1 classifier consists of the lan guage vectors generated from the original training (train.csv) and development (dev.csv) sets only. For NEMO_system2, the training data also includes the phylogenetic, areal and implicational relations computed from the known features in the blind test set (test_blinded.csv).

Approaches That Disappointed
In addition to the experiments described above, we also tried applying missing feature imputation algorithms to the data provided, as a preprocess ing step prior to training the classifiers described above. Neither of the feature imputation algo rithms that we tried produced satisfactory results that could improve upon our best baseline. One of the imputation approaches that we evaluated was Multiple Imputation with Denoising Autoen coders (MIDAS) by Lall and Robinson (2020). This approach is particularly attractive because it natively supports categorical features. However, due to reliance on a denoising autoencoder archi tecture (Vincent et al., 2010), the network requires large amounts of training data to be estimated re liably, which prevented us from getting adequate 6 Submitted as NEMO_system1_assoc-train-dev_constrained and NEMO_system2_assoc-traindev-test_constrained, respectively.  performance on the provided typological data. 7 Similarly disappointing were the attempts to ad dress the class imbalance problem during multi class classifier training using class resampling techniques. Similar to the feature imputation above, resampling was applied as a preprocessing step prior to training the classifiers. In particu lar, neither the application of synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002) nor the adaptive synthetic sampling (He et al., 2008) produced any improvements over our best baseline.
We also extended our method to use country code information provided with the task data for each language. This was achieved by accumu lating the percountry typological feature priors, similar to other areal and phylogenetic features, and representing the country codes as categori cal features in classification, the process described in Section 3. Ridge regression estimator that in cluded country code information achieved micro averaged accuracy of 75.66% on the development set -a small deterioration of 0.3% compared to our best result. This may be due to multiple errors in the provided data, e.g., the country code for the Chepang language from SinoTibetan family spo ken in Nepal (Caughley, 1982) is defined as US in the development set.

Discussion
We start with some observations about the rather apparent differences between the test data and the training and development data. In Table 7 we list the six language families evidenced in the test data, and the numbers of languages for each represented in the training and development data, versus the test data. As can be seen some of the language families are significantly more represented in the test data than they are in the data that was released earlier. This is particularly the case for Northern PamaNyungan, which comprised two languages in the trainingdevelopment data, and twenty four languages in the test data. This skew can also be seen in Figure 1 where the yellow dots represent the test data. There are islands of yellow in Aus tralia, Northern South Asia, Central East Africa and in Central America, surrounded by seas of green and blue. 8 Turning now to our results, we discuss here the performance of our best model NEMO_system2 (overall microaveraged accuracy 0.66). First of all, we note that there is very little correlation be tween the number of exemplars of a feature in training and development, and our system's perfor mance on that feature in the test set (correlation co efficient =0.21): see Figure 2. 9 Second, comparing our results against the best baseline provided by the task organizers frequency-baseline_constrained (overall microaveraged accuracy 0.51), we list in Tables 8  and 9 the five features for which our system had the largest win and loss, respectively in terms of absolute accuracy difference. Note that three of the five features for which we showed the largest gains relate to word order. While this may be due to the automatically derived implicational features used in our models -wordorderrelated features being probably among the most robust of the im plicational universals -it is also true that these features are better instantiated in the data.    One reviewer notes that it would be interesting to see the results of ablation studies for some of the high dimensional features discussed in Section 3.2. We agree, but leave this for future work.
Finally, recall from the discussion of develop ment setbased model selection in Section 4.3 that ridge regression was the best performing model, followed by the nonparametric random forest pre dictor (see Table 6). Once the golden test data was released, we evaluated the same set of twelve clas sifiers on the test set with the results shown in Ta ble 10. As can be seen from the table, the ranking of three best classifiers on the development set is preserved on the heldout test set as well: ridge regression is the bestperforming classifier in our task, followed by random forest and, finally, multi nomial logistic regression.

Conclusion
We have presented the NEMO submission to the 2020 SIGTYP shared task. Our system used ge netic, geographical and automatically deduced im plicational universals, and a range of classifiers of which ridge regression yielded the best overall performance. Our method achieved 0.66 overall accuracy on the task, compared to the best base line provided by the organizers, which had an ac curacy of 0.51. Our system tended to do better on test language families that were better represented in the training and development data, though in terestingly the correlation between the number of instances of features in the training/development data and performance on that feature was not strong.
As discussed above, we include in the training package methods for ensembling classifiers so that one can find perfeature optimal classifiers, but we did not make use of this functionality for our sub mitted results. We do think, however, that further experimentation with this functionality would be useful.
Finally, as one of the reviewers notes, it would be interesting to ask what use work along the lines presented here could be to the field lin guist/typologist who is interested in testing po tential relationships between languages based on shared features. Many of the systems reported in Section 2 had this aim in mind. While we do not want to overstress the value of this sort of work compared to good linguistic intuition, we do think that knowing the Bayesian priors of associations, as we discussed in Section 3.1 could at least serve as a reminder that some feature settings may be shared simply because they are very common any way.