Automatic Extraction of Rules Governing Morphological Agreement

Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results. Using cross-lingual transfer, even with no expert annotations in the language of interest, our framework extracts a grammatical specification which is nearly equivalent to those created with large amounts of gold-standard annotated data. We confirm this finding with human expert evaluations of the rules that our framework produces, which have an average accuracy of 78%. We release an interface demonstrating the extracted rules at https://neulab.github.io/lase/.


Introduction
While the languages of the world are amazingly diverse, one thing they share in common is their adherence to grammars -sets of morpho-syntactic rules specifying how to create sentences in the language. Hence, an important step in the understanding and documentation of languages is the creation of a grammar sketch, a concise and human-readable description of the unique characteristics of that particular language (e.g. Huddleston (2002) for En-glish, or Brown and Ogilvie (2010) for the world's languages).
One aspect of morphosyntax that is widely described in such grammatical specifications is agreement, the process wherein a word or morpheme selects morphemes in correspondence with another word or phrase in the sentence (Corbett, 2009). Languages have varying degrees of agreement ranging from none (e.g. Japanese, Malay) to a large amount (e.g. Hindi, Russian, Chichewa). Patterns of agreement also vary across syntactic subcategories. For instance, regular verbs in English agree with their subject in number and person but modal verbs such as "will" show no agreement.
Having a concise description of these rules is of obvious use not only to linguists but also language teachers and learners. Furthermore, having such descriptions in machine-readable format will further enable applications in natural language processing (NLP) such as identifying and mitigating gender stereotypes in morphologically rich languages (Zmigrod et al., 2019).
The notion of describing a language "in its own terms" based solely on raw data has an established tradition in descriptive linguistics (e.g. Harris (1951)). In this work we present a framework (outlined in Figure 1) that automatically creates a first-pass specification of morphological agreement rules for various morphological features (Gender, Number, Person, etc.) from a raw text corpus for the language in question. First, we perform syntactic analysis, predicting part-of-speech (POS) tags, morphological features, and dependency trees. Using this analyzed data, we then learn an agreement prediction model that contains the desired rules. Specifically, we devise a binary classification problem of identifying whether agreement will be observed between a head and its dependent token on a given morphological property. We use decision trees as our classification model because   Figure 1: An overview of our method's workflow for gender agreement in Greek. The example sentence translates to "The port of Igoumenitsa is connected to many ports in Italy and Albania." First, we dependency parse and morphologically analyze raw text to create training data for our binary agreement classification task. Next, we learn a decision tree to extract the rule set governing gender agreement, and label the extracted leaves as either representing required or chance agreement. Finally these rules are presented to a linguist for perusal. they are easy to interpret and we can easily extract the classification rules from the tree leaves to get an initial set of potential agreement rules. Finally, we perform rule labeling of the extracted rules, identifying which tree leaves correspond to probable agreement. This is required because not all agreeing head/dependent token pairs are necessarily due to some underlying rule. For instance, in Figure 1's example of Greek gender agreement, both the head and its dependent token Ιταλίας→Αλβανίας have feminine gender, but this agreement is purely bychance, as correctly identified by our framework.
The quality of the learnt rules depends crucially on the quality and quantity of dependency parsed data, which is often not readily available for lowresource languages. Therefore, we experiment with not only gold-standard treebanks, but also trees generated automatically using models trained using cross-lingual transfer learning. This assesses the applicability of the proposed method in a situation where a linguist may want to explore the characteristics of agreement in a language that does not have a large annotated dependency treebank.
We evaluate the correctness of the extracted rules conducting human evaluation with linguists for Greek, Russian, and Catalan. In addition to the manual verification, we also devise a new metric for automatic evaluation of the rules over unseen test data. Our contributions can be summarized to: 1. We propose a framework to automatically extract agreement rules from raw text, and release these rules for 55 languages as part of an interface 2 which visualizes the rules in detail along 2 https://neulab.github.io/lase/ with examples and counter-examples. 2. We design a human evaluation interface to allow linguists to easily verify the extracted rules. Our framework produces a decent first-pass grammatical specification with the extracted rules having an average accuracy of 78%. We also devise an automated metric to evaluate our framework when human evaluation is infeasible. 3. We evaluate the quality of extracted rules under real zero-shot conditions (on Breton, Buryat, Faroese, Tagalog, and Welsh) as well as lowresource conditions (with simulation experiments on Spanish, Greek, Belarusian and Lithuanian) varying the amount of training data. Using cross-lingual transfer, rules extracted with as few as 50 sentences with gold-standard syntactic analysis are nearly equivalent to the rules extracted when we have hundreds/thousands of gold-standard data available.

Problem Formulation
For a head h and a dependent d that are in a dependency relation r, we will say that they agree on a morphological property f if they share the same value for that particular property i.e. f h = f d . Some agreements that we observe in parsed data can be attributed to an underlying grammatical rule. For example, in Figure 2 the Spanish A.1 shows an example of where subject (enigmas) and verb (son) need to agree on number. We will refer to such rules as required-agreement. Such a required agreement rule dictates that an example like A.2 is ungrammatical and would not appear in wellformed Spanish sentences, since the subject and the verb do not have the same number marking. However, not all word pairs that agree do so because of some underlying rule, and we will refer to such cases as chance-agreement. For example, in Figure 2 the object (perro) and verb (tiene) in B.1 only agree in number by chance, and example B.2 (where the object of a singular verb is plural) is perfectly acceptable.
Our goal is to extract, from textual examples, the set of rules R f l that concisely describe the agreement process for language l. Concretely, this will indicate for which head-dependent pairs the language displays required-agreement and for which we will observe at most chance-agreement. Canonically, agreement rules are defined over syntactic features of a language as seen in Figure 2 where we have the following rule for Spanish: "subjects agree with their verbs on number". 3 To formalize this notion, we define a rule to be a set of features which are defined over the dependency relation, head and dependent token types. In this paper, we make the simplifying assumption that head and dependent tokens are represented by only part-ofspeech features, as we would like our extracted rules to be concise and easily interpretable downstream, although this assumption could be relaxed in future work.
The rule discovery process consists of two major steps: a rule extraction step followed by a rule labeling and merging step (also see Figure 1). 3 Sometimes semantic features are used for agreement for eg. United Nations is, despite United Nations being plural, it is treated as singular for purposes of agreement.

Rule Extraction
To create our training data for rule extraction, we first annotate raw text with part-of-speech (POS) tags, morphological analyses, and dependency trees. We then base our training data on these annotations by converting each dependency relation into a triple h, d, r , indicating the head token, dependent/child token, and dependency relation between h and d respectively. From the whole treebank, we now have input features X f = { h 1 , d 1 , r 1 , . . . , h n , d n , r n } and binary output labels Y =y 1 , . . . , y n , where if the head and the dependent token agree on feature f (such that f h =f d ) we set y = 1, otherwise y = 0. We filter out the tuples where either of the linked tokens does not display the morphological feature f . We train a model for p(Y |X) using decision trees (Quinlan, 1986) using the CART algorithm (Breiman et al., 1984). A major advantage of decision trees is that they are easy to interpret and we can visualize the exact features used by the decision tree to split nodes. The decision tree induces a distribution of agreement over training samples in each leaf, e.g. 99% agree, 1% not agree in Leaf-3 for gender agreement in Spanish (Figure 3(a)).

Rule Labeling
Now that we have constructed a decision tree where each tree leaf corresponds to a salient partition of the possible syntactic structures in the language, we then label these tree leaves as required-agreement or chance-agreement. For this we apply a threshold on the ratio of agreeing training samples within a leaf -if the ratio exceeds a certain number the leaf will be judged as required-agreement. We experiment with two types of thresholds: Hard Threshold: We set a hard threshold on the ratio that is identical for all leaves. In all experiments, we set this threshold to 90% based on manually inspecting some resulting trees to find a threshold that limited the number of non-agreeing syntactic structures being labeled as required-agreement.
Statistical Threshold: Leaves with very few examples may exceed the hard threshold purely by chance. In order to better determine whether the agreements are indeed due to a true pattern of required agreement, we devise a thresholding strategy based on significance testing. For all agreementmajority leaves, we apply a chi-squared goodness of fit test to compare the observed output distri- bution with an expected probability distribution specified by a null hypothesis. Our null hypothesis H 0 will be that any agreement we observe is due to chance. If we reject the null hypothesis, we will conclude from the alternate hypothesis H 1 that there exists a grammatical rule requiring agreement for this leaf's cases: H 0 : The leaf has chance-agreement.
H 1 : The leaf has required-agreement.
If there is no rule requiring agreement, we assume that the morphological properties of the head and the dependent token are independent and identically distributed discrete random variables following a categorical distribution. We compute the probability of chance agreement based on the number of values that the specific morphological property f can take. Since morphological feature values are not equally probable, we use a probability proportional to the observed value counts. For a binary number property where 90% of all observed occurrences are singular and 10% are plural, the probability of chance agreement is equal to 0.82=0.9×0.9+0.1×0.1, which gives the observed output distribution p=[0.18, 0.82]. Using p we compute the expected frequency count E i = np i where n is the total number of samples in the given leaf, i=[0, 1] is the output class of the leaf, and p i is the hypothesized proportion of observations for class i. The chi-squared test calculates the test statistic χ 2 as follows: where O i is the observed frequency count in the given leaf. The test outputs a p-value, which is the probability of observing a sample statistic as extreme as the test statistic. If the p-value is smaller than a chosen significance level (we use 0.01) we reject the null hypothesis and label the leaf as required-agreement.
The chi-squared test especially helps in being cautious with leaves with very few examples. However, for leaves with larger number of examples statistical significance alone is insufficient, because there are a large number of cases where there are small but significant differences from the ratio of agreement expected by chance. 4 Therefore, in addition to comparing the p-value we also compute the effect size which provides a quantitative measure on the magnitude of an effect (Sullivan and Feinn, 2012). Cramér's phi φ c (Cramér, 1946) is a commonly used method to measure the effect size: where χ 2 is the test statistic computed from the chi-squared test, N is the total number of samples within a leaf, and k is the degree of freedom (which in this case is 2 since we have two output classes). Cohen (1988) provides rules of thumb for interpreting these effect size. For instance, φ c > 0.5 is considered to be a large effect size and a large effect size suggests that the difference between the two hypotheses is important. Therefore, a leaf is labeled as required-agreement when the p-value is less than the significance value and the effect size is greater than 0.5. Now Leaf-1 in Figure 3(b) is correctly identified as chance-agreement.
Rule Merging: Because we are aiming to have a concise, human-readable representation of agreement rules of a language, after labeling the tree leaves we merge sibling leaves with the same label as shown in Figure 3(c). Further, we collapse tree nodes having all leaves with the same label thereby reducing the apparent depth of the tree.

Experimental Settings and Evaluation
Our experiments aim to answer the following research questions: (1) can our framework extract linguistically plausible agreement rules across diverse languages? and (2) can it do so even if gold-standard syntactic analyses are not available?
To answer the first question we evaluate rules extracted from gold-standard syntactic analysis (Sec. §4). For the second question we experiment in low-resource and zero-shot scenarios using crosslingual transfer to obtain parsers on the languages of interest, and evaluate the effect of noisy parsing results on the quality of rules (Sec. §5).

Settings
Data We use the Surface-Syntactic Universal Dependencies (SUD) treebanks (Gerdes et al., 2018(Gerdes et al., , 2019 as the gold-standard source of complete syntactic analysis. The SUD treebanks are derived from Universal Dependencies (UD) (Nivre et al., 2016(Nivre et al., , 2018, but unlike the UD treebanks which favor content words as heads, the SUD ones express dependency labels and links using purely syntactic criteria, which is more conducive to our goal of learning syntactic rules. We use the tool of Gerdes et al. (2019) to convert UD v.2.5 (Nivre et al., 2020 into SUD. We only use the training portion of the treebanks for learning our rules.

Rule Learning
We use sklearn's (Pedregosa et al., 2011) implementation of decision trees and train a separate model for each morphological feature f for a given language. We experiment with six morphological features (Gender, Person, Number, Mood, Case, Tense) which are most frequently present across several languages. We perform a grid search over the decision tree parameters (detailed in Appendix A.1) and select the model performing best on the validation set. We report results with the Statistical Threshold because on manual inspection we find the trees to be more reliable than the ones learnt from the Hard Threshold (see Appendix A.5 for an example).

Evaluation
We explore two approaches to evaluate the extracted rules, one based on expert annotations, and an automated proxy evaluation.
Expert Evaluation Ideally, we would collect annotations for all head-relation-dependent triples in a treebank, but this would involve annotating hundreds of triples, requiring a large time commitment from linguists in each language we wish to evaluate. Instead, for each language/treebank we extract and evaluate the top 20 most frequent "head POS, dependency relation, dependent POS" triples for the six morphological features amounting to 120 sets of triples to be annotated. 5 We then present these triples with 10 randomly selected illustrative examples and ask a linguist to annotate whether there is a rule in this language governing agreement between the head-dependent pair for this relation. The allowed labels are: Almost always agree if the construction must almost always exhibit agreement on the given feature; Sometimes agree if the linked arguments sometimes must agree, but sometimes do not have to; Need not agree if any agreement on the feature is random. An example of the annotation interface is shown in the Appendix A.2.
For each of the human annotated triples in feature f , we extract the label assigned to it by the learnt decision tree T . We find the leaf to which the given triple t belongs and assign that leaf's label to the triple, referred by l tree,f,t . The human evaluation score (HS) for each triple marking feature f is given by: where l human,f,t is the label assigned to the triple t by the human annotator. These scores are then averaged across all annotated triples T f to get the human evaluation metric (HRM) for feature f Automated Evaluation As an alternative to the infeasible manual evaluation of all rules in every language, we propose an automated rule metric (ARM) that evaluates how well the rules extracted from decision tree T fit to unseen gold-annotated test data. For each triple t marking feature f , we 5 The top 20 most frequent triples covered approximately 95% of the triples where this feature was active on average. first retrieve all examples from the test data corresponding to that triple. Next, we calculate the empirical agreement by counting the fraction of test samples that exhibit agreement, referred by q f,t . For a required-agreement leaf, we expect most test samples satisfying that rule to show agreement. 6 To account for any exceptions to the rule and/or parsing-related errors, we use a threshold that acts as proxy for evaluating whether the given triple denotes required agreement. We use a threshold of 0.95, and if q f,t > 0.95 then we assign the test label l test,f,t for that triple as required-agreement, and otherwise choose chance-agreement. 7 Similar to the human evaluation, we compute a score for each triple t marking feature f then average scores across all annotated triples in T f to get the ARM score for each feature f :

Experiments with Gold-Standard Data
In this section, we evaluate the quality of the rules induced by our framework, using gold-standard syntactic analyses and learning the decision trees over triples obtained from the training portion of all SUD treebanks. As baseline, we compare with trees predicting all leaves as chance-agreement.
6 There are exceptions: e.g. when the head of dependent is a multiword expression (MWE), in which case dependency parsers might miss or pick only one of its constituents as head/dependent, or if the MWE is syntactically idiosyncratic. 7 We keep a 5% margin to account for any exceptions or parsing errors based on the feedback given by the annotators. The extracted rules have an 0.574 ARM score (averaged across all treebanks and features), outperforming the baseline scores by 0.074 ARM points. 8 Of all the 451 decision trees across all treebanks and features, we find 78% trees outperforming the baseline trees. In Figure 4, we show the improvements over the baseline averaged across language families/genera. In families with extensive agreement systems such as Slavic and Baltic our models clearly outperform the baseline discovering correct rules, as they do for the other Indo-European genera, Indo-Aryan and Germanic. For mood and tense, the chance-agreement baseline performs on par with our method. This is not surprising because there is little agreement observed for these features given that only verbs and auxiliary verbs mark these features. We find that for both tense and mood in the Indo-Aryan family, our model identifies required-agreement primarily for conjoined verbs, which mostly need to agree only if they share the same subject. However, subsequent analysis revealed that in the treebanks nearly 50% of the agreeing verbs do not share the same subject but do agree by chance.
Agreement for Indo-European languages like Hindi and Russian is well documented (Comrie, 1984;Crockett, 1976) and is reflected in our large improvements over the baseline ( Figure 5). Similarly, Arabic exhibits extensive agreement on noun phrases including determiners and adjectives (Aoun et al., 1994). We find that for Arabic gender the lower ARM scores of our method are an artifact of the small test data.
North Sami is an interesting test bed: as a Uralic language, case agreement would be somewhat unexpected and indeed our model's predictions are not better than the baseline. Nevertheless, with our interface we find patterns of rare positive paratactic constructions with required agree-  fi  et cs  ru  pl  grc orv  hr  la uk  de  tr sl  lt  ar  sr  ro sk hy  hu  no tasv nl  cu  fr  es da  got hi  en  it  ur  sme  el  ga  gl  gd  wo  af  lzh  mr  kk hsb cop kmr bxr olo Figure 6: Correlation between size of the decision trees constructed by our framework and morphological complexity of languages. ment where demonstrative pronouns overwhelmingly agree with their heads. 9 The case decision tree also uncovers interesting patterns of 100% agreement on Tamil constructions with nominalized verbs (Gerunds) where the markings propagate to the whole phrase.

Conciseness of Extracted Rules
We further analyze the decision trees learnt by our framework for conciseness and find that the trees grow more complex with increasing morphological complexity of languages as seen in Figure 6. To compute the morphological complexity of a language, we use the word entropy measure proposed by Bentz et al. (2016) which measures the average information content of words and is computed as follows: where V is the vocabulary, D is the monolingual text extracted from the training portion of the respective treebank, p(w i ) is the word type frequency normalized by the total tokens. Since this entropy doesn't account for unseen word types, Bentz et al.
where λ∈[0, 1], p target denotes the maximum entropy case given by the uniform distribution 1 V and p ML is the maximum likelihood estimator which is given by the normalized word type frequency. Languages with a larger word entropy are considered to be morphologically rich as they pack more information into the words. In Figure 6 we plot the 9 Leaf 3 here: https://bit.ly/34mHTeG morphological richness with the average number of leaves across all features and find these to be highly correlated.

Manual Evaluation Results
We conduct an expert evaluation for Greek (el), Russian (ru) and Catalan (ca) as described in Section §3.2. For a strict setting, we consider both Sometimes agree and Need not agree as chance-agreement and report the human evaluation metric (HRM) in Figure 7. Overall, our method extracts first-pass grammar rules achieving 89% accuracy for Greek, 78% for Russian and 66% for Catalan.
In most error cases, like person in Russian, our model produces required-agreement labels, which we can attribute to skewed data statistics in the treebanks. In Russian and Greek, for instance, conjoined verbs only need to agree in person and number if they share the same subject (in which case they implicitly agree because they both must agree with the same subject phrase). In the treebanks, though, only 15% of the agreeing verbs do indeed share the same subject -the rest agree by chance. In a reverse example from Catalan, the overwhelming majority (92%) of 8650 tokens are in the third-person, causing our model to label all leaves as chance agreement despite the fact that person/number agreement is required in such cases. Similarly for tense in Catalan, our framework predicts chance-agreement for auxiliary verbs with verbs as their dependent because of overwhelming majority of disagreeing examples. We believe this is because of both annotation artifact and the way past tense is realized.
To demonstrate how well the automated evaluation correlates with the human evaluation protocol, we compute the Pearson's correlation (r) between the ARM and HRM for each language under four model settings: simulate-50, simulate-100, baseline and gold. simulate-x is a simulated low-resource setting where the model is trained using x gold-standard syntactically analysed data. 10 The baseline setting is the one where all leaves predict chance-agrement and under the gold setting we train using the entire gold-standard data. We compute the ARM and HRM scores for the rules learnt under each of the four settings and report the Pearson's correlation, averaged across all features. Overall, we observe a moderate correlation for all three languages, with r = 0.59 for Greek, r=0.41 for Russian and r=0.38 for Catalan. The correla- tions are very strong for some features such as Gender (r el =0.97, r ru =0.82, r ca =0.98) and Number (r el =0.97, r ru =0.69, r ca =0.96) where we expect to see extensive agreement.

Simulated Zero-/Few-Shot Experiments
It is not always possible to have access to goldstandard syntactic analyses. Therefore, in order to investigate how the quality of rules are affected by the quality of syntactic analysis, we conduct simulation experiments by varying the amount of goldstandard syntactically analysed training data. For each language, we sample x fully parsed sentences from the its treebank out of L training sentences available. For the remaining L − x sentences, we use silver syntactic analysis i.e., we train a syntactic analysis model on x sentences and use the model predictions for the L − x sentences.

Data and Setup:
We experiment with Spanish, Greek, Belarusian and Lithuanian. For transfer learning, we use Portuguese, Ancient Greek, Ukrainian and Latvian treebanks respectively. The data statistics and details are in Appendix A.2. We train Udify (Kondratyuk and Straka, 2019), a parser that jointly predict POS tags, morphological features, and dependency trees, using the x goldstandard sentences as our training data. We generate model predictions on the remaining L − x sentences. Finally, we concatenate the x gold data with the L − x automatically parsed data from which we extract the training data for learning the decision tree. We experiment with x = [50, 100, 500] gold-standard sentences. To account of sampling randomness, we repeat the process 5 times and report averages across runs.
To further improve the quality of the automatically obtained syntactic analysis, we use crosslingual transfer learning where we train the Udify model by concatenating x sentences of the target language with the entire treebank of the related  language. We also conduct zero-shot experiments under this setting where we directly use the Udify model trained only on the related language and get the model predictions on L sentences. As before, we train five decision trees for each x setting and report the average ARM over the test data.

Results
We report the results for Number agreement in Figure 8. Similar plots for other languages and features can be found in the Appendix A.5. We observe that using cross-lingual transfer learning (CLTL) already leads to high scores across all languages even in zero-shot settings where we do not use any data from the gold-standard treebank. Taking Spanish gender as an example, 93% of the ruletriples extracted from the gold-standard tree (which are overwhelmingly correct) are also extracted by the zero-shot tree. The zero-shot tree only makes a few mistakes (shown in Table 1 and reflected in its overall ARM score) on certain proper noun and auxiliary verb constructions. Interestingly, using CLTL, training with just 50 gold-standard target language sentences is almost equivalent to training with 100 or 500 gold-standard sentences. This opens new avenues for language documentation: with as few as 50 expertly-annotated syntactic analysis of a new language and CLTL our framework can produce decent first-pass agreement rules. Needless to say, in most cases the extracted rules improve as we increase the number of goldstandard sentences and CLTL further helps bridge the data availability gap for low-resource settings.

Real Zero-Shot Experiments
Some languages like Breton, Buryat, Faroese, Tagalog and Welsh have test data only; there is no goldstandard training data available, which presents a true zero-shot setting. In such cases, we can still extract grammar rules with our framework using zero-shot dependency parsing.
Data and Setup: We collect raw text for the above languages from the Leipzig corpora (Goldhahn et al., 2012). Data statistics are listed in Appendix A.2. We parse these sentences using the "universal" Udify model that has been pre-trained on all of the UD treebanks, as released by (Kondratyuk and Straka, 2019). As before, we use these automatically parsed syntactic analyses to extract the rules which we evaluate with ARM over the gold standard test data of the corresponding SUD treebanks.
Results: We report the ARM scores in Figure 9. Averaged over all rules, our approach obtains a ARM of 0.566, while the naive all-chance baseline only achieves 0.506. The difference appears to be small, but we still consider it significant, because these languages do not actually require agreement for many grammatical features. Tagalog and Buryat are the most distant languages that we test on (no Philippine and Mongolic language is present in our training data) and yet we observe our method being at par with the baseline and even outperforming in case of Tagalog (2016) also infer morphotactics from IGT using k-means clustering. To the best of our knowledge, our work is the first to propose a framework to extract firstpass grammatical agreement rules directly from raw text in a statistically-informed objective way. A parallel line of work (Hellan, 2010) extracts a construction profile of a language by having templates that define how sentences are constructed.

Future Work
While we have demonstrated that our approach is effective in extracting a first-pass set of agreement rules directly from raw text, it focuses only on agreement between a pair of words and hence might fail to capture more complex phenomena that require broader context or operate at the phrase level. Consider this simple English example: "John and Mary love their dog". Under both UD and SUD formalisms, the coordinating conjunction "and" is a dependent, hence the verb will not agree with either of the (singular) nouns ("John" or "Mary"). Also, deciding agreement based on only POS tags is insufficient to capture all phenomena that may influence agreement for e.g. mass nouns such as 'rice' do not follow the standard number agreement rules in English. We leave a more expressive model and evaluation on more languages as future work. We also plan to expand our methodology for extracting grammar rules from raw text to other aspects of morphosyntax, such as argument structure and word order phenomena.

A.1 Decision Tree Hyperparameters
We perform a grid search over the following hyperparameters of the decision tree: • criterion = [gini, entropy] • max depth = [6,15] • min impurity decrease = 1e −3 The best parameters are selected based on the validation set performance. For some treebanks which have no validation set we use the default crossvalidation provided by sklearn (Buitinck et al., 2013). Average model runtime for a treebanks is 5-10mins depending on the size of the treebank.

A.2 Dataset Statistics
For the true low-resource experiments, the dataset details are in Table 2.

A.4 Annotation Interface for Expert Evaluation
In Figure 10, we show the annotation interface used for verifying Gender agreement rules in Catalan.
For each triple, we display 10 randomly selected examples from the training portion of the treebank.

A.5 Low-resource Experiment Results
For the simulation experiments, the dataset details are in Table 3.

A.5.1 Udify (Kondratyuk and Straka, 2019) Model Details
We used the Udify model for automatically annotating the raw text with part-of-speech (POS), dependency links and morphological features. For each of the simulation experiment we report the udify parsing performance on the test data in Table 4. We used the same hyperparameters for training with a related languages as specified by the authors. 11 . In the configuration file, we only change the parameters warmup steps= 100 and start-step= 100, as recommended by the authors for low-resource languages.

A.5.2 Results and Discussion
For each language and feature, we plot the ARM score with and without transfer learning in Figure 12-14. Similar to our findings for Gender in Figure 5, we find that cross-lingual transfer leads to a better score across all languages in the zeroshot setting. As we increase the number of goldstandard sentences, the quality of extracted rules improve. Although, for Belarusian we observe the opposite trend for Person agreement. On closer inspection we find that it is because person applies only to non-past finite verb forms (VERB and AUX) as an inflectional feature and to pronouns (PRON) as a lexical feature which means that in many cases person is not explicitly marked, even though it implicitly exists 12 .

A.6 Experiments with Gold-Standard Data
We present the ARM scores for all treebanks and features in Tables 5-11. We also report the validation results in the same tables for our best setting which uses the Statistical Threshold. In Section 2.2, we proposed using two types of thresholds for retaining the high probability agreement rules. In order to compare which threshold is the best for all treebanks, we manually inspect some of the learnt decision trees. We find that for the trees learnt from the hard threshold often over-fit on the training data causing to produce leaves with very few examples. In Figure 15 we compare the trees constructed for number agreement with the two thresholds for Marathi. One reason why Statistical-Threshold performs better for low-resource languages is because there are more leaves with fewer samples overall causing the Hard Threshold to have more false positives. Whereas the Statistical Threshold uses effect size with the significance test which takes into account the sample size within a leaf leading to better leaves. Therefore, we choose to use Statistical-Threshold for all our simulation experiments.
In Figure 11, we report that (avg.) number of leaves in the decision trees grouped by language   family. Overall, Gender and Case tend to have more complex trees. For Case, it is probably because languages have more number of cases making it harder for the decision tree to model them. Figure 16 presents a comparison of UD and SUDstyle trees for the German sentence, "Ich werde lange Bücher lesen.". The SUD tree has the function word 'werde' as the syntactic head to the content word 'lesen'.    Figure 14: Comparing the (avg.) ARM score for Case agreement with and without cross-lingual transfer learning (transfer language in parenthesis). Note: the higher the ARM the better. For Spanish, there was < 10 data points with Case annotated hence we do not report results for it.