Parsing Universal Dependencies without training

We present UDP, the first training-free parser for Universal Dependencies (UD). Our algorithm is based on PageRank and a small set of specific dependency head rules. UDP features two-step decoding to guarantee that function words are attached as leaf nodes. The parser requires no training, and it is competitive with a delexicalized transfer system. UDP offers a linguistically sound unsupervised alternative to cross-lingual parsing for UD. The parser has very few parameters and distinctly robust to domain change across languages.


Introduction
Grammar induction and unsupervised dependency parsing are active fields of research in natural language processing (Klein and Manning, 2004;Gelling et al., 2012).However, many data-driven approaches struggle with learning relations that match the conventions of the test data, e.g., Klein and Manning reported the tendency of their DMV parser to make determiners the heads of German nouns, which would not be an error if the test data used a DP analysis (Abney, 1987).Even supervised transfer approaches (McDonald et al., 2011) suffer from target adaptation problems when facing word order differences.
The Universal Dependencies (UD) project (Nivre et al., 2015;Nivre et al., 2016) offers a dependency formalism that aims at providing a consistent representation across languages, while enforcing a few hard constraints.The arrival of such treebanks, expanded and improved on a regular basis, provides a new milestone for crosslingual dependency parsing research (McDonald et al., 2013).Furthermore, given that UD rests on a series of simple principles like the primacy of lexical heads, cf.Johannsen et al. (2015) for more details, we expect that such a formalism lends itself more naturally to a simple and linguistically sound rulebased approach to cross-lingual parsing.In this paper we present such an approach.
Our system is a dependency parser that requires no training, and relies solely on explicit part-ofspeech (POS) constraints that UD imposes.In particular, UD prescribes that trees are single-rooted, and that function words like adpositions, auxiliaries, and determiners are always dependents of content words, while other formalisms might treat them as heads (De Marneffe et al., 2014).We ascribe our work to the viewpoints of Bender (2009) about the incorporation of linguistic knowledge in language-independent systems.

Contributions
We introduce, to the best of our knowledge, the first unsupervised rule-based dependency parser for Universal Dependencies.
Our method goes substantially beyond the existing work on rule-aided unsupervised dependency parsing, specifically by: i) adapting the dependency head rules to UDcompliant POS relations, ii) incorporating the UD restriction of function words being leaves, iii) applying personalized PageRank to improve main predicate identification, and by iv) making the parsing entirely free of languagespecific parameters by estimating adposition attachment direction at runtime.We evaluate our system on 32 languages 1 in three setups, depending on the reliability of available POS tags, and compare to a multi-source delexi-calized transfer system.In addition, we evaluate the systems' sensitivity to domain change for a subset of UD languages for which domain information was retrievable.The results expose a solid and competitive system for all UD languages.Our unsupervised parser compares favorably to delexicalized parsing, while being more robust to domain change.

Related work
Cross-lingual learning Recent years have seen exciting developments in cross-lingual linguistic structure prediction based on transfer or projection of POS and dependencies (Das and Petrov, 2011;McDonald et al., 2011).These works mainly use supervised learning and domain adaptation techniques for the target language.
The first group of approaches deals with annotation projection (Yarowsky et al., 2001), whereby parallel corpora are used to transfer annotations between resource-rich source languages and lowresource target languages.Projection relies on the availability and quality of parallel corpora, sourceside taggers and parsers, but also tokenizers, sentence aligners, and word aligners for sources and targets.Hwa et al. (2005) were the first to project syntactic dependencies, and Tiedemann et al. (2014;2016) improved on their projection algorithm.Current state of the art in cross-lingual dependency parsing involves leveraging parallel corpora for annotation projection (Ma and Xia, 2014;Rasooli and Collins, 2015).
The second group of approaches deals with transferring source parsing models to target languages.Zeman and Resnik (2008) were the first to introduce the idea of delexicalization: removing lexical features by training and cross-lingually applying parsers solely on POS sequences.Søgaard (2011) andMcDonald et al. (2011) independently extended the approach by using multiple sources, requiring uniform POS and dependency representations (McDonald et al., 2013).
Both model transfer and annotation projection rely on a large number of presumptions to derive their competitive parsing models.By and large, these presumptions are unrealistic and exclusive to a group of very closely related, resource-rich Indo-European languages.Agić et al. (2015;2016) exposed some of these biases in their proposal for realistic cross-lingual tagging and parsing, as they emphasized the lack of perfect sentence-and word-splitting for truly low-resource languages.Further, Johannsen et al. (2016) introduced joint projection of POS and dependencies from multiple sources while sharing the outlook on bias removal in real-world multilingual processing.
Rule-based parsing Cross-lingual methods, realistic or not, depend entirely on the availability of data: for the sources, for the targets, or most often for both sets of languages.Moreover, they typically do not exploit constraints placed on linguistic structures through a formalism, and they do so by design.
With the emergence of UD as the practical standard for multilingual POS and syntactic dependency annotation, we argue for an approach that takes a fresh angle on both aspects.Specifically, we propose a parser that i) requires no training data, and in contrast ii) critically relies on exploiting the UD constraints.
These two characteristics make our parser unsupervised.Data-driven unsupervised dependency parsing is now a well-established discipline (Klein and Manning, 2004;Spitkovsky et al., 2010a;Spitkovsky et al., 2010b).Still, the performance of these parsers falls far behind the approaches involving any sort of supervision.
Our work builds on the line of research on ruleaided unsupervised dependency parsing by Gillenwater et al. (2010) and Naseem et al. (2010), and also relates to Søgaard's (2012a;2012b) work.Our parser, however, features two key differences: i) the usage of PageRank personalization (Lofgren, 2015), and of ii) two-step decoding to treat content and function words differently according to the UD formalism.
Through these differences, even without any training data, we parse nearly as well as a delexicalized transfer parser, and with increased stability to domain change.

Method
Our approach does not use any training or unlabeled data.We have used the English treebank during development to assess the contribution of individual head rules, and to tune PageRank parameters (Sec.3.1) and function-word directionality (Sec.3.2).Adposition direction is calculated on the fly at runtime.We refer henceforth to our UD parser as UDP.

PageRank setup
Our system uses the PageRank (PR) algorithm (Page et al., 1999) to estimate the relevance of the content words of a sentence.PR uses a random walk to estimate which nodes in the graph are more likely to be visited often, and thus, it gives higher rank to nodes with more incoming edges, as well as to nodes connected to those.Using PR to score word relevance requires an effective graphbuilding strategy.We have experimented with the strategies by Søgaard (2012b), such as words being connected to adjacent words, but our system fares best strictly using the dependency rules in Table 1 to build the graph.UD trees are often very flat, and a highly connected graph yields a PR distribution that is closer to uniform, thereby removing some of the difference of word relevance.
We build a multigraph of all words in the sentence covered by the head-dependent rules in Table 1, giving each word an incoming edge for each eligible dependent, i.e., ADV depends on ADJ and VERB.This strategy does not always yield connected graphs, and we use a teleport probability of 0.05 to ensure PR convergence.
Teleport probability is the probability that, in any iteration of the PR calculation, the next active node is randomly chosen, instead of being one of the adjacent nodes of the current active node.See Brin and Page (1998) for more details on teleport probability, where the authors refer to one minus teleport probability as damping factor.
We chose this value incrementally in intervals of 0.01 during development until we found the smallest value that guaranteed PR convergence.A high teleport probability is undesirable, because the resulting stationary distribution can be almost uniform.We did not have to re-adjust this value when running on the actual test data.
The main idea behind our personalized PR approach is the observation that ranking is only relevant for content words. 2 PR can incorporate a priori knowledge of the relevance of nodes by means of personalization, namely giving more weight to certain nodes.
Intuitively, the higher the rank of a word, the closer it should be to the root node, i.e., the main predicate of the sentence is the node that should have the highest PR, making it the dependent of the root node (Fig. 1, lines 4-5).We use PR personalization to give 5 times more weight (over an 2 ADJ, NOUN, PROPN, and VERB mark content words.otherwise uniform distribution) to the node that is estimated to be main predicate, i.e., the first verb or the first content word if there are no verbs.

Head direction
Head direction is an important trait in dependency syntax (Tesnière, 1959).Indeed, the UD feature inventory contains a trait to distinguish the general adposition tag ADP in pre-and post-positions.
Instead of relying on this feature from the treebanks, which is not always provided, we estimate the frequency of ADP-NOMINAL vs. NOMINAL-ADP bigrams. 3We calculate this estimation directly on input data at runtime to keep the system training-free.Moreover, it requires very few examples to converge (10-15 sentences).If a language has more ADP-NOMINAL bigrams, we consider all its ADP to be prepositions (and thus dependent of elements at their right).Otherwise, we consider them postpositions.
For other function words, we have determined on the English dev data whether to make them strictly right-or left-attaching, or to allow either direction.There, AUX, DET, and SCONJ are right-attaching, while CONJ and PUNCT are left-attaching.There are no direction constraints for the rest.Punctuation is a common source of parsing errors that has very little interest in this setup.While we do evaluate on all tokens including punctuation, we also apply a heuristic for the last token in a sentence; if it is a punctuation, we make it a dependent of the main predicate.1.
The head assignations in lines 7 and 13 read as follow: the head h of a word (either c or f ) is the closest element of the current list of heads (H) that has the right direction () and respects the POSdependency rules ( ).These assignations have a back-off option to ensure the final D is a tree.If the conditions determined by  and are too strict, i.e., if the set of possible heads is empty, we drop the head-rule constraint and recalculate the closest possible head that respects the directionality imposed by .If the set is empty again, we drop both constraints and assign the closest head.Lines 4 and 5 enforce the single-root constraint.To enforce the leaf status of function nodes, the algorithm first attaches all content words (C), and then all function words (F ) in the second block where H is not updated, thereby ensuring leafness for all f 2 F .The order of head attachment is not monotonic wrt.PR between the first and second block, and can yield non-projectivities.Nevertheless, it still is a one-pass algorithm.Decoding runs in less than O(n 2 ), namely O(n ⇥ |C|).However, running PR incurs the main computation cost.

Parser run example
This section exemplifies a full run of UDP for the example sentence from the English test data: "They also had a special connection to some extremists".

PageRank
Given an input sentence and its POS tags, we obtain rank of each word by building a graph using head rules and running PR on it.Table 2 provides the sentence, the POS of each word, the number of incoming edges for each word after building the graph with the head rules from Sec. 3.1, and the personalization vector for PR on this sentence.Note that all nodes have the same personalization weight, except the estimated main predicate, the verb "had".Table 2: Words, POS, personalization, and incoming edges for the example sentence.
Table 4 shows the directed multigraph used for PR in detail.We can see, e.g., that the four incoming edges for the verb "had" from the two nouns, plus from the adverb "also" and the pronoun "They".
After running PR, we obtain the following ranking for content words: C = hhad,connection,extremists,speciali Even though the verb has four incoming edges and the nouns have five each, the personalization makes the verb the highest-ranked word.
root They also had a special connection to some extremists

Decoding
Once C is calculated, we can follow the algorithm in Fig. 1 to obtain a dependency parse.The first four iterations calculate the head of content words following their PR, and the following iterations attach the function words in F .Finally, Fig. 2 shows the resulting dependency tree.Full lines are assigned in the first block (content dependents), dotted lines are assigned in the second block (function dependents).The edge labels indicate in which iteration the algorithm has assigned each dependency.Note that the algorithm is deterministic for a certain input POS sequence.Any 10-token sentence with the POS labels shown in Table 2 would yield the same dependency tree. 4

Experiments
This section describes the data, metrics and comparison systems used to assess the performance of UDP.We evaluate on the test sections of the UD1.2 treebanks (Nivre et al., 2015) that contain word forms.If there is more than one treebank per language, we use the treebank that has the 4 The resulting trees always pass the validation script in github.com/UniversalDependencies/tools.

!
They also had a special connection to some extremists canonical language name (e.g., Finnish instead of Finnish-FTB).We use standard unlabeled attachment score (UAS) and evaluate on all sentences of the canonical UD test sets.

Baseline
We compare our UDP system with the performance of a rule-based baseline that uses the head rules in Table 5.The baseline identifies the first verb (or first content word if there are no verbs) as the main predicate, and assigns heads to all words according to the rules in Table 1.We have selected the set of head rules to maximize precision on the development set, and they do not provide full coverage.The system makes any word not covered by the rules (e.g., a word with a POS such as X or SYM) either dependent of their left or right neighbor, according to the estimated runtime parameter.We report the best head direction and its score for each language in Table 5.This baseline finds the head of each token based on its closest possible head, or on its immediate left or right neighbor if there is no head rule for the POS at hand, which means that this system does not necessarily yield well-formed tress.Each token receives a head, and while the structures are single-rooted, they are not necessarily connected.Note that we do not include results for the DMV model by Klein and Manning (2004), as it has been outperformed by a system similar to ours (Søgaard, 2012b).The usual adjacency baseline for unsupervised dependency parsing, where all words depend on their left or right neighbor, fares much worse than our baseline (20% UAS below on average) even with an oracle pick for the best per-language direction, and we do not report those scores.

Evaluation setup
Our system relies solely on POS tags.To estimate the quality degradation of our system under non-gold POS scenarios, we evaluate UDP on two alternative scenarios.The first is predicted POS (UDP P ), where we tag the respective test set with TnT (Brants, 2000) trained on each language's training set.The second is a naive typeconstrained two-POS tag scenario (UDP N ), and approximates a lower bound.We give each word either CONTENT or FUNCTION tag, depending on the word's frequency.The 100 most frequent words of the input test section receive the FUNC-TION tag.
Finally, we compare our parser UDP to a supervised cross-lingual system (MSD).It is a multisource delexicalized transfer parser, referred to as multi-dir in the original paper by McDonald et al. (2011).For this baseline we train Tur-boParser (Martins et al., 2013) on a delexicalized training set of 20k sentences, sampled uniformly from the UD training data excluding the target language.MSD is a competitive and realistic baseline in cross-lingual transfer parsing work.This gives us an indication how our system compares to standard cross-lingual parsers.

Results
Table 5 shows that UDP is a competitive system; because UDP G is remarkably close to the supervised MSD G system, with an average difference of 6.4%.Notably, UDP even outperforms MSD on one language (Hindi).
More interestingly, on the evaluation scenario with predicted POS we observe that our system drops only marginally (2.2%) compared to MSD (2.7%).In the least robust rule-based setup, the error propagation rate from POS to dependency would be doubled, as either a wrongly tagged head or dependent would break the dependency rules.However, with an average POS accuracy by TnT of 94.1%, the error propagation is 0.37, i.e, each POS error causes 0.37 additional dependency errors.In contrast, for MSD this error propagation is 0.46, thus higher. 5 For the extreme POS scenario, content vs. function POS (CF), the drop in performance for UDP is very large, but this might be too crude an evaluation setup.Nevertheless, UDP, the simple unsupervised system with PageRank, outperforms the adjacency baselines (BL) by ⇠4% on average on the two type-based naive POS tag scenario.This difference indicates that even with very deficient POS tags, UDP can provide better structures.

Discussion
In this section we provide a further error analysis of the UDP parser.We examine the contribution to the overal results of using PageRank to score content words, the behavior of the system across different parts of speech, and we assess the robustness of UDP on text from different domains.

PageRank contribution
UDP depends on PageRank to score content words, and on two-step decoding to ensure the leaf status of function words.In this section we isolate the constribution of both parts.We do so by comparing the performance of BL, UDP, and UDP NoP R , a version of UDP where we disable PR and rank content words according to their reading order, i.e., the first word in the ranking is the first word to be read, regardless of the specific language's script direction.The baseline BL described in 5.1 already ensures function words are leaf nodes, because they have no listed dependent POS in the head rules.The task of the decoding steps is mainly to ensure the resulting structures are well-formed dependency trees.
If we measure the difference between UDP NoP R and BL, we see that UDP NoP R contributes with 4 UAS points on average over the baseline.Nevertheless, the baseline is oracle-informed about the language's best branching direction, a property that UDP does not have.Instead, the decoding step determines head direction as described in Section 3.2.Complementarily, we can measure the contribution of PR by observing the difference between regular UDP and UDP NoP R .The latter scores on average 9 UAS points lower than UDP.These 9 points are caused by the difference attachment of content words in the first decoding step.

Breakdown by POS
UD is a constantly improving effort, and not all v1.2 treebanks have the same level of formalism compliance.Thus, the interpretation of, e.g., the AUX-VERB or DET-PRON distinctions might differ across treebanks.However, we ignore these differences in our analysis and consider all treebanks to be equally compliant.
The root accuracy scores oscillate around an average of 69%, with Arabic and Tamil (26%) and Estonian (93%) as outliers.Given the PR personalization (Sec.3.1), UDP has a strong bias for choosing the first verb as main predicate.Without personalization, performance drops 2% on average.This difference is consistent even for verbfinal languages like Hindi, given that the main verb of a simple clause will be its only verb, regardless of where it appears.Moreover, using PR personalization makes the ranking calculations converge a whole order of magnitude faster.
The heuristic to determine adposition direction succeeds at identifying the predominant pre-or postposition preference for all languages (average ADP UAS of 75%).The fixed direction for the other functional POS is largely effective, with few exceptions, e.g., DET is consistently right-attaching on all treebanks except Basque (average overall DET UAS of 84%, 32% for Basque).These alternations could also be estimated from the data in a manner similar to ADP.Our rules do not make nouns eligible heads for verbs.As a result, the system cannot infer relative clauses.We have excluded the NOUN !VERB rule during development because it makes the hierarchy between verbs and nouns less conclusive.
We have not excluded punctuation from the evaluation.Indeed, the UAS for the PUNCT is low (an average of 21%, standard deviation of 9.6), even lower than the otherwise problematic CONJ.Even though conjunctions are pervasive and identifying their scope is one of the usual challenges for parsers, the average UAS for CONJ is much larger (an average of 38%, standard deviation of 13.5) than for PUNCT.Both POS show large standard deviations, which indicates great variability.This variability can be caused by linguistic properties of the languages or evaluation datasets, but also by differences in annotation convention.

Cross-domain consistency
Models with fewer parameters are less likely to overfit for a certain dataset.In our case, a system with few, general rules is less likely to make attachment decisions that are very particular of a certain language or dataset.Plank and van Noord (2010) have shown that rule-based parsers can be more stable to domain shift.We explore if their finding holds for UDP as well, by testing on i) the UD development data as a readily available proxy for domain shift, and ii) manually curated domain splits of select UD test sets.

Development sets
We have used the English development data to choose which relations would be included as head rules in the final system (Table 1).It would be possible that some of the rules are indeed more befitting for the English data or for that particular section.
However, if we regard the results for UDP G in Table 5, we can see that there are 24 languages (out of 32) for which the parser performs better than for English.This result indicates that the head rules are general enough to provide reasonable parses for languages other than the one chosen for development.If we run UDP G on the development sections for the other languages, we find the results are very consistent.Any language scores on average ±1 UAS with regards to the test section.There is no clear tendency for either section being easier to parse with UDP.Cross-domain test sets To further assess the cross-domain robustness, we retrieved the domain (genre) splits from the test sections of the UD treebanks where the domain information is available as sentence metadata: from Bulgarian, Croatian, and Italian.We also include a UD-compliant Serbian dataset which is not included in the UD release but which is based on the same parallel corpus as Croatian and has the same domain splits (Agić and Ljubešić, 2015).When averaging we pool Croatian and Serbian together as they come from the same dataset.
For English, we have obtained the test data splits matching the sentences from the original distribution of the English Web Treebank.In addition to these already available datasets, we have annotated three different datasets to assess domain variation more extensively, namely the first 50 verses of the King James Bible, 50 sentences from a magazine, and 75 sentences from the test split in QuestionBank (Judge et al., 2006).We include the third dataset to evaluate strictly on questions, which we could do already in Italian.While the answers domain in English is made up of text from the Yahoo!Answers forum, only one fourth of the sentences are questions.Note these three small datasets are not included in the results on the canonical test sections in Table 5.Table 7: Average language-wise domain evaluation.We report average UAS and standard deviation per language.The bottom row provides the average standard deviation for each system.
guages.We attribute this higher stability to UDP being developed to satisfy a set of general properties of the UD syntactic formalism, instead of being a data-driven method more sensitive to sampling bias.This holds for both the gold-POS and predicted-POS setup.The differences in standard deviation are unsurprisingly smaller in the predicted POS setup.In general, the rule-based UPD is less sensitive to domain shifts than the datadriven MSD counterpart, confirming earlier findings (Plank and van Noord, 2010).Table 6 gives the detailed scores per language and domain.From the scores we can see that presidential bulletin, legal and weblogs are amongst the hardest domains to parse.However, the systems often do not agree on which domain is hardest, with the exception of Bulgarian bulletin.Interestingly, for the Italian data and some of the hardest domains UDP outperforms MSD, confirming that it is a robust baseline.

Comparison to full supervision
In order to assess how much information the simple principles in UDP provide, we measure how many gold-annotated sentences are necessary to reach its performance, that is, after which size the treebank provides enough information for training that goes beyond the simple linguistic principles outlined in Section 3.
For this comparison we use a first-order nonprojective TurboParser (Martins et al., 2013) following the setup of Agić et al. (2016).The supervised parsers require around 100 sentences to reach UDP-comparable performance, namely a mean of 300 sentences and a median of 100 sentences, with Bulgarian (3k), Czech (1k), and German (1.5k) as outliers.The difference between mean and median shows there is great variance, while UDP provides very constant results, also in terms of POS and domain variation.

Conclusion
We have presented UDP, an unsupervised dependency parser for Universal Dependencies (UD) that makes use of personalized PageRank and a small set of head-dependent rules.The parser requires no training data and estimates adposition direction directly from the input.
We achieve competitive performance on all but two UD languages, and even beat a multi-source delexicalized parser (MSD) on Hindi.We evaluated the parser on three POS setups and across domains.Our results show that UDP is less affected by deteriorating POS tags than MSD, and is more resilient to domain changes.Given how much of the overall dependency structure can be explained by this fairly system, we propose UDP as an additional UD parsing baseline.The parser, the in-house annotated test sets, and the domain data splits are made freely available. 6D is a running project, and the guidelines are bound to evolve overtime.Indeed, the UD 2.0 guidelines have been recently released.UDP can be augmented with edge labeling for some deterministic labels like case or det.further constrains can be incorporated in UDP.Moreover, the parser makes no special treatment of multiword expression that would require a lexicon, coordinations or proper names.All these three kinds of structures have a flat tree where all words depend on the leftmost one.While coordination attachment is a classical problem in parsing and out of the scope of our work, a proper name sequence can be straightforwardly identified from the partof-speech tags, and it falls thus in the area of structures predictable using simple heuristics.Moreover, our use of PageRank could be expanded to directly score the potential dependency edges instead of words, e.g., by means of edge reification.

Figure 2 :
Figure 2: Example dependency tree predicted by the algorithm.
Table3shows a trace of the algorithm, with C = hhad,connection,extremists,speciali and F = {They,also,a,to,some}.

Table 3 :
Algorithm trace for example sentence.it: iteration number, word: current word, H: set of possible heads.

Table 4 :
Matrix representation of the directed graph for the words in the sentence.

Table 7
summarizes the per-language average score and standard deviation, as well as the macroaveraged standard deviation across languages.UDP has a much lower standard deviation across domains compared to MSD.This holds across lan-