A rule-based system for cross-lingual parsing of Romance languages with Universal Dependencies

This article describes MetaRomance, a rule-based cross-lingual parser for Romance languages submitted to CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The system is an almost delexicalized parser which does not need training data to analyze Romance languages. It contains linguistically motivated rules based on PoS-tag patterns. The rules included in MetaRomance were developed in about 12 hours by one expert with no prior knowledge in Universal Dependencies, and can be easily extended using a transparent formalism. In this paper we compare the performance of MetaRomance with other supervised systems participating in the competition, paying special attention to the parsing of different treebanks of the same language. We also compare our system with a delexicalized parser for Romance languages, and take advantage of the harmonized annotation of Universal Dependencies to propose a language ranking based on the syntactic distance each variety has from Romance languages.


Introduction
This article describes the MetaRomance parser, which participated at CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies . MetaRomance is a rule-based parser for Romance languages adapted to Universal Dependencies (UD). The system relies on a basic grammar consisting on simple cross-lingual and (almost) delexicalized rules likely to be shared by most Romance lan-guages. Rules are almost delexicalized because they are mainly applied on Universal PoS-tags, only containing few grammar words (some prepositions and conjunctions) together with a small list of verbs. The grammar was developed by one expert with no prior knowledge in UD in about 12 hours. 1 As the Universal Dependencies initiative (Nivre et al., 2016) offers linguistic criteria providing a consistent representation across languages, it fits perfectly with our objective of defining crosslingual rules. In fact, the availability of harmonized treebanks provides an interesting test bench for cross-lingual dependency parsing research (McDonald et al., 2011;Mcdonald et al., 2013;Vilares et al., 2016).
Our participation at this CoNLL 2017 shared task has several experimental objectives. First, we will compare our rule-based approach with the rest of participants, which are likely to be supervised systems, with regard to Romance languages. Namely, we will analyze the performance of several systems on different treebanks of the same language. Then, we will also evaluate the crosslingual property of our system by comparing it with a supervised delexicalized parser. Last but not least, the analysis of the results in the shared task will allow us to check whether our method might be useful to measure the syntactic distance between Romance and non-Romance languages.
The results of different experiments show that, in spite of its simplicity, MetaRomance achieve reasonable results on Romance languages with no training data, and that its performance is relatively uniform across different treebanks of the same language. The delexicalized rules of this system al-lowed us to present a classification of all the languages present in the shared task ranked by their syntactic distance from Romance languages.
The remaining of the paper is organized as follows. Section 2 presents some related work on dependency and cross-lingual parsing. Then, we present the architecture of MetaRomance in Section 3, and several experiments on Section 4. Finally, we briefly discuss the results and present the conclusions of our work in Sections 5 and 6, respectively.
In opposition to data-driven approaches, many grammar-driven (or rule-based) parsers use finitestate technology, which has attractive properties for syntactic parsing, such as conceptual simplicity, flexibility, and efficiency in terms of space and time. It permits to build robust and deterministic parsers. Most finite-state based parsing strategies use cascades of transducers (Ait-Mokhtar et al., 2002;Oflazer, 2003).
Concerning cross-lingual parsing, there are two main approaches for parsing one language (the target) with resources from one or more source languages: (a) data transfer, and (b) model transfer methods. On the one hand, data transfer approaches obtain annotated treebanks of a target language by projecting the syntactic information from the source data. Some methods use parallel corpora (Hwa et al., 2005;Ganchev et al., 2009; while others create artificial data taking advantage of machine translation (Tiedemann and . On the other hand, the strategies based on model transfer train systems on the source data that can be used to parse a target language (Zeman and Resnik, 2008). The emergence of different initiatives promoting harmonized annotations allowed researchers to explore this approach, using delexicalized models and multi-source strategies (Mc-Donald et al., 2011;Täckström et al., 2012).
More recently, some works addressed multilingual parsing using a single model (trained in a combination of various treebanks) to analyze different languages (Vilares et al., 2016;Ammar et al., 2016).
The growth in cross-lingual parsing research has given rise to a recent shared task at VarDial 2017 (Zampieri et al., 2017), Cross-lingual Dependency Parsing (CLP) (Tiedemann, 2017). CLP is a shared task whose aim is to develop models for parsing selected target languages without annotated training data, but having annotated data in one or two closely related languages (Rosa et al., 2017).
With the emergence of UD as the practical standard for multilingual PoS and syntactic dependency annotation, it is possible to develop universal rule-based strategies requiring no training data, and relying on basic rules exploiting the UD criteria. The Universal Dependency Parser, described in (Martínez Alonso et al., 2017), is a good example of this unsupervised strategy. Our work goes in that direction, but with two differences: the grammar is focused on Romance languages and the parser relies on basic rules implemented as cascades of finite-state transducers.

The architecture
The core of MetaRomance, depicted in Figure 1, consists of the following modules: • An adapter converting CoNLL-U into the format required by the rule-based parser.
• A MetaRomance grammar with 150 crosslingual rules configured to work with tags, labels and linguistic constraints of UD.
• A grammar compiler that takes the grammar as input and generates a dependency parser, which is based on finite state transitions.
In order to allow MetaRomance to work on raw text, some scripts are provided in addition to the core architecture for converting the tags given by different PoS-taggers (namely, FreeLing (Padró and Stanilovsky, 2012; Garcia and Gamallo, 2010), TreeTagger (Schmid, 1994), and LinguaKit (Garcia and Gamallo, 2015)) into the CoNLL-U format. Thus, MetaRomance is able to parse raw text which has been tokenized, lemmatized and PoS-tagged with several systems that provide high-quality analyses for different languages.

The MetaRomance grammar
The cost of writing the grammar is not high since its size is small and the rules are not languagespecific. The strategy we followed to write the MetaRomance grammar is based on two methodological principles: • Start with high-coverage rules.
• Otherwise, develop rules shared by as many Romance languages as possible.
The objective is to find a trade-off between high performance and low effort, i.e. we look for efficiency. Most rules satisfy these two principles, giving rise to a broad-coverage parser. We have not defined non-projective rules since, in general, they have low coverage and are language dependent. Some rules contain information on specific lexical units, but only to identify grammatical words: some prepositions, conjunctions, determiners, and pronouns (and a small and automatically extracted list of verbs). Most phenomena not covered by the grammar are related with some long distance dependencies, including subordinate clauses in non-canonical positions, or complex issues derived from coordination.
Cross-lingual rules were written with DepPattern (Gamallo and González, 2011), a high-level syntactic formalism aimed to write dependencybased grammars. This dependency formalism has been adapted so as to let it interpret Universal Dependencies, more specifically UDv2. All rules were written in about 12 hours by an expert linguist who has skills in the DepPattern formalism, but with no prior knowledge in UD. He took into account the syntactic structure of all Romance languages of the UDv2 treebanks except Romanian. In the following you can see an example of a Dep-Pattern rule: det: DET [ADJ]? NOUN Agreement: gender, number % The first line contains, divided by the colon, the name of the dependency relation (det) together with the PoS context. Here, a determiner will be linked as dependent of a noun (the head), with an optional adjective between them. The second line states that this rule will only be applied if both the dependent and the head agree in gender and number.
As the grammar is not complete, giving rise to partial parses, we implemented a post-editor script linking all tokens without head information to the corresponding sentence root. Moreover, in order to assign a label to each unknown dependency, the script associates dependency names to PoS-tags: e.g., PUNCT is associated with the dependency name "punct", VERB with "xcomp", and so on.
It is worth noting that the rules implemented in MetaRomance only make use 25 out of the 37 universal relations defined in the UDv2 guidelines.

A finite-state transition parser
The parser, automatically generated from the formal grammar, is based on a finite-state transition approach making use of a similar strategy to the shift-reduce algorithm. More precisely, it consists of a set of transducers/rules that compress the input sequence of tokens by progressively removing the dependent tokens as soon as dependencies are recognized (Gamallo, 2015). So, at each application of a rule, the system reduces the input and make it easier to find new dependencies in further rule applications. In particular, short dependencies are recognized first and, as a consequence, the input is simplified so as to make lighter the recognition of long distance dependencies. This is inspired by the easy-first strategy.

Experiments
This section presents several evaluations of MetaRomance using the data provided by the CoNLL 2017 shared task on UD parsing . We will show the results of the following experiments: • Comparison of MetaRomance with other supervised approaches on all the testing treebanks of Romance languages.
• Analysis of the performance of several parsers on different treebanks of the same language.
• Comparison of MetaRomance with a neural network delexicalized parser for Romance languages. • Syntactic distance between Romance and non-Romance languages.
As we had several alignment issues concerning the evaluation of data pre-processed by LinguaKit and FreeLing, all the experiments presented in this paper (as well as the official MetaRomance results) used as input the tokenized, lemmatized and PoS-tagged data provided by the UDPipe baseline models.

Results at CoNLL-2017 shared task
In general, our system obtained low LAS and UAS results in the whole dataset of the shared task (34.05% LAS, 42.55% UAS). 2 The results were mostly expected due to the characteristics of MetaRomance: an almost delexicalized parser which does not require training data, with simple rules only based on the syntactic structure of Romance languages.
MetaRomance needed 29 minutes and 155MB of memory to parse all the testing sets on the TIRA virtual machine provided by the shared task (Potthast et al., 2014). Table 1 shows the official MetaRomance results on every treebank of a Romance language evaluated in the shared task. On average, our system achieved F1 results of 58.9 (LAS) and 66.1 (UAS). The worst results were obtained in Romanian; this fact was expected because (a) Romanian is linguistically more distant than the other Romance languages (Gamallo et al., 2017), and (b) we did not implement any dependency rule with this language in mind. 2 After correcting a small bug in a script -which produced invalid treebanks for three languages-, we obtained 34.98% LAS and 43.81% UAS.  Table 1 are not comparable with most supervised systems in the competition, our simple parser obtained competitive results in some languages, such as es, it, and pt. Interestingly, MetaRomance performed better in the pud datasets than in the others treebanks of the same languages (with only one exception: UAS results in pt and pt pud), while most systems in the shared task decreased their performance in the pud datasets in several points. In this respect, MetaRomance leaded some supervised approaches in treebanks such as pt pud or gl treegal (this last one with small training data). Some of the results on different treebanks of the same language have noticeable LAS differences: more than 5 points between es and es pud, and about 10% between pt br and the two other treebanks of Portuguese. 3 In this regard, our next experiment compares the cross-treebank performance of supervised models (i.e., parsing different treebanks of the same language with the same model). To carry out this experiment we trained a UDPipe model (Straka et al., 2016) in each training dataset of Spanish, Galician, and Portuguese. These models were trained using the default parameters of UDPipe 1.1, but removing the lemmas and the morphological features of the treebanks, with a view to building parsers with more robust performance among the different test sets. 4  Table 2 includes the LAS and UAS values of each model (in the columns) on the target treebanks (on each row). These numbers clearly show that the results of supervised models are very different when parsing a different treebank to the one used for training, even if both corpus belong to the same language. These differences are much higher than those reported for MetaRomance, exceeding 22% in gl parsing gl treegal, more than 15% in the analysis of es by es ancora, or more than 14 in pt br parsing pt. Note, however, than most supervised parsers (except gl analyzing gl treegal) achieved better results than those obtained by MetaRomance.

Even if the values in
These results (both the UDPipe and the MetaRomance ones) suggest that careful analyses of the different treebanks are required, aimed at knowing whether these large variations are due to different domains, annotation issues, or linguistic differences.

Comparison with a cross-lingual delexicalized parser
In the next experiment we compare the performance of MetaRomance with a delexicalized parser trained with a combined corpus which includes sentences from every Romance treebank.
guage.  This is a competitive supervised baseline in crosslingual transfer parsing work, which gives us an indication of how our system compares to standard cross-lingual parsers. We trained 50 UDPipe models by randomly selecting from 1 to 50 sentences of each Romance treebank in the training data. Then, we obtained the average results on all the Romance test treebanks, and plotted them together with the MetaRomance performance in Figure 2.
This figure shows that MetaRomance obtains similar results (≈ 59% LAS) to those achieved with about 2,000 tokens of all the Romance treebanks. The learning curve also suggest that it is difficult for cross-lingual models with no lexical features (as MetaRomance, which is also delexicalized) to keep increasing their cross-lingual performance on Romance languages. Thus, UDPipe achieves 64% with about 5,000 tokens, but it cannot surpass 65% even with a training corpus of 20,000 tokens.

Syntactic distance from Romance languages
The last experiment is an attempt to rank all the languages in the shared task with respect to the Romance family, aimed at knowing whether it is possible to use these results as a syntactic distance between Romance and non-Romance languages.  For those languages with more than one treebank we show the average results. 5 As expected, at the top of the ranking we find Romance languages, on which MetaRomance achieves the best results (except on Romanian, slightly surpassed by Bulgarian). With few exceptions, such as the Indian varieties which obtained low values, Indo-european languages have the best results. In general, our system does not reach 40% UAS in Non-Indo-european languages, except in Hungarian and in Indonesian. In this regard, it is worth mentioning that Indonesian (with 51% UAS) has a Subject-Verb-Object word order similar to most European languages (Sneddon, 1996).

Discussion
The experiments performed in this paper provided some interesting results that claim for further research in cross-lingual parsing.
On the one hand, there are noticeable differences when parsing different treebanks of the same language, both using a rule-based system and harmonized supervised models. In this respect, it could be interesting to analyze the source of these variations, and MetaRomance could be useful for this purpose because it uses linguistically transparent rules based on PoS-tags.
On the other hand, the learning curve of a crosslingual delexicalized model reinforces the idea that lexical features are required to obtain highquality parsing results. In this respect, further experiments could compare this learning curve to lexicalized cross-lingual models, which seem to obtain good results in languages from the same linguistic family. Concerning MetaRomance, the addition of new rules (both lexicalized and without lexical information) could allow the parser to better analyze different languages.
Finally, and even if this is not a fair comparison, it is worth noting that MetaRomance obtained higher results in Romance languages than those achieved by UDP (Martínez Alonso et al., 2017). UDP is a training-free parser based on PageRank and a small set of head attachment rules, being more generic than MetaRomance (it can be applied to any language with more homogeneous results than our system). The differences on Romance languages vary between few decimals to more than 6% UAS, but the experiments were performed using different versions of the UD treebanks. 6

Conclusions
This paper presented our submission to the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The system, MetaRomance, is a fast rule-based parser suited to analyze Romance languages with no training data. It can be used on the top of several PoS-taggers such as LinguaKit, FreeLing, TreeTagger, or in a CoNLL-U file processed by tools such as UDPipe.
This cross-lingual parser contains 150 rules based on PoS-tags patterns, implemented by a linguist in about 12 hours. The MetaRomance grammar was written in DepPattern, a formalism that allows experts to easily modify and increase the rules to cover more syntactic phenomena.
Several experiments showed that a simple system such as the proposed in this paper can analyze in a uniform way different treebanks of Romance languages (and also from other linguistic families). Furthermore, a preliminary experiment on cross-lingual delexicalized parsing of Romance languages suggested that lexical features are needed to increase the parsing performance. Lexical information can be added both to supervised systems and to our rule-based approach.
The grammar provided by MetaRomance was also used to present a classification of all the languages of the shared task datasets ranked by their syntactic distance with respect to Romance languages.