Delexicalized transfer parsing for low-resource languages using transformed and combined treebanks

This paper describes our dependency parsing system in CoNLL-2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. We primarily focus on the low-resource languages (surprise languages). We have developed a framework to combine multiple treebanks to train parsers for low resource languages by delexicalization method. We have applied transformation on source language treebanks based on syntactic features of the low-resource language to improve performance of the parser. In the official evaluation, our system achieves an macro-averaged LAS score of 67.61 and 37.16 on the entire blind test data and the surprise language test data respectively.


Introduction
A dependency parser analyzes the relations among the words in a sentence to determine the syntactic dependencies among them where the dependency relations are drawn from a fixed set of grammatical relations. Dependency parsing is a very important NLP task and has wide usage in different tasks such as question answering, semantic parsing, information extraction and machine translation.
There has been a lot of focus recently on development of dependency parsers for low-resource languages i.e., the languages for which little or no treebanks are available by cross-lingual transfer parsing methods using knowledge derived from treebanks of other languages and the resources available for the low-resource languages (McDonald et al., 2011;Tiedemann, 2015;McDonald et al., 2011;Zeman and Resnik, 2008;Rasooli and Collins, 2015).

The
Universal Dependencies (http: //universaldependencies.org/) (Nivre et al., 2016) project has enabled the development of consistent treebanks for several languages using an uniform PoS, morphological features and dependency relation tagging scheme. This has immensely helped research in multi-lingual parsing, cross-lingual transfer parsing and the comparison of language structures over several languages.
The CONLL 2017 shared task focusses on learning syntactic parsers starting from raw text that can work over several typologically different languages and even surprise languages for which no training data is available using the common annotation scheme (UD v2). The details of the task are available in the overview paper (Zeman et al., 2017).
For parsing the surprise languages we trained delexicalized parser models. In order to improve performance on the surprise languages we applied syntactic transformation on some source language treebanks based on the information obtained from the "The World Atlas of Language Structures" (WALS) (Haspelmath, 2005) and sample data and used the transformed treebanks to train the parsers for the surprise languages. The details of the treebanks are discussed in Section 3.1.
The rest of the paper is organized as follows. In Section 2 we describe the corpora and resources used to build our system. In Section 3 we describe in details the methods used to train the parser models. In Section 4 we describe the experiments and report the results, and, we conclude in Section 5.

Corpus and resources
We used the treebanks (UD v2.0) (Nivre et al., 2017b) which were officially released for the shared task to train our parser models. The dataset consists of 70 treebanks on 50 different languages. There are multiple treebanks for some languages such as Arabic, English, French, Russian etc. For the shared task, only the training and development data was released. The small sample treebanks (approximately 20 sentences per language) for the surprise languages were made available separately one week before the test phase.
We have used the pre-trained word vectors of 50 dimensions provided by the organizers to train the parser models. For tokenization and tagging we used the baseline models provided by the organizers. Our parser models were trained using the Parsito parser (Straka et al., 2015) implemented in UDPipe (Straka et al., 2016) text-processing pipeline system.

System description
Our parser worked on parsed the tokenized and tagged files ( * -udpipe.conllu) provided by the organizers rather than the raw text files. We first discuss the steps for training the models for surprise languages in Section 3.1 followed by methods used to train the models for the new parallel treebanks in Section 3.2 and known treebanks in Section 3.3.

Surprise language
The surprise languages are Buryat (bxr), Kurmanji (Kurdish) (kmr), North Sámi (sme) and Upper Sorbian (hsb) for which sample data of approximately 20 annotated sentences per language was made available. No training or test data is available for the surprise languages.
We have used cross-lingual parser transfer to develop parsers for the surprise languages using the treebanks of resource-rich languages (McDonald et al., 2011). Annotation projection (Hwa et al., 2005) and delexicalized transfer (Zeman and Resnik, 2008) are the two major methods of crosslingual parser transfer.
However, annotation projection requires parallel data which is not available for the surprise languages. Hence, we used the delexicalized parser transfer method to train parser models for the surprise languages. Training delexicalized parser involves supervised training of a parser model on a source language (SL) treebank without using any lexical features and then applying the model directly to parse sentences in the target language (TL). Zeman and Resnik (2008) and Søgaard (2011) have shown that cross-lingual transfer by delexicalization works best for syntactically related language pairs. The first step was to identify the languages which are syntactically related to the surprise languages and whose treebanks are in Universal Dependency corpus. We observed that Upper Sorbian being a slavonic language is typologically related to Czech, Polish and to some extent Slovak. North Sámi is spoken in the northern parts of Norway, Sweden and Finland. It belongs to the family of Finno-Ugric languages and hence has typological similarities with Estonian, Finnish and Hungarian. Kurmanji has typological similarities with Persian and Turkish. Buryat is spoken in Monglia. Although none of the languages whose treebanks are available in Universal Dependencies corpus belong to the family of Buryat yet we guessed that Kazakh, Tamil, Hindi and Urdu might have some similarities with Buryat based on the syntactic features and phrasal structures of the languages.
In order to verify our guesses, we tested the delexicalized models trained on individual treebanks on the sample data for the surprise languages and ranked them based on LAS. We observed that our guesses were quite close to the actual results except a few cases. Table 3.1 lists the top-5 languages for each of the surprise language based on LAS score. Encouraged by the above results that support our guesses we explored a transformation-based method to further reduce the syntactic differences between the surprise languages and the corresponding source languages. Besides attempting to reduce the syntactic differences between the languages we also experimented with combining the treebanks for which the individual LAS scores were highest to further boost the LAS on the surprise languages. Aufrant et al. (2016) have shown that local reordering of words of the source sentence using PoS language model and linguistic knowledge derived from WALS improve performance of the delexicalized transfer parser even for syntactically different SL-TL pairs. The reordering features they use are relative orderings of the adjectives, adpositions, articles (definite and indefinite) and demonstratives with respect to the corresponding modified nouns in the TL.

Syntactic feature based transformation
However, for language pairs that differ in the arrangement of verb arguments, local rearrange-   Table 2: Ordering of the head-modifier pairs in the target language as derived from the universal dependency treebank statistics. "pre" indicates that the modifier precedes the head, "post" indicates that the modifier succeeds the head and "-" indicates that the ordering cannot be decided. "×" shows that the dependency does apply to the language. Some of the feature identifiers are derived from WALS: 81Aorder of subject, object and verb in a sentence, 84A -order of object, oblique and verb, 87A -ordering of ADJ and NOUN, 85A -ordering of ADP and NOUN, 37A/38A -ordering of Definite/Indefinite articles and NOUN, 88A -ordering of Demonstrative and NOUN, 86A -ordering of genitive and NOUN, 90Aordering of relative clause and VERB  An example of syntactic transformation In order to illustrate the syntactic difference between two languages we put forth an example for English-Kurmanji (Kurdish) language pair.
The example in figure 1 shows the syntactic difference between the two languages -English and Kurmanji (Kurdish) -and how the transformation of the English sentence makes it syntactically Figure 1: Transformation of English sentence to match the syntax of Kurdish sentence based on the syntactic features of Hindi closer to Kurmanji (Kurdish). English language sentences have a SVO sentence structure while Kurmanji (Kurdish) has a SOV sentence structure. Moreover, in English the oblique arguments tend to appear after the object in the sentence while in Kurmanji (Kurdish) the oblique arguments tend to appear before the object.
The English sentence "Me and my friend had fish last night" may be translated to Kurmanji (Kurdish) as " Minû hevalê min şeva din masî xwar (Me and friend my night last fish had)". In this sentence pair Me and my friend (Minû hevalê min) and fish (masî) are the subject and the object of the main verb had (xwar) and last night (şeva din) is the non-core (oblique) argument indicating the time of occurrence of the verb. Also, in Kurmanji (Kurdish), the adjectival modifiers and genitives occur after the modified noun e.g., hevalê min (friend my) and şeva din (night last), while in English these modifiers occur before the modified noun e.g. my friend and last night.

Transformation features
Apart from the features proposed by Aufrant et al. (2016) we obtained the order of subject-objectverb (SOV), order of object-oblique-verb, and the relative order of relative clause, auxiliaries, copula verbs, clausal complements, clausal modifier (adjectival and adverbial) with respect to the modified verb from WALS and the statistics of sample data. The relative ordering of head-modifier pairs based on the features derived from treebanks was determined using the following heuristic. If a particular order appears in at least 90% cases out of the total number of occurrences of the feature (dependency tag) then we use that ordering corresponding to the feature. Else we do not do any transformation based on that feature. We relied on the WALS features and the statistics of the sample data to derive the syntactic features and ignored the features that did not appear in these two sources. For example, although Buryat, Upper Sorbian and Kurmanji have relative clauses we neither did we find mention of the feature in WALS for the language nor did we find that relation in sample data for these languages. Hence, we did not use that relation during transfer. In Table 3.1 we summarize the transformation features and the corresponding orderings used for each surprise language.
We categorized the dependency relations into six classes.
Among the determiners we only considered the articles and demonstratives. We further divide the members of class modifiers into pre-modifiers and post-modifiers depending upon the position they take in the sentences with respect to the parent word in TL.

Tree-traversal based transformation
algorithm Given a source language sentence S = {w 1 , · · · , w m }, where m is the length of S, let T S be the parse tree of S. The transformation is carried out in two steps.

•
Step 1: Remove the words corresponding to the dependency relations that do not hold in the TL from the SL parse tree e.g., remove Demonstratives when North Sámi is the target language.
• Step 2: Rewrite the sentence by a treetraversal method depending upon the ordering of the head-modifier pairs based on the transformation features.
Corresponding to each target language we have separate transformation procedures. The Procedure BuildTree is common for all the target languages. In this procedure we construct the tree data structure where each node in the tree corresponds to a word in the sentence. Each node consists subject, object, clausal complement, premodifier, postmodifier and other-modifier lists. The lists of a node are filled up only by the dependents of the corresponding word in the dependency tree. The subjects, objects and the clausal complements of the word are added to the corresponding lists. While constructing the pre-modifier, post-modifier and other modifier lists the module refers to a look-up table to obtain the order of the modifiers in the TL and place the modifiers in the corresponding lists. All the lists are not necessarily filled up e.g., if none of the dependents hold a subject relation (nsubj or nsubj:pass) with the word then the subject list of the corresponding node remains empty.
We have separate procedures for transforming the SL trees for each TL. The sentences are transformed by traversing the trees according to the ordering of the dependencies in the TL e.g., the subtrees corresponding to the modifiers in the pre-modifier list and the modifiers in the othermodifier list that appear before the current word in the SL sentence are traversed first, then the word of the current node is added to the transformed word list, followed by traversal of the subtrees corresponding to the modifiers in the postmodifier list and the words in other-modifiers list that appear after the current word in the SL sentence. Also, if the TL has SVO sentence structure, first the subtree corresponding to the subject is traversed, the verb is added to the transformed list and finally the subtree corresponding to the object is traversed. Procedure TraverseAnd-TransformTree illustrates the steps used for transforming the SL tree when the the target language follows SOV ordering of verb arguments and the clausal complements occur before the verb.  1. We obtained the syntactic features proposed by Aufrant et al. (2016) from WALS and the sample data.
2. Besides the features obtained in step 1, we also derived some more syntactic features from WALS and sample data statistics such as ordering of subject (S)-object (O)-verb (V) Procedure BuildTree input : Source language parse tree T S output: Tree data structure T 1 T = node n root , containing the root word (w root ), POS (pos wroot ), empty children list (cl root ), parent link (p root ) = null 2 for each word w i in S except the root word do 3 Form a node n i containing, Add n i to the children list of p i 5 Add n i to the pre − modif ier, post − modif ier or other − modif ier list based on TL features 6 return T in a sentence, relative ordering of auxiliaries, copula verbs, clausal complements, adjectival and adverbial clausal modifiers.
3. We transformed all the available treebanks based on the syntactic features described in step 1 and the combination of the features stated in step 1 and 2 using the appropriate transformation procedures.
4. We trained separate delexicalized models for untransformed treebanks and both types of transformations such that corresponding to each source language there are three models -one trained on untransformed treebank and two on transformed treebanks. Universal Dependencies v2.0 corpus consists of 70 treebanks. Hence, after transformation we have 70 × 3 = 210 treebanks.
5. We ranked the 210 models based on their LAS on the sample data provided for the surprise language and broke ties based on UAS and chose the top 20 treebanks for our next step.
6. We trained 20 models by combining the treebanks in the top-k ordering (top-1, top-2,· · · , top-20) and selected the model that gave the highest LAS on the sample data. The treebanks were combined by concatenating the treebank files to form a single treebank e.g. for the top-2 model, we concatenated the two treebanks which ranked first and second with respect to the LAS on the sample data and used the concatenated treebank to train the top-2 model. In Table 3.1.4 we summarize Procedure TraverseAndTransformTree input : Source language parse tree data structure T output: Transformed source language parse tree T R Rearranged word sequence (S R ) = TraverseTree(n root ) return S R the LAS of our combined treebanks on the sample data. We report only those top-k combinations that have been used in the submitted systems.

Known language, new parallel treebank
New parallel treebanks were provided for 14 languages in the test data. Out of these 14 languages, we trained the models for German, Hindi, Japanese and Turkish on the single UD treebanks available for each of these languages. Multiple treebanks are available for each of the remaining 10 languages, viz, Arabic, Czech, English, Finnish, French, Italian, Portuguese, Russian and Swedish.
For each language with multiple treebanks we followed the following steps: 1. We combined all the treebanks in that language and trained a parser model on the combined treebank.
2. We tested the combined model and the models trained on the individual treebanks on the development sets of all individual treebanks.
3. We used the combined model for the parallel treebank if it gives uniform UAS and LAS scores across all the development sets and gave significant improvement over the models trained on individual treebanks. Else we used the model trained on the treebank that gave best result across all the treebanks.
We used the combined models for Swedish, English, Finnish, French, Italian, Portuguese, Russian. For Arabic and Czech we used the models trained on the main treebanks (lcode: 'ar' and 'cs', tcode: '0') of the respective languages.

Known language, known treebank
We trained separate models for each of the 70 Universal Dependecies v2.0 treebanks. We used word, PoS and dependency relation embeddings of 50 dimensions. Apart from these parameters we used the default parameter settings of the UD-Pipe parser to train our models. The 'small' treebanks for which for which no development data was available, we used the training data itself as development data.

Experiments and results
Our system comprises of 88 models. 70 models were trained on the individual treebanks available from http://universaldependencies. org/, 14 models were trained for the new parallel treebanks and 4 models for the surprise language. Given the language code (lcode) and treebank code (tcode), our system identifies the parser model corresponding to the input test treebank and parses the sentences in the treebank file.
The systems were ranked based on macroaveraged LAS. The final evaluation of the parser is on blind test data sets (Nivre et al., 2017a) through TIRA platform set up by Potthast et al. (2014). We submitted 9 systems (softwarek, where k ∈ {2, · · · , 10}). The systems differ in the models trained for the surprise languages. The models corresponding to the known language treebanks and the new parallel treebanks were same in all the systems. Since the test set was blind, the first four systems (software 2 to 5) consisted of a combination of models for the surprise languages that were expected to perform best based on the performance on the sample treebanks. The remaining 5 consisted of models corresponding to combinations of top-k (k= 1, 5, 10, 15, 20) models for each of the surprise languages. Table 3.1.3 lists the treebanks combined to train the models for our primary system. We summarize the macro-averaged LAS scores for the surprise languages for the 8 models in Table 3.3. The highest scoring system for the surprise languages (software2) consists of top-2 model for Buryat, top-10 model for Kurmanji (Kurdish) and top-6 models for North Sámi and Upper Sorbian. The results using the primary system is summarized in Table 3.1 and the macroaverage over all submitted softwares are listed in Table 3.3.

Conclusion
In this work, we have implemented a system for parsing sentences in several typologically different languages with a special focus on surprise languages for the CoNLL 2017 Shared Task. We have developed a system for combining treebanks to train parsers for surprise languages and applied syntactic transformation of source languages based on the syntactic features of the target languages to improve performance on the target languages. We derived the syntactic features from the WALS and sample data provided. On the surprise languages, the macro-averaged LAS F1-score of our primary system is 37.16 while that of the best performing system (Stanford) is 47.54. However,  Table 6: Comparison of LAS F1 scores of the submitted systems and their macro-averages on the surprise language test data. The system with highest macro-averaged LAS F1 score (software2) is composed of top-2, top-10, top-6, top-6 models for bxr, kmr, sme and hsb respectively. The software4 is composed of top-1, top-2, top-3, top-15 models for bxr, kmr, sme and hsb respectively and the software5 is composed of top-2, top-10, top-15, top-15 models for bxr, kmr, sme and hsb respectively. For the remaining systems (software6-10) we combined the top-5, top-10, top-15 and top-20 treebanks respectively. the macro-averaged LAS F1-score of our best performing system is 37.75. Our rank with respect to the surprise languages is 10. The overall macro-averaged LAS F1-score of our primary system is 67.61 as compared the best performing system that has an macro-averaged LAS F1 score of 76.30. The overall macroaveraged LAS F1-score of our best-performing system is 67.75. Our overall rank is 19.