Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

The paper presents the largest Polish Dependency Bank in Universal Dependencies format – PDBUD – with 22K trees and 352K tokens. PDBUD builds on its previous version, i.e. the Polish UD treebank (PL-SZ), and contains all 8K PL-SZ trees. The PL-SZ trees are checked and possibly corrected in the current edition of PDBUD. Further 14K trees are automatically converted from a new version of Polish Dependency Bank. The PDBUD trees are expanded with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts and with the semantic roles of some dependents. The conducted evaluation experiments show that PDBUD is large enough for training a high-quality graph-based dependency parser for Polish.


Introduction
Natural language processing (NLP) is nowadays dominated by machine learning methods, especially deep learning methods. Data-driven NLP tools not only perform more accurately than rule-based tools, but are also easier to develop. The shift towards machine learning methods is also visible in syntactic parsing, especially dependency parsing. The vast majority of the contemporary dependency parsing systems (e.g. Nivre et al., 2006;Bohnet, 2010;Dozat et al., 2017;Straka and Straková, 2017) take advantage of machine learning methods. Based on training data, parsers learn to analyse sentences and to predict the most appropriate dependency structures of these sentences. Even if various learning methods were applied to data-driven dependency parsing (e.g. Jiang et al., 2016), the best results so far are given by the supervised methods (cf. Zeman et al., 2017). Supervised dependency parsers trained on correctly annotated data achieve high parsing performance even for languages with rich morphology and relatively free word order, such as Polish.
The supervised learning methods require goldstandard training data, whose creation is a timeconsuming and expensive process. Nevertheless, dependency treebanks have been created for many languages, in particular within the Universal Dependencies initiative (UD, Nivre et al., 2016). The UD leaders aim at developing a crosslinguistically consistent tree annotation schema and at building a large multilingual collection of dependency treebanks annotated according to this schema.
Polish is also represented in the Universal Dependencies collection. There are two Polish treebanks in UD: the Polish UD treebank (PL-SZ) converted from Składnica zależnościowa 1 and the LFG enhanced UD treebank (PL-LFG) converted from a corpus of the Polish LFG structures. 2 PL-SZ contains more than 8K sentences with 10.1 tokens per sentence on average. PL-LFG is larger and contains more than 17K sentences, but the average number of tokens per sentence is only 7.6. 3 This paper presents the largest Polish Dependency Bank in Universal Dependencies format -PDBUD 4 -with 22K trees and 352K tokens (hence 15.8 tokens per sentence on average). PDBUD builds on its previous version, i.e. the Polish UD treebank (PL-SZ), and contains all 8K PL-SZ trees. The PL-SZ trees are checked and possibly corrected in the current edition of 1 Składnica zależnościowa was converted to the UD format by Zeman et al. (2014). 2  PDBUD. Further 14K trees are automatically converted from a new version of Polish Dependency Bank (PDB, see Section 2). Polish sentences underlying the additional PDB trees contain problematic linguistic phenomena whose conversion requires some modifications of the UD annotation schema (see Section 3). Furthermore, the PDBUD trees are expanded with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts (see Section 4) and with the semantic roles of some dependents (see Section 5). Finally, we conduct some evaluation experiments. The evaluation results show that PDBUD is large enough for training a highquality graph-based dependency parser for Polish (see Section 6).

PDB
The first Polish dependency treebank -Składnica zależnościowa (Wróblewska, 2012) -was a collection of about 8K trees which were automatically converted from Polish constituent trees of Składnica frazowa (Woliński et al., 2011). All sentences of Składnica were derived from Polish National Corpus (Przepiórkowski et al., 2012). The annotated sentences are rather short with 10.2 tokens per sentence on average and corresponding trees are relatively simple (there is only 289 nonprojective trees, 5 i.e. 3.5% of all trees).
This first version of Polish dependency treebank was enlarged with 4K trees (Wróblewska, 2014). The additional trees resulted from the projection of English dependency structures on Polish parallel sentences from Europarl (Koehn, 2005), DGT-Translation Memory (Steinberger et al., 2012), OPUS (Tiedemann, 2012) and Pelcra Parallel Corpus (Pęzik et al., 2011). The additional sentences with the average length of 15.9 tokens per sentence were longer than the sentences from 5 Non-projective trees contain long distance dependencies resulting in crossing edges. See the topicalisation example Czerwoną kupiłam sukienkę 'I bought a red dress' (lit. 'Red I bought a dress') with the following non-projective dependency tree:

Czerwoną
kupiłam sukienkę red I bought dress root obj amod Składnica. The projection-based trees were also more complex and 235 of them are non-projective (i.e. 5.9% of all added trees). The entire set of Składnica trees and the projection-based trees is called Polish Dependency Bank (PDB). PDB is still being developed at the Institute of Computer Science PAS. The current version of PDB is enlarged with a suite of 10K sentences annotated with the dependency trees. The additional sentences are relatively complex (20.5 tokens per sentence on average) and come from Polish National Corpus (Przepiórkowski et al., 2012), Polish CDSCorpus 6 (Wróblewska and Krasnowska-Kieraś, 2017), and literature. There are 1388 nonprojective trees in this set (i.e. 13.9% of 10K trees). Besides enlarging PDB, the development consists in correcting the previous PDB trees. The Składnica trees and the projection-based trees are manually checked and corrected if necessary.
The current version of PDB consists of more than 22K trees with 15.8 tokens per sentence on average (see Table 1). There are 1912 nonprojective trees in PDB (i.e. 8.61% of all trees).

PDBUD
The PDB trees are automatically converted to the UD trees according to the guidelines of Universal Dependencies v2 7 and the resulting set is called PDBUD (i.e. Polish Dependency Bank in Universal Dependencies format). PDBUD contains all trees of the Polish UD treebank (PL-SZ), which are possibly corrected. The size of PDBUD is exactly the same as the size of PDB, i.e. 22K trees and 351K tokens (see Table 1). 1783 of the PDBUD trees are non-projective, i.e. 8.03% of all trees. There are 17K enhanced edges (4.96% of all edges) in PDBUD and 41.6% of the PDBUD graphs have at least one enhanced edge. The converted PDBUD trees are largely consistent with the PL-SZ trees. While converting, we try to preserve the universality principle of UD, but some necessary modifications are essential. The PL-SZ trees are rather simple and the sentences underlying this data set do not contain some linguistic phenomena, e.g. ellipsis, comparative constructions, directed speech, interpolations and comments, nominative noun phrases used in the vocative function, and many others. Therefore, the repertoire of the UD relation subtypes and language-specific features is slightly extended in PDBUD to cover these phenomena (see Section 3). Furthermore, in contrast to the PL-SZ trees, the PDBUD graphs contain enhanced edges encoding shared dependents or shared governors of coordinated elements (see Section 4). Finally, some semantic labels are added that goes beyond the standard annotation scheme of Universal Dependencies (see Section 5).

Corrections and extensions
Plenty of errors are corrected in the original Składnica trees (and the projection-based trees) and thus they are not transferred to these PDBUD trees, which correspond to the PL-SZ trees. The errors in the Składnica trees were predominantly caused by the inadequate automatic conversion of the phrasestructure trees into the dependency trees, particularly by the erroneous labelling. Defective part-ofspeech tags, morphological features, lemmas, dependency relations and their labels are manually corrected by highly qualified linguists. The correction issues do not fall within the scope of this paper. The conversion issues and extension suggestions are described in the following sections.

Comparative constructions
Comparative constructions are distinguished in the PDB trees and thus they are also marked in PDBUD. According to Bondaruk (1998), there are two types of comparative constructions in Polish: comparatives of equality marked with e.g. tak ... jak ('as ... as'), taki ... jaki ('just like'), and com-paratives of inequality marked with NIŻ ('than'). 8 All markers introducing comparative constructions, e.g. JAK, NIŻ, JAKBY, NICZYM, are converted as the subordinate conjunctions SCONJ with the feature ConjType=Cmpr. 9 Comparative constructions are annotated with the following dependencies (see Figure 1): the comparative marker is labelled mark and it depends on the main element of the comparative construction labelled obl:cmpr (a new UD subtype).

Constructions with JAKO
The lexeme JAKO is one of the uninflectable Polish parts of speech. It causes considerable difficulties and is heterogeneously analysed as a preposition, a coordinating conjunction, a subordinating conjunction, or an adverb in the traditional Polish linguistics. According to the concept of the bi-functional subordinating conjunction JAKO (Wróblewska and Wieczorek, 2018), we convert all examples of JAKO as SCONJ with the feature ConjType=Pred (i.e. a predicative conjunction -a new Polish-specific feature). The subordinating conjunction JAKO, which is labelled mark, can be governed by the head of any constituent phrase (e.g. a nominal, prepositional, or verbal phrase) which is, in turn, governed by the sentence predicate subcategorising another phrase of the same type (see Figure 2). There is an identification relation between the sub-categorised argument and the phrase introduced by JAKO (hence the bi-functional subordinating conjunction) which could be marked with an enhanced edge.
o zgodę wystąpił jako oś rodek for permission he applied as for measure case obl:arg case mark obl Figure 2: The PDBUD tree of the sentence O zgodę taką wystąpił jako ośrodek zapobiegawczy ('He applied for such permission as a precautionary measure') with JAKO.

Mobile inflection
The mobile inflections (marked as aglt in the Polish tagset, e.g. -em in odwołałem 'I Mask recalled' or -ś in zrobiłabyś 'you Fem would do') are the enclitics which substitute auxiliary verbs in the past perfect constructions. We convert them as AUX with Aspect, Number, and Person features, similar to PL-SZ. The repertoire of the morphological features of the mobile inflections is enriched with Clitic=Yes and its Variant -either Long (e.g. -em in odwołałem 'I Mask recalled') or Short (e.g. -m in odwołałam 'I Fem recalled'). The mobile inflections are marked with the further features Verb-Form=Fin and Mood=Ind in the PL-SZ trees, but as they are not the proper finite verbs, these features seem to be incorrect and are not included in PDBUD. A mobile inflection is the special case of an auxiliary verb. Therefore, the relation between the mobile inflection and its governing participle is labelled with a special subtype aux:clitic (a new UD subtype).

Conditional particle
The conditional particle BY, e.g. -by-in zrobiłabyś ('you Fem would do'), is annotated in PL-SZ as an auxiliary AUX with the features As-pect=Imp, Mood=Cnd and VerbForm=Fin, and with the lemma BYĆ ('to be'). It is a particle which doesn't bear any grammatical features in Polish (cf. Przepiórkowski et al., 2012). Since it is not any verb form, it cannot be annotated with Aspect, Mood and VerbForm features which are reserved for verbs. Furthermore, its lemma form is BY and not BYĆ. The conditional particle BY is converted as PART in PDBUD. The relation between this particle and its governor is labelled with aux:cnd (a new UD subtype).

Other morphosyntactic extensions
We propose some morphosyntactic extensions of the schema which was used to annotate the PL-SZ trees. Some of these extensions are already defined in the UD guidelines, but they were not applied in PL-SZ. Other extensions are newly defined.
ADP There is only one postposition in Polish -TEMU ('ago'), which is converted in PDBUD as the adposition ADP with the feature Adp-Type=Post. In PL-SZ, the postposition TEMU was wrongly assigned the feature AdpType=Prep, which is reserved for prepositions.
CCONJ We convert the conjunctions PLUS and MINUS as the coordinating conjunction CCONJ with the feature ConjType=Oper (a mathematical operator). There was not any conjunction of this kind in PL-SZ.
Digits Digits (NumForm=Digit) and roman numbers (NumForm=Roman), which are distinguished in PDB, are converted as follows: • ordinal numbers: the adjectives ADJ with the feature NumType=Ord and other standard features of the adjectives, • cardinal numbers: the numerals NUM with the feature NumType=Card and other standard features of the numerals, • other numbers: the tag X.
Note that Elip, Slsh and Blsh are the newly defined PunctType values.
Emojis are always labelled with the function discourse:emo in PDBUD (a new UD subtype).
VERB The impersonal verb forms 10 are converted as the adjectives ADJ with the feature Case in PL-SZ. In the Polish linguistics however, the impersonals are considered verb forms which cannot be conjugated by the grammatical case. Therefore, we convert them as the verbs VERB with the following features: Aspect (Perfective or Imperfective), Mood=Ind, Person=0, Tense=Past, VerbForm=Fin, and Voice=Act.
X The foreign words are converted as X tags with the feature Foreign=Yes. Abbreviations are also annotated as X tags with the features Abbr=Yes and Pun=Yes if the abbreviation requires a full stop (e.g. art. 'article'), or Pun=No if it doesn't (e.g. cm 'centimetre').

Additional relation subtypes
We also propose to extend the inventory of the UD relation subtypes with some additional subtypes listed in the alphabetical order below. is just a sign of the law of sexual attraction: an insect infallibly goes to a plant that wants to be pollinated.') -the clause owad nieomylnie trafia [...] modifies the noun prawa ('of the law'). The relation subtype acl:attrib (adverbial clause modifier of a noun) 12 is therefore introduced to cover constructions of this type.

advmod:neg
The relation between the negation particle NIE ('not') and its governor is labelled with advmod:neg.

aux:imp
The relation between the imperative particle NIECH ('let's') and its governor is labelled with aux:imp.
ccomp:obj The PDB direct objects are these verb arguments which are shifted into the grammatical subjects in the passive sentences. Not only noun objects but also clausal objects undergo this shift, e.g. Przewidział,że inflacja będzie spadać ('He predicted that inflation would go down') and its passive versionŻe inflacja będzie spadać zostało przewidziane ('It was foreseen that inflation would go down', lit. 'That inflation would go down was foreseen'). In order to convert the clausal objects, the subtype ccomp:obj is proposed. It is worth considering whether it is not a better solution to introduce a new UD type cobj in analogy to csubj. discourse:intj Interjections, e.g. cześć ('hello'), Och ('Oh'), Okay, are labelled with the function discourse:intj. nmod:arg Noun complements of various parts of speech, except for verbs, are labelled with the function nmod:arg (noun complement), e.g. srodowiska in ochrona NOUNś rodowiska NOUN 13 ('environmental protection'), dzieci in korytarz pełen ADJ dzieci NOUN ('a corridor full of children').
nmod:subj Polish allows the grammatical subject realised as a prepositional phrase, e.g. do ADP 2 lat więzienia in Grozi mu do 2 lat więzienia ('He faces up to two years in prison', lit. 'Up to two years in prison threatens him') or an adverbial phrase, e.g. Rzadko ADV in Rzadko nie znaczy wcale ('It's rare, nevertheless still occurs', lit. 'Rarely does not mean at all'). The relation between a prepositional or adverbial subject and its governing verb is labelled with the subtype nmod:subj. We realise that this subtype is not the best solution. Alternatively, an adverbial subject could be labelled advmod:arg and a prepositional subject could be labelled obl:arg, but then we lose information about their subject function. We also consider introducing two additional subtypes -advmod:subj and obl:subj, but they are extremely confusing. 14

Enhanced graphs
The PDBUD graphs contain the enhanced edges encoding the dependents shared by the conjuncts in coordinate structures (see Figure 3) and the shared governors of the coordinated elements (see Figure 4). In the PDB trees, all coordinated elements depend on a conjunction and the relations between 14 One of the reviewers of the paper suggests to use the label subj. It would be an ideal solution. However, the function subj does not belong to the repertoire of the UD functions. the conjunction and these elements are labelled with a technical dependency type -conjunct. A dependent shared by all conjuncts also depends on the conjunction, but this relation is labelled with the grammatical function of the shared dependent, e.g. subj, obj. The conversion of the PDB trees into the enhanced PDBUD graphs is thus a straightforward process. There are only enhanced edges involved in the coordination constructions in PDBUD, but they are numerous, i.e. more than 41% of all PDBUD trees contain at least one enhanced edge (see Table 1).

Semantic labels
The UD format is extended by adding some semantic labels in the 11th column. There are 28 semantic labels corresponding to some selected frame elements of FrameNet (Fillmore and Baker, 2009;Ruppenhofer et al., 2010). In addition to the common semantic roles: THEME, RECIPI-ENT/BENEFICIARY, RESULT, there are roles related to The additional semantic labels extend the semantic meaning of indirect objects (iobj), oblique nominals (obl) 15 , adverbial clause modifiers (advcl), some adverbial modifiers (advmod), some noun modifiers (nmod), etc.

Dependency parsing systems
Various contemporary dependency parsing systems are tested in our evaluation experiments. All of the tested systems allow dependency parsing, but only some of them allow part-of-speech tagging, morphological analysis and lemmatisation. We test transition-based parsers (i.e. MaltParser, UDPipe, and the transition-based version of BIST system architecture classifier parsing tagging lemmatisation MaltParser (Nivre et al., 2006) trans LR yes no no MATE parser (Bohnet, 2010) graph perceptron yes no no BIST parser (Kiperwasser and Goldberg, 2016) trans/graph biLSTM yes no no Stanford parser (Dozat et al., 2017) graph biLSTM yes yes no UDPipe (Straka and Straková, 2017) trans 1-layer NN yes yes yes Table 2: Properties of the dependency parsing systems tested in our experiments. Explanation: trans -a transitionbased parser, graph -a graph-based parser, LR -a linear classifier based on logistic regression, 1-layer NN -a nonlinear classifier based on 1-layer neural network, biLSTM -Bidirectional Long-Short Term Memory network.
parser) as well as graph-based parsers (i.e. MATE parser, Stanford parser, and the graph-based version of BIST parser). The properties of the tested dependency parsing systems are summarised in Table 2.

Data split
PDBUD is divided into three parts -training, test and development data sets. The procedure of assigning dependency trees to particular data sets is generally random, but there is one constraint on the dividing procedure -the Składnica trees, and thus also the PL-SZ trees, are not included in the test set. 16 Since sentences underlying the Składnica trees are generally shorter than the remaining sentences, the average number of tokens per sentence is significantly higher in the test set than in two other sets. The statistics of the particular data sets is given in Table 3  16 PDBUD is used in the shared task on dependency parsing of Polish -PolEval 2018 (http://poleval.pl). The organisers of this shared task decided not to use the PL-SZ trees, which have been publicly available for some time, for validation of the participating systems. Therefore, the PL-SZ trees are not part of the PDBUD test set.

Evaluation methodology
We apply the evaluation measures defined for the purpose of CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 17 The proposed metrics, i.e. LAS, UAS, CLAS, MLAS, BLEX, evaluate the different prediction aspects.
Two evaluation scenarios are proposed: 1) testing the quality of dependency parsing of Polish, and 2) testing the quality of morphosyntactic prediction of dependency trees, i.e. part-of-speech tagging, lemmatisation, and dependency parsing of Polish. For the purpose of our evaluation, we use the script 18 of CoNLL 2018 shared task.

Evaluation of dependency parsing
Stanford parser is the best performing parser on Polish data (see Table 4). The second best parser -MATE parser -performs surprisingly well. Even if it doesn't have any neural component, it outperforms not only the graph-based neural parser BIST (87.06 LAS vs. 84.88 LAS), but also all transitionbased parsers. It is also worth mentioning that the worst graph-based parser -BIST parserperforms slightly better than its transition-based version, which achieves LAS of 84.79% and is the best of all transition-based parsers. It follows that the graph-based parsers are generally better suited for parsing Polish than the transition-based parsers.  Table 4: Parsers are tested on the sentences with the gold-standard tokens, lemmas, and part-of-speech tags.

Evaluation of morphosyntactic prediction of dependency trees
Two systems -Stanford system and UDPipe -are tested in the task of morphosyntactic prediction of dependency trees. These systems predict universal part-of-speech tags (UPOS) as well as languagespecific tags (XPOS). Stanford system outperforms UDPipe in part-of-speech tagging (see Table 5). Only UDPipe predicts morphological features (UFEATS) and lemmas (LEMMA  Stanford parser significantly outperforms UD-Pipe in predicting labelled dependency trees (LAS) and in predicting governors and dependency relation types of content words (CLAS), see Table 6. Since Stanford system doesn't predict morphological features and lemmas, we cannot compare MLAS and BLEX scores.

Summary
We carried out two evaluation experiments on PDBUD data. The results of these experiments show that the graph-based parsers, even the parsers without any neural component, are better suited for parsing Polish than the transitionbased parsing systems. The best results in parsing Polish data without preceding morphosyntactic analysis are achieved with Stanford parser, i.e. 88.04 LAS. These results are slightly lower than those reported in Dozat et al. (2017) Table 6: The quality (F1 scores) of predicting unlabelled dependency trees (UAS), labelled dependency trees (LAS), governors and dependency relation types of content words (CLAS), governors, dependency relation types, universal part-of-speech tags and morphological features of content words (MLAS), governors, dependency relation types and lemmas of content words (BLEX).
90.32 LAS. The possible reason for this is that our test data contains the dependency trees of the longer sentences and thus there is more room for making mistakes. If Stanford parser operates on the PDBUD sentences with the gold-standard part-of-speech tags, it performs better, i.e. 90.03 LAS.

Conclusions and future work
We presented PDBUD -the largest Polish dependency bank with 22K dependency trees in Universal Dependencies format. PDBUD contains the corrected trees of the Polish UD treebank (PL-SZ) and 14K dependency trees automatically converted from Polish Dependency Bank. The PDBUD trees are expanded with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts and with the semantic roles of some dependents.
Our evaluation experiments showed that PDBUD is large enough for training a high-quality graphbased dependency parser for Polish.
We did our best to maintain consistency with the UD guidelines while building PDBUD. However, some of our annotation decisions could be arguable and should be discussed again in the context of the universality assumptions of Universal Dependencies.
There is plenty of elliptical constructions in Polish. Some of them are labelled with the function orphan in PDBUD. In our future works, we plan to add empty nodes representing the elided elements to the PDBUD trees. Furthermore, we are going to create a Polish version of Parallel Universal Dependency treebank.
PDBUD data were already used in the shared task on automatic identification of verbal multi-word expressions (LAW-MWE-CxG-2018) 19 and are currently used in the shared task on dependency parsing of Polish (PolEval 2018). 20 This is a confirmation of the fact that PDBUD is of very high quality. Therefore, in the future we would like to replace the Polish UD treebank PL-SZ with its corrected, extended and enhanced version -PDBUD.