Survey of Uralic Universal Dependencies development

This paper attempts to evaluate some of the systematic differences in Uralic Universal Dependencies treebanks from a perspective that would help to introduce reasonable improvements in treebank annotation consistency within this language family. The study finds that the coverage of Uralic languages in the project is already relatively high, and the majority of typically Uralic features are already present and can be discussed on the basis of existing treebanks. Some of the idiosyncrasies found in individual treebanks stem from language-internal grammar traditions, and could be a target for harmonization in later phases.


Introduction
The Uralic languages constitute one of the major language families in the world. There are approximately 38 languages represented by seven major branches in the family tree. Only Finnish, Estonian and Hungarian are majority languages in their states, and other Uralic languages are minority languages, often endangered, in their respective regions, including Northern Scandinavia and Russia.
Uralic languages are agglutinative, morphologically rich and typically have large case systems. The majority of Uralic languages share many features, including, the expression of negation with a verb of negation, preference of postpositions and complex object marking. They usually also have a complex system of non-finite forms. Constituent order is relatively flexible. However, since the different branches of the Uralic family are rather far removed from one another, there are also numerous independent reflexes of historically shared features on the individual language level.
Recent versions of Universal Dependencies treebanks (Nivre et al., 2019) include 11 treebanks in seven Uralic languages. All in all, they represent five of the aforementioned major branches in the family. In release 2.4 the Uralic languages include: • Erzya, 1 treebank (Rueter and Tyers, 2018) • Estonian, 2 treebanks (Muischnek et al., 2014), (Muischnek et al., 2016) • Finnish, 3 treebanks (Haverinen et al., 2014), (Pyysalo et al., 2015) • Hungarian, 1 treebank  • Karelian, 1 treebank (Pirinen, 2019) • Komi-Zyrian, 2 treebanks (Partanen et al., 2018) • North Sámi, 1 treebank (Sheyanova and Tyers, 2017) In addition to the languages listed above, there are at least plans for a Northern Mansi treebank outlined in the literature (Horváth et al., 2017, 63), and work for Livvi Karelian treebank is undergoing. At the moment, the branches of the Uralic language family that do not have a single treebank are Samoyed and Mari, while the Ugric branch is only represented by Hungarian, missing the Ob-Ugric languages. From this point of view, the Northern Mansi treebank mentioned would be a most welcome addition. This would also bring improvement to the geographical limitations in the current selection of languages, where all now available treebanks represent Uralic languages spoken in Europe, with no language spoken primarily in Siberia.
Finite-State Transducers, especially in the Giellatekno infrastructure (Moshagen et al., 2014), have traditionally played an important role in open-source language technology for Uralic languages. Recent work in harmonizing NLP solutions (Hämäläinen, 2019) and lexical resources (Hämäläinen and Rueter, 2018) in Uralic languages may also prove beneficial for these efforts. Many of the Uralic treebanks have been created by automatic conversion of annotation schemes and tagsets from these analysers. Since several language documentation projects have also started to integrate these tools into their workflows (Gerstenberger et al., 2016), it would be expected that the pipelines used thus far could be reused when creating new treebanks for languages that have comparable resources.
Variation in annotation schemes has already been identified as one problem between analysers of different Uralic languages, and an earlier study by Tyers and Pirinen (2016) which examined this in detail. To continue research along these lines, we reproduce their Table 2 with minor corrections, and compare how the same examples align with the solution currently found in UD treebanks. This is one method to measure whether the common guidelines proposed in Tyers and Pirinen (2016) have taken root.

Selected Uralic features
In this section we go through selected features in the Universal Dependencies annotation scheme that can be commented specifically from a Uralic point of view.

Feature Evident
Komi-Zyrian has morphologically marked evidential forms of the verbs, and the Komi treebanks also use this feature with the value 'Evident=Nfh' to mark second past tense forms as non-firsthand information, although evidentiality is not there as an obligatory category as in some other languages, used primarily in unwitnessed narrative or to express non-voluntary action (Leinonen, 2000). Various evidentiality related phenomena occur in the morphology of other Permic languages, Mari and Ob-Ugric, and the Samoyedic languages, which, as mentioned, are still largely missing from the UD project. It can be stated that evidentiality is one feature for which Uralic family still has much to add in the project, although in the currently included languages it does not play a central role. The Erzya treebank, it will be noted, uses the 'Evident=Nfh' feature for some particles connected as advmod dependencies. Erzya treebank usage may open the discussion of introducing this feature in relation to other free word forms in the UD project. The Estonian treebank UD_Estonian-EDT, for example, does not have the feature Evident, although the concept is implicitly present in 'Mood=Qot'.

The Feature Gender
The Uralic languages do not have grammatical gender, as it were, permeating the pronoun, noun and verb systems. They do, however, have peripheral derivational elements, which are not regularly addressed, e.g. Finnish -tar/-tär: ruhtinas 'duke' vs ruhtinatar 'duchess', näyttelijä 'actor' vs näyttelijätär 'actress', etc. Erzya also a peripheral derivational element: -низэ/-нызэ -ńize/-nïze, which, traditionally, was added to the husband's name for indicating 'wife of': Иван Ivan vs. Иваннызэ Ivannïze, Гава Gava vs. Гаванизэ Gavańize. A similar construction predominantly occurs in the Hungarian language né 'wife of': István vs. Istvánné, but it is not found in the Hungarian treebank. Under normal circumstances, there should be no reason to mark gender feature in Uralic treebanks. One exception could arise in situations where the treebank contains code-switch speech to Russian, under which circumstances gender marking may be present (Janurik, 2015).

The Features Animacy and Definite
Animacy is not a grammatical category in Uralic languages, but it does influence the object marking in some languages within the family, e.g. in Komi, animacy has been connected to the marking of definiteness and focus as briefly described by Fediunova (2000, 69). At present, neither animacy nor definiteness have been marked in Komi-Zyrian treebanks, but definiteness can, in principle, be deduced from possessive suffixes used to this end. In the Erzya treebank, definiteness is marked as an incremental feature of the np head morphology -similar to Scandinavian languages -yet it is distinct from the use of posses-sive marking. Hungarian also uses this feature, but it is collocated with the definite articles found in the language.

The Feature Aspect
At the moment, some Erzya verbs are marked with 'Aspect=Inch', and some Hungarian verbs are marked with 'Aspect=Iter'. Neither language has gone beyond expressing features for specific derivation morphology. In the Northern Saami treebank, however, 'Aspect' is used as a means for encoding participles, i.e. the perfect participle is coded with 'Aspect=Perf', whereas the present participle is coded with 'Tense=Present'. Neither Finnish, Estonian nor Komi use the Aspect feature at this point.

The Feature Number
Among the current Uralic treebanks, Northern Saami is the only one that has a dual number. The numbers used throughout are singular and plural. On the subject of number, however, there are several types to keep track of: simple Number is used with nominals to indicate the number of entities, Number[Psor] the possessor number, of course, tells us of the possessor flagged by possessive suffixes. When we arrive on the verb scene, Erzya introduces counting subject entities flagged on the finite verb with Number [Subj], and object entities as well Number [Obj]. Hungarian introduces counting of possessa/possessee with Number[Psee] (see also ). This is useful in Hungarian and could be feasible in any up-coming of treebank for Moksha, as well, e.g. Hungarian kutya 'dog' vs. kutyáé 'the one belonging to the dog', Moksha пине pińe 'dog' vs. пиненне pińeńńe 'one belonging to a dog'.
The 'Number' strategy sets a precedence for analogical regular inflectional features in Erzya and Komi-Zyrian. Where Erzya uses some of the oblique cases at both the np level and the vp level, Komi has an operating dichotomy that distinguishes the two levels. If, for instance, the inessive Erzya вирьсэ viŕ-se 'in the forest' (derived from вирь viŕ 'forest') is taken as a premodifier in a noun phrase, Erzya morphology allows for constructions where np head morphology is directly concatenated onto the premodifier, which might result in a form вирьсэтнесэ viŕ-se-t́-ńe-se 'in the ones that are in the forest' (a matter of ellipsis or 'secondary declension' as it is also refered to in the literature. Komi-Zyrian can derive a premodifier with the same semantics of its Erzya counterpart in вöрса vərsa 'in the forest' from вöр vər 'forest', which can in turn, as a np head, take on either copula plural morphology (вöрсаöсь vər-sa-əɕ '[are] ones in the forest') or noun plural morphology (вöрсаяс vər-sa-jas 'the ones in the forest'). Although this regular morphology for Komi premodifiers is not addressed as case morphology in the largest of Komi grammars, (Fediunova, 2000), it merits contemplation in any extensive and parallel treatment of the language family. The fact that a second plural form can be present introduces further problems.
The Komi np premodifier derivation strategies allow for plural stems. Hence, forms, such as вöръяссаяс vər-jas-sa-jas 'the ones that are in the forests' and вöръяссаöсь vər-jas-sa-əɕ '[are] the ones that are in the forests' may require regular counting and therefor a new 'Number': one to express the number of the np head and the other indicate the number of the np premodifier. In Komi, the locative -са -sa has further siblings in a temporal -ся -ɕа, a privative -тöм -təm and a proprietive -а -a, which should not be confused with inessive -ын -ɨn, caritive -тöг -təg, comitative -кöд -kəd or instrumental -öн -ən cases.
The situation seems to be similar to what is found in the Turkic languages, and the solution proposed in Çöltekin (2016) to split these words in Turkish into multiple tokens, unless they are lexicalized, would also be possible with the Uralic languages. The fact that Uralic languages do not have separate morphology analogic to the ki element in Turkic, however, would seem to speak against following such a lead.
The system of number marking outlined above seems to be a good starting point for all Uralic languages. It also sets the scene for a new discussion, which might draw from other morphologically rich language families and their practices of grammar description.

The Feature Case
At the moment all Uralic treebanks use traditional terms from their own grammars. Some of the terms, for example 'superessive', 'sublative' and 'additive' are only used in individual languages. On the one hand, this is understandable, but it begs the question as to how useful these terms actually are. Should these names indicate functions for a given language or should they be generalized. At least, in regard to spatial cases, the cross-lingual comparability is now rather weak. Some of the cases such as 'terminative', however, are already present in numerous treebanks and refer to a very similar concept. Similarly, cases like 'approximative', although currently present only in the Komi-Zyrian treebank, will eventually be wider present once languages with this case, such as Olonets-Karelian (Livvi), Ludic or Veps, are included.
Are the names of cases with multiple divergent functions relevant? The case names in Uralic languages are often very language-specific and are not transparent. Finnish derivation practice includes a mere mnemonic letter string to indicate a derivation, e.g. 'Derivation=Sti', which is basically a morphological representation for deadjectival adverb derivation. Without language-specific documentation neither case names nor derivation letter strings are meaningful. How to construct documentation in the manner that allows cross-linguistic comparison is a forthcoming challenge for treebank developers. Discussions should also include analogical solutions used in other language families, such as the English 'in' preposition equivalent of Finnish 'inessive' (Germanic prepositions are not generally given names for mapping them to equivalent Finnish case functions).

Selected syntactic questions
In this section we discuss some of the observations that can be made about the use of specific dependency relations, and questions that arise from morpho-syntactic particularities of Erzya and Komi.

The Dependency iobj
At the moment only the Hungarian treebank uses the dependency relation 'iobj'. In other Uralic treebanks, however, the relation 'obl' is used, which is illustrative of the examples shown in UD documentation for nps with prepositional construction closely related to 'iobj'. 1 These present languages do not have dative alternation, but any future work with Mansi or Khanty may introduce this variation. It would be reasonable to assume that any updating of the version 2.2 Hungarian treebank would involve the introduction of the 'obl' relation where the 'iobj' is now used, perhaps with a special relation subtype of 'obl'.
In order to compare the question better and to illustrate the situation, we translated the English and French example sentence from UD documentation into different Uralic languages.
• English: give the children the toys & give the toys to the children • French: donner les jouets aux enfants • Estonian: andma mänguasju lastele • Hungarian: a játékokat a gyerekeknek adja • Finnish: antaa lapsille leluja • Erzya: максомс налкшкетнень эйкакштнэнень • Komi: сетны ворсанторъяс челядьлы All example sentences above with the exception of the first English sentence are or (in the case of Hungarian) can be coded with the 'obl' relation due to the explicit morphological encoding of the np head, which distinguishes, among others, the Hungarian dative case.

Copulas
The non-past identity clause involves copula morphology in all but the Komi language. While North Sámi, Estonian, Hungarian, Finnish and Karelian all have free copulas, Erzya provides examples of dependent morphology, which have elicited segmentation (locus + copula), on the one hand, and a discussion of word order issues, on the other.
The North Sámi in figure 1a is representative of word ordering typifying Finnish, Estonian and Karelian, alike -cs copula cc. Komi, however, does not use a copula, instead, it applies juxtaposition to achieve the same (see 1b).
Mon lean Olga . In figures (2a) and (2b), distinctive functions in Erzya morpho-syntax are presented where first and second person personal pronouns can visibly serve as both copula subjects and copula complements. The dependent copula morphology has been split off of the root in the analysis (analogic to what has been applied in the UD_Turkish-IMST treebank), but when the analysis is non-past third person singular, i.e. ero, no extra token is introduced (cf. Tyers and Pirinen (2016) and ). The logic of the split solution in Erzya can be questioned. This question is underlined by the fact that Komi-Zyrian has copula complement plural marking -öсь -əɕ , which is used for marking attribution, location and even possession non-verbal predications non-past in much the same way as the more elaborate Erzya morphology -Komi does not split this affix off from the stem. If the copula morphology is segmented, the -an affix in figure (2a) is better illustrated by figure (3). The distinction between (3) and (2b) lies in which argument the attributed agreement marking is. In (2b) it is the name that commands subject correlation, and prosodic stress falls on the personal pronoun root. In (3), however, the constant is actually the personal pronoun, and prosodic stress falls on the proper name root. These are matters, of course, for future work at discourse levels.
Unlike Turkic languages, the Erzya language has no unquestionable, distinct morphological element representing the copula in dependent marking other than what actually expresses tense, person and number. Although comparative linguistics does postulate the merging of a form of copula into the copula complement. For this reason, we are presented with a choice of following the Turkic lead, i.e. separating copula morphology from nouns (in numerous cases), adjectives and numerals for the soul purpose of reusing a ready solution. The non-past, third-person singular form of the copula complement, however, takes zero marking, which would point to non-symmetric representation of the copula construction. This, in turn, indicates a further need for a more elegant UD resolution of the issue.

Deverbal words and features
In Table 1 we utilize, correct and expand upon working knowledge in Uralic deverbal word constructions as represented in or analogous to UD work (cf. Tyers and Pirinen (2016)). Komi-Zyrian uses two finite clauses in sentence (i), where other languages all have a non-finite solution to the problem.
In sentence (ii), it appears that Hungarian and Estonian treat their deverbal nouns as nouns, while the other languages all encode their analogic forms with a spectrum of non-finite interpretations Conv, Ger, Inf and Sup. North Sámi, Erzya and Finnish mark person on the converb, which might not be mandatory. The present participle in sentence (iii) is treated in all treebanks, except for Hungarian, as a deverbal form. Komi-Zyrian and Finnish deviate from these by introducing a 'PartForm=Pres' value, deviating from the 'Tense=Pres' strategy of the other treebanks.
In sentence (iv) the subject in nearly all languages is regularly a deverbal noun, although derivation or inflection is not indicated in the Hungarian treebank. While North Sámi, Finnish and Estonian use an attributive copula construction, Erzya and Hungarian apply an equation construction with a noun head, and Komi-Zyrian uses a simple infinitive to mark the primary argument with an attributive copula construction.
Sentence (v) provides multiple solutions for the second argument of the matrix verb. While the secondary argument in North Sámi, Hungarian and Estonian is an infinitive, the other languages use a deverbal noun. It should be noted that the deverbal noun is an object in Erzya, a subject in Komi-Zyrian and an oblique in Finnish. The subject-object dichotomy may be observed in the use of infinitives, as well.

Summary
At the moment, Uralic treebanks are mainly representative of the largest languages in each branch, although with the recent addition of a Karelian treebank the coverage is already spreading in the direction of smaller Finnic languages with a high representation of Balto-Finnic languages. In the same vein, smaller Sámi languages would be very welcome to UD, and similarly Udmurt and Moksha would increase the diversity of these branches. As mentioned previously, Mari and Samoyed treebanks are still missing entirely, although some of these languages already have openly licensed annotated corpora that could easily be extended into treebanks. In principle the large coverage of Uralic languages at this point makes it realistic to expect that new treebanks would not introduce entirely different phenomena from what is already represented by the current Uralic languages. In the case of smaller and less studied Samoyedic languages, however, there may be questions that need specific attention, e.g. the expression of evidentiality and mood. Similarly Ob-Ugric languages may introduce dative alternation in Uralic languages.
The most common inconsistencies between languages in the Uralic treebanks seem to be related to the traditional terminology and concepts used in the description of individual languages. These are presumably the result of conversion schemes used when transforming different tagsets into UD. Especially with smaller treebanks improvements could be made relatively fast. However, since the inconsistency is large, it may not always be evident what the best shared solution is. The phenomena pointed out in this paper could be taken into account when systematizing the Uralic treebanks in future releases, although some of the work certainly falls beyond this language family into wider questions around cross-linguistic comparability in Universal Dependencies treebanks. One solution could be to create an explicit mapping between grammatically similar phenomena in the treebanks, and provide harmonization scripts that would adjust different phenomena into comparable representations. This could be connected to better documentation of the treebank conventions, ideally in a machine readable format, so that similar phenomena in different languages could be automatically linked to one another.

Acknowledgements
We want to thank three anonymous reviewers for their useful comments. Niko Partanen's work has been carried out within the project Language Documentation meets Language Technology: the Next Step in the Description of Komi, funded by the Kone Foundation.