Towards an open-source universal-dependency treebank for Erzya

This article describes the first steps towards a open-source dependency treebank for Erzya based on universal dependency (UD) annotation standards. The treebank contains 610 sentences with 6661 tokens and is based on texts from a range of open-source and public domain original Erzya sources. This ensures its free availability and extensibility. Texts in the treebank are first morphologically analyzed and disambiguated after which they are annotated manually for dependency structure. In the article we present some issues in dependency syntax for Erzya and how they are analyzed in the universal-dependency framework. Preliminary statistics are given for dependency parsing of Erzya, along with points of interest for future research. Tiivistelmä Tässä artikkelissa kerrotaan ersän kielen avoimen puupankin ensimmäisistä askeleista, joissa sovelletaan universaaliriippuvuus-annotaatiota (UD). Puupankki sisältää 610 virkettä joissa on yhteensä 6661 tokenia ja se perustuu avoimeen ersänkieliseen originaalikirjoituksiin. Tällä tavalla varmistetaan puupankin saatavuutta ja laajennettavuutta. Puupankin tekstit on ensin analysoitu morfologisella jäsentimellä ja disambiguoitu, minkä jälkeen suoritetaan loppuyksiselitteistäminen käsin ja lisätään riippuvuussuhteet. Artikkelissa esitetään joitakin kysymyksiä, jotka esiintyvät ersän lauseoppia sovellettaessa universaaliriippuvuuskehyksiin. Annetaan alkutilastoja ersän jäsennyksestä sekä ajatuksia tulevan tutkimuksen näkemyksistä. Abstract Те статиясонть сёрмадтано эрзянь келень од ресурсадо, конась весеменень панжадо, чувтокс валрисьмень пурнавксто, чувтонь банкто, ды юртонзо путомадо. Валрисьмень анализэнь теемстэ нолдави тевс масторлангонь вейсэнь аннотация, конаньсэ невтеви валрисьме пелькстнэнь вейкест-вейкест эйстэ чувтокс аштема лувост (Universal DependencyUD). Статиянть сёрмадомсто чувтонь банкось ашти 610 валрисьмеде, косо весемезэ 6661 токент (валтлотксема тешкст), материалось ашти весеменень панжадо эрзякс сёрмадозь литературанть эйстэ. Истя чувтонь банкось саеви-келейгавтови киненьмелезэ – ресурсась ванстсы оляксчинзэ. Васня пурнавксонь валрисьметненень тееви морфологиянь анализ, конасьмейле седе вадрялгавтови синтаксисэнь анализсэ.Те статиясонть сёрмадтано эрзянь келень од ресурсадо, конась весеменень панжадо, чувтокс валрисьмень пурнавксто, чувтонь банкто, ды юртонзо путомадо. Валрисьмень анализэнь теемстэ нолдави тевс масторлангонь вейсэнь аннотация, конаньсэ невтеви валрисьме пелькстнэнь вейкест-вейкест эйстэ чувтокс аштема лувост (Universal DependencyUD). Статиянть сёрмадомсто чувтонь банкось ашти 610 валрисьмеде, косо весемезэ 6661 токент (валтлотксема тешкст), материалось ашти весеменень панжадо эрзякс сёрмадозь литературанть эйстэ. Истя чувтонь банкось саеви-келейгавтови киненьмелезэ – ресурсась ванстсы оляксчинзэ. Васня пурнавксонь валрисьметненень тееви морфологиянь анализ, конасьмейле седе вадрялгавтови синтаксисэнь анализсэ. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ Мейле келень ванкшныцясь сонсь невти кона пелькстнэ конатнень эйстэ аштить. Статиясонть макстано зярыя кевкстемат, конат чачить эрзянь кель UDмарто вастневемстэ. Макстано эрзянь келень анализдэ васнянь статистика ды арсемат-мельть келень ванкшномань сыця ёнкстнэде-тевтнеде.


Introduction
This article describes work towards the development of a Universal Dependenciesbased dependency treebank for Erzya, a Uralic language traditionally spoken in the Volga Region. Little if any computational-linguistic research has been published on syntactic parsing for Erzya. A valuable resource in the study and development of syntactic parsing is a treebank-a corpus of parsed texts containing gold-standard syntactic annotation.
Freely available treebanks exist for many languages, one particularly interesting set is the group of over 60 languages represented in Universal Dependencies (UD), where Erzya is now one of the smaller "upcoming languages"¹. This mutual presentation makes it possible to understand and utilize language-independent dependency tagging with direct analogy from other Uralic languages, such as Finnish, Estonian, North Sami and Hungarian, as well as languages sharing other morphosyntactic characteristics with Erzya. The UD environment also makes direct reference to terminology definition resources, such as those offered by SIL², and research in the World Atlas of Language Structures (WALS)³.
To our knowledge, however, no previous treebank exists for either of the Mordvinic languages, although there are closed annotated corpora, such as MORMULA at the University of Turku⁴, quantlang-uhlcs⁵ in Helsinki, and the semi-limited ERME⁶.
In building our treebank we take advantage of previous work done by Rueter in Helsinki Finite-State Transducer Technology (HFST) morphological analysis and partof-speech tagging for Erzya on the Giellatekno infrastructure, as well as ongoing disambiguation work with Constraint Grammar (VISLCG).
The remainder of the paper is organized as follows. Section 2 gives some background linguistic information on Erzya, and outlines some special challenges in parsing Erzya. In Section 3 we describe the corpus that we annotated and the methodology used in annotating it. Section 4 gives a sketch of some decisions we have made with respect to annotation guidelines, referring back to the discussion in Section 2. For reasons of space and time, these guidelines are by no means complete, but they do present a subset of guidelines which are of particular interest.

Erzya
Erzya is one of the two Mordvinic languages traditionally spoken in scattered villages throughout the Volga Region and former Russian Empire by well over a million in the beginning of the 20th century and down to approximately half a million in the 2010 census⁷. For some, however, Erzya is only a part of the conglomerate Mordvin index, a population with the status of most numerous among the Uralic languages in Russia.
Since there is no Mordvin language, as it were, but rather the closely related (adjacent yet not contiguous) Erzya and Moksha languages with their literary representation, research in syntax has often attempted to encompass the two. Erzya, like many Uralic languages, is agglutinative with extensive morphology, agreement and constituent ordering phenomena that present a challenge to any syntactic description of the language. The most prominent of these challenges apparent from the start are case marking, definiteness, ellipsis, numerals, and copula variation between dependent and independent morphology. An open-source finite-state morphological analyzer constructed for Erzya⁸ provides ample tagging for the annotation, but there is still plenty of work to be done with disambiguation. Erzya, much like other Ural-Altaic languages (Tyers and Washington, 2015), assigns more than one function to its cases. It also attests to intricate constituent ordering and minimal conjunction/subjunction marking, which will be one topic future research.
As indicated in Table 1, the definite nominative singular might be attested with the dependency relation, nsubj, root (in certain equative predications⁹, example (1)), and to indicate a postposed topic (see example (5.b)).
( 'Every year, the meeting place is always the same. '

Corpus
To form a corpus, we were able to utilize materials by Erzya authors previously secured for language research purposes while in the Republic of Mordovia. The number of sources utilized is extremely limited, due to the elementary state of the developing treebank.  The initial materials are representitive of original Erzya-language materials from the late 1920s to the turn of the new millennium. The Separate Individual phrases file will serve for documenting cited materials from scientific publications, such as the most recent Erzya syntax Агафонова et al. (2011).
The figures in Table 3 are incomplete, but do provide an initial ball park figure. It was noted that subsequent work will need to be done with extended dependency relations, as a small number of the cases returned included the values appos and conj. Also, the high number of genitive occurrences with the obl relation would indicate the presence of adpositions. Although the inessive had been indicated earlier as an object case, there was not a single instance where it occurred as a dependent case. The distinction of the dependency relations nmod versus nmod:poss in the statistics has been taken in following with usage in the Finnish UD projects; in subsequent contemplation, this may, in fact, be unnecessary since the genitive case already indicates a possessive relation in contrast to the inessive case with a spatial meaning as a noun modifier.

Number
The Erzya language like its closely related sibling Moksha has what is often called the object or definite conjugation. Unlike Hungarian, Nenets, Khanty and Mansi, however, the object conjugation of the Mordvinic languages morphologically indicates 1st, 2nd and 3rd person as well as singular and plural for some of the object and subject referents (Keresztes (1999); Trosterud (2006) (2) and (3) there is an actual non-ambiguous morphological distinction for third person singular subject in combination with third person singular object in (2) and third person plural object in (3).
Although it is possible to use personal pronouns in subject and direct object position, their presence is not the normal situation. Context awareness is required that transcends conventional sentence boundaries, hence there is person and number disambiguation present that is not discernible from the morphology but rather the larger context.

Copula and polarity
Erzya attests to varied non-verbal predication Turunen (2010) and negation Hamari (2007) strategies. The copula is divided into locative and non-locative usage, and this dichotomy can readily be observed in the morphology of the negative copulas, i.e., where the negative locative copula is represented by /araś-/ (which is conjugated with the help of a morphologically dependent copula), whereas the presentation of equative or class membership negation is shown with the non-flective word form /avoĺ /. Further negation is manifest in the converbal negation element /apak/ (which can also be conjugated), first preterite /eź-/, conjunctional /avoĺ-/, optative and prohibitive /iĺa-/, and the negative particle /a/. All of these can trigger the dependency relation aux:neg, which in the case of the polarity markers /a/ and /avoĺ / to all parts of speech.
Whereas the negative particle /a/ can occur with many parts of speech, it is the non-flective word form /avoĺ / or emphatic negative that is used in clausal negation. Clausal negation in combination with the imperative mood evokes a contrast in the prohibitive strategy in /iĺa-/. The clausal negation particle /avoĺ / in combination with the second person imperative produces a contrastive negative imperative, whereas the prohibitive /iĺa-/ (Mood=Proh) is combined in the modern literary norm with a connegative form. This has not been attested in WALS van der Auwera et al. (2013).

Dependent copula morphology
Copula morphology is dependent and independent in Erzya. While many grammarians of the past century have referred to dependent copula morphology as noun conjugation, earlier presentations, such as Wiedemann (1864) ( §77), refer to it as a suffixed copula. Judging from the fact that the dependent morphology can be attached to nouns in various declensional forms, as well as adjectives, numerals, adverbs, adpostions and non-finite verb forms, the phenomenon might more readily be referred to as a clitic.  Thus we have independent copula in /uĺń-/ (prt1) and /uĺ-/ (prs), on the one hand, and dependent copula morphology in -/Oĺ /-, etc.), on the other. Both the nonpast and prt2 conjugation are virtually identical to their verbal subject conjugation counterparts; essentially the prt2 in verbs is a combination of the short nomen agentis + the prt2 used for copula function beyond the scope of finite verbs.
The dependent and independent morphologies present a challenging problem for dependency analysis in Erzya, whereas both can be in equative, class member, assertive, locative vs existential, and possessive vs belong-to predication. A dichotomy, however, is introduced where the Universal Dependency guidelines stating that the copula should be the dependent of the lexical predicate are applied only to independent copula.
The copula as dependent morphology attaches to what some scholars have considered the subject, but some scholars working with Moksha have approached this from a discourse point of view Kholodilova (2016). Instead of splitting the copula off as a separate leaf node, we have annotated the instances with dependent morphology as the head of the structure. Here constituent ordering might be an underlying factor to be considered in future work with the Erzya UD treebank.
(5) a. In both sentences with and without the second person singular personal pronoun, it is obvious that the word /lomań-eś/ can only be interpreted as an extra element; more than likely a topic marker. The interrogative pronoun /ki-jat/ in the predicative takes the subject second person singular, whereas the third person singular topic marker would not have triggered second person singular agreement.

Further auxiliaries
In addition to the copula and negation, the definition of additional auxiliaries in Erzya take us to necessitives. Necessitives in Erzya have parallels in Finnish, Komi-Zyrian and partially in Skolt Sami, in that a non-nominative case is used to indicate the actor, which might be construed as a subject when aligned with other European, accusative languages. In Erzya, it is the Dative that is used as in mońeń in example (3). Discussion with Erzya and Komi native scholars has also introduced the idea of adding verbs indicating future to the list of auxiliaries, this, however, would be problematic, due to the second meaning involved, namely, inchoative/inceptive albeit an aspect marker. In this initial Erzya treebank, auxiliaries have been limited to copulas, negation and necessitives.

Compound nouns
The most recent orthographic word list of the Erzya language Бузакова et al. (2012) prescribes a mathematical strategy to compounding, i.e. if the first element is a nominative singular indefinite form (also called absolute form) with no evident derivation (sometimes a rather gray definition) two nouns are written as a single unit. This solution is not entirely related to the writing practices of the last century, and there do prove to be certain problems. The most evident problems are ensemble nouns containing mensural classifiers (for definition see Lyons (1977) In both instance we are talking about a measurement of water, whereas the idea of a bucket especially intended for water would be constructed in the telic noun /veď vedra/.

Noun head ellipsis
An analogy of symmetric negation as described in Miestamo (2013) can be applied to the description of the Erzya nominal phrase declension. In symmetric negation the structure of the negative is identical to the structure of the affirmative, except for the presence of the negative marker(s).
In a similar way, it is the final word of the Erzya noun phrase that is symmetrically declined while modifiers (with the exception of some determinatives) appear in what is termed the absolute form. Determinatives such as /iśťamo/ 'like this/that' can agree in number with the head noun. (NB. some descriptions Bartens (1999) maintain that regular adjectives might agree for number as well, but this type of apparent agreement, seems to be limited to parts of the northwestern dialect) In an elliptical construction, where the noun head is recoverable or inferable from the context¹⁰, the case ending is merely joined to the final modifier of the noun phrase. In practice, the final modifier can be an adjective (such as color, size, shape, etc.), a participle or converb, a determinatives (ordinal, demonstratives, collective numerals), non-core case-form nouns (including: Ine, Abe, Cmpr, Tra, Prl, etc.). If the modifiers are genitive in form, however, they generally require an additional /śe/ element, i.e. a demonstrative type element. Finally, it should be noted that this kind of construction usually takes either definite or possessive access marking, see Rueter (2010). Less frequent, perhaps due to contextual presuppositions, comes the vowel-final modifier with the additional determinative. This more complex construction presupposes a contrastive context.
ašo-śe-ťńe white-A.Sg.Nom.Indef-Det.Dem.Sg-Pl.Nom.Def 'those white ones' In elliptical constructions of the nominal phrase, the part-of-speech has been retained in column 4, whereas column 6 bares witness to a deluge of ordered zeroderivation, and column eight indicates the actual dependency relation of the noun phrase head.

Numerals
The Erzya language has several types of regular counting, cardinal, collective, distributive, multiplicative. There is no problem with counting nouns, such as singular concrete items, pairs or sets (such as socks, batches, broods); these numerals can be readily connected with the nummod dependency relation to the noun they modify. Problematic are the missing dependency relations which might better be characterized as iterative numerals; they also count entities, i.e. iterations of a predication (similar to once, twice, but not twofold, double or the second time).

In conclusion
This has been a description of the first steps to building Erzya treebanks in accordance with Universal Dependencies. Much space has been dedicated to extensive morphological contemplation, where the matters requiring in-depth consideration are actually the minimalized set of dependency relations in tandem with morphological information, i.e., minimal use of language-dependent subfeatures. Hopefully, this work will provide a means for pivoting and sharing in what has been achieved for larger languages.