Applying Universal Dependency to the Arapaho Language

This paper discusses the use of Universal Dependency for annotations of a Native North American language Arapaho (Algonquian). While some relations of the universal dependency perfectly correspond with those in Arapaho, language specific annotations of verbal arguments elucidate problems of assuming certain syntactic categories across languages. By critiquing the influence of grammatical structures of major European and Asian languages in establishing the UD framework, this paper develops guidelines for annotating a polysynthetic agglutinating language and sets a path to developing a more comprehensive cross-linguistic approach to syntactic annotations of language data.


Introduction
The recent initiatives to create a cross-linguistic scheme of annotation rely on Universal Dependency (UD) as a system of describing the syntactic connection between words (Nivre, 2015;de Marneffe et al., 2014). While research shows this annotation type is effective not only for monolingual parsers but also cross-linguistically across multiple platforms, the universality of this approach is based on the assumptions of similar syntactic structures of major, often European, languages (McDonald et al., 2013). Without doubt, those are also the languages that receive predominant attention in the computational sphere, the languages whose technological presence requires a thorough analysis and annotation. However, if the goal of natural language processing is truly to develop a universal cross-linguistic strategy for annotating and analyzing linguistic data, it is important to attend to lesser described languages that may present strikingly different syntactic structures and dependencies.
Applying the UD rules while annotating the data from the Arapaho (Algonquian) language, several specific features were observed to fall outside of the charted labels. Since the language does not have a fixed word order and allows discontinuous constituency, dependencies on the previous word were avoided and re-analyzed. The most problematic dependency distinction in this language is the variation in relations between a verb and its arguments. This paper examines the correlation of the dependency relations in the UD scheme and their practical application for the Arapaho data. Using the UD framework, we create guidelines for annotating this data. In considerations of space, this paper primarily focuses on the argument structures defined by the UD and their correspondences to the Arapaho syntactic patterns. An additional discussion of non-verbal roots and topicality problematizes some of the common assumptions in discounting pragmatic features while analyzing syntactic dependencies.
In the following pages, we first provide a short note on the Arapaho language and the procedures of annotations (2); discuss issues of mapping the labels for subject, objects, and noun modifiers of the UD onto the Arapaho dependencies (3); define the mechanism of analysis of non-verbal roots (4); and suggest further ways of developing these annotation guidelines (5).

Arapaho data and annotations
Arapaho is an Algonquian poly-synthetic agglutinating language spoken by less than 200 people in the Wind River Indian Reservation in Wyoming. Because the language is in critical condition, there have been attempts at documenting and preserving it. A large transcribed and annotated spoken corpus has been created and parts of it are now available in the Endangered Languages Archive 1 . A total of around eighty thousand lines transcribed, translated, and grammatically analyzed is available for further processing. The current attempts at establishing the dependency scheme for this language initiate the new type of analysis of this data to allow machine processing.

Some features of the Arapaho language
The current paper largely relies on the previous description of the Arapaho grammar by Cowell and Moss (2008). There are several intriguing features of the grammar, but the ones most relevant to this study are its complex verbal morphology, split semantic and syntactic transitivity, and the system of obviation.

Verbal complexity
As is observed in many other poly-synthetic languages, Arapaho verbs are highly complex and mark multiple grammatical and semantic features. So, in example (1), a single verb demonstrates incorporation of not only the usual tense, aspect, mode, person, and number features, but also the manner of action and an incorporated object.
(1) he'ih'ii-xoo-xook-bix-ohoe-koohuut-oo-no' "Their hands would go right through them and appear on the other side." A single verb can be a full clause conveying a full thought. Verbal prefixes code grammatical as well as many semantic features, inhibiting the dependency analysis since this framework only considers the relations between individual words.

Transitivity
The category of verbal transitivity is both syntactic and semantic (Cowell and Moss, 2008). To understand how many arguments are allowed in a verb's frame, one must examine both the morphological and the semantic structure of a verb. So, while semantically a verb to'oo3ei "to hit things" may appear transitive, grammatically it is intransitive, requiring only one argument, the subject, as in too'oo3einoo "I am hitting (unspecified) things." The transitivity of a verb is expressed in its inflection which must agree in person and number with its arguments. Truly transitive verbs carry inflections agreeing with both of its arguments: (2) Nih-to'ow-oo-t PST-hit-3/4-3S nuhu' this hinen-ino man-OBV.PL "He hit these men" Even though only one of the two arguments appears in the sentence, the verb nihto'owoot "s/he hit him/her" is marked to agree both with the semantic agent and undergoer of the verb. This semantic distinction in the arguments is not observed in intransitive and semi-transitive verbs. Because such verbs demonstrate morphological agreement only with one nominal 2 , other nominals are considered outside of the argument structure of a verb even if they specify the semantic patient or theme.
(3) nih'ii-koo-ko'uyei-3i' PST.IMPF-REDUP-pick things-3PL biino chokecherries "They were picking chokecherries." So in the example (3), the noun biino "chokecherries" is not reflected in verbal morphology, but corresponds with its semantics by specifying the object of picking. Being outside of the argument structure of this verb, syntactically the noun is better understood as a verbal adjunct specifying the manner of action, while semantically it is still the patient. So the designation of the relationship between such arguments and verbs as dobj of the universal dependencies is wrong because it does not consider verbal morphology, whereas the label of nmod would not account for its semantic role.

Obviation
Unlike many languages, Arapaho does not rely on word order or case markers to disambiguate between overt nominals; rather it uses a system of obviation that incorporates a distinction based on animacy along with the combination of verbal morphosyntax and pragmatics to mark particular grammatical roles. This system clearly distinguishes between two third person referents by marking one of them (a less salient one in the discourse) as obviative and leaving the other referent unmarked (proximate). In Algonquian languages, the obviation is argued to be a pragmatic feature structuring discourse outside of a single clause (Goddard, 1984). Verbal morphology also shows agreement with these categories: the transitive verb inflection clearly marks which argument is acting on the other. So, instead of the usual three persons, Arapaho has four, with the fourth person being the obviative argument. In the example below, the obviative argument is the noun hiinoon "his mother" which corresponds with the verbal subjunctive inflection -eihok "4 th person acting on 3 rd singular." (4) Hohou, thank you hee3eihok say to s.o.-4/3S.SUBJ hiinoon his/her mother 3eeyokooxuu. Tipi-pole Child "Thank you," his mother said to Under-the-Tipi-Pole Child.
As it is observed in this example, obviation does not correspond with the semantic or the syntactic role of an argument. Neither it depends on the transitivity of a verb. Rather, obviative status lines up with the semantic role of an obviative coded in verbal morphology. Based on this feature of transitivity and obviation, the current paper suggests employing the semantic labels in marking the syntactic relations.

Annotation procedures
We are not aware of previous attempts at dependency annotations with other Algonquian languages; however, dependency grammar has been one of the theoretical approaches in Algonquian syntax. The guidelines discussed below were created based on the annotations of a small set of Arapaho narratives. In the first phase of the project, the dependency relations were outlined based on annotations of a sample of several traditional narratives, totaling at about two thousand lines 3 . The annotators, one fluent non-native speaker and three graduate students in Linguistics well familiar with Arapaho language structure, were given a protocol established without the considerations of the UD framework but based purely on the Algonquian syntax patterns. Several problems using these syntax patterns clarified and specified the dependency relations, leading to the creation of a new set of labels.
In the second phase of the project, these new labels were further standardized based on the principles of the Stanford Dependencies (de Marneffe and Manning, 2008). Using this new set of relations, the annotations of the previous phase were converted and a total of 3616 lines of elicited personal and traditional oral narratives as well as 593 lines of conversational data were newly annotated. The disfluency of the conversational data indicated major issues with this annotation scheme which prompted us to turn to the UD-based system. The guidelines presented here have been used to remark the previous annotations of the data used in the second phase. No special software to perform 3 Mapping the UD scheme Out of the forty dependencies proposed by the UD, thirty Arapaho dependencies have one-to-one correspondence. Additional seventeen specifications and relations have been added to describe language-specific instances. The final scheme of Arapaho's nominal argument dependencies is presented in Table 1. Some of the dependencies were not used in the Arapaho scheme because such dependencies merely do not exist in this language. So, for example, the language does not have a grammatical category of an adjective; therefore, the dependency amod has not been used; instead descriptive verbs are analyzed as relative clauses, acl. Example (5) demonstrates the relative clause dependency where verb modifies the noun in the same manner that an adjective would.
siikoocei'ikuu3oo rubber item nih-nohk-okoo'ohuni-'i. PST-INSTR-sealed with stiff object-0PL "They would be sealed with the red rubber gasket." In addition, there are no relative pronouns in the language, so the dependency relation marker is also obsolete in the current scheme. Similarly, there is no category of a number or numeral; instead the number can be expressed by a verb or a particle, at which instance it is analyzed just like other particles with the dependency of advmod to the word that it modifies. In sum, omitted UD relations are the ones that are either expressed by some other dependency or non-existent in the Arapaho language.
Several UD dependencies perfectly line up with the Arapaho scheme. So such relations as noun modifiers, adverbial modifiers, adverbial clauses, determiners, appositives, relative clauses, case markers, and a few more have a direct correspondence. For example, an adverbial clause in Arapaho is very similar, if not the same, as adverbial clauses described for other languages in the UD. Arapaho adverbial clauses, as it is seen in the example below, lack a distinct word introducing it; instead, the head of an adverbial clause exhibits particular morphological markers indicating its dependency. So in the example (6) this distinction is made by the subjunctive mode indicating that the verb bih'iyoohok "when it is dark" is a dependent of the main verb of the sentence.
"When it's dark, we'll come back." In general, dependencies between function words and content words mirror the same dependencies in the UD framework, and most of these dependency labels are used. The most complicated dependency relations tend to be between the content words, and especially the relations between the verb and its arguments. From the UD scheme, only one of such relations matches the Arapaho scheme with some modifications: nsubj and csubj correspond to subjects of intransitive verbs and transitive inanimate verbs. Similarly, subjects of passive verbs also correspond to the nsubjpass and csubjpass dependencies. Additional provisions are made in Arapaho scheme to account for the obviation status. In the following section, we discuss all of the provisions and additions made to the argument dependencies.

Subjects
While there is some correspondence between the UD's nsubj and subjects in Arapaho, it is, nonetheless, problematic to analyze subjects based purely on syntax since there is no syntactic features that would index the particular verbal arguments. Because nominals can take any position in the sentence and because they are not marked by a case corresponding with its syntactic role, the only certain way of finding a subject is in the person and number verbal agreement. The proximate and obviative distinction also does not clarify the syntactic role of the nominal, so with transitive verbs, the proximate form can be either agent or undergoer, and thus roughly correspond to either subject or object in English. In other words, the distinction of subject is not really important in the Arapaho language, especially with transitive verbs, and a relationship that is based on obviation would mark the dependencies more clearly. In response to this, the current dependency scheme adopted the UD dependency of nsubj and csubj with the additional marker :obv to index the obviative arguments of intransitive verbs expressed in the verbal morphology. The proximate counterparts are not marked. In the example (7), the obviative noun agreeing with the verb is such subject. Similarly, the nsubj and csubj dependency is also used for animate arguments of transitive inanimate verbs (VTI) and inanimate arguments of intransitive inanimate verbs (VII). However, transitive verbs exhibit a double marker on indicating both the proximate and obviative participants, as well as the direction of action (agential relationship) between the two. The proximate participant can be either agent or patient, as can the obviative participant. So, an additional label employing the semantic distinction, nagent (nominal agent) is introduced.  The following example further demonstrates the mismatch between subject and agent in Arapaho.
Here, the verb is in passive voice, and the "subject" of the verb is "my grandfathers." However, this "subject' is obviative, and it is the oblique agent ("my father") which is proximate.
Neisonoo nihcihwonbiineihini3i nebesiiwoho'. NA VAI.PASS NA nagent:oblique nsubjpass:obv ne-isonoo 1S-father nih-cih-won-biin-eihi-ni3i PST-to here-ALLAT-give-PASS-4PL ne-besiiwoho' 1S-grandfathers.obv "My grandfathers were given (sth) by my father" Since the verb is passivized and thus intransitive, only one argument is reflected in its morphology, the obviative subject nebesiiwoho' "my grandfathers." The label of nagent is kept with an additional marker :oblique to indicate that the argument neisonoo "my father" is not expressed in verbal morphology. Importantly, such oblique agents are different from noun modifiers, which are discussed further in the paper, because they specify the actor of the verb rather than its manner.
The subject relationship is not clearly defined in the Arapaho language. Instead, it is possible to talk about nominal expressions that are indexed by verbal morphology either as sole arguments of (syntactically) intransitive verbs or agential arguments (proximate or obviative) of the transitive animate verbs. We propose to account for this distinction as well as the distinction in obviation, which is clearly marked in nominal and verbal morphology.

Objects
The prototypical objects of transitive verbs do not easily fit the dobj relation in Arapaho. This is primarily because Arapaho verbs commonly undergo complex secondary derivation to produce verb stems which allow one to promote an animate argument to a core argument, marked inflectionally on the verb disregarding its semantic role. Thus, benefactives, recipients, goals, and even themes are typically the "object" marked inflectionally on the verb. Conversely, other arguments that would be classic "direct" objects in English are demoted, and not marked inflectionally on the verb. On the other hand, because the promoted animate argument is marked inflectionally, it can also easily be dropped from overt mention in the sentence, while unmarked items are much more likely to be mentioned explicitly.
Thus, when the manual for universal dependencies notes that dobj is the most patient-like argument of a verb, this is in direct tension with the tendencies of Arapaho transitive verb dependencies.
Additionally, when it notes that "if there is just one object, it should be labeled dobj, regardless of the morphological case or semantic role that it bears" (UniversalDependencies.org, 2014), this raises additional problems, since the actual 'object' marked on the Arapaho verb is highly likely not to appear in the sentence. The only exception and full correspondence to the UD's definition of the direct object is the inanimate object of an inanimate transitive verb (VTI): (10) niico'ontonounowoo nuhu' niinen. Because the verb is transitive inanimate, it requires two arguments, only one of which (the animate agent) is marked inflectionally. The second argument can only be expressed by an inanimate noun and can either precede or follow the verb. So the overt nominal in the example above represents a prototypical direct object for transitive inanimate verbs. Meanwhile, transitive animate verbs can have up to three arguments (e.g., ditransitive verbs), with the two animate arguments being expressed inflectionally on the verb. So technically, ditransitive constructions may have only one overt nominal not corresponding to either of the person markers in verbal inflection. According to the UD definition cited above, such a nominal should be considered a direct object. In the following example, the true "object" of the Arapaho verb is "you," (since it is in imperative form) while "your eyes" is not marked on the verb, and is thus from the perspective of Arapaho grammar an oblique form. There is no direct agreement between the secondary object hesiiseii "your eyes" and the verb. Ideally, this should be represented by iobj relation which emphasizes the indirect syntactic relation between the verb and the nominal.
Furthermore, objects of a transitive animate verb (VTA) and transitive inanimate verb (VTI) are different from the point of view of the grammar 4 and their respective part of speech designation 5 . Hence, a further specification of the dobj is necessary for transitive verbs. To stay consistent with the labels proposed for the nagent and cagent relations, the additional labels employed are :under and :obv. In the example (12), the object clearly marked on the verb is the fourth person, or the obviative. Specifying this dependency relation disambiguates between the nominals and enables the correct translation of the sentence.
So, in the current scheme the distinction between different types of objects is further clarified. The iobj is reserved only for the secondary objects of the ditransitive verbs which show no verbal agreement. Meanwhile, the dobj is used to mark the dependency relation between the transitive inanimate verb and its object, which is also not specified in the verbal morphology. Label dobj:under with the additional specification of obviation indicates the dependency relation between transitive animate verbs and the undergoers specified in the verbal morphology.

Noun Modifiers
The dependency relation of noun modifier corresponds rather well to the noun modifiers in Arapaho. It is primarily used for the disambiguation between direct or indirect objects of transitive verbs and the implied, incorporated objects, or otherwise, adjuncts.
Having argued that some overt nominals of transitive animate nouns play a role of a secondary, or indirect object, we now also argue that such label in the same context can be inappropriate as well. Using the UD rules for distinguishing the dependency in the example below would lead to analyzing the nominal koxouhtiit "handgame" as a direct object of the main verb. But as one can see from the translation, it would also lead to a wrong analysis. Similarly, the indirect object analysis would also be incorrect. Indeed, annotating this noun as an oblique or an adjunct, nmod, is the only way of ensuring the correct analysis and translation. When adjuncts are used with semi-transitive verbs, the nmod relation is further suffixed with :objim to note that the noun modifier further specifies the under-specified objects of semi-transitive verbs. Essentially, while these nominals are analyzed and marked as noun modifiers, for a successful translation they need to be marked as direct objects, which we have argued against in the previous section. In order to avoid the incorrect translation as well as incorrect analysis, the label nmod:objim is used. In the following example, the noun bei'ci3ei'i "money" semantically is the object of the semi-transitive verb. However, as we argue, marking it as direct or indirect object would violate the principles of Arapaho syntax.
neeyeih'oonotooneenou'u bei'ci3ei'i. In addition, some of these implied or incorporated objects with overt nominal expressions can be modified by an adverbial particle similar to a preposition in English. In the example above, particle hi3oobei'i' "under" is a dependent of the adjunct neci' "water-LOC." This relation is reflected in the locative case marker on the noun showing a direct dependency with the particle. The Arapaho dependency scheme additionally distinguishes the instrumental case since there are special case markers defined by an adverbial or an adverbial prefix. So where the prefixes hi'-, nohk-, and nii3-are present or where the corresponding adverbial particles appear, the nominal adjunct is considered to be instrumental (nmod:instr). So in example (5), the relation between the head of the relative clause siikoocei'iikuu3oo "rubber item" and the main verb is nmod:instr.
Finally, an additional dependency poss, possessor modifier, is being used for possessive constructions with an overt possessor. Similar to Finnish (Tsarfaty, 2013;, in Arapaho, it is possible to distinguish between the subject and the object of a possession. However, unlike in other languages, no special genitive construction exists to mark this type of relation. Instead, the possessor and possessed appear side-byside. The possessed in such constructions has a third (or fourth) person possessive marker. So in a phrase nii'ehihi' hi-siiseii "little bird his-eyes" the possessor is "little bird" since the possessive prefix hi-"his" directly references this third person. The dependency relation marked here is possessor nominal modifying another nominal.
The examples above demonstrate that not all of the arguments that may semantically appear similar to the dependencies established in UD are the same in Arapaho. While under-specification of the semantic relationships can be beneficial in establishing some commonalities cross-linguistically, it can also result in misrepresentation of some of the relations and lowered efficiency in machine learning (Lipenkova and Souček, 2014). The major underspecification for the Arapaho language is the omission of proximate-obviative distinction: while we realize that it could potentially be problematic in cross-linguistic applicability, omitting this distinction disregards one of the main features of Algonquian syntax, and renders automatic translation of English transitive verbs into Arapaho effectively impossible.

Non-verbal roots
Adopting the relation of a root as the independent word in a clause or sentence allows us to avoid issues arising from securing the root node with verbs. So, like in the UD scheme, our annotations do not attach the node of a root to a particular part of speech even though they are usually represented by verbs. The main reason for doing this is avoiding the potential analysis of what is not there (Nivre, 2015;Hajicova et al., 2015;Osborne and Liang, 2015). In our annotations, the root often represents a pragmatically independent word, as for example in predicative type constructions (Cowell and Moss, 2008). Such constructions are used to topicalize one of the verbal arguments or the manner of action (i.e., verbal particles) similar to existential constructions in other languages. However, instead of marking the predicate as a root of the sentence as it is done in the Russian TreeBank (de Marneffe et al., 2014), the topicalized nominal or the particle is the root in Arapaho. The relation between the root and the predicate is backreference: Ni'ook he'ne'nih'iisih'it. NAME

VAI.PASS root backref
Ni'ook Puffy Eyes he'ne'-nih-'iisih'i-t that-PST-how named-3S "Puffy Eyes, that is how he is named." In example (16), the argument of the verb he'ne'nih'iisih'it "that is how he is named" is not realized overtly, and the verbal prefix ne'-"that is" references back to the topical argument, making the verb actually a dependent of it. Were we to analyze distinct morphological elements, this prefix would act as a copula between the two. Overall, the reasoning for treating such topicalized elements (which sometimes may take other than the clause-initial position) comes from the combination of the pragmatics and morphology: nearly all of the verbal clauses with prefixes ne'-"that" and nee'ees-"that is how" are backreference dependents of such roots.

Conclusion
In this paper we demonstrate the use of Universal Dependency scheme with a language typologically different from the ones often included in machine-learning technologies. In using the UD framework, several unmentioned issues stemming from the reliance on the word order were noticed. For example, in the current annotation scheme, we reanalyzed the relationship of parataixis to account for verbs of citations, so that dependency would be traced from such a verb to the root of the whole clause. Similarly, the discourse marker dependencies were modified to include and analyze interjections. Unfortunately, it is outside of the scope of the paper to discuss these issues, but we hope that expanding this project to annotating conversational data and applying the annotated data to machine learning methods will further reveal some additional insights on analysis of discontinuous constituency.
In critiquing the UD, we, nonetheless, want to stress the eloquence of such an approach. Unlike the phrase structure annotations, UD allows us to account for the inconsistent phrase structures and dislocated tokens so often encountered in the Arapaho language. At the same time, however, we argue that to adequately account for the many linguistic nuances in annotations of such a morphologically and syntactically complex language like Arapaho, it is often necessary to include the semantic and pragmatic levels of analysis.