Universal Dependencies for Mbyá Guaraní

This paper presents the ﬁrst treebank of Mbyá Guaraní, a Tupí-Guaraní language spoken in Ar-gentina, Brazil and Paraguay. The Mbyá treebank is part of Universal Dependencies, a project that aims to create a set of guidelines for the consistent grammatical annotation of typologically different languages. We describe the composition of the treebank, and non-trivial choices that were made in the adaptation of Universal Dependencies guidelines to the annotation of Mbyá


Introduction
Universal Dependencies (UD) is a cross-linguistic treebank annotation project, which aims to provide guidelines that are consistently applicable to typologically different languages (McDonald et al., 2013).Annotation guidelines are meant to be suitable for computer parsing, while enabling rigorous typological research and linguistic analysis of individual languages.They should also be easily understood by nonlinguists.At the time of writing this paper, UD version 2.4 consists of 146 treebanks in 83 languages (Nivre et al., 2019).
This paper discusses the creation of a UD treebank for Mbyá Guaraní, a Tupí Guaraní language (Tupian) spoken in Argentina, Brazil and Paraguay.Work on indigenous American language in Universal Dependencies is still scarce.Previous research on the suitability of Universal Dependencies for the analysis of indigenous American languages include work on Arapaho, an Algonquian language (Wagner et al., 2016) and Shipibo-Konibo, a Panoan language (Vasquez et al., 2018).Outside of UD, Mikkelsen et al. (2014) discuss the development of a dependency treebank of Karuk, a isolate within the Hokan group.Our goals in this paper are to motivate the choices that were made in adapting UD guidelines for the annotation of Mbyá, and to reflect on difficulties that were encountered in this process.In doing so, we hope to contribute to the ongoing debate on the typological foundations of the UD project.
The treebank consists of two parts, each of which has been included in Universal Dependencies v2.4 (Thomas 2019a,b).The paper refers to the latest development version of the UD Mbyá Treebank at the time of writing.We assume familiarity with UD v2 guidelines, as described in UD Guidelines (n.d.).

General Information and Treebank Composition
Mbyá is a Guaraní language spoken by approximately 30,000 speakers in Argentina, Brazil and Paraguay (Ladeira, 2018).It belongs to the southern branch (group 1) of the Tupí Guaraní family, together with Nhandeva, Kaiowá and Paraguayan Guaraní, among other languages (Rodrigues, 1986).The main references on the grammar of Mbyá are Robert Dooley's grammatical sketch and lexicon of the language (Dooley, 2015), andMartins (2003)'s doctoral dissertation.
The UD Mbyá treebank consists of two corpora.The largest one is composed of narratives collected by Robert Dooley, written by two Mbyá Guaraní speakers, Nelson Florentino andDarci Pires de Lima, between 1976 and1990 in Brazil. It contains 11,771 tokens (1,046 sentences).Interlinearized versions of these narratives are archived on the Archive of the Indigenous Languages of Latin America (Dooley, n.d.), and were used with Robert Dooley's authorization.The second corpus is composed of three speeches by Paulina Kerechu Núñez Romero, a Mbyá Guaraní speaker from Ytu community, Caazapá Department, Paraguay, which were recorded by the author.It consists of 1,318 tokens (98 sentences).
There is no standard orthography of Mbyá.Dooley's corpus uses the orthography presented in Dooley (2015), which is popular among Mbyá communities in the south of Brazil.The texts collected in Paraguay uses an adaptation of this orthography to Spanish based spelling conventions adopted in Mbyá communities in Argentina and Paraguay.
The texts were manually interlinearized in SIL Fieldworks Language Explorer (FLEx; Black and Simons (2008)).Robert Dooley's interlinearization of Florentino and Pires de Lima's narratives was imported into FLEx and revised to fit our annotation guidelines.Language specific parts of speech were added manually at this stage of annotation.Interlinearized narratives were exported in the XML FLEx-Text format from FLEx and converted to the CoNLL-U format using a Python script.Morphosyntactic features were automatically created in the conversion stage.Universal POS tags were automatically converted from language specific tags at this stage too, and were later corrected manually.Dependency annotation was semi-automatic.A first set of 500 sentences were annotated manually in Arborator (Gerdes, 2013), and was used to train a parser in UDPipe (Straka et al., 2016).This parser was used to annotate the rest of the corpus, which was manually corrected in Arborator.The annotation team consisted of the author and research assistants at the University of Toronto.1

Annotation Guidelines
Our annotation guidelines are based on version 2 of Universal Dependencies (UD Guidelines, n.d.).In this section, we describe the adaptation of UD guidelines to Mbyá, focusing on phenomena that are specific to the annotation of Mbyá or that raise interesting issues for the UD annotation scheme.

Lexical Categories
Universal Parts of Speech (POS) in UD include 6 POS for open class words: Adjectives (ADJ), Adverbs (ADV), Interjections (INTJ), Nouns (NOUN), Proper Nouns (PROPN) and Verbs (VERB).Here, we will focus on the distinction between ADJ, ADV, NOUN and VERB.While most scholars of Guaraní languages recognize the existence of a noun/verb distinction, there is less agreement on the existence of a distinction between adjectives and adverbs in these languages (see Dietrich (2017) for a recent discussion).Consequently, we have not included these categories in the language specific tagset for Mbyá.The following subsections describe the mapping from these language specific categories to the universal POS of UD.
Verbs The language specific tagset of Mbyá includes subcategories of verbs that reflect their valency and agreement class.In order to understand this categorization, it is necessary to give some background on agreement in the language.Subjects and objects are cross-referenced on verbs by prefixes that encode person and number.There are two sets of cross-reference prefixes, which I refer to as 'set A' and 'set B' prefixes, following Tonhauser (2017).These two sets distinguish two classes of intransitive verbs.Set A prefixes are used to index the subject of active (dynamic) verbs, while set B prefixes are used with inactive (stative) verbs, as illustrated by examples (1) and ( 2).Accordingly, we distinguish active (vi:a) from inactive (vi:i) intransitive verbs in our language specific tagset:2 (1) A-vaẽ.
B1.SG-tired VERB vi:i 'I am tired.' Words that were categorized as verbs in the language specific tagset were also tagged as VERB when they are used as predicates.In addition, inactive verbs (vi:i) are also attested as modifiers of nouns and of non-nominal heads, in which case they were tagged as (ADJ) or (ADV), respectively: porã.good ADJ vi:i 'He planted the corn in a good line.' (Dooley, 2015) (4) Oro-vy'a A1.PL.EXCL-happy VERB vi:a porã.good ADV vi:i 'We were very happy.' (Dooley, 2015) Alternatively, verbs tagged as ADJ or ADV could have been tagged as VERB, and analyzed as reduced relative or adverbial clauses.However, when used as modifiers, these verbs are typically uninflected (i.e. they do not bear cross-reference prefixes), unlike verbs in fully fledged clausal modifiers.Consequently, we believe that verb roots used as modifiers do not head a clause, which means that they should be tagged as ADJ or ADV in UD v2.4 guidelines.
Nouns Drawing a distinction between nouns and inactive verbs is not trivial in Mbyá, since the B set of cross-reference markers of inactive verbs is also used as possessive markers on nouns, and nouns are productively used as predicates without copula (for a discussion of this issue in Tupí-Guaraní languages, see Queixalós ( 2001)).Following Dooley (2015), we categorize as nouns those words that can be used as arguments without additional marking (such as nominalizing morphology).In order to preserve the distinction between nominal and verbal predications, these words were tagged as nouns both in their argument uses and in their predicative uses.
Adjectives and Adverbs Some roots can be used as modifiers but not as predicates.Because of the lack of evidence of a lexical distinction between adjectives and adverbs in the language, they were tagged as modifiers in the language specific tagset, and as ADJ or ADV in the universal tagset.This is illustrated by the use of guaxu ('big', 'a lot') in the following examples: (5) Ja-j-apo
Many of these particles can be used as dependents of nouns as well as of verbs.This raises an issue for the UD v2 annotation scheme, since the only functional dependencies of nominal heads admitted by the guidelines are determiners (det), classifiers (clf) and case (case).The solution we have adopted is to default to amod for particles used as modifiers of nouns, as illustrated in example (7), where the mirative particles ri ty and ra'e modify the noun kavaju ('horse'): (7) 'When he looked, a horse had arrived.' (Dooley, n.dThis solution, however, is unsatisfying, since particles are not adjectives, but belong to a closed class of functional items.A more satisfying solution would be to introduce a dependency relation for modifiers that is neutral with respect to the syntactic category of the dependent, as advocated by Croft et al. (2017).
Particles that modify non-nominal heads were annotated with the advmod relation, as illustrated by the hearsay evidential je in example (7).Alternatively, particles expressing tense, aspect, mood and evidentiality (TAME) might have been tagged as AUX when they modify a verb, in which case the relation aux would have been used.However, since TAME particles have no morphological verbal features, and are so flexible in their distribution, we are reluctant to tag them as auxiliaries.This being said, the semantic subcategorization of particles in the language specific tagset should make it trivial to map TAME particles to auxiliaries, when they are used as modifiers of verbs.

Complex Predicates
There are at least two types of complex predicates in Mbyá.The first one is formed by combining the main verb with a bare uninflected root, which we glossed vpos in the language specific POS tagset.Postposed roots are used in the expression of agent oriented modality (e.g.pota, 'try to'), or sensory evidentiality, like nhendu ('audibly') in the following example: (8) 'As she was saying this, she heard something coming on the road.' (Dooley, n.dA second type of complex predicates is formed by using one of a limited number of verbs in the socalled 'gerund' form common in Tupí-Guaraní languages (Rodrigues, 1953;Jensen, 1989), which we glossed vs in the language specific tagset.The dependent verb can be interpreted literally as expressing the position in which the event described by the main verb is realized, but it can also have an aspectual value.This construction was described by Dooley (1991): (9) 'He was walking down the road.' (Dooley, n.d.) Ha Both types of complex predicates were annotated with the relation compound:svc in the treebank.

Strategies of Subordination
The UD v2 annotation scheme recognizes three major types of subordinate clauses: core clausal arguments, adverbial clauses and relative clauses.We describe each of these in turn in this section.

Adverbial Clauses
Adverbial clauses are introduced by a variety of subordinating conjunctions (SCONJ).Among these, it is useful to distinguish plain subordinating conjunctions from switch reference markers.The former, such as jave ('when'), only express temporal or causal relations between clauses.The switch reference markers vy and ramo/rã also function as subordinating conjunctions, but do not encode a specific temporal or causal relation.Instead, they indicate whether the subject of the subordinated clause is the same as that of the superordinate clause.In example (10), the same subject marker vy indicates that the subject of the eating is the same as the subject of the sitting: (10) 'After he had eaten, he was sitting on a bench.' (Dooley, n.dOur annotation guidelines treat switch reference markers as subordinating conjunctions, which are introduced by the relation mark.

Nominalized Complement Clauses
There are no subject clauses in the treebank.Complement clauses are attested and are formed by nominalizing the dependent verb with the morpheme a.While Dooley (2015) analyzes this morpheme as a suffix, we might argue that it is a clitic, since its host is not necessarily the verb, but can be one of its adverbial modifiers.This makes it unsatisfying to analyze the nominalizer as a suffix and to represent its function as a verbal feature, since in some cases this feature would have to appear on a word other than that to which the nominalizer is affixed.Consequently, we decided to represent this nominalizer as a token in the dependency annotation, where it is tagged as SCONJ and related to its head by a mark dependency, as illustrated in example (11).This decision is consistent with some writing conventions for Mbyá, such as those adopted by Cadogan (1959).
Nominalized clauses have several morphosyntactic features that are characteristic of noun phrases.They are compatible with temporal suffixes, which in Guaraní languages are nominal markers, and they may be used as complements of post-positions.Nevertheless, they preserve their full clausal structure.In particular, the verbs that head clausal nominalizations project their regular argument structure, bear cross-reference markers, may be modified by adverbs and may be part of complex predicates.
The mixed categorial status of clausal nominalizations in Mbyá raises the question of which dependency relation should be used to relate them to their head.In particular, should nominalized complements of verbs be introduced by the ccomp relation, or by the obj relation?Using the ccomp relation allows us to capture the fact that these complements are propositional.In this case, we take ccomp to indicate the semantic status of its dependent, a proposition rather than an individual.In addition, the use of ccomp indicates the fact that the complement has a full clausal structure.On the other hand, using obj is consistent with the nominal category of the complement, in particular its compatibility with adpositions, which UD represents as case marking of nominal dependents.In devising annotation guidelines for the Mbyá treebank, we have decided to use the ccomp relation with nominalized clauses that denote propositions.

Relative Clauses
Relative clauses are formed with the enclitic va'e, as illustrated in example ( 12). ( 12) 'I got scared, and I left a bag that I had brought.' (Dooley, n.d.) .Like the clitic a, va'e is a nominalizer, and relative clauses have a mixed syntactic category.This create an issue for the annotation of free relative clauses used as arguments of verbs, like ou nhendu va'ekue (lit.'that he had heard coming') in the following example: (13) 'He met the person that he had heard coming.' (Dooley, n.d.) . . .ovaexĩ ma ou nhendu va'ekue .The first property motivates the use of a ccomp relation, while the second supports the use of an obj relation.We have opted for the latter, in order to capture the contrast between clausal nominalizations that denote individuals, introduced by obj, and those that denote propositions, introduced by ccomp.

Discourse Connectives
Most sentences in narratives start with a sentence initial discourse connective.These connectives are composed of the pronoun ha'e, which is generally followed by an adposition or a subordinating conjunction, as illustrated in examples ( 14) and ( 15).Following Dooley (2015), we assume that occurrences of ha'e in discourse connectives denote propositions or situations that are described or made salient by a preceding discourse unit, much like the demonstrative this in the English connective contrary to this.
Since their head is pronominal, we analyze sentence initial discourse connectives as oblique modifiers of the root: (14) 'Finally, he arrived at a place.' (Dooley, n.d.) Ha Note that these propositional pronouns can be modified by an adposition and a subordinating expression simultaneously, as illustrated in ( 16): (16) 'Because of this, I climbed down the tree again.' (Dooley, n.d.) Ha Sentence initial discourse connectives are another manifestation of the blurring of the distinction between clausal and nominal categories in Mbyá.Their heads are pronominal.As such, they are compatible with plural marking by the particle kuery, and they can be introduced by post-positions, which are characteristic features of nouns.On the other hand, their heads denote propositions, and are compatible with subordinating conjunctions like the same subject marker vy in example (15), which normally introduces adverbial clauses.In this example, vy indicates that the subject of the verb oo is the same as that of the previous sentence, which provides an antecedent to the pronoun ha'e.
We have decided to annotate sentence initial discourse connectives as obliques (obl) rather than adverbial clauses (advcl), thereby giving more weight to the form of their heads (pronominal) than to their interpretation (propositional).

Conclusion
We presented the Mbyá treebank, a syntactically annotated corpus of Mbyá in Universal Dependencies.We discussed the adaptation of UD guidelines to the annotation of Mbyá, highlighting questions raised by mixed categories (nominalizations) and the use of functional particles as adnominal modifiers.
, the complement of ovaexĩ ('meet') is syntactically clausal, yet it denotes an individual.