Apurinã Universal Dependencies Treebank

This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features — some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.


Introduction
Apurinã (ISO code apu) is an endangered language spoken in the Amazon Basin. The language has around 2,000 native speakers and it is definitely endangered according to the UNESCO classification (Moseley, 2010). This paper is dedicated to describing the first ever Universal Dependencies (UD) treebank for Apurinã 1 . We describe how the treebank was created, and what exact decisions were made in different parts of the process.
The UD project (Zeman et al., 2020) has the goal of collecting syntactically annotated corpora containing information about lemmas, parts-of-speech, morphology and dependencies in such a fashion that the annotation conventions are shared across languages, although there may be inconsistencies between languages (see Rueter and Partanen 2019). As the number of South American languages represented in the Universal Dependencies project has grown rapidly in the last years (see i.e. Vasquez et al., 2018;Thomas, 2019), the descriptions of individual treebanks are thereby also a very valuable 1 https://github.com/UniversalDependencies/UD_Apurina-UFPA resource that helps to maintain consistency in the treebanks of this complex linguistic regions.
The advantage of UD treebanks is that they can be used directly in many neural NLP applications such as parsers (Qi et al., 2020) and part-of-speech taggers (Kim et al., 2017). Although the endangered languages have a very different starting point in comparison with large languages (Hämäläinen, 2021), there has been recent work (Lim et al., 2018;Ens et al., 2019;Hämäläinen and Wiechetek, 2020;Alnajjar, 2021) showcasing good results on a variety of tasks even for the few endangered languages that have a UD treebank.
The fact that UD treebanks can be used with neural models to build higher level NLP tools is one of the key motivations for us to build this resource for Apurinã. In addition to NLP research, UD treebanks have been used in many purely linguistically motivated research papers (Croft et al., 2017;Levshina, 2017Levshina, , 2019Sinnemäki and Haakana, 2020). We believe such developments will only grow stronger, and believe that easily available treebanks in the UD project, covering continuously better the world's linguistic diversity, will continue widening their role as suitable and valuable tools for both descriptive linguistic research and computational linguistics. This goal will be achievable only by creating an open discussion about the conventions and choices done in different treebanks, which can be adjusted and refined at the later stage. This study aims to provide such description about Apurinã treebank. An example of a UD annotated sentence in Apurinã can be seen in Figure 1.

Modelling the Apurinã Language in UD
The Apurinã language has a rich morphology with regular correlation between numerous formatives and semantic categories. One challenge in the conversion from fieldwork/typology style annotation to that used in the UD project is to choose what  Figure 1: An example of a UD tree for an Apurinã sentence meaning 'They had it, had meat, manioc, fish, fruit'.
features should or can be highlighted with specific transferability to other UD projects and which ones should only be represented as language specific morphology. The task has also been contemplated from a finite-state perspective, where regular inflection plays a decisive role in determining lemma and regular inflection strategies. Finite-state description also entails the use of the open-source GiellaLT infrastructure (Norwegian Arctic University, Tromsø) (Moshagen et al., 2014), which introduces a large number of mutual tag definitions and practices that can be applied to Apurinã with ample analogy from the morphologically challenging Uralic and other languages of the Circum-Polar region.
Solutions for dealing with the categories of case, number, person and gender are available in the GiellaLT infrastructure. Extensions, however, have been required for Apurinã in the categories of number, person and gender. Unlike some Indo-European and Uralic languages, the category of gender must also be applied to the subjects and objects of verbs; subject and object marking for number (see Facundes et al. 2021) and person categories could have been adapted directly from description work in the Erzya (Rueter and Tyers, 2018) and Moksha (Rueter, 2018) UD treebanks.

Case
The Feature of CASE, for example, permeates many of the individual language projects, and some attempts are made to align case documentation with principles adapted in the Unimorph project (Kirov et al., 2018). In the instance of Apurinã, parallel case categories have been adapted with names familiar to those used in work with languages of the Uralic language family. This was done princi-pally because the team involved in the annotation was most familiar with this language family: at the same time the Uralic UD annotations, especially for the minority languages, are already closely adapted to the UD project at large. Whether such generalizations work is also one test for the cross-linguistic suitability of the current annotation model.
The concept of case in Apurinã is most salient in oblique marking. While the subject, object and adposition complements show no special marking, there are at least six oblique marker to deal with (Facundes, 2000, 385-390). The labeling of these cases also underlines a problem not new to UD, namely, every language research tradition tends to apply its own terms for similar functions. Apurinã, as in the Uralic languages, shows evidence of case-like formatives associated not only with nominals but verbs, as well. In the first version of the Apurinã UD treebank, the formative case name pairs have been assigned as follows: munhi = Dat (dative, allative, goal), kata = Com (comitative, associative), ã = Loc (locative, instrumental), Ø = Nom (nominative). Subsequent work in the dataset will introduce the additional case formative sawaky = Temp (temporal), and show the extent of shared morphology across parts-of-speech.

Possession
One complexity of Apurinã morphology is encountered in the expression of possession. While the possessor of a noun may be indicated morphologically on the possessum, it is not obligatory. A preceding personal pronoun, for example, also serves as a marker of possession, to which the morphology of the possessum reacts and shows indication of being possessed. Hence, there are four basic categories that can be expressed on the possessum: person, number and gender of the possessor, on the one hand, and indication of whether the entity is a possessum or not, on the other. These categories are expressed as feature and value pairs in the UD project: • While matters of gender, number and person are directly attested in the morphology of the possessum, the feature POSSESSED identifies the individual noun as to whether there is or is not marking indicating that it is possessed. This particular issue of research is dealt with extensively in Freitas, 2017. Apurinã nouns can be split into four groups on the basis of how their morphology is affected by possession. There are nouns that never take possession or possessive affixes. Such nouns include proper names (Freitas, 2017, 179-180). The remaining nouns, however, take possessive affixes, on the one hand, and additional marking to indicate whether the word is possessed or not. First, there are nouns, such as kinship terms, that virtually always appear with possessive affixes and no morphology to indicate that they are possessed. These nouns may only be construed as not possessed in some verbal incorporations where the noun is non-specific by nature. A formative -txi is present to indicate the noun is not possessed. Other words in this group, including terms for body parts and individual belongings, for example, take the -txi formative to indicate the item is not possessed more freely, e.g. kywy 'head (possessed)' vs kywĩtxi 'head (possessed)' (Freitas, 2017, 163-171;Facundes, 2000, 199-204,228-236). Second, there are noun categories that take the formatives -ne, -te and -re1 to indicate the item is possessed, but they, in contrast, have no morphology to indicate that the item is not possessed. Third, there is group of nouns which actually mark both the possessed with the formative -re2 and the non-possessed with the formative -ry2. This alternation is described in Facundes, 2000, and explicitly Freitas, 2017, (112-123) (see Table 1) The Apurinã treebank solution has been to introduce the possessed feature with Yes and No values. Nouns that cannot be possessed are simply left without the feature Possessed.

Intransitive descriptive verbs
Apurinã verbs can bear morphology indicating subject and object, be that simultaneously or separately. What is interesting, however, is that a specific subclass of intransitive descriptive verbs attest to the use of object marking to indicate congruence with the subject (Facundes, 2000, 278-283). There are, in fact, certain verbs that distinguish object and subject marking strategies for the same intransitive verbs, such that subject marking indicates a short temporal frame, and object marking indicates permanency (cf. Chagas, 2007;Freitas, 2017, 70-71).
The solution here has been to refer to objectlooking morphology with subject congruence as subject marking: To cope, an additional feature value set has been introduced to distinguish verbs of the intransitive descriptive (Vid) nature, and this subset is subsequently split on the on basis of whether the formative entails object-identical Vido or subjectidentical marking Vids.

Derivations
Fieldwork annotations of certain derivational morphology are minimalistic, and their conversion in the UD treebank calls for more specific representation. Whereas some formatives have been referred to using the same terms, e.g. nominalizer, gerund, we have been obliged to elaborate. Only one feature has been provided for Derivation, Proprietive (ka-). The proprietive construction is one of many annotated as atrib in the fieldwork materials.

Lemmatization
The Apurinã language is spoken in 18 indigenous communities of the Purus basin (Lima Padovani et al., 2019). Grammar descriptions from Facundes, 2000 to Freitas, 2017 demonstrate a change in orthographic development, on the one hand, and actual variation in forms of the same words in relation to geographic location, on the other. Materials in the treebank alone show some vacillation with regard to stem-initial h and word-internal e vs i. Since the orthographic standard is still in a developmental state, lemma forms have been chosen on a basis of whether they occur in the manuscript dictionary (Lima-Padovani and Facundes, 2016) or not, and a preference for longer word forms, i.e., hinitial stems are forwarded, since it easier to drop a letter in the description than to automatically insert one. Thus the form hãty 'one' is given as a lemma instead of its variant ãty (as given in the dictionary), and herãkatxi (given as a variant) is forwarded as a lemma over both erãkatxi and erẽkatxi (given in the examples of the alphabet), arẽkatxi. The high vowel i is preferred over the middle e such that tiwitxi 'thing' is given as a lemma for the forms teetxi and tiitxi. Fortunately, work with Apurinã variation is continuing (Lima Padovani et al., 2019), and an updated version of the Apurinã-Portuguese dictionary is forthcoming.

Treebanks in figures
There were 76 valid and dependency-annotated sentences in the first release. Broken into figures, these sentences contain 574 tokens and a 454 word count, which can be further broken down into features, parts-of-speech and dependency relations.
The most salient features are Case (101), Gender (96), Number (73), but the newly introduced Gender[obj] (47) is also well attested. The Case feature owes its prominence to the presence of all nouns not marked for oblique cases, i.e. Nom; this leaves a total of 25 obliques (see Table 2).  The most prominent parts-of-speech the NOUN (170) and VERB (137) classes, followed by PRON (59) and ADV (39), whereas two instances of the same unknown word pekana outnumber the ADJ, CCONJ and PROPN, each at one (see Table 3).   (83), which is made possible through the extensive use of the conj relation. Language-specific deprels have extensions such as: lmod = locative modifier, neg = negation, poss = possession, relcl = relative clause tcl = temporal clause and tmod = temporal modifier (see Table 4).  Due to the size and orientation of the dataset some features of the Apurinã language have been neglected. It will also be a challenge to apply recent studies in noun incorporation annotation for UD in Tyers and Mishchenkova, 2020 to what Facundes and Freitas, 2015 describe for Apurinã noun and classifier incorporation. Another obvious goal for further work is to make Apurinã treebank so large that it can be split into train, test and dev portions. The goal to expand the treebank is connected to the availability of resources. Currently the sentences used in the treebank come mainly from the grammatical descriptions. As a language documentation corpus exists 2 , an important consideration is whether the treebank sentences could be more closely connected to audio and video recordings as well, and, of course, the main corpora in Belém, as multimodal resources are valuable in language documentation.