Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik

This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.


Introduction
The Universal Dependencies (UD) project (Nivre et al., 2016(Nivre et al., , 2020 provides a cross-lingual syntactic dependency annotation scheme for many languages. The most recent release of the UD treebanks (version 2.7) contains 183 treebanks in 104 languages. However, polysynthetic languages, known for words synthesizing multiple morphemes, are still much under-represented in the UD treebanks. To our knowledge, Abaza 1 and Chukchi (Tyers and Mishchenkova, 2020), are the only polysynthetic languages included in UD version 2.7.
In this paper, we describe how we annotated a corpus of St. Lawrence Island Yupik (also known as Central Siberian Yupik), a polysynthetic language spoken in parts of Alaska and Chukotka, Russia, within the framework of the UD guidelines. While UD is a framework for word-level annotations, we argue that morpheme-level annotations are more meaningful for polysynthetic languages. We provide morpheme-level annotations for Yupik in addition to word-level annotations. 2 We believe that subword-level annotations can help better capture morphosyntactic relations for polysynthetic languages and assist further dependency annotations and morphosyntactic research for polysynthetic languages.
Previously Tyers and Mishchenkova (2020) called for the need to annotate parts of words in regard to noun incorporation in Chukchi. They proposed annotating a noun incorporated into a verb via morphology as a separate token available in the enhanced dependency structure. While our approach is motivated by a similar need to annotate subword units for another polysynthetic language, our paper focuses on morpheme-level annotations, which may be applied to other types of multi-morphemic words than just noun incorporation.
In what follows, we describe the characteristics of the Yupik language ( §2) and show how we annotated a corpus at the morpheme level as well as the word level ( §3 and §4). Then we present some language-specific decisions we made for morpheme-level annotations and illustrate Yupik constructs captured by the new annotation scheme ( §5 and §6). We also compare the performance of the two annotation schemes in automatic parsing experiments ( §7). Based on our findings, we conclude that the morpheme-level annotation is essential and effective for polysynthetic languages and discuss implications of the study for other polysyn-thetic languages and the UD framework ( §8 and §9).

St. Lawrence Island Yupik
St. Lawrence Island Yupik (ISO 639-3 ess; Yupik hereafter) is a polysynthetic language in the Inuit-Yupik language family, spoken in parts of Alaska and Chukotka, Russia. Like other polysynthetic languages, Yupik is characterized by its rich morphology. Jacobson (2001) provides the most thorough descriptions of the Yupik grammar with an emphasis on the morphology. Yupik is strictly suffixing with the exception of one prefix. Yupik words typically have the following form: root (+ derivational morphemes) * + inflectional morpheme (+ enclitic) That is, a typical Yupik word has a root, followed by zero or more derivational morphemes (thus forming a stem), followed by obligatory inflectional morpheme(s), finally followed by an optional enclitic. Most roots are nominal or verbal, such as mangteghagh-'house' and negh-'to eat' respectively. The language also includes a set of non-inflecting particles, such as quunpeng 'always' or unaami 'tomorrow'.

Mangteghaghllangllaghyugtukut.
Mangtegha-ghlla-ngllagh-yug-tu-kut house-big-to.make-to.want.to-IND.INTR-1PL 'We want to make a big house.' (Jacobson, 2001, p.47) In (2), the same nominal base takes multiple derivational morphemes, forming the sentencelength word Mangteghaghllangllaghyugtukut. To form this multi-morphemic word, the nominal base mangteghaghfirst combines with the noun-elaborating derivational suffix -ghlla-(N→N), yielding an extended nominal base mangteghaghlla-'big house'. This extended nominal base then combines with the verbalizing derivational suffix -ngllagh-(N→V) to create an extended verbal base mangteghaghllangllagh-'to make a big house'. Next, this extended verbal base combines with the verb-elaborating suffixyug-(V→V) to yield the extended verbal stem mangteghaghllangllaghyug-'to want to build a big house'. Finally, the inflectional suffix -tuattaches to the extended verbal stem to mark the verb's valency as intransitive and its mood as indicative, while the inflectional suffix -kut marks the person and number of the verb's subject as first person plural; the final result is the fully inflected word mangteghaghllangllaghyugtukut 'we want to make a big house'.
(3) Taghnughhaat aanut mangteghameng . The UD annotation guidelines are lexicalist (Chomsky, 1970;Bresnan and Mchombo, 1995) in nature, specifying that syntax dependencies should be annotated at the word level, such that both the head and the child of each dependency relation are words (Nivre et al., 2016).
In (3), we see the Yupik sentence from (1) with dependency relations annotated at the word level, following the UD guidelines. The resulting dependency tree successfully depicts the core syntactic information in the Yupik sentence, with the intransitive verb aanut at the root of the dependency tree, with a nominal subject and an oblique argument as children. However, when we annotate the singleword Yupik sentence from (2) according to the UD annotation guidelines, the result is a degenerate tree that completely fails to capture any syntactic information about the Yupik sentence.
In order to adequately represent the syntactic relations in (2), it is necessary to discard the lexicalist hypothesis and annotate relations between morphemes rather than between words. When we contrast (4) with (5), we observe that annotating relations at the morpheme level results in a meaningful linguistic analysis for this Yupik sentence. It is clear from these two dependency trees that treating morphemes as the basic unit of syntactic dependency relations is necessary in order to adequately encode the syntax of the Yupik sentence in (2). By doing so, we move from a degenerate tree devoid of syntactic information to a tree that successfully encodes a main verb -yug-('to want to') with a complement -ngllagh-('to make'), and an object mangtegha-('house') with a nominal modifier -ghlla-('big'); the inflectional suffixes encode the number and person of the subject (1PL, 'we') and the main verb's mood and valency (IND.INTR).
In (6) we observe a more complex Yupik sentence; we see the sentence Taaghtam aghnamun qayunghitesqaa kufi ('The doctor prevented the woman from drinking the coffee') annotated in (7) with dependency relations between words. The resulting dependency tree fails to illustrate the complex verbal structure of the multi-morphemic third word qayunghitesqaa ('he told one not to drink it'); it is only in (8) when we annotate (6) with syntactic relations between morphemes that we are able to observe that aghnamun ('the woman') is the subject of the embedded verb qayu-('to drink') while Taaghtam ('the doctor') is the subject of the main verb -sq-('to tell'). That is, parts of the Yupik word, the main verb -sq-('to tell') and the embedded verb qayu-('to drink'), participate in different syntactic relations, which cannot be annotated at the word level. The necessity for this type of sub-word annotation is not unique to Yupik; see Çöltekin (2016) for a discussion of subword syntactic units in Turkish.
If sentences that required morpheme-level dependency relations were rare, it might be reasonable to accept the inclusion of a few degenerate and under-annotated trees such as (4) and (7) in a Yupik dependency treebank. However, Yupik is polysynthetic, and multi-morphemic words involving complex derivation are very common; the same is true of all of the languages in the Inuit-Yupik language family. For the polysynthetic languages in this language family, there are simply too many sentences that require morpheme-level dependency annotations to annotate only dependency relations between words. In particular, essentially all words formed with derivational suffixes require morpheme-level dependency relations in order to satisfactorily encode the syntax of the sentence.
In annotating Yupik sentences with dependency relations, we therefore treat each Yupik morpheme as a token rather than treating each Yupik word as a token. This necessarily requires that Yupik words be analyzed and segmented into morphemes prior to dependency annotation; this task was performed using the existing Yupik finite-state morphological analyzer (Chen et al., 2020). In cases of ambiguity when the analyzer provided multiple possible analyses for a given word, we selected the gold analysis via manual disambiguation.
We chose to represent all Yupik morphemes as independent syntactic tokens, including inflectional morphemes. An alternative approach would be to instead not tokenize inflectional morphemes, but rather annotate inflectional information using feature values. A major benefit of our choice is greater compatibility with the existing Yupik morphological analyzer (Chen et al., 2020), which treats inflectional morphemes as independent tokens in the underlying lexical form.
Because the UD annotation guidelines were not designed for morpheme-level annotation, some minor adaptations were required; we discuss these adaptations in §5 and §6 as we discuss the POS tags and dependency relations used in our corpus along with sample sentences. In order to enable the use of morphemes as tokens, we adapted the existing "multiword expressions" annotation mechanism. The UD annotation guidelines recognize that syntactic words do not always align perfectly with orthographic word boundaries; this can occur even in analytic languages such as English, for ex- ample, in words involving a clitic or a contraction. For example, in Spanish, the word dámelo ('give it to me') may be broken down into dá me lo ('give me it') for the purpose of UD annotations; the annotation scheme records that the single orthographic token (dámelo) is annotated as multiple syntactic words, and that information can be used to collapse the annotations to the single orthographic token when needed. In our case, we treat each multi-morphemic Yupik word as a UD "multiword expression," with Yupik morphemes serving as the tokens within the "multiword expression." Recognizing the UD project's lexicalist view of syntax, we provide a script to convert our morpheme-level annotations into word-level annotations. This script deterministically merges each multi-morphemic word into a single word token using Udapi (Popel et al., 2017). Because our morpheme-level annotation does not strictly follow the entirety of the UD guidelines, a small number of sentences had to be manually corrected after the conversion. We plan to release our morpheme-level annotation in UD version 2.8 along with descriptions of the conversion process from the morphemelevel annotations to the word-level annotations.

Corpus
The annotated corpus is comprised of exercise sentences from the Yupik reference grammar (Jacobson, 2001, as released in Schwartz et al., 2021. The grammar book, designed to teach Yupik at the college level, provides end-of-chapter exercises with sample Yupik sentences. Morphological segmentation and analyses were performed using the Chen et al. (2020) Yupik morphological analyzer and manually verified when needed.
The number of annotations for the final version of the Yupik treebank is summarized in Table 1  were annotated. For the morpheme-level annotation, about 63% of the words (773 words) were further analyzed into the subword units, with a total of 2,568 segments (i.e. morphemes, particles and punctuation marks) annotated.

POS Tags
We annotated our Yupik corpus using the tags shown in Table 2. 3 Our morpheme-level annotations make use of ten POS tags; when these annotations are converted into word-level annotations, only eight POS tags are utilized. We tagged nominals and nominal bases as NOUN and verbals and verbal bases as VERB. We tagged derivational suffixes that yield nominal stems (N→N, V→N) as NOUN and those that yield verbal stems (N→V, V→V) as VERB. For example, (9) shows the morpheme-level annotation for the word Qikmilguyugtunga 'I want to have a dog'. In the annotation, the nominal root Qikmi-'dog' combines with a verbalizing derivational suffix (-lgu-'to have', N→V) to yield a verbal base (Qikmilgu-'to have a dog'). Then this extended base combines with the verb-elaborating suffix (-yug-'to want to', V→V) to yield a complex verbal stem (Qikmilguyug-'to want to have a dog'), which is followed by inflection. The two verb-yielding derivational suffixes are tagged as VERB.
Uninflected words or particles were given the particle tag (PART). Many Yupik particles are borrowed from Chukchi, a geographically neighboring language, and are mostly adverbial or connective in meaning (de Reuse, 1994, p.14). Examples include ighivgaq 'yesterday' and qayughllak 'because'.
The two additional POS tags available only at the morpheme level were X and CCONJ . The POS tag X is reserved for words that are outside of POS tags defined within the UD framework. We used the X tag for inflectional suffixes such as -tuand -nga as in (9). Coordinating conjunctions (CCONJ) were only found at the morpheme level because they are only expressed as an enclitic in the language: =llu 'and' as in (10) Our morpheme annotation scheme makes use of 25 types of dependency relations while our word annotation scheme makes use of 14 dependency relations. In general, we followed the UD annotation guidelines, except in cases where polysynthetic nature of Yupik made divergence from the guidelines necessary. The full documentation on POS tags, morphological features, and dependency relations used in the treebank is available at the language's UD documentation page. 4 The most notable difference between the two annotation schemes is the dep relation. Within the UD framework, the dep relation is reserved for unspecified relations. Because morpheme-level annotations require multiple dependency relations specified for subword units, we created a few dependency relations under the dep relation for the morpheme-level annotation only. Note that some relations that are commonly annotated at the word level for other languages (e.g. auxiliary, copula) are only available at the morpheme level in Yupik. When we can, we expanded existing relations, defined at the word level, to morphemes (e.g. nmod for nominal modifier). Whenever that was not possible, we created a version of the corresponding dependency relation in our morpheme annotation scheme.
For example, we used dep:aux for verbelaborating (V→V) derivational morphemes that modify the base verb's tense and aspect information. For example, the V→V derivational morpheme (as manifested as -aqin the context) adds the present tense and progressive aspect to the base gaagh-'to cook' in (11). This relation would fit the descriptions of the auxiliary (aux) relation if it were annotated at the word level. We created a new relation as dep:aux to describe the dependency relation at the morpheme level because there were UD limitations to applying the existing aux relation to morphemes. First, the aux relation requires a short list of possible word forms while morphemes with the dep:aux relation may take many different forms depending on the context as they undergo morphophonological processes. Second, the word with the aux relation cannot have any children while corresponding morphemes often have inflections as their children.
Similarly, we included the dep:mark relation to represent the marker (mark) relation at the morpheme level. In (12) we observe a word that acts as a subordinate clause in a sentence and is roughly translated as 'in order to see them'. The second morpheme of the word -namarks the word as a subordinate clause to the main verb, a mark relation in the word level UD annotation. 5 Again, because of some limitations of using this relation at the morpheme level, we created the dep:mark relation for morpheme-level anntoations. On a similar note, the dep:cop relation was added to represent the copula (cop) relation at the morpheme level. In (13), the verbalizing (N→V) derivational suffix -nguacts as a copula, turning the nominal base as a verbal stem, which combines with the inflection to form a verbal word meaning 'it is a land' in the sentence meaning 'Chaplino is a land'. The dep:infl was used for the relation between the stem and its inflectional suffix as shown in (13). Because all Yupik words other than particles require one or more inflectional morphemes, the dep:infl relation was the most frequently used in the morpheme-level annotation.
In general, morpheme-level annotation was needed to capture some of important morphosyntactic relations present in Yupik words. The aux and cop relations are only available at the morpheme level in Yupik. While a small number of particles act as marker, the mark relation was also primarily attributed to derivational suffixes. When annotating Yupik sentences at the word level, such dependency relations are lost. Only when we annotate at the morpheme level can we find such constructions, which may be invaluable in subsequent linguistic inquiries or computational applications alike.

Parsing experiments
In order to investigate the practical usage of the annotations, we conducted automatic parsing experiments using UDPipe 1.2 (Straka and Straková, 2017) and UDPipe 2.0 (Straka, 2018). The UDPipe project 6 provides a trainable pipeline for any UD treebanks in the CoNLL-U format.

Data
We made use of two sets of data: the Jacobson corpus and a separate test corpus annotated using the same word-level and morpheme-level annotation schemes. A text extracted from Nagai (2001) was annotated to provide an out-of-domain test set. The Nagai corpus was smaller than the entire Jacobson corpus with 360 word tokens or 834 tokens when including morphemes. The Nagai corpus is quite distinct from the Jacobson corpus. The former is a collection of an elder Yupik speaker's speech while the latter is a college-level grammar book. Therefore, the former has more disfluencies, repetitions,

Word-level Morph-level Morph-level (Automatic segmentation) (Automatic segmentation) (Gold segmentation) Corpus
Jacobson (2001) Nagai (2001) Jacobson (2001) (Straka, 2018) for the word-level and morpheme-level annotation schemes. A test set was either 1) automatically segmented or 2) manually verified to have gold segmentation. The annotations on Jacobson (2001) was trained and tested using ten-fold cross validation. A sample text from Nagai (2001) was annotated to provide an out-of-domain test set. The columns show F 1 score: Words word tokenization; Segments splitting words into morphemes when applicable; Lemmas lemmatization; UPOS universal part-of-speech tags; Feats morphological features; UAS unlabelled attachment score (dependency heads); LAS labelled attachment score (dependency heads and relations). and some code-switching with English words while the latter contains sample sentences in the literary language without any foreign words. 7

Tokenization
At annotation time, the process of tokenizing sentences into syntactic tokens is performed manually as part of the annotation process. When annotating relations between morphemes, each morpheme serves as a token. When annotating relations between words, each word (delimited by whitespace or punctuation) serves as a token. At test time, it is also necessary to tokenize each sentence. In our experiments, we consider three mechanisms for doing so.
In the first experimental condition, we follow standard dependency parsing practice and rely on the dependency parser to tokenize each sentence into word tokens. To do so, we used a UDPipe 1.2 (Straka and Straková, 2017) model to automatically tokenize each test sentence into word tokens. In Table 3, we refer to this tokenization method as Word-level (Automatic segmentation).
In the second experimental condition, we used a UDPipe 1.2 (Straka and Straková, 2017) model to automatically tokenize each test sentence into morpheme tokens. In Table 3, we refer to this tokenization method as Morpheme-level (Automatic segmentation).
In the third experimental condition, we assume that tokenization of words into morphemes is han-dled as a separate pre-process (for example, by a finite-state morphological analyzer). In this condition, we provide a test file in which words have already been correctly segmented into morpheme tokens. In Table 3, we refer to this tokenization method as Morpheme-level (Gold segmentation).
We observe the results of tokenization in the first two rows of Table 3. The first row shows that all methods were able to identify word boundaries without error. In the second row of Table 3, we observe that using a dependency parser to segment Yupik words into morphemes is only 72% effective. This is problematic, as this places an upper bound on the potential dependency parsing performance of this condition. By definition, the third condition results in perfect morpheme tokenization.

Methods
We trained separate UDPipe 2.0 (Straka, 2018) parsers for the word-level annotations and the morpheme-level annotations, using the default UD-Pipe settings. UDPipe 1.2 (Straka and Straková, 2017) models were trained for tokenizing the test sets only, also using the default settings. To test indomain performance, we trained and tested a parser on the original Jacobson corpus using ten-fold cross validation for each annotation scheme. For out-ofdomain performance, we trained a parser on the entire Jacobson corpus and tested it on the Nagai corpus for each annotation scheme. The evaluation was conducted based on the official evaluation script from the CoNLL 2018 UD Shared Task (Zeman et al., 2018).

Results
Parsing results (unlabelled and labelled attachment scores) are shown in the final two rows of Table 3. In all cases, we observe that parsing accuracy for the in-domain data from Jacobson is substantially higher than in the out-of-domain data from Nagai.
When we compare the word-level and morpheme-level parsing given automatically segmented test sets (left and middle columns), the word-level parsing outperforms the morphemelevel parsing due to many segmentation errors present in the latter. Segmentation errors create an effective upper limit for any subsequent parsing efforts at the morpheme level, and all results in the second column are substantially worse than those in the first column.
In contrast, morpheme-level parsing outperforms word-level parsing across the board when correct morpheme tokenization is provided (right-most column). This shows that morpheme-level parsing (the second column) performed poorly on the automatically segmented test set mostly because of the poor quality morpheme segmentation. We observe that the morpheme-level dependency parser (the third column) outperforms the word-level parser (the first column) across the board, and even with the more challenging out-of-domain test set.
The task of analyzing and segmenting a word into its underlying component morphemes is a wellstudied task for which robust finite-state solutions are well known. For polysynthetic languages especially, the development of such a finite-state morphological analyzer is nearly always the very first element of language technology developed. It is therefore realistic to assume that tokenization of words into morphemes can be effectively handled by in a pre-processing step prior to dependency parsing.

Discussion
The Universal Dependencies project is intended as a de-facto standard for consistent dependency syntax annotations across all of the world's languages (Nivre et al., 2016(Nivre et al., , 2020. Our attempt to construct a UD corpus of Yupik can be viewed as a kind of stress test for the UD annotation project. If the UD guidelines truly are universal in nature, then it should be possible to construct dependency trees for Yupik while fully following the UD guidelines; to the extent that this is not possible, any such disconnect may serve to illuminate ways in which the UD guidelines might be improved upon in order to be more language universal. One of the core assumptions of the UD guidelines is lexicalism, the assumption that the fundamental token of syntax should be the word. This assumption has been widely adopted in many syntactic formalisms, including the Lexical-Functional Grammar theory of syntax that UD in part draws upon. It has, however, been widely debated (for a thorough recent critique of lexicalism, see Bruening, 2018), and other theories such as Distributed Morphology (Halle and Marantz, 1993) explicitly reject the lexicalist hypothesis, asserting that large parts of morphology and syntax operate using a common hierarchical mechanism.
The UD guidelines already explicitly recognize that phonological and orthographic boundaries do not always coincide with syntactic words. Nivre et al. (2016) recognize that clitics act as words from the viewpoint of syntax, even though phonologically (and orthographically) they must attach to a host word; as such in UD annotations clitics are treated as independent syntactic tokens. Similarly, the UD annotation guidelines recognize that contractions should be treated as the combination of two independent syntactic tokens. Finally, the UD guidelines recognize that some larger units such the English expression in spite of act syntactically as a single token.
However, the existing UD guidelines indicate that derivational morphemes should not be treated as syntactic words for the purposes of dependency annotation. For example, in an English dependency tree, the word dancer would be treated as a single syntactic token, rather than as two (verbal root dance-+ nominalizing suffix -er). In this paper, we have observed that this approach to derivational morphology fails when applied to Yupik.
The languages in the Inuit-Yupik language family are polysynthetic and rely heavily on productive derivational morphology. St. Lawrence Island Yupik has around 400 derivational suffixes, around half of which are verb-elaborating (V → V) derivational suffixes. It is essentially impossible to adequately annotate the syntax of Yupik sentences without recognizing that significant parts of Yupik grammar are handled by Yupik derivational morphology.
In this paper, we have chosen to treat every Yupik morpheme (both derivational and inflectional) as a syntactic token. In future work, it may be beneficial to build upon work by Çöltekin (2016) and treat only some derivational morphemes as syntactic tokens, while not tokenizing other derivational morphemes and perhaps all inflectional morphemes. At a minimum, this work shows that in order to be universal, the UD project must acknowledge that at least some derivational morphemes must be treated as syntactic tokens.

Conclusion
This paper presents the first UD treebank for St. Lawrence Island Yupik, the first UD treebank to be annotated at the morpheme level as well as the word level to our knowledge. The polysynthetic language has rich morphology, characterized by a theoretically unlimited number of possible derivations and multimorphemic words. In order to capture the morphosyntactic relations among morphemes, we annotated a corpus (Jacobson, 2001) at the morpheme level and converted the morpheme-level annotations into word-level annotations. While the morpheme-level annotation may require more linguistic resources (e.g. morphological analyzer, morphological segmentation), it provides a deeper insight into the language and better automatic parsing performance. Morpheme-level syntactic dependency annotation may be a better way to represent polysynthetic languages within the framework of UD. Milan Straka and Jana Straková. 2017 A Overview of dependency relations used in the Jacobson treebank Table 4 summarizes dependency relations used in the word-level and morpheme-level annotations for the Jacobson corpus. In this section, we provide additional descriptions of the dependency relations that we added for Yupik but were not introduced in the main text due to limited space. We added a sub-relation (obl:mod) to the existing obl relation to specify a special usage of a noun in ablative-modalis case. The existing obl relation is used for an oblique nominal or as a non-core argument of the corresponding verb. For example, a noun in ablative-modalis case is annotated as an oblique nominal (obl) when used to express motion away from somewhere as in mangteghameng (house-ABL_MOD.SG, 'from the house') in (14).

(14)
Taghnughhaat aanut mangteghameng children they.went.out from.house nsubj obl root In contrast, a noun in ablative-modalis case can also be used as "indefinite object" of an intransitive verb (Jacobson, 2001, p.20). For example, pagunghaghmeng (crowberry-ABL_MOD.SG) in (15) is understood as the object of an intransitive verb as an indefinite form of the noun (e.g. "crowberries" instead of "the crowberries"). Because an indefinite object in ablative-modalis case is not encoded in the verb, we annotated such nouns as an oblique noun, but distinguished it with the rest of oblique  We also added the nmod:arg sub-relation to the existing nmod (nominal modifier) relation to specify when a nominal base is used as the argument of a noun-elaborating (N→N) derivational suffix. In (17), the nominal base (aqavzi-'cloudberry') modifies the derivational suffix as the argument (aqavzileg-'the one with cloudberry'). The extended base then combines with the inflection to yield the noun in ablative-modalis case (aqavzilegmeng 'from the one with cloudberry').
(17) aqavzi--leg--meng cloudberry one.with.N ABL_MOD.SG nmod:arg dep:infl The dep:pos relation was used for the relation between a postural root and its postbase. A postural root takes a postbase to yield a verbal stem as in (18). The postural root (ingagh-'lying down') combines with the postbase (-nga-) to yield a stative form of the root (ingaghnga 'to be lying down'), which combines with the inflection to form the word (ingaghngaghpek, 'you are lying down'). A postural root is different from nominal or verbal bases as it can only take one of two postbases that turn the root into a stative or active form to be followed by inflection. Similarly, the dep:emo relation was used for emotional roots. Emotional roots can take one of a select number of postbases to yield nominal or verbal stems. In (19), the emotional root (qugina-'spooked') takes the postbase (-k-) to yield a verbal stem (quginak 'to be spooked'), which combines with the inflection to form a verbal (quginakanka 'I am spooked by them'). The dep:ana relation is used for the only prefix in Yupik, the anaphoric prefix. In general, the prefix is used for anaphora, emphasis or specificity. The prefix is also used in demonstratives to provide reference to person spoken to or situation spoken about (Jacobson, 2001, p.109).

B Overview of the Nagai treebank
This section provides additional information about the Nagai annotations, used for the parsing experiments in §7. Table 5 summarizes the number of annotations for the new corpus. As introduced in the main text, this corpus was smaller than the Jacobson corpus, but was bigger than a test set in the ten-fold cross-validation setting. In general, the new corpus provides a more realistic and challenging test set for an automatic parser. The Nagai corpus records a Yupik elder's speech and presents some code-switching with English words. For example, the Nagai corpus included an English word 'electric beater' inflected in Yupik electric beater-meng. For this, we used an additional feature 'Foreign=Yes' in annotating the corpus.
Because of such foreign words, the distribution of the POS tags were slightly different from the Jacobson treebank. Table 6 summarizes the POS tags used to annotate the Nagai corpus, and shows the presence of some tags used only for English words: For example, the Nagai annotations included an adposition (ADP), which was an English word, 'on'.
Because the new corpus was smaller than the original treebank, there were some POS tags in the original Jacobson corpus that were missing in the new corpus. No DET or CCONJ tags were used in the new corpus. Similarly, some dependency relations that were present in the Jacobson corpus were not present in the new corpus: cc, dep:emo, and det.