Toward Universal Dependencies for Shipibo-Konibo

We present an initial version of the Universal Dependencies (UD) treebank for Shipibo-Konibo, the first South American, Amazonian, Panoan and Peruvian language with a resource built under UD. We describe the linguistic aspects of how the tagset was defined and the treebank was annotated; in addition we present our specific treatment of linguistic units called clitics. Although the treebank is still under development, it allowed us to perform a typological comparison against Spanish, the predominant language in Peru, and dependency syntax parsing experiments in both monolingual and cross-lingual approaches.


Introduction and Background
Shipibo-Konibo is a language of the Panoan family spoken by around 35,000 native speakers in the Amazon region of Peru. It is a language with agglutinative processes, with a majority presence of suffixes and some clitics (neither a word nor an affix). Additionally, it presents word orders different from the dominant Spanish language.
To the best of our knowledge, there are no other Universal Dependencies (UD) treebanks for an indigenous language of South America, as surveyed by Mager et al. (2018). The closest resource is a treebank developed for a Quechuan variant; however, it was not designed under the UD guidelines (Rios et al., 2008). Another related case is the application of UD for the annotation of the native North American language Arapaho (Algonquian) (Wagner et al., 2016). Thus, Shipibo-Konibo would be the first South American indigenous language with this kind of computational resource 1 .
Natural Language Processing (NLP) efforts for Shipibo-Konibo have developed a POS-tagger, a 1 The treebank will be available for the next UD release lemmatizer, a spell-checker, and a machine translation prototype with Spanish as the paired language (Mager et al., 2018). Each functionality has been published alongside its annotated corpus. A UD treebank would enhance the NLP toolkit for the language, as it is the core element for being able to train a dependency parser.
This paper describes the steps and decisions made towards a UD treebank for Shipibo-Konibo. First, §2 presents the annotation process. Then, §3 details the information of the UD treebank itself, such as the POS tags, morphological features and dependency relations, including the specific ones for Shipibo-Konibo. Moreover, it describes relevant decisions regarding clitics and word segmentation, including an analysis of the generated multiword tokens. Finally, we take advantage of the built treebank, and perform a typological comparison against Spanish in §4, as well as dependency parsing tests for monolingual and crosslingual scenarios in §5.

Treebank Annotation
The annotation workflow of the Universal Dependencies (UD) treebank for Shipibo-Konibo is described in §2.1. In particular, specific consideration has been given for word segmentation with respect to clitics, which is detailed in §2.2.

Annotation Workflow
Annotation followed a sequential flow: 1. To annotate Shipibo-Konibo corpus in ChAnot (Mercado et al., 2018) and BRAT (Stenetorp et al., 2012). The former tool was specifically used for the morpheme segmentation of raw text into prefixes, root morphemes and suffixes in appropriate morphological detail. The provided interface with BRAT allows the graphical annotation of syntactic information over the segmentation. We used part of speech and relation names determined prior to the decision to conform to UD v2.0.
2. To compile segmented corpus into UD v2.0 format: Gather all annotations from ChAnot and BRAT into single file in UD v2.0 format. Compress detail segmentation of prefixes and suffixes to only segment on clitic boundaries. Add clitic features, and convert non-standard to UD v2.0 standard universal POS and dependency relation notation.

Clitics and Segmentation
In terms of its morphological profile, Shipibo-Konibo favors synthetic word formations. That is, in Shipibo-Konibo, words are often composed of a root and one or more bound morphemes. Some of these morphemes may be considered clitics, linguistic elements that do not fit either the prototype of word or that of affix. Similar elements are labelled particles in the Universal Dependencies tradition, but we prefer clitics, following the arguments presented in Zwicky (1977,1985). In the Panoan literature, these intermediate linguistic units have also been called clitics (Fleck, 2013;Valenzuela, 2003;Zariquiey, 2015), so we consider it appropriate to follow this terminology in the development of our Shipibo-Konibo treebank. As clitics, these linguistic units exhibit some features that resemble those attested in words. This intermediate nature clashes with the dichotomic division between morphology and syntax, in which linguistic units belong to one of these domains (see Dixon and Aikhenvald (2002);Haspelmath (2011) for discussion).
Taking all this into consideration, we have made the methodological decision of treating clitics as independent syntactic words. Therefore, the relationships between words and clitics is rendered as syntactic and is annotated by means of the appropriate dependency. All clitics in Shipibo-Konibo are phrasal in nature and treating them as independent words captures this in a more precise way (although annotation may be more time-consuming). In section 2.3 we present some examples.
Furthermore, following the principles for tokenizing a surface word into multiple inflectional groups (IGs) proposed by Çöltekin (2016, p. 2), we segment clitics as independent words because they and their host may participate in different syntactic relations. For instance, in the Shipibo-Konibo sentence ea=ra joke (I came), ea is the pronoun (I) in a dependency of nsubj from the verb joke (came), whereas =ra is an evidential clitic in the dependency of aux:valid.
Languages with similar morphological profiles have treebanks in Universal Dependencies, such as Finnish (Pyysalo et al., 2015), Turkish (Sulubacak et al., 2016) or Kazakh (Tyers and Washington, 2015). Nevertheless, those treebanks do not tend to systematically label bound morphemes as independent words, as we aim to do in the development of our treebank because of the reasons mentioned above.

Language Examples
We present two Shipibo-Konibo sentences in anticipation of further discussion.
The sentence Jatianra en ja maxko bake panshin kírika menike (So, I give this little boy a yellow book) in Figure 1 presents a ditransitive verb with direct and indirect objects. The clitic =ra has an evidential function, hence it projects the dependency relation aux:valid to the main verb menike (gave). The clitic =n expresses nominal case and projects to the token's core word. In Shipibo-Konibo, adjectives tend to precede nominal heads, with determiners preceding both adjectives and nominal heads as shown in the phrase ja maxko bake.
The sentence Joninronki yoyo aká iki: "Jen, enra moa onanke" (They say the man said, "Ah, I already knew that") in Figure 2 presents a direct speech construction showing two main verbs, each one with a evidentiality clitic. There are two multiword tokens with three syntactic words each, joni =n =ronki and e =n =ra.

Shipibo-Konibo Treebank
Our current Shipibo-Konibo treebank is the result of the syntactic annotation of 407 sentences extracted from parallel Shipibo-Konibo and Spanish educational materials and storybooks -complemented with elicited sentences produced and translated by the Shipibo-Konibo members of our team. This is a small treebank with work still ongoing (Table 1).

Typological features of Shipibo-Konibo
Shipibo-Konibo presents a basic AOV/SV constituent order (Figures 1 & 2 pragmatically conditioned orders. NP-modifiers often precede their head ( Figure 1) and verbs do not show either subject or object cross-reference. As this is first treebank for any South-American indigenous language, there could well be novel grammatical features of Shipibo-Konibo not included in any other treebanks.

Universal Part of Speech (POS) Tags
Universal Dependencies (UD) introduces a tagset of 17 POS tags, mainly based in the Google universal part-of-speech tags (Petrov et al., 2012). All of them have been employed in the development of the Shipibo-Konibo treebank. The POS tags and frequencies in the treebank are shown in table 2.
The POS tag X is used for labelling onomatopoeia, which is a relevant POS in various Panoan languages, including Shipibo-Konibo (Valenzuela, 2003;Zariquiey, 2015Zariquiey, , 2011. UD does not have an onomatopoeia POS tag. Hence, we opted to use X to label it. In other treebanks, onomatopoeias were ascribed to different POS tags. For example, Badmaeva (2016) in her "Universal Dependencies for Buryat" states that "the case of onomatopoeia is also an interjection" (2016, p. 40). However, onomatopoeias in Shipibo-Konibo are members of a special closed part of speech. They are used in combination with semantically generic verbs or auxiliaries as a productive strategy in order to form new words. Therefore, we considered it appropriate to label them as a different and independent POS.
As discussed in §2.2, Shipibo-Konibo clitics are a special type of linguistic unit that ought to be treated as an independent POS. Since Universal Dependencies does not present a clitic POS tag, but it does present a particle POS tag, PART, we opted to treat the Shipibo-Konibo clitics as particles, since clitics are often called particles ( §2.2). These linguistic units are divided into three different categories: nominal clitic (expressing case and only used with nominal phrases), second position clitics (mainly expressing evidentiality and following the first constituent of a sentence), and less-fixed clitics (expressing adverbial value and used with any kind of POS). In this sense, it is important to remark that we are not considering them as adpositions ADP, since they belong to a closed set of items that occur before (preposition) or after (postposition) a complement composed of a noun phrase, noun, pronoun, or clause that functions as a noun phrase. Thus, they form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause.
The high PART frequency noted in table 2 could impact performance in tasks as part-of-speech tagging or even syntax dependency parsing if it would require prior POS tag information. This was dis-cussed and analyzed by Endresen et al. (2016) in a Russian corpus. We believe it will be important to measure whether the impact would be positive or negative in morphosyntactic tasks for Shipibo-Konibo as well, and thus, we would like to extend the discussion to a multilingual approach as further work.

Universal Morphological Features
The universal morphological features of UD are based on Zeman (2008)'s "Reusable tagset conversion using tagset drivers" with the concept of an expandable feature structure that could support any tagset. Tagset labels aim to "distinguish additional lexical and grammatical properties of words, not covered by the POS tags" (Nivre et al., 2017). A list of the morphological features and values used in the Shipibo-Konibo treebank annotation are given in Table 3; most are already defined in Universal Dependencies. The few morphological features of Shipibo-Konibo that require labels not currently in Universal Dependencies are underlined in Table 3.
The new morphological features are further defined below.   Table 4.

Tense=Fut1, Fut2
Shipibo-Konibo also has two different classes of future tense, expressed by bound morphemes. These features are also presented in Table 4.

Dependency Relations
UD defines a set of 37 dependency relations, mainly based on "Universal Stanford Dependencies: A cross-linguistic typology" by Marneffe et al. (2014). Thirty-one of these 37 relations were employed in our Shipibo-Konibo treebank. One of the main characteristics of UD is that relations link content words rather than abstract nodes, i.e., lexicalism (Nivre et al., 2017). Dependency relations and frequencies in the treebank are reported in Table 6. It is worth mentioning that the frequency of acl and ccomp relation labels is low due to the choice of annotated sentences rather than a specific property of the language.

Shipibo-Konibo specific relations
While UD aims to provide "a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages" (Nivre et al., 2017), it also allows language-specific subtype relation labels when necessary. For the Shipibo-Konibo treebank, we considered the inclusion of two new subtype relation labels: aux:valid and compound:onom.

Relation subtype -aux:valid
An auxiliary is an element that may express different grammatical categories such as time, aspect, mood, voice and evidentiality. In Shipibo-Konibo, evidentiality and mood are expressed through a subset of clitics. These clitics are ascribed to the relation aux, but in order to distinguish them from verbal auxiliaries, they receive the subtype relation label val. This subcategory refers to the notion of validator, as defined by Cerrón-Palomino (2008, p. 166) for Quechua. For example, the sentence Enra yapa yoá akai (I cook fish) uses the first-hand evidentiality clitic =ra (Valenzuela, 2003, p. 534) to express that the speaker witnessed the event. See Figures 1 & 2 for more examples.
Note the high frequency of use for aux:valid shown in Table 6. At 176 instances, 5.6% of all syntactic words, almost half of Shipibo-Konibo sentences would include an expression of evidentiality (given seldom more than one aux:valid is used per sentence). This high frequency expression of evidentiality is an intriguing linguistic phe-nomenon and worth further study.

Relation subtype -compound:onom
Similar to other languages of the Panoan language family, in Shipibo-Konibo, onomatopoeias are considered as a closed word class (Valenzuela, 2003). In this language there are constructions that include two semantically generic verbs: ati (do) or iti (be) (Valenzuela, 2003, p. 83). These elements may be combined with onomatopoeias in order to create a type of compound verb.
We decided to use the subtype relation label compound:onom for those specific types of compound verbs. For example, the verb yoyo iti (to speak) corresponds to a compound formed by the verb iti (be) and the onomatopoeia yoyo (speech noise). In spite of the fact that they are two differentiated entities, both elements constitute a unit at the semantic level, and therefore are compounds in Universal Dependencies. See Figure 2 as another example.
There is a significant use of compounds, 46 instances and 1.5% of syntactic words (Table 6), but only a few are due to onomatopoeia. While deemed important in the language, onomatopoeias have low frequency representation in the current instance of the treebank.

Segmentation and Multiword Tokens
Our decision to split orthographic tokens on clitic boundaries in §2.2 results in an abundance of multiple syntactic word tokens (Table 7) with 402 multiword tokens (MWTs) of 2706 total tokens. The clitic of second position, Spcl, invokes the dependency relation aux:valid typically with the clausal head and not with the core word of the MWT. The nominal clitic, Nomcl, invokes the dependency relation case with the core word of the MWT.
The cases where a token contains multiple clitics, the Spcl comes later. This has the effect of preserving projectivity. We continue to follow this issue of multiple clitic MWTs and projectivity.

Word Order vs Spanish
We examined word order differences between the dominant Spanish and Shipibo-Konibo. Spanish results are from the training set of the Es-Ancora treebank (Martínez Alonso and Zeman, 2017), while Shipibo-Konibo results are from our treebank. Table 9 reports counts and relative frequencies of a constituent preceding its head. Constituents are reported either by their dependency relation with their head or POS in the case of single syntactic word constituents. Relative frequency of following the head is just the complement of that of preceding the head.
Direct and oblique objects usually follow the head (typically a verb) in Spanish and precede the head in Shipibo-konibo. Auxiliary verbs usually precede the head in Spanish and follow the head in Shipibo-Konibo. Spanish uses prepositions and Shipibo-Konibo postpositions, but determiners precede their heads in both languages. Similar differences and similarities follow for the less common constituents as well.  Full confirmation of Shipibo-Konibo features versus the WALS database (Dryer and Haspelmath, 2013) awaits further progress. But a review of word order from Table 9 versus WALS largely confirms comparable word order features in WALS. An exception is adjective and noun head order. Our corpus shows ∼75% adjective preceding head (∼80% for adjective preceding noun head). So adjective precedes noun head order dominates versus the earlier finding by Faust (1973) reported in WALS of no dominant order.

Parsing for Shipibo-Konibo
Dependency syntax parsing is a complex task that usually requires a lot of annotated data, thus we decided to perform experiments in two different scenarios. The first one treats the treebank as an isolated corpus using monolingual methods, whereas the second one presents a cross-lingual experiment to identify which other languages from the UD v2.0 collection can support the parsing task for Shipibo-Konibo.

Monolingual Parsing
A straightforward test was performed using a greedy transition-based parser (Parsito) (Straka et al., 2015) from UDPipe (Straka and Straková, 2017) and the Yara Parser (Rasooli and Tetreault, 2015), which is also a transition-based method but uses beam search. The obtained results with 10fold cross-validation are presented in Table 10, where we perform parses with POS gold annotations and raw text.  With the gold annotations, UAS and LAS scores from Parsito are greater than the language average of 78.59% and 72.81%, respectively, from Straka and Straková (2017); and the Yara Parser provides slightly better results in most cases. The low difference may be caused by the different search approaches (greedy versus global beam search) in the transition-based parsers. Meanwhile, parsing raw text scored much worse, which was expected for the corpus size. However, most of the crossvalidation results has presented high variance; and thus, these results must not be treated as definitive ones, and only as a reference, as there could be overfitting and scarcity issues.

Cross-Lingual Parsing
We conducted an experiment with single-source cross-lingual delexicalized parser transfer from the UD v2.0 source languages into Shipibo-Konibo as the target language, in the vein of Zeman and Resnik (2008).
In the experiment, we used the mate-tools graph-based parser by Bohnet (2010) with default settings. The entire Shipibo-Konibo treebank was our test set. We tagged the treebank for POS using MarMoT (Mueller et al., 2013)   PREL. Yet, to avoid any dependency label inconsistencies since our treebank is small, we evaluated for UAS only. We excluded all multiword tokens from the experiment, while retaining their respective syntactic words. A single delexicalized parser was trained for each UD v2.0 source treebank and applied on the Shipibo-Konibo test data. Table 11 presents the results of the transfer parsing experiment. We achieve by far the best parsing results via the Kazakh delexicalized parser (66% UAS), closely followed by Japanese (63%), Basque and Turkish (ca 59%), and then Tamil, Persian, and Hindi (57%). Specifically, Kazakh presents morphosyntactic features similar to Shipibo-Konibo, such as SOV word order, high presence of agglutinative suffixes and head-final directionality (Mukhamedova, 2015). Moreover, the results are interesting as the topperforming cluster of sources for Shipibo-Konibo comprises languages that mainly feature as outliers in most cross-lingual parsing research, owing to the strong mainstream bias towards experimenting with resource-rich languages, as argued by Agić et al. (2016).
To further support our findings, we correlate the cross-lingual parsing UAS scores with language similarity of UD v2.0 source languages to Shipibo-Konibo. We express language similarity as pairwise Hamming distance between WALS vectors (Dryer and Haspelmath, 2013) for Shipibo-Konibo and the respective UD v2.0 source languages similar to Agić (2017). We depict this set of results in Figure 3 is unlikely to be random at p < 0.05. In other words, the source languages that are more similar to Shipibo-Konibo in terms of WALS are more likely to provide Shipibo-Konibo with good delexicalized parsers. That said, some of the best source parsers are outliers in the figure: Kazakh and Basque yield good parsers for Shipibo-Konibo, but their WALS distance to it is large. This is due to the sparsity of WALS features for these languages: for example, 183 of 202 WALS features are null for Kazakh, and 188 for Basque, but only 41 for Japanese. Fixing these WALS feature deficiencies would in turn arguably strengthen the correlations to further support our findings. Besides, this analysis could be complemented by using a subset of WALS features that are generally available, as well as by inferring empty Kazakh features from related languages in the Kypchak group.

Conclusion and Future Work
We've presented Shipibo-Konibo from the Amazon region of Peru and our ample progress in building a treebank conforming to Universal Dependencies v2.0. We argued for segmenting syntactic words (versus tokens) along phrasal clitic boundaries and provided parse examples of this. While our treebank is still a work in progress with 407 sentences, we've learned much already about what distinguishes us from other languages and treebanks. Segmenting on phrasal clitics and POS tagging as PART resulted in a phenomenal 14% of clitics tagged as PART in our treebank, following only PUNCT, NOUN, VERB in popularity.
Several morphological features were added to account for past and future verb tenses, And and Ven aspects, Chez case, and Nomcl, Spcl, and Lfcl clitics. Each of these additions matters in the meaningful annotation of Shipibo-Konibo.
We considered two new dependency relation subtypes: aux:valid and compound:onom. The aux:valid relation occurred 176 times (5.6% of words and almost half of sentences). This high use evidentiality function invites further linguistic study.
By segmenting on phrasal clitics Shipibo-Konibo stands out in its use of multiword tokens (MWTs) including both two and three word MWTs. The Spcl clitic usually projects to the verbal head, but since it succeeds other clitics, projectivity is preserved. Shipibo-Konibo has a huge fives times as many MWTs (∼15% versus ∼3% for Turkish) versus other (agglutinative) languages.
Word order of Shipibo-Konibo versus Spanish reveals dramatic differences, which informs our work on machine translation between them. We largely confirmed WALS word order features for Shipibo-Konibo, except for our finding that adjective precedes noun is dominant as opposed to no dominant order as reported in WALS.
Results on a monolingual parser show promise with better than the language average performance for gold POS tags. Delexicalized cross-lingual parsing using parsers trained on all UD v2.0 treebanks, showed a maximum 66% unlabeled attachment score (UAS) for Kazakh, a language with similar morphosyntactic features, followed closely by Japanese at 63%. A plot of UAS versus Hamming distance from WALS vectors reveals the expected inverse correlation between WALS distance and UAS (lesser WALS distance related to higher UAS). Japanese showed a low WALS distance and a high UAS, but Kazakh showed both high WALS distance and high UAS (seemingly an outlier).
As future work, we will increase the size of the UD treebank, as well as annotate the morphological features in a semi-supervised way. There has been developed an FSM-based morphologi-cal analyzer (Cardenas Acosta and Zeman, 2018) that could support the annotation for that purpose. Moreover, as Shipibo-Konibo is one of many in the Panoan linguistic family, the next step would be the definition of the UD tagsets and guidelines for closely related languages, such as Iskonawa or Amawaka. We hope these efforts could extend language technologies development for minority languages in Peru.