A Morphological Analyzer for Shipibo-Konibo

We present a fairly complete morphological analyzer for Shipibo-Konibo, a low-resourced native language spoken in the Amazonian region of Peru. We resort to the robustness of finite-state systems in order to model the complex morphosyntax of the language. Evaluation over raw corpora shows promising coverage of grammatical phenomena, limited only by the scarce lexicon. We make this tool freely available so as to aid the production of annotated corpora and impulse further research in native languages of Peru.


Introduction
Linguistic and language technology research on Peruvian native languages have experienced a revival in the last few years. The academic effort was accompanied by an ambitious long term initiative driven by the Peruvian government. This initiative has the objective of systematically documenting as many native languages as possible for preservation purposes (Acosta et al., 2013). So far, writing systems and standardization have been proposed for 19 language families and 47 languages.
In this paper, we focus on Shipibo-Konibo (henceforth, SK), also known in the literature as Shipibo or Shipibo-Conibo. SK is a member of the Panoan language family. This family is a wellestablished linguistic group of the South American Lowlands, alonside Arawak, Tupian, Cariban, and others. Currently, circa 28 Panoan languages are spoken in Western Amazonia in the regions between Peru, Bolivia, and Brazil. Nowadays, Shipibo is spoken by nearly 30,000 people mainly located in Peruvian lands.
The morphosyntax of SK in extensively analyzed by Valenzuela (2003). However, several phenomena such as discourse coherence marking and ditransitive constructions still require deeper understanding, as pointed out by Biondi (2012).
We present the first finite-state morphological analyzer for SK, capable of performing POS tagging as well as morpheme segmentation and categorization. In order to impulse the development of downstream applications and corpora annotation, the tool is freely available 1 under the GPL license.

Related Work
The development of freely available basic language tools has proven to be of utmost importance for the development of downstream applications for native languages with low resources. Finite-state morphology systems constitute one type of such basic tools. Besides downstream applications, they are essential for the construction of annotated corpora, and consequently, for development of other tools. Such is the case of Quechua, a native language spoken in South America, for which the robust system developed by (Rios, 2010) paved the way to the proposal of a standard written system for the language (Acosta et al., 2013) and impulsed work in parsing, machine translation (Rios, 2016), and speech recognition (Zevallos and Camacho, 2018).
Initial research regarding SK has centered in the development of manual annotation tools (Mercado-Gonzales et al., 2018), lexical database creation (Valencia et al., 2018), Spanish-SK parallel corpora creation and initial machine translation experiments (Galarreta et al., 2017). Related to our line of research, work by Pereira-Noriega et al. (2017) addresses lemmatization but not morphological categorization. Alva and Oncevay-Marcos (2017) presents initial experiments on spell-checking using proximity of morphemes and syllable patterns extracted from anno-132 tated corpora.
In this work, we take into account the morphotactics of all word categories and possible morpheme variations attested by Valenzuela (2003). We explored and included as many exceptions as found in the limited annotated corpora to which we got access. Hence, the tool presented is robust enough to leverage current efforts in the creation of basic language technologies for SK.

Shipibo-Konibo Morphosyntax
In terms of a syntactic profile, SK is a (mainly) post-positional and agglutinating language with highly synthetic verbal morphology, and a basic but quite flexible agent-object-verb (AOV) word order in transitive constructions and subject-verb (SV) order in intransitive ones, as summarized by (Fleck, 2013).
SK usually exhibits a biunique relationship between form and function, and in most cases morpheme boundaries are easily identifiable. It is common to have unmarked nominal and adjectival roots, and few instances of stem changes and suppletion are documented by (Valenzuela, 2003). In addition, the verb may carry one or more deicticdirective, adverb type suffixes, in what can be described as a polysynthetic tendency.
In addition, SK presents a rare instance of syntactic ergativity in an otherwise morphologically ergative but syntactically accusative language.
We proceed to comment about the most salient morpho-syntactic features relevant to the morphotactics argumentation in section 4.2. The examples presented in this section were taken from Valenzuela (2003).

Expression of Argument
Verb arguments are expressed through free lexical case-marked nominals, with no co-referential pronominal marking on the verb or auxiliary. That is, verbs and auxiliaries are not marked to agree with 1st, 2nd, or 3rd person of the subject or agent. Instead, verbs are marked to indicate that the action was carried out by the same participant of the previous clause or by another one. We explain this phenomena in section 3.4.
Omission of required subject and object is normally understood as zero third person singular form. There are no systematic morpho-syntactic means of distinguishing direct from indirect objects, or primary versus secondary objects.

Case Marking
Grammatical cases are always marked as suffixes, except for a couple of exceptions. SK exhibits a fairly rigid ergative-absolutive case-marking system. The ergative case is always marked, whereas the absolutive case is only marked on non-emphatic pronouns. All other grammatical cases are marked, except the vocative case. The vocative case is constructed by shifting the stress of a noun to the last syllable.

Participant Agreement
Certain adverbs, phrases, and clauses are semantically oriented towards one core participant or controller and receive a marking in accordance with the syntactic function this participant plays, namely subject (S) of a intransitive verb, agent (A) of a transitive verb, or object (O) of a transitive construction. This feature can be analyzed as a type of split-ergativity which might be exclusive to Panoan languages. The following example illustrates this phenomena for the adjunct bochiki: high up in S, O, and A orientation (ONOM refers to onomatopeic words).
(1) S orientation Bochiki-ra e-a oxa-i up:S-Ev 1-Abs sleep-Inc "I sleep high up (e.g., in a higher area inside the house)." (2) O orientation E-n-ra yami kentí bochiki a-ke 1-Erg-Ev metal pot:Abs up:O do.T-Cmpl "I placed the metal pot high up." (only the pot is high up) (3) A orientation E-n-ra yami kentí bochiki-xon 1-Erg-Ev metal pot:Abs up-A tan tan a-ke. ONOM ONOM do.T-Cmpl "I hit the metal pot (being) high up." (I am high up with the pot)

Clause-Chaining and Switch-Reference System
Chained clauses present only one clause with fully finite verb inflection while the rest of them carry same-or switch-reference marking. Referencemarked clauses are strictly verb-final, carry no obvious nominalizing morphology and may precede, follow, or be embedded in their matrix clause.
Same-reference markers encode transitivity status of the matrix verb, co-referentiality or non coreferentiality of participant, and relative temporal or logical order of the two events. This is because most same-subject markers are identical to the participant agreement morphemes and hence correlate with the subject (S) or agent (A) function played by their controller in the matrix clause. The following example shows three chained clauses. Notice that the matrix verb is chew, and the subordinated clause's verbs carry the marker xon to indicate that the action was performed by the same agent prior to the action described in the main clause (PSSA: previous event, same subject, A orientation).

Pronouns and Split-Ergativity
The personal pronoun system in SK is composed of 6 basic forms corresponding to the combinations of three person (1,2,3) and two number (singular and plural) distinctions. SK does not differentiate gender or inclusive vs exclusive first person plural. There are no honorific pronouns either.
The ergative-absolutive alignment is used in all types of constructions, except for reflexive pronoun constructions. Reflexive pronouns are marked with the suffix -n when referring to both A and S arguments, but remain unmarked when referring to an O argument. Hence, reflexive pronouns constructions clearly present a nominativeaccusative alignment.

Clitics
All clitics in SK are enclitics, i.e. they always function as suffixes, but most of them encode clause level features in which case they are attached to the last element of the phrase or clause they are modifying. SK clitics are categorized into case markers, less-fixed clitics and second position clitics, as proposed by Valenzuela (2003).
Case markers are attached to noun phrases preceding mood and evidentiality markers in its last constituent word.
Less-fixed clitics mark the specific element they are attached to, instead of the whole clause. These are endo-clitics, i.e. they can take any position other than the last morpheme slot in a construction. In this category we can find adverbial, adjectival, and dubitative suffixes.

Morphological Analyzer
The analyzer was implemented using the Foma (Hulden, 2009) toolkit, following the extensive morphological description provided by Valenzuela (2003). Besides segmenting and tagging all morphemes in a word form, the analyzer also categorizes the root and the final token in order to account for any sequence of derivational processes. The analysis is of the form [POS] root[POS.root] morpheme[+Tag] ... and it is illustrated with an example in Table 1.
The complete list of abbreviations and symbols used for morphological tagging can be found in the Appendix A of (Valenzuela, 2003). Language specific POS tagset was mapped to the Universal Dependencies (Nivre et al., 2016) v2 POS tagset. 2 In the remaining of this section we provide a thorough explanation of the production rules for the main POS categories and the comment on the limitations of the analyzer.

The Lexicon
The lexicon was obtained from manually annotated corpus and a digitalized thesaurus kindly provided by the Artificial Intelligence Research Lab of the Pontifical Catholic University of Peru (GIPIAA-PUCP). The annotated corpus was built from folk tales documents and it consists of 12,250 tokens and 2,915 types. The thesaurus provides dictionary entries for 6,750 types.
The extensive work of (Valenzuela, 2003) provides a systematic encoding of morpho-syntactic information for SK. Similar guidelines were followed to design the encoding for Quechua (Rios, 2016), another agglutinative, ergative-absolutive   native language widely spoken in Peru and South America.
The annotated corpus, however, was not annotated following this encoding, and further manual annotation was required. With the help of a digitalized dictionary and an affix thesaurus we manually resolved the mappings and correspondences using the-now widely acceptedmorphosyntactic encoding.
The following example illustrates the annotation. The first row shows the raw segmentation of the tokens; the second row, the original annotation (Clit stands for clitic, VS stands for verbal suffix); the third row, the new annotation following the morphosyntactic tagset proposed by Valenzuela (2003 Table 2 presents the number of roots per UD POS category for each lexicon source, for a total of 8,658 roots.

Morphotactics
Although SK presents a predominantly suffixed morphology, there exists a closed list of prefixes, almost all being body part derivatives shortened from the original noun (e.g. 'head' mapo → ma). These prefixes can be added to nouns, verbs, and adjectives to provide a locative signal.

Nouns
Nominal roots can occur in a bare form without any additional morphology or carry the following morphemes.
• Body part prefix (+Pref), to indicate location in the body.
• Plural marker (+Pl:bo), meaning more than one. Dual number distinction is not made in nouns, but in verbs.
• Participant agreement marker (+S:x), to indicate the subject of a transitive verb.
It is worth mentioning that only the first plural morpheme has precedence over the others suffixes, and clitics are required to be last. Plural, cases, and adverbial markers can occur multiple times.
There is no gender marking in SK. Instead, the words for woman (ainbo) and man (benbo) are used as noun modifiers. Consider the example Títa-shoko-bicho-ra oxa-ai mom:Abs-Dim-Adv-Ev sleep-Inc 'Mommy sleeps alone.' The diminutive shoko is denoting affection instead of size. Notice that the adverbial suffix bicho would have to be constructed as a separate adjunct in English and it is attached to the noun, not the verb.
Derived Nominals Verbal roots can be nominalized by adding the suffix -ti or past participle suffixes a, ai. Zero nominalization is only possible over a closed set of verbs, e.g. shinan-'to think, to remember / mind, thinking'.
On the other hand, adverbial expressions and adjectives may function as nominals and take the corresponding morphology directly without requiring any overt derivation.

Adjectives and Quantifiers
Adjectival roots can optionally bear the following morphemes.
• Negative (+Neg:ma), to encode the opposite feature of an adjective.
In regards to verbs, participial tense-marked verbs can function as adjectives. Transitive verbs and a closed set of intransitive verbs can take an agentive suffix (+Agtz:mis,yosma,kas) to express one who always does that action.
As with nominalization, adverbs take zero morphology to function as adjectives.

Verbs
Verbal morphology presents by far the most complex morphotactics in SK, allowing up to 3 prefixes and 18 suffixes following a relatively strict order of precedence, as follows.
• Prefixes related to body parts, providing locative information about the action.
• Up to 2 valency-changing suffixes, depending whether we are increasing or decreasing transitivity, whether the root is transitive or intransitive, or whether the root is bisyllabic or not.
• Interrogative intensifier (+Intens:shaman), to bring focus on the action in a question.
• Deictive-directive markers are identical or similar to motion verbs and encode a movement-action sequence, e.g. V-ina → 'go up the river and V'.
• Adverbial suffixes, depending whether the verb is marked as plural or not. Here in this slot we find the suffix bekon that indicates dual action.
• Habitual marker (+Hab:pao), to encode that the action is done as a habit.
• Preventive marker (+Prev:na), to express warning, a situation to be prevented.
• Final markers, including participial and reference markers depending whether the verb is finite or non-finite in the clause. Reference markers encode agreement with the agent or subject of the clause (S vs A agreement), whether it is even the same agent and the point in time the action was carried out.
• All second position clitics.
Verbal roots must always bear either a tense marker or at least one final marker. All other suffixes are optional. The following example illustrates how the deictive-directive marker can encode a whole subordinated clause.
Sani betan Tume bewa-kan-inat-pacho-ai Sani and Tume sing-Pl-go.up.the.river-Adv-Inc 'Sani and Tume always sing while going up the river.' Derived Verbs Nominal roots are turned into transitive verbs by adding the causativizer +Caus:n. The auxiliary marker +Aux:ak can be added to nominal, adjectival, and adverbial roots to form transitive verbs.

Pronouns
Personal pronouns can bear the following suffixes.
• Ergative (+Erg:n) and absolutive (+Abs:a) case marker. This last one is only used on singular forms and first person plural.
The ergative case construction also renders possessive modifiers, with the exception of the first and third singular form, which have a different form with no marking. Possessive pronouns are formed by adding the nominalizer +Nmlz:a to possessive modifiers.
Emphatic pronouns present the marker +S:x when agreeing with the S argument and no marker when agreeing with the A argument. Special attention was taken for the third person singular pronoun ja-, which presents a tripartite distribution: ja-n-bi-x for S, ja-n-bi for A, ja-bi for O.
Interrogative pronouns who, what, where can be marked for ergative, absolutive, genitive, chezative, and comitative cases. The participant agreement suffix for these pronouns presents a tripartite distribution: +S:x, +O:o, +A:xon for S, O, A agreement, respectively. The following example illustrates the behavior of pronoun jawerano: where.
Demonstrative roots can function both as pronouns and determiners. In the first case, they bear all proper pronoun morphology. In the second case, they can only bear the Plural nominal marker +Pl:bo.

Adverbs
Adverbs can be suffixed with evidential clitics. However, whenever an adverb is modifying an adjective, it takes participant agreement morphology (+S:x,ax,i; +A:xon) in order to agree with the syntactic function of the noun the adjective is modifying.
Adverbial roots can also function as suffixes and be attached to nouns, verbs, adjectives, and even other adverbial roots.
Derived Adverbs Adverbs can be derived from demonstrative roots by adding locative case markers depending of the proximity of the entity being referred to. Adjectival roots function as adverbs by receiving the +Advz:n morpheme. Nouns and quantifier roots take the locative case marker +Loc:ki in order to form adverbs.

Postpositions
There are only 20 postpositional roots in SK, all of them can take second position clitics. In the same fashion as adverbial roots, postpositional roots can also function as suffixes. Adverbial roots can function as postpositions by taking the locative marker sequence +Loc: ain-ko.

Conjunctions
All conjunction roots take participant agreement markers (+S:x, +A:xon), except coordinating conjunctions betan (and) and itan (and, or). These markers encode inter or intra-clausal participant agreement, often used as discourse discontinuity flags.
Subordinating conjunctions can take the following morphemes.
• Completive aspect markers, also found as participials in verbs at the final slot.
• Reference agreement mark +P:ke, to encode discourse continuity.
In the following example, we analyze the behavior of the conjunction root ja. While the first instance of jatian in (9) coincides with the introduction in subject function of the male Inka and hence with a change of subject, the second instance in (11) does not. In fact, the subjects in (10) and (11) have the same referent, but jatian is used to indicate a switch from narrative to direct quote in the chain. Note that in (11) the subject 'her husband' is overtly stated so that the hearer does not misinterpret jatian as indicating a change in subject.

Limitations
The analyzer processes token by token without considering context, restricting it from discarding hypothesis based on fairly rigid constructions, e.g. future tense with auxiliary verbs, modal verbs, nominal compounds, among others.
There exist a group of morphemes that present multiple possible functions in the same position of the construction template. Hence, they can be mapped to more than one morphological tag. Consider the case suffix -n in the following example. The square brackets indicate that even though -n is attached to nonti, it acts as a phrase suffix that modifies the whole phrase (you canoe).
E-n [ mi-n nonti]-n yomera-i ka-ai 1-Erg 2-Gen canoe-Ins get.fish-SSSS go-Inc "I am going to fish with your canoe." In this case, the analyzer outputs all possible tag combinations, such as +Erg:ergativo, +Inst:intrumental, +Gen:genitive, +Intrss:interessive, and +All:allative. Other suffixes with this kind of behavior are completive aspect suffixes and past tense suffixes in verbs. Disambiguation of these morphemes requires knowledge of the syntactic function of the word in the clause. Such sentence level disambiguation is out of the scope of the analyzer.

Evaluation
We evaluate the robustness of our analyzer by testing the coverage of word forms. A coverage per type of 94.99% was achieved for the training data (annotated corpus + thesaurus). A closer look into the remaining non-recognized types revealed that in all cases they contain an already covered root or affix but with different diacritization. This is to be expected since the only diacritization rules existent for SHK were proposed recently by Valenzuela (2003) and the text the annotated data was based in was written way before the proposal of the diacritization rules. Table 3 shows type and token coverage over raw text not used during development. These corpora span several domains such as the bible, educational material, legal domain, and folk tales. This last domain-same as the domain of the annotated corpus-has the highest coverage.
As expected, the lowest coverage is obtained over the legal domain, a specialized domain with complex grammatical constructions and specialized vocabulary. For example, legal documents   must be precise about semantic roles of the participants, information partially encoded through morphology in SK. In contrast, educational material for kindergarten level presents the second highest coverage, quite possibly because only basic grammatical constructions are used at this level of education.
Error Analysis: We further analyze the unrecognized words in the raw corpora. We manually categorize the 100 most frequent unrecognized word types, as shown in Table 4. It can be noted that the most common error is due to alternative spelling of the final word form, mostly due to the absence-or presence-of diacritics or due to the presence of an unknown allomorph. Most of the errors of this kind can be traced back to tokens in the Bible domain. The Bible was translated to SK in the 17th century and it has remained almost intact since then. Hence, some constructions are considered nowadays ungrammatical (e.g. a verb must always carry either a participant agreement suffix or a tense suffix) or some suffixes are obsolete (e.g. the n-form +Erg:sen; the infinitive form +Inf:ati).
Furthermore, the high presence of OOV words other than nouns or proper nouns is an indicative that the root lexicon upon the analyzer is based is still limited and far more entries are needed.

Conclusion and Future Work
We presented a robust and fairly complete (in morphotactics, not in lexicon) finite-state morphological analyzer for Shipibo-Konibo, a low-resourced native language from Peru. The analyzer is capable of performing morphological segmentation and categorization, as well as part-of-speech tagging of the root and the whole final token.
Experiments over corpora from different domains show promising coverage given the limited root lexicon available. We performed a thorough analysis of errors over unrecognized words, finding that our analyzer cannot recognize certain obsolete constructions and spellings found in Biblical text, which was written centuries ago. However, for modern day Shipibo-Konibo in nonspecialized domains (e.g. legal domain) the tool is quite robust and covers production rules for all word categories.
The work presented in this paper is part of a greater effort to provide the research community with basic language tools that would aid in the construction of treebanks. Future paths considered include the mapping of morphological tags into morphological features defined in Universal Dependencies 3 , sentence-level tag disambiguation and parsing, among others.