A corpus of K’iche’ annotated for morphosyntactic structure

This article describes a collection of sentences in K’iche’ annotated for morphology and syntax. K’iche’ is a language in the Mayan language family, spoken in Guatemala. The annotation is done according to the guidelines of the Universal Dependencies project. The corpus consists of a total of 1,433 sentences containing approximately 10,000 tokens and is released under a free/open-source licence. We present a comparison of parsing systems for K’iche’ using this corpus and describe how it can be used for mining linguistic examples.


Introduction
For some time, one of the fundamental resources for language technology has been a part-of-speech tagged (or morphologically annotated and disambiguated) corpus. Creating these resources has traditionally been a lengthy process, from defining an annotation scheme to collecting texts, training annotators and performing the annotation. Recently however advances in annotation schemes and endto-end linguistic processing pipelines mean that the development of a single resource, a treebank can enable a whole pipeline of language analysis tools from tokenisation to dependency parsing from a single resource.
In this paper we describe the annotation of such a corpus for K'iche', a Mayan language of Guatemala and outline how the corpus can be used to train systems for linguistic annotation.
The remainder of the paper is laid out as follows: Section 2 gives a brief grammatical overview of K'iche'; Section 3 gives an overview of related work on K'iche' syntax; Section 4 describes the corpus and preprocessing steps; Section 5 describes the annotation process; Section 6 describes a range of syntactic constructions in K'iche' and how they were annotated. We evaluate parsing performance using the corpus in Section 7 and show how models trained on the corpus can be used in finding lin-guistic examples. Finally, we describe some future work (Section 8) and present some concluding remarks (Section 9).

K'iche'
K'iche' (ISO-639-3: quc, also K'ichee', previously Quiché) is a language within the Quichean-Mamean branch of the Mayan language family. As of the 2018 Guatemalan census, it is documented to have over 1.5 million native speakers, however the number is likely higher now and does not account for speakers in the diaspora. There are roughly 23 variants of K'iche' spoken throughout southwestern Guatemala.
K'iche' is a language with ergative-absolutive alignment, basic verb-initial order of constituents, and prefixes for agreement. The language is both prefixing (for inflection) and suffixing (for derivation and some inflection). Neither subject nor object need be overtly expressed when recoverable from context. An important part of the K'iche' grammatical system are the sets of agreement markers. These are traditionally split into set A and set B. Set A, or the ergative (ERG) markers, are used on nouns to crossreference, that is, agree with, their possessors and on verbs to indicate a transitive subject. Set B markers, or the absolutive (ABS) markers, are used to crossreference the transitive object or intransitive subject. Table 1 shows the markers.
K'iche' verbal morpho-syntax, like other Mayan languages, is organised around transitivity. Root verbs, i.e., verbs of the form CVC, and their derived non-CVC counterparts are classified as either transitive or intransitive, and this classification has implications for the kinds of morphology the verb can take. It controls the distribution of Set A and Set B morphology that we have already seen, but it also constrains what kinds of nominalisations a verb stem allows (Can Pixabaj, 2009), a well as which 'Status Suffixes' a verb stem takes (see section 6.9 There are also formal forms which appear as a combination of one of the prefixes with a following particle, lal or alaj. The Set B first person plural morph may also be uj-. for more discussion of this unique aspect of Mayan morphology). While the basic word of K'iche' is VOS, all possible word orders are attested, conditioned by discourse factors, the most important of which are topic and focus. Focus involves marking the focused expression with a focus particle, and then preposing it to a position before the verb. Topicalisation involves morphologically unmarked preposing of the topicalised expression before the verb. If a clause contains both topicalised and focused expressions, the topic comes before the focus.

Related work
Broadly, this work is a corpus of K'iche' sentences, morphosyntactically analysed and annotated in a way to support downstream natural language processing tasks like machine translation, relation extraction, etc. While there are annotated corpora of K'iche', like the K'iche' segment of the Oxlajuuj Keej Maya' Ajtz'iib' Mayan Languages Collection (Oxlajuuj Keej Maya' Ajtz'iib', 2021) of Telma Can Pixabaj's 2018 annotated collection of ceremonial discourse in K'iche', these are not in easily parsable formats that can be fed directly into existing NLP pipelines. The nearest analogs to the work presented here are Sachse's 2016 XML standard for morphological annotations of Mayan languages, including K'iche', and Palmer's 2010 IGT-XML corpus of the related language Uspanteko. While parseable, and annotated with grammatical information like part-of-speech, these are not treebanks like the present work. In fact, ours is the first treebank of any Mayan language.

Corpus
The corpus is composed of sentences from a range of text types. Around two thirds are example sentences either from a published dictionary (Medrano Rojas, 2004) or from linguistic research (Can Pixabaj, 2015;Henderson, 2012). To this we added some language learning materials (Romero et al., 2018), and religious, medical and legal texts (Wycliffe Bible Translators, 2011;Wikimedia Incubator, 2017;Méndez López, 2020;Gobierno de Guatemala, 2009). The remainder was from a collection of folk tales (Ministerio de Educación, 2016a,b). The majority of the texts came with a translation either in Spanish or in English. Some texts, such as the linguistic examples additionally came with interlinear glosses. For the texts that did not have translations, we performed a rough-andready glossing into Spanish with the aid of a prototype machine translation system. 1 The texts were chosen for their availability and for the range of linguistic phenomena they exhibited, as one of the aims of the work was to create annotation guidelines that can be used in further annotation and adapted to other Mayan languages, this was an important consideration.

Preprocessing
The texts were preprocessed using a freely-available finite-state morphological analyser (Richardson and Tyers, 2021). The morphological analyser returned, for each token the set of possible morphological analyses, including multiple output tokens in the case of contractions. These analyses were then disambiguated by hand, and missing analyses added.
This disambiguated output was then converted to the ten-column CoNLL-U format. 2 Morphological tags were converted to Feature=Value pairs by using a deterministic maximum-set-overlap matching algorithm.

Annotation process
The annotation guidelines are based on Universal Dependencies (Nivre et al., 2020), an international  collaborative project to make cross-linguistically consistent treebanks available for a wide variety of languages. At time of writing, data for over 111 languages is available through the project in a standardised format and with a standardised annotation scheme. We chose the UD scheme for the annotation as it provides pre-defined recommendations on which to base annotation guidelines. This reduces the amount of time needed to develop annotation guidelines for a given language, as where the existing universal guidelines are adequate, they can be imported wholesale into the language-specific guidelines.
The treebank was annotated by the first author and difficult cases were determined by discussion between the first author and the second author.

Constructions
In the following subsections we describe some particular features of K'iche' that are interesting or novel with respect to the Universal Dependencies annotation scheme, and our approach to annotating them. Inline examples are given on three lines, with the original text, a segmentation showing the inflectional morphs, and an approximate translation in English. Glosses are provided when necessary for explaining some particular feature or construction. 3 Where contractions are split, the split is indicated with a hyphen on the both sides of the split, so for example ch-followed by -we should be read chwe.
The focus is primarily on the relation between syntactic words, so for example constructions such as the morphological expression and annotation of agreement, tense-aspect-mood prefixes, incorporated movement, and possessive prefixes are not outlined here. It suffices to say that these are encoded with Feature=Value pairs.

Relational nouns
K'iche' has two prepositions with locative meaning chi 'in' and pa 'in, at, on, to, towards, from'. Following the guidelines these are attached using the case relation to their complement, as in (1). (1) Kinchʼaw pa le chʼawebʼal. K-in-chʼaw pa le chʼawebʼal.
I speak on the telephone.
root obl det case All other adpositional phrases are made using either relational nouns or combinations of relational nouns with these two prepositions. 4 For readers familiar with Indo-European languages, these relational nouns are similar in function to nouns of the type front, top, side in English or frente 'front', cima 'top', lado 'side' in Spanish (e.g. al lado de la casa 'at the side of the house'). However, they are more extensive, used for encoding relations that in Indo-European languages are encoded with prepositions, such as with, by, of, etc. or even determiners or pronouns, e.g. -onojel 'all'.
Relational nouns agree with their complements using possessive markers (set B affixes) and may have an complement or not. For example, in (2) the relational noun -ukʼ 'with' is used with a complement le nunan 'my mother'. (2) Kinchʼaw rukʼ le nunan. K-in-chʼaw r-ukʼ le nu-nan. I speak with the my mother.
root obl det nmod In (3) the same relational noun -ukʼ 'with' is used without a nominal complement.
(3) ¿La katpe quk' chwe'q? La k-at-pe q-uk' chwe'q QST will you come with us tomorrow root obl advmod discourse To maintain language-internal consistency these are annotated with the relational noun as the head of the construction, attached to predicates with the obl oblique relation and to nominals with the nmod relation.
It is worth noting that relational nouns can also be used in conjunction with the true prepositions, as in for example (4).

Nominal possession
In terms of nominal possession, Kʼicheʼ is a head marking language. The schema for possession is a noun with a possessive prefix followed by the possessor, POS-N 1 N 2 = N 2 of N 1 . For example, utzij ri ajqʼij "the daykeeper's word" (lit. "his word the daykeeper". Possession can also be expressed on multiple nouns in series, as in the sentence Kʼax ri ubʼaqil nuqʼabʼ. "The bones of my arms hurt" (5).

Relative clauses
Following Can Pixabaj (2021), relative clauses in K'iche' are post-nominal and come in two broad types, headed (6) and headless (7) Kojtzalijoq jawi ri xojkanaj … K-oj-tzalij-oq jawi ri x-oj-kanaj … Let's return where that we stayed … root mark obl acl Relative clauses embedded under a head nominal, like (6), can be further split into those that contain an interrogative relative pronoun and those that contain a determiner acting as a subordinating conjunction. The reason for treating the latter as a subordinating conjunction and not a relative pronoun, pointed out by Bridges Velleman (2014), is that the two can co-occur, as in (8) In (8), the relative clause jas le kimbʼij, lit. "what that I'm saying" is introduced by the interrogative relative pronoun jas which is given the relation of object. It is then followed by a relative clause complementiser we give the mark relation. The predicate in the relative clause is then attached to the nominal it modifies with the relation acl, adnominal clausal modifier. … xuto ri xubʼij ri ratiʼt … x-ø-u-to ri x-ø-u-bʼij ri r-atiʼt … listen the say it the her mum parataxis obj acl nsubj det In addition to headed and headless relatives, Can Pixabaj (2021) also discusses so-called light-headed relatives. In these, the noun head is usually modified by relative not expressed, leaving only a determiner. As shown in (9), in this case we promote the determiner as head of the construction, and treat the light-headed relative as adnominal clause modification (namely acl).

Non-verbal predicates
In non-verbal predication, for example with nouns or adjectives, the predicate is the root, and the subject, as the example Bʼixonel ri a Luʼ "Luʼ is a singer" (10) and Kʼax le kibʼe ri winaq. "The road of the people is difficult" (11). For existential sentences in the affirmative and in the negative, two non-inflecting words are used kʼo in the case of existence and maj in the case of nonexistence. In these constructions, the non-inflecting word is the head and the thing existing is the subject, as in Kʼo jun tzʼiʼ pa bʼe. "There is a dog in the street." Another set of non-verbal predicates involve forms such as rajawaxik 'necessary', kʼax 'difficult' with verbal subjects. These are analysed as nominals (nouns or adjectives), and the complement is an embedded clausal subject.

Complement clauses
Our analysis of complement clauses is based on research done by Can Pixabaj (2015), whose thesis gives a thorough treatment of the topic. This section is based on Chapter 3 of (Can Pixabaj, 2015, p.85). In K'iche', complements can be split into three subcategories: finite with complementiser, finite without complementiser and non-finite.
In UD, the distinction in complements is between those with obligatory control, xcomp and those without control, ccomp. Each of the three types defined in K'iche' may have control or not. In (14) the subordinate clause is introduced by a subordinator, while in (15)  In (17) and (18)

Adverbial clauses
There are a number of types of adverbial clauses in K'iche', including those introduced using word order, by a subordinator (e.g. we 'if' or are taq 'when'), and using a relational noun (e.g. -umal 'because',ech 'in order to').

Kinbʼinik
xinʼek. K-in-bʼin-ik x-in-ʼe-k. I was walking I left. root advcl In (19) a manner clause k-in-bʼin-ik 'IMPF-B3Swalk-SS' precedes its main clause. This ordering is mandatory for manner clauses as is the lack of subordinator.

(20)
We keqbʼan ri qʼuch utz kujelik. We k-e-q-bʼan ri qʼuch utz k-uj-el-ik. If we practice the qʼuch well we come. Other kinds of adverbial clauses may precede or follow the main clause. In We keqbʼan ri qʼuch utz kujelik. 'If we practice qʼuch 6 it will be good for us.' (20) the conditional clause introduced by the subordinator we 'if' appears before the main clause. Adverbial clauses can also be introduced by relational nouns, as in (21) where the relational noun -umal 'by' has the function of obl standing in for a manner oblique and the clause is dependent on it as a adnominal clause.

Valency changing
Transitive verbs in K'iche' are subject to two main valency changing operations, the passive and the antipassive. These are morphological processes which involve suffixation. For the passive, either the final vowel is lengthened, or the suffix -x is added. For the antipassive the suffixed morpheme is -Vn or -n.
In the passive, the subject is omitted and the object promoted to subject position. This can be seen in the comparison between the sentence Xkikunaj le ali ri ixoqibʼ. "The women cured the girl." (22) where the verb x-ø-ki-kuna-j 'PERF-B3S-A3Pcure-ACT' has agreement for both subject and object and the sentence Xkunax le ali kumal ri ixoqibʼ. "The girl was cured by the women." (23) where the verb x-ø-kuna-x 'PERF-B3S-cure-PASS' agrees only for the subject (previously object) and the subject is demoted to oblique using the relational noun -umal 'by'.
Cured the girl the women. In the antipassive, the subject is retained, but encoded with the absolutive, and the object is demoted to oblique status using the preposition chi 'to' and the relational noun -e(ch).  ' (24) where the verb k-inu-loqʼo-j 'IMPF-B1S-A3S-love-ACT' has agreement for both subject and object with the antipassive version in (25) which exhibits agreement only for the subject, ka-ø-loqʼo-n 'IMPF-B3S-love-AP'.

Directionals
In Mayan languages there is a category of words called directionals, which are grammaticalised forms of intransitive verbs of motion (Can Pixabaj, 2017). Some examples are b'i(k) < -b'e 'go', qaj(oj) < -qaj 'go down', and kan(oq) < -kan 'stay'. The part in parentheses after the directional is the status suffix (see §6.9). They usually follow verbs and other predicates to express movement, deictic or aspectual information and are related to the incorporated movement prefixes e'-< b'e 'go' and ul-< ul 'arrive'. Despite being derived from verbs, these are not full predicates, being either modifiers or copredicates. We analyse them as adverbial modifiers and provide a feature AdvType=Dir for linguists interested in querying the corpus for this phenomenon.

Status suffixes
Status suffixes are a particular feature of the Mayan languages. These are suffixes that appear on verbs (and directionals which historically come from verbs). The particular status suffix a verb bears is conditioned by an amalgamation of morphosyntactic facts about the clause, including the transitivity of the verb, whether the verb is a root verb (i.e., CVC form) or has undergone derivation, the tenseaspect-mood of the clause, and whether the clause is an independent or dependent clause. In K'iche' there are four status suffixes, -ik, -oq, -u ∼ -o (with vowel harmony) and -u' ∼ -a' ∼ -o' (with vowel harmony). 7
Fall DIR SS in its inside the hole. root case advmod aux:ss det nmod obl In this example, the directional, itself derived from a verb, bears the status suffix -ik, which indicates that the verb is intransitive and non-dependent. One might wonder why tzaq 'fall', the main verb does not bear its own status suffix. This is because, in K'iche', these suffixes only appear at the edges of certain prosodic phrases (Henderson, 2012). These is no such phrase break between the verb and directional, and so only the latter bears the status suffix.
We have chosen to link status suffixes to their verbs with a flavour of the aux relation. The reason is that status suffixes are function words accompanying the verb that express aspect and mood information like verbal auxiliaries do in more familiar languages. For instance, swapping the -ik and -oq status suffixes on an intransitive verb (in certain aspects) is enough to change the interpretation from conditional mood to imperative mood.

Experiments
Here we present two experiments using the corpus. The first is an evaluation of three different parsing pipelines and the second is an experiment in using automatic parsing for mining linguistic examples.

Automatic parsing
In order to test the usage of the corpus for automatic parsing, performed three experiments using three off-the-shelf natural-language processing pipelines: UDPipe 1.2 (Straka et al., 2016), UDPipe 2.0 (Straka, 2018) and UDify (Kondratyuk and Straka, 2019). Version 1.2 (Straka et al., 2016) of UDPipe is a pipeline-based model where tokenisation is performed by a BiLSTM, morphological analysis and part-of-speech tagging are performed using an averaged perceptron model and dependency parsing uses a transition-based non-projective parser, where transitions are predicted by a neural network. Version 2.0 (Straka, 2018) is a complete rewrite of the UDPipe parser. It implements a joint model for part-of-speech tagging, morphological analysis, lemmatisation and parsing. The parsing model is graph-based using the Chu-Liu/Edmonds algorithm for decoding. Finally, UDify (Kondratyuk and Straka, 2019) is a multilingual model that supports parsing 75 languages. This is also a joint model, with a shared BERT representation for all 75 languages. The pre-trained model can be fine-tuned on language data from a new language, and we provide the results for fine-tuning on K'iche'. All parsers were trained with default hyperparameters.
As there was not enough data to maintain a held out test set of sufficient size, we performed ten-fold cross validation. Table 3 presents the results of the comparison. The evaluation was carried out using the official evaluation script from the 2017 CoNLL Shared Task (Zeman et al., 2017).
As can be seen from the results in Table 3, UD-Pipe 2.0 performs significantly better than UDPipe 1.2 and UDify for all of the tasks. This comes at a Straka et al. (2016) Straka (2018)   substantial increase in model size and training time compared to UDPipe 1.0, but results in a model that is still tractable on a consumer-grade laptop.

Linguistic example mining
Using corpora of under-resourced languages to test predictions pertinent to linguistic theory is often difficult. The reason is that the predictions are usually highly structurally dependent, making it hard, or even impossible, to search for relevant examples via string matching. We show the utility of the present treebank through a case study probing the distribution of phrase-final status suffixes (see section 6.9). Henderson (2012) proposes that the status suffixes that only appear phrase-finally are sensitive to intonational phrase boundaries, which roughly map onto clause boundaries. The generalisation is that a phrase final status suffix should only appear if the verb / directional bearing it is (i) utterance final, (ii) directly before an embedded clause, (iii) directly before a functional head that itself embeds a clause. Notice that to find counterexamples to this generalisation, one must search for sentences that do not satisfy a structural description-e.g., give me sentences containing a status suffix that is not directly followed by an embedded clause. This is impossible to do without a treebank. It is not even possible to do via string matching over a corpus with grammatical annotations like part-of-speech tags. We used the corpus to test the generalization in Henderson (2012) against a larger set of K'iche' texts. In order to produce a larger corpus of examples, we took all of the texts we had available from the sources mentioned in Section 4 and to that added the Crúbadán corpus of K'iche' (Scannell, 2007) and processed them with the UDPipe 2.0 model described in the previous section.
We used the Grew (Guillaume, 2019) corpus query language to extract all sentences where a verb had both a dependent that was an auxiliary with the relation of aux:ss and a noun with the relation obj. The query can be seen schematically in (27). (27) VERB AUX NOUN aux:ss obj This lead to a total of 16,196 sentences containing 352,509 tokens. Note that the annotation for these sentences was not hand annotated, but simply the output of the data-driven parser. Although the output contained errors, the number of false positives due to errors in the parse tree was unexpectedly low.
The result is that we discovered a series of examples with structures that have not yet been considered in the literature on status suffixes, including direct counterexamples to Henderson (2012). For instance, we see in the following example a directional bearing the phrase-final dependent status suffix -oq. Yet, the directional is not at clause boundary or before a functional head that embeds a clause. Instead, it occurs before a reflexive pronoun, which in K'iche' is a relational noun construction.
… e kakimiq' ukoq kib'. … e ka-ø-ki-miq' uk-oq k-ib'. … B3PL they warm DIR-SS themselves. acl nsubj advmod obj An example like Kekanaj kan kuk' chila' [e kakimiq ' ukoq kib']. "They remained over there with those [that were warming themselves]." (28) is intriguing because while a counterexample, there are plausible stories one could tell. For instance, these reflexives are prosodic clitics. Perhaps the requirement that the status suffix be phrase final ignores expressions that are prosodically deficient because they do not count as independent phonological words. While arguing for this account would take more work, the fact that we have very quickly found a theoretically interesting counterexample to a prominent generalisation in literature shows the utility of the treebank for example mining.

Future work
We would like to investigate the use of enhanced dependencies 8 to provide a more semantics-oriented encoding of relational nouns. For example if we take example (23), we could envisage an enhanced obl link from the verb Xkunax 'was cured' to the semantic head of the agent phrase ixoqib' 'the women' (29) where we indicate the differences with respect to the basic tree in boldface. This would fall under Case information in the enhanced schema and would be an additional layer on top of the basic syntax. The process could be partially automated using the Grew tool.

(29)
Xkunax le ali kumal ri ixoqibʼ. X-ø-kuna-x le ali k-umal ri ixoq-ibʼ. Was cured the girl by the women. We also intend to expand the treebank and apply the lessons learnt and annotation solutions to other Mayan languages, this is a large group and we would like to start with languages related to K'iche' such as Uspanteko and Kaqchikel.

Concluding remarks
We have presented the first syntactically annotated corpus of sentences in K'iche'. Both the corpus and the documentation of the annotation scheme are freely available 9 through the Universal Depen-dencies project. 10 It is our hope that the work we describe here will facilitate the annotation of, and promote language technology for other Mayan languages.