Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations

Interlinear Glossed Text (IGT) is a widely used format for encoding linguistic information in language documentation projects and scholarly papers. Manual production of IGT takes time and requires linguistic expertise. We attempt to address this issue by creating automatic glossing models, using modern multi-source neural models that additionally leverage easy-to-collect translations. We further explore cross-lingual transfer and a simple output length control mechanism, further refining our models. Evaluated on three challenging low-resource scenarios, our approach significantly outperforms a recent, state-of-the-art baseline, particularly improving on overall accuracy as well as lemma and tag recall.


Introduction
The under-documentation of endangered languages is an imminent problem for the linguistic community. Of the 7,000 languages in the world, an estimated 50% will face extinction in the coming decades, while around 35-42% still remain substantially undocumented (Austin and Sallabank, 2011;Seifart et al., 2018). Perhaps these numbers are no surprise when one acknowledges the difficulty of language documentation -fieldwork demands time, cultural understanding, linguistic expertise, and financial support among a myriad of other factors. Consequently, an important task for the linguistic community is to facilitate the otherwise daunting documentation process as much as possible. Documentation is not a cure-all for language loss, but it is an important part of language preservation; even in an unfortunate worst-case scenario where a language does disappear, a permanent record of the language is saved for posterity, and could hopefully be useful for revitalization purposes.
Documentation written for a Language A in a Language B necessarily involves careful selection of example data from Language A that illustrate various aspects of Language A's grammar. These data are often provided in the form of a typical semi-structured format known as interlinear glossed text (IGT). IGT consists of three lines: the first line being the original data from Language A, the third line its translation into Language B, and the line in between the other lines being a morpheme-by-morpheme gloss, i.e. annotation of the Language A data. The provision of this second line illustrates the morphological structure of Language A to the reader, who is not necessarily informed on the grammar of Language A. An example is outlined in Figure 1. IGT helps linguists better understand how other languages work without prior knowledge of those languages or their grammars.
However, manual segmentation and IGT annotation is still a time-and labor-consuming work, as it requires linguistic training. Therefore, language documentation projects typically manifest a yawning gap between the amount of material recorded and archived and the amount of data that is thoroughly analyzed with morphological segmentation and gloss (Seifart et al., 2018). This gap can be filled using automatic approaches, which could at least accelerate the annotation process by providing high-quality first-pass annotations. Previous approaches to automatic gloss generation include manual rule crafting and deep rule-based analysis (Bender et al., 2014;Snoek et al., 2014), treating the glossing task as a classification problem focusing only on the morphological tags (Moeller and Hulden, 2018) and requiring a lexicon for stems (Samardžić et al., 2015), and using models based on Conditional Random Fields (CRF) integrated with translation and POS-tagging information (McMillan-Major, 2020). In contrast, our approach is, to our knowledge, the first one to show that modern neural systems are a viable solution for the automatic glossing task, without requiring any additional components or making unrealistic assumptions regarding data or NLP tool availability for low-resource languages.
We rely on the observation that parallel corpora with transcription and translation are likely to be available for many low-resource languages, since the knowledge of the two languages is sufficient for translating the corpus without the need of linguistic training. Documentation approaches relying on parallel audio collection (Bird et al., 2014) are in fact already underway in the Americas (Jimerson and Prud'hommeaux, 2018) and Africa (Rialland et al., 2018;Hamlaoui et al., 2018), among other places. An additional advantage of parallel corpora is that they contain rich information that can be beneficial for gloss generation. As Figure 1 outlines, the stems/lemmas in the analysis are often hiding in the translation, while the grammatical tags could be derived from the segments in the transcription. We hypothesize that the information from the translation can further ground the gloss generation, and especially allow a system that properly takes into account to generalize to produce lemmas or stems unseen during training.
In this work we propose an automated system which creates the hard-to-obtain gloss from an easyto-obtain parallel corpus. We use deep neural models which have driven recent impressive advances in all facets of modern natural language processing (NLP). Our model for automatic gloss generation uses multi-source transformer models, combining information from the transcription and the translation, significantly outperforming previous state-of-the-art results on three challenging datasets in Lezgian, Tsez, and Arapaho (Arkhangelskiy, 2012;Abdulaev and Abdullaev, 2010;Kazeminejad et al., 2017). Importantly, our approach does not rely on any additional annotations other than plain transcription and translation, also making no assumptions about the gloss tag space. We further extend our training recipes to include necessary improvements that deal with data paucity (utilizing cross-lingual transfer from similar languages) and with the specific characteristics of the glossing task (presenting solutions for output length control).
Our contributions are three-fold: the object language line and the translation line -hence the name "interlinear gloss". Ex. 1 shows an example of IGT for a sentence from our Tsez dataset.
(1) Tsez Lexical morphemes such as open-class words and their stems are simply glossed as their translations. On the other hand, functional morphemes such as inflectional affixes and closed-class words are glossed with the grammatical categories or function that they encode. Glosses for lexical morphemes are referred to as STEMs and those for functional morphemes are referred to as GRAMs.
In Ex. 1, the verb bok'ek'asi consists of the class I plural prefix b-, the stem ok'ek' and the resultative participle suffix -asi. The stem, being a lexical morpheme, is glossed as its English translation steal. On the other hand, the prefix and the suffix are functional morphemes. Thus they are glossed as the collections of grammatical categories they encode. The prefix bis glossed as the combination of I (for class I -Tsez has four noun classes I-IV) and PL (for plural), and the suffix is glossed as RES.PRT (for resultative participle). When a morpheme is glossed as a collection of multiple GRAMs, they are separated by periods or some other delimiting punctuation.
3 Multi-Source Transformer for Gloss Generation

Problem Formulation and Model
Our model is built upon the transformer model (Vaswani et al., 2017), a self-attention-based sequenceto-sequence (seq2seq) neural model. Compared to the CRF model used in McMillan-Major (2020), which can only capture local dependencies, a self-attention model can produce context-sensitive hidden representations that take the whole input into account. Moreover, unlike other recurrent (seq2seq) models such as bidirectional LSTM, the Transformer model shows more robust performance in morphologyrelated tasks under low-resource settings (Ryan and Hulden, 2020). Our architecture choice is also motivated by the promising performance of the model along with its computational efficiency.
The original Transformer is composed of a single encoder and a decoder, each with several layers. Each encoder layer consists of a multi-head self-attention layer and a fully connected feed-forward network, while decoder layers are additionally augmented with multi-head attention over the output of the encoder stack. Our model adds a second encoder to create a multi-source transformer similar to (Zoph and Knight, 2016;Anastasopoulos and Chiang, 2018), in order to incorporate the secondary information from the translation. A visual depiction of our model is outlined in Figure 2.
Let X 1 = x 1 1 . . . x 1 N be a sequence of transcription words, X 2 = x 2 1 . . . x 2 M a sequence of translation words, and Y = y 1 . . . y K be a sequence of the target gloss. A single-source gloss generation model attempts to model P (Y | X 1 ).
A multi-source model can jointly model P (Y | X 1 , X 2 ), and thus we need two encoders (see Figure 2b). One encoder transforms the input transcription sequence ). An attention mechanism transforms the two sequences of input states into a sequence of summed context vectors via two matrices of attention weights: In the standard single-source model, the decoder attends to a single encoder's states. In our multisource setup, we have two input sequences encoded by two different encoders, and attention mechanisms provide two context to the decoder. Note that for clarity's sake there are dependencies not shown.
Finally, the decoder computes a sequence of output states from which a probability distribution over output stems/tags can be computed:

Inference
As standard in sequence-to-sequence tasks, we use beam search to search the output space for the most likely output gloss sequence. In addition, though, we incorporate enhancements to handle the particular nature of the gloss generation task.
Length control Unlike other text generation tasks where the generated text can be relatively free in word order, the gloss must map to the transcription morpheme-by-morpheme or word-by-word, dependent on the intended granularity. One drawback of using a seq2seq model for the gloss generation task compared to e.g. a CRF-based approach like (McMillan-Major, 2020) is that a hard constraint of "one output per input" is not enforced by or hard-coded in the model. Even though structural biases (Cohn et al., 2016) such as hard monotonic attention , or source-side coverage mechanisms (Tu et al., 2016;Mi et al., 2016) could remedy this potential issue, we found that there was little need for them, as a simple mechanism to control the final output length during inference was sufficient. 1 The intuition lies in the observation that the length of the output gloss should match that of the input transcription exactly. Hence, during inference we set a minimum desired length of the output sequence, and disallow any candidates shorter than that. 2 Alignment between gloss and transcription To ensure fair evaluation against the baseline and other models, we need to be able to produce the exact mapping of the output gloss to the input transcription (we discuss the reasoning in Section 4 on evaluation). Luckily, this information lies in the cross-attention weights. Even though neural attention has been reported to behave differently than traditional statistical MT alignments (Ghader and Monz, 2017), we find that our cross-attention between the decoder 3 and the transcription encoder is indeed monotonic and can function as such an alignment. For each output y k , we align it to a single source word x i such that i = arg max α 1 kn .

Evaluating Gloss Generation
The characteristics of gloss generation require special care rather than blindly using metrics established for other tasks like machine translation. Previous work uses: • Accuracy: percentage of correct (full) analyses for each token. It is the main metric used in previous work (Samardžić et al., 2015;McMillan-Major, 2020).
• BLEU (Papineni et al., 2002): an average of n-gram precision along with a brevity penalty, BLEU is perhaps the most popular reference-based machine translation method. Since our models are inspired from MT, we use it as another indication of quality as it captures accuracy/precision over n-grams, even though the rest of the metrics are more suitable to the automatic glossing task.
• Precision/Recall: We further break down the evaluation to focus separately on lemmas and tags. Several previous works prioritize precision over recall, especially by not outputting tags if items are not seen during training, e.g. (Moeller and Hulden, 2018).
• Error Rate: A normalized edit distance between the output and the reference, we consider error rate as a good indicator of the overall quality of the generated sequences, rather than the more finegrained metrics such as precision/recall or accuracy. This is also the most reasonable metric if the model is used for computer-assisted glossing where linguists do post-editing.
Comparing all these metrics on our results, we find that unsurprisingly they heavily correlate with each other. We measure correlation with Spearman's rank coefficient, and all correlations end up with coefficients higher than 0.85 (p < 0.0001).

Experimental Settings
Languages We use three languages as the testbed for evaluating our approach: Arapaho, Lezgian and Tsez. They provide challenging scenarios with different amounts of training data available overall (details are listed in Table 1) as well as the complexity of the produced gloss due to the complex morphosyntax they exhibit. 4 We randomly shuffle and split each dataset into train/validation/test sets with a ratio of 8:1:1.
Arapaho is an Algonquian language spoken in and around Wyoming and western Oklahoma, U.S. With only around 250 fluent speakers, it is very much an endangered language that requires immediate attention and effort for documentation and preservation. Being highly polysynthetic, Arapaho makes heavy use of verb incorporation with morphemes for tense, aspect, modality, evidentiality, and adverbial information, whose (de)incorporation is determined by pragmatic factors such as saliency and emphasis (Cowell et al., 2008). We use the corpus compiled by Kazeminejad et al. (2017), which contains more than 20,000 glossed sentences with English translations.
Lezgian is a member of the Lezgic branch of the Nakho-Daghestanian language family, also known as the (North)east Caucasian family. It is spoken by about 400,000 speakers in southern Dagestan and northern Azerbaijan in the eastern Caucasus. It has a rich consonant inventory with 54 members and an agglutinative morphology. The language has 18 nominal cases, features case-stacking, head-final syntax, ergative-absolutive alignment and no noun-verb agreement (Haspelmath, 1993;Moeller and Hulden, 2018). We use the corpus compiled by Arkhangelskiy (2012), comprised of slightly over 1,000 IGT examples, with free translations in English.
Tsez (Dido) is a member of the Tsezic group of the Nakho-Daghestanian language family -it is a close relative of Lezgian.Spoken by 12,467 speakers in Dagestan according to a 2010 Russian census, Tsez also features head-final syntax, ergative-absolutive alignment, rich suffixing morphology, an impressive inventory of cases and a variety of strategies for converb formation (Comrie and Polinsky, forthcoming). We use the Tsez Annotated Corpus compiled by Abdulaev and Abdullaev (2010) which includes almost 1,800 glossed utterances with translations in both Russian and English. We report results using the English translations, since the analysis is also using English stems, and as a result the performance using the Russian translations was worse in preliminary experiments.
Word-level vs. Morpheme-level Settings For each of the language datasets, we have access to gold (hand-created) morpheme-level gloss annotations mapped to morphologically segmented transcriptions. This allows us to evaluate models under two distinct settings, for which we report results separately: 1. Word-level without gold segmentation: this setting (referred to as word-level hereinafter) closely matches the realistic scenario where we do not have access to gold segmentation of the utterance. This setting is, as a result, more challenging, as the model needs to additionally infer a segmentation. The evaluation in this setting is also performed at the word level (with regards to BLEU and Error Rate) such that the first target for the example in Figure 1 would be "you-GEN1" as a single unit.
2. Morpheme-level with gold segmentation: in this setting (which we will refer to as morphemelevel) we have access to the gold segmentation of the transcription, hence the glossing task consists of simply providing the correct stems or tags for each segment. Accordingly, we perform the evaluation at the morpheme level: the first two evaluation units from the Figure 1 example would be "you" and "-GEN1", separated. This setting will provide somewhat of an oracle score, that would be achievable if a linguist or the community provide correct segmentations for the transcriptions, or if a morphological segmentation tool is available for that language.
Transliteration Cross-lingual training between typologically related languages has shown promising results in several NLP tasks especially in low-resource settings (McCarthy et al., 2019; Anastasopoulos and Neubig, 2019). Two of our evaluation languages, namely Lezgian and Tsez are fairly similar as they are both members of the Nakho-Daghestanian language family, and as such are ideal for crosslingual transfer. However, Anastasopoulos and Neubig (2019) pointed out that cross-lingual learning can be inversely impeded if the languages do not use the same script even if they are closely genealogically related languages. Lezgian is written in Cyrillic script while Tsez is written in Latin script. To maximally exploit the power of cross-lingual training, we transliterated Lezgian from Cyrillic script to Latin script, and transliterated Tsez from Latin script to Cyrillic script. 5 With the original and the transliterated versions of the training data at hand, we combine them during training into a single training set for the LANGUAGE TRANSFER Model. The evaluation is of course performed on the original test sets with the original corresponding scripts.

Implementation
We base our implementation on the Joey-NMT toolkit 6 (Kreutzer et al., 2019), which we extended to support multi-source transformer models. 7 The transcription and translation input sentences can be represented at different granularities: either at the word level or at the more recently popular sub-word level. For simplicity we leave this detail out of the results tables, reporting results with the better-performing option in each case. It is worth noting, though, that for Tsez and Arapaho the sub-word representations (obtained using byte-pair-encoding (BPE) 8 ) always lead to better results. For the much smaller Lezgian dataset, we saw no difference between sub-word and word-level models, but this lack of difference can be explained by the overall very small size of the vocabulary for the Lezgian dataset. 5 We use the transliterator provided by https://pypi.org/project/transliterate/. 6 https://github.com/joeynmt/joeynmt 7 Our code will be open-sourced at https://github.com/yukiyakiZ/Automatic_Glossing. 8 We use the sentencepiece implementation of the BPE method (Sennrich et al., 2016) with vocab size of 2000 for Lezgian, 2500 for Tsez, and 10000 for Arapaho) For training all Lezgian and Tsez models and the Arapaho model with the subsampled 2,000 training sentences, we use 2 layers for both encoders and the decoder and 2 attention heads. All the embedding and hidden state dimensions are set to 128. We use a batch size of 20. For training the Arapaho model on the original larger dataset, we use 4 layers for all encoders and decoder, with 4 attention heads. The embedding and hidden state dimension are 256, and batch size is 50. For all models, learning rate is initialized to 0.0005 and optimized through Adam (Kingma and Ba, 2015). We also use dropout (Srivastava et al., 2014) with p = 0.3. We set the early stop criterion to a minimum learning rate of 1.0e-6. In the end, Lezgian models trained for 3000 epochs, and Tsez and Arapaho models trained for 1100 epochs without reaching that early stop criterion. For inference, we use beam search with a size of 5. We note that we did not perform any grid-search over the hyperparameter space, which leaves room for further improvements in future work.

Baseline
We use the model by McMillan-Major (2020) as our baseline, since it is the most recent statistical model for gloss generation. It is additionally the only other, to our knowledge, previously proposed approach that leverages both source and translation information to generate glosses. As required by the baseline model, we first pre-processed the data by converting them to Xigt format (Goodman et al., 2015) and then enriched the data using the INTENT system (Georgi, 2016). This enriching step creates a heuristic alignment between the transcriptions and its translation, which is then used to project Part-of-Speech and morpheme tag information from the translation to the corresponding morpheme using heuristics. This additional information is used through heuristic post-editing: an out-of-vocabulary word, for instance, is assumed to be a stem and the aligned translation word is used as the output gloss stem.
For Lezgian and Tsez, we train and evaluate the baseline model using the exact same training and testing data as for our models. 9 For the much larger Arapaho dataset, we were unable to train the baseline using all 25 thousand training examples; the released code does not support GPU processing and after 7 days of training less than 50% of the training procedure had been completed. Instead, we subsampled the Arapaho training data to only 2,000 examples.
For a fair and complete comparison, we report our system's performance trained on both that subsampled training set and the complete one.

Results and Analysis
A summary of our main results on the three languages in the realistic word-level scenario (without gold segmentation) is outlined in Table 2, reporting all metrics discussed in Section 4. To determine the gloss corresponding to each morpheme when calculating accuracy, precision and recall, we use the attentionbased alignment discussed in Section 3.2.
Our models significantly outperform the baseline in every evaluation metric in all datasets. The results in the subsampled Arapaho dataset are less conclusive (the baseline achieves slightly higher BLEU and lower WER) but both models in this very challenging setting are quite bad to begin with; with an order of magnitude more data, our model's outputs are exceptional, while the baseline was unable to be trained.
In both Tsez and Arapaho, the multi-source approach yields significantly better performance on both lemma and grammatical tag generation than a single-source model. In Lezgian, however, we don't observe any substantial differences between the two models, which we suspect is because the very limited training data may not be sufficient for the model to learn a good gloss-translation cross-attention. 10 Nevertheless, the improvements in Tsez and Arapaho indicate that our model is indeed able to leverage the information provided by the translation to further aid in automatic glossing. Table 3 we present the same set of results, but this time using the morpheme-level transcription with the gold segmentation. The first thing to note is that the outputs are now better across the board for all metrics and for all models. This indicates that proper segmentation remains a challenge and that the creation of segmentation tools is a valuable endeavor. Nevertheless,  Table 2: Results on the realistic scenario where no source-side segmentation information is available (word-level transcription). Our approach outperfoms the baselines in all datasets. We highlight the best results in each dataset and metric.

Results with Gold Segmentation In
even in cases where the baseline is particularly strong when provided with the oracle segmentation, as in the Tsez dataset, its recall in particular lags behind the recall of our approach, for both lemmas and tags.
Cross-Lingual Transfer Cross-lingual transfer significantly improves performance, especially for Lezgian, which only has 951 training sentences before data augmentation using transliteration and crosslingual training, leading to a 25% reduction in morpheme accuracy error and 27% error reduction in WER (cf. Table 2). Upon manual inspection of the outputs, we find that cross-lingual transfer helps the model make fewer mistakes when predicting stems. As an example, one test sentence with the gold gloss kazakh nation was is glossed correctly with cross-lingual transfer, but without it it is glossed as 1pl.abs is-PST was. There may be overlaps between the Tsez and Lezgian vocabularies, which may be contributing to the improved stem predictions. Cross-lingual transfer also helps in the higher-resource Tsez setting too, with relatively smaller improvements. Unlike the case with Lezgian, we do not observe much improvements in stem prediction after employing cross-lingual transfer.
Overall, the improvements we obtain does show that cross-lingual transfer between related languages is quite a promising direction for future research.
Error Analysis For any model to produce correct IGT given input in some language, it is essential that the model know the morphology of that language. With data-driven machine learning models, it must rely on the segmentation in the training data as its only clue to understanding the morphological structure of the object language. This implies that the model is highly subject to any bias presented in the training data annotation. There is considerable variation in glossing style depending on the language, the linguist, and what the gloss is intended to illustrate. It is not always possible to follow suggested standards for glossing such as the Leipzig Glossing Rules (Bickel et al., 2008).
One bias in the glosses for the so-called "non-distal local cases" in the Tsez dataset is of particular interest. Tsez features 28 non-distal local cases, which are combinatorially formed by selecting one out of the 7 locational series (in-, cont-, super-, sub-, ad-, apud-and poss-) and one out of the 4 directional series (-essive, -lative, -ablative and -versative). Each locational component is associated with a morpheme; so is each directional component. The local case marking is formed by concatenating the morphemes corresponding to each component. For example, the superablative case marking -ň'aaj is formed by   Table 4: Four glosses for the same Tsez sentence: gold gloss, baseline prediction with morphemelevel (gold) segmentation, and outputs by our best model without and with gold segmentation.
the "super-" morpheme -ň' and the "-ablative" morpheme -aaj (Comrie and Polinsky, forthcoming). This means the superablative marking can be glossed with the two morphemes separate, as in -ň'-aaj '-SUPER-ABL' or concatenated, as in -ň'aaj '-SUPER.ABL'. Our Tsez dataset follows the latter convention. However, this makes the compositionality of the local case morphemes opaque to a naive language model -the model does not know that -ň'aaj can be further broken apart into -ň' and -aaj.
To examine this, we present a Tsez example containing a superessive ('SUPER.ESS') marking by comparison four predicted glosses: gold, baseline, our best model (multi-source with cross-lingual transfer and length control) with and without the gold segmentation. The glosses, along with the original Tsez sentence and its English translation are provided in Table 4. The baseline model is provided with morphologically segmented input, yet it incorrectly glosses the superessive as a contessive ('CONT.ESS'). This is reasonable, because morpheme-level segmentation in this dataset does not elucidate the compositionality of the local case morphemes. This problem can be potentially remedied by providing subword-level input, which can approximate finer morphological segmentation. As such, the superessive is correctly glossed in our model even when provided input without gold segmentation but with, crucially, BPE.
As BPE is only an approximation of morphological segmentation, it is not the perfect solution. The same model incorrectly attaches QUOT, i.e. the quotative particle to the end of nesi-ň' 'DEM1.ISG.OBL-SUPER.ESS', one word before the correct destination Musa 'Musa'. Upon inspection, we found that the input word Musaňin is segmented into Mus-aňin after BPE, and the occurrences of aňin in the training set are overwhelmingly combinations of -a and ňin 'QUOT', where -a is a verbal suffix. In contrast, Musa 'Musa' is a proper name. The model cannot deduce that ňin is its own morpheme and may appear after nouns, when it has only seen it in tandem with a verbal suffix which, obviously, appears only after verbs.

Related Work
Several works have studied the automated IGT generation task Samardžić et al., 2015;Moeller and Hulden, 2018;McMillan-Major, 2020). They mainly used machine learning methods such as CRF and SVM to generate gloss and proposed a series of heuristic post-editing algorithms to improve the performance. Among them, ,  combined machine labeling and active learning for creating IGT. Moeller and Hulden (2018) tested LSTMs to predict the morphological labels within glosses, but underperformed against CRF models in that task. McMillan-Major (2020) exploited parallel information in gloss generation. These models either depended on an assumption of only a finite set of possible morphological tags, or required additional linguistic information and feature engineering such as POS tags and word alignment.

Conclusion
In this work, we make the initial attempts to leverage neural-based models with dual sources -source transcription and its translation -for the automatic gloss generation task. We further extend our model with cross-lingual transfer to overcome data paucity, and with simple techniques to adapt to the characteristics of the glossing task. Our multi-source transformer-based model significantly outperforms previous work on various evaluation metrics in three low-resource languages. We also hold both qualitative and quantitative error analysis. Future research directions include (i) combining a multi-source model and a multi-task model together by taking both word-level transcription and translation as input, to generate morphologically segmented transcription and morpheme-level gloss, (ii) adding additional structural biases for attention, and (iii) data augmentation algorithms and heuristic post-editing methods.