Cognate Projection for Low-Resource Inflection Generation

We propose cognate projection as a method of crosslingual transfer for inflection generation in the context of the SIGMORPHON 2019 Shared Task. The results on four language pairs show the method is effective when no low-resource training data is available.


Introduction
In this description of the University of Alberta systems, we discuss our approach to Crosslingual Transfer for Inflection Generation (Task 1) in the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology (McCarthy et al., 2019).The task of inflection generation is to produce an inflected word-form given a lemma and a sequence of abstract morphological tags.For example, the Latin citation form fucō with the tag V;IND;FUT;3;SG should yield the form fucābit. 1The goal is to examine how best to do this in a cross-lingual setting.
We focus on depth over breadth, performing experiments on only four language pairs which represent a range of diachronic relationships.Kashubian is so closely related to Polish that it is sometimes viewed as a dialect.Occitan and Spanish are less closely related, but share many morphological features.Romanian evolved from Latin over the course of 1500 years.Hindi and Bengali are also related, but written in distinct scripts.
In order to alleviate the training data sparsity in the low-resource setting, we attempt to leverage external text corpora, from which we extract target language word lists for both inflection generation and cognate projection.The results show that this strategy improves the overall results for some of the tested language pairs.As our principal contribution, we propose and test the idea of performing cognate projection to leverage high-resource training data for lowresource inflection generation.The results demonstrate that an implementation of this concept can perform better than the baselines in the scenario when no low-resource inflection data is available.

Prior Work
Our methods build upon the prior work of the University of Alberta teams for three previous SIG-MORPHON shared tasks on type-level morphological generation (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018)).We view inflection as a string transduction task.Our discriminative transduction models stem from the DIRECTL+ transducer of Jiampojamarn et al. (2008), which was originally designed for grapheme-to-phoneme conversion.Nicolai et al. (2016) apply discriminative string transduction to morphological reinflection.They show that the approach of Nicolai et al. (2015) performs well on typologically diverse languages.They also discuss language-specific heuristics and errors.Nicolai et al. (2017) combine a discriminative transduction system with neural models.The results on five languages show that the approach works well in the low-resource setting.Additionally, they propose adaptations designed to handle small training sets, such as tag re-ordering and particle processing.Najafi et al. (2018a) make further progress on the combination of neural and non-neural models for low-resource reinflection.Their best system obtains the highest accuracy on 34 out of 103 languages.They achieve additional improvements in accuracy by leveraging unannotated text corpora using the non-standard approaches of Nicolai et al. (2018) and Najafi et al. (2019)

Tools
In this section, we describe our two principal tools: DTLM for cognate projection and low-resource inflection generation, and OpenNMT for highresource inflection generation.

DTLM
DTLM (Nicolai et al., 2018) combines discriminative transduction with character and word language models derived from large unannotated corpora, with the language-model features integrated into the transducer.DTLM employs a many-tomany alignment method, which is referred to as precision alignment.Nicolai et al. (2018) demonstrate that DTLM achieves superior results in low-data scenarios on several transduction tasks, including inflection generation, transliteration, phoneme-to-grapheme conversion, and cognate projection.
In the CoNLL-SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection (Cotterell et al., 2018), DTLM was our best performing individual system.It was also successfully used in the NEWS 2018 shared task on transliteration (Najafi et al., 2018b).

OpenNMT
OpenNMT (Klein et al., 2017) is an open-source neural machine translation tool based on sequence to sequence model with attention mechanism.Klein et al. (2017) demonstrates that Open-NMT generally performs better quality of machine translation than other existing open-source machine translation systems and is fairly efficient in terms of training and test speed.
Machine translation models have been successfully applied to other transduction tasks (Kann and Schütze, 2016).We employ OpenNMT as a vanilla HR morphological inflection tool, by simply concatenating the lemma and the tags to form the input sequence.Each individual tag is encoded as a single input token.No target wordlists are used.

Cognate Projection Methods
Each dataset in this shared task pairs a lowresource (LR) language with a related highresource (HR) language.Genetically related languages share cognates, words with a common linguistic origin (St Arnaud et al., 2017).For example, the Latin word oculus 'eye' is cognate with the Romanian word ochi.Cognate pairs exhibit phonetic and semantic similarity (Kondrak, 2013).The correspondences between substrings in cognates tend to follow regular patterns (Kondrak, 2009).
Cognate projection, also referred to as cognate production (Beinborn et al., 2013;Ciobanu, 2016), is the task of predicting the spelling of a hypothetical cognate in another language.For example, the projection of oculus from Latin to Romanian should generate ochi.word does not exist, cognate projection should produce a target form that incorporates the interlingual sound correspondences and the phonotactic constraints of the target language.We hypothesize that the projected forms exhibit some of the morpho-phonetic properties of the actual words.
For example, the projection of the Spanish verbal form tomaré ('I will take') into a (non-existent) Latin word tomābō could provide useful information for inflecting actual Latin verbs.
We propose two projection-based approaches for inflection generation which are based on the above hypothesis (Figure 1).We refer to those approaches as Data Projection and Instance Projection respectively.Both approaches aim at taking advantage of the HR inflection training data to perform LR inflection.Morphological tags are left unchanged.For cognate projection, we train transduction models (Section 3.1) on lists of cognate pairs extracted from small bitexts.The projection models are strengthened by target wordlists extracted from freely-available monolingual corpora.
The Data Projection approach simply projects the entire HR training data, which consists of lemmas and the corresponding inflected forms, into the LR language.For example, the Romanian training pair "dormi+V;3;SG = doarme" projects into Latin "dormio+V;3;SG = dormit".This produces a relatively large, synthetic LR training set from which an LR inflection can be derived (Section 3.2).The underlying idea is that the HR inflection patterns may be reflected in the corresponding LR inflection patterns, especially if the languages are closely related.
The Instance Projection approach is more complex, consisting of three transduction steps: (1) project an individual LR test instance into the HR language; (2) inflect the resulting form using a model trained on the HR training data, and (3) project the result back into the LR language.For example, Latin "dormio+V;3;SG" would first be Pair k t Train Dev Test pol↔csb 7500 0.4 6500 500 500 spa↔oci 5300 0.4 4500 500 300 ron↔lat 4612 0.4 4000 300 312 hin↔ben 1816 0.5 1456 180 180 projected into Romanian "dormi+V;3;SG", then inflected using the Romanian model into doarme, and finally projected back into Latin as dormit.Unlike in Data Projection, inflection is performed entirely in the HR language.We aim to determine whether the higher HR inflection accuracy can offset the errors introduced at either of the projection steps.

Development
In this section, we describe our external resources and development results.

External Resources
For low-resource tasks, in both inflection generation and cognate projection, it makes obvious sense to leverage additional resources, which are freely available for many under-resourced languages.We extract the target word lists for DTLM from UniMorph2 (Kirov et al., 2018).and Wikipedia3 , as summarized in Table 1. 4or cognate projection, we need training sets composed of cognate pairs, Finding good parallel bitexts for low-resource languages is quite challenging.Small bitexts exist in special domains, such as technical documentation or Bible translations.For Polish-Kashubian and Spanish-Occitan, we use software documentation from OPUS5 (Tiedemann, 2012).For Hindi-Bengali, we use the OpenSubtitles (v2018) data, also from OPUS.For Romanian-Latin, we use a parallel corpus which contains a verse-by-verse alignment of the Bible translations in 100 languages (Christodouloupoulos and Steedman, 2015).

Inflection Generation
We perform inflection generation with DTLM in the low-resource setting, and OpenNMT in the high-resource setting.For DTLM, we apply the tag splitting and particle handling techniques described in Nicolai et al. (2017).In particular, we split tag sequences into component tags, and append them at both the beginning and end of the lemma, treating each of them as an atomic symbol.We tune the hyper-parameters of both the aligner and transducer using grid search for each language.For OpenNMT, we split tag sequences, and append them to the lemma.All parameters are set to default values.The task of leveraging HR training data for LR inflection generation is complicated by two types of inconsistencies.First, there are unavoidable typological differences, especially between less similar languages.For example, Latin nominal inflection paradigms include six cases, most of which do not exist in Romanian, which instead distinguishes between definite and indefinite forms.Second, the order of the tags in the data may differ.For example, the person tag follows the tense tag in the Spanish data, while the order is reversed in the Occitan data.We do not perform any tag re-ordering in the current shared task, but see Nicolai et al. (2017) for a principled solution to this problem.

Cognate Projection
We train our cognate models on lists of HR-LR word pairs acquired from the bitexts.The bitexts are aligned with FAST ALIGN (Dyer et al., 2013).We extract all aligned word pairs, and sort them by the alignment frequency.For Hindi and Bengali, which are written in different scripts, we compute the inter-lingual orthographic similarity after romanizing all words using uroman (Herm-jakob et al., 2018).We discard all pairs with orthographic similarity below a threshold t, which is manually tuned for each language pair.The similarity is computed as 1 − D/L, where D is the Levenshtein distance, and L is the length of the longer of the two strings.Furthermore, we discard pairs which involve any words that are English, are shorter than 4 characters, or include digits.We take the top k HR-LR pairs, and randomly divide them into training, development, and test sets, as summarized in Table 2.
For each language pair, we train a DTLM model in each direction on the training set, using the development set to prevent over-fitting, as well as a target-language word list (Section 5.1).The results of the intrinsic evaluation of the projection models on the in-domain test sets are shown in Table 4.The accuracy of the Romanian-Latin is relatively low, which may be due to the Bible domain.

Results and Discussion
We test several systems, as listed in Table 3. (Submission IDs are given here in parentheses.)A naive copy baseline ( 5) simply outputs the unchanged input lemmas.DTLM models with and without target wordlists (2 and 1) make no use of HR data (the latter is our only standard submission, which uses no external resources).The next three systems make use of only the HR training sets provided as part of the shared task.This emulates a scenario6 where no LR inflection data is available.Data Projection (3) and Instance Projection (4) implement the two methods illustrated in Figure 1  stances.The last system (6) combines the projected HR inflection data with LR data, which probably comes closest to the spirit of this shared task.1-MONO is the first-order monotonic hard attention system of Wu and Cotterell (2019).
The test results are shown in Table 3.The best result on each language is shown in bold.When only LR data is used, the results confirm the finding of (Nicolai et al., 2018) that leveraging target wordlists from monolingual corpora can improve inflection accuracy for less-closely related languages.With the exception of Polish-Kashubian, the standard DTLM model is better than the competitive baselines.However, the Polish-Kashubian results demonstrate that cognate projection can outperform the Copy and No Projection baselines when only HR data is used.Finally, augmenting the LR training data with the projected HR data does not improve the inflection accuracy in most cases.

Conclusion
We described the details of the systems that we tested on four language pairs in the SIGMOR-PHON 2019 Shared Task.In particular, we successfully experimented with leveraging cognate projection for inflection generation.We view our Polish-Kashubian results as a proof of concept that should motivate further research on this new idea.

Figure 1 :
Figure 1: Two approaches to applying cognate projection to inflection generation.DTLM and NMT denote projection and inflection models, respectively.Dashed arrows show transduction.Solid arrows indicate training data.The LR and HR components are shown in orange and blue.

words LR words HR inflect HR words LR cognate HR LR
.

Table 1 :
Even if a cognate The size of the UniMorph datasets and our target word lists.

Table 3 :
Inflection results on test sets of the shared task.
, while No Projection simply applies an inflection model trained on HR data to LR in-

Table 4 :
Intrinsic evaluation of cognate projection.