Morphological Analysis of the Dravidian Language Family

The Dravidian languages are one of the most widely spoken language families in the world, yet there are very few annotated resources available to NLP researchers. To remedy this, we create DravMorph, a corpus annotated for morphological segmentation and part-of-speech. Additionally, we exploit novel features and higher-order models to set state-of-the-art results on these corpora on both tasks, beating techniques proposed in the literature by as much as 4 points in segmentation F1.


Introduction
The Dravidian languages comprise one of the world's major language families and are spoken by over 300 million people in southern India (see Figure 1). Despite their prevalence, they remain low resource with respect to language technology. We annotate new data and develop new models for the most commonly spoken Dravidian languages: Kannada, Malayalam, Tamil and Telugu.
We focus on the computational processing of Dravidian morphology, a critical issue since the family exhibits rich agglutinative inflectional morphology as well as highly-productive compounding. For example, Dravidian nouns are typically inflected with gender, number and case in addition to various postpositions. E.g., consider the word ag niparvvatattinṟ eyeāppam (അഗ് നിപർവ് വതത് തിന് റെയോപ് പം) in Malayalam which is compromised of the compound noun stem agni+paṟ avvatam (fire+mountain) and the following suffixes: tta (inflectional increment), inṟ e (genitive case marker), ye (inflectional increment) and oppam (postposition). These combine to give the meaning of the English phrase``with a volcano.'' This complexity makes morphological analysis obligatory for the Dravidian languages.

Introduction
The Dravidian languages comprise one of the world's major language families and are spoken by over 300 million people in southern India. Despite their prevalence, they remain-with respect to language technology-low resource. Our current work focuses on developing new models and data for processing the four most commonly spoken Dravidian languages: Kannada, Malayalam, Tamil and Telugu. We present a brief overview of the linguistic features that characterize the family as whole and then describe the development of statistical models that utilize these specific features.
We focus on the computational processing of Dravidian morphology, a critical issue since the family exhibits rich agglutinative inflectional morphology as well as highly-productive compounding. For example, nouns are typically inflected with gender, number, case in addition to various postpositions. Consider the Malayalam word (അ ിപർവതതിെ േയാ ം), which consists of the compound stem (fire+mountain) and the following suffixes: (inflictive increment), (genitive case marker), (inflictive increment) and (post position). These combine to give the meaning of the English phrase "with a volcano". The added intra-word complexity makes morphological analysis requisite for the Dravidian languages.
We make three contributions. First, we show that a combination of higher-order models and linguistically-motivated features yields state-ofthe-art accuracies on the task of morphological segmentation in the four major Dravidian languages. Second, we show that training POS tag-  gers that use the output of our segmenters as feature greatly improve tagging accuracy. This indicates that for languages with rich morphology, a more structured approach to character-level features than simple prefix and suffix features is necessary. Third, we release the annotated segmentation and POS-tagged corpora as open-source resources, encouraging future work on Dravidian languages.

Morphological Segmentation
The task of morphological segmentation entails breaking a word up into its constituent morphs. For example, the English word can be segmented as + + , uncovering how the word was built and hinting at the semantics of the resulting derived form. When processing morphologically-rich languages, this helps reduce the sparsity created by the higher OOV rate due to the productive morphology, and, empiri- We make three primary contributions: (i) We release DravMorph, a corpus annotated for morphological segmentation and part-of-speech (POS) as an open-source resource, encouraging future work on Dravidian languages; (ii) We show that a combination of higher-order models and linguisticallymotivated features yields state-of-the-art accuracy on the task of morphological segmentation on the corpus; (iii) We show that training POS taggers that use the output of our segmenters as features significantly improves a state-of-the-art tagger.

DravMorph
A primary contribution of this work is the release of DravMorph, 1 a corrected corpus for both morphological segmentation and POS in the four   Table 2. To the best of our knowledge, this is the most comprehensive annotated corpus of the Dravidian languages. All the newly annotated corpora are based on Wikipedia text in the respective languages (see Table 1). To speed up annotation, we first ran closedsource ruled-based morphological analyzers and POS taggers produced by the government of India and Indian universities. We remark that the existence of such rule-based tools does not diminish the utility of the annotated corpus---our ultimate goal is the adoption of modern statistical methods for Dravidian NLP, which requires annotated data. To ensure a gold standard corpus, we then handcorrected the resulting output. Additionally, we standardized the POS tagging schemes across languages, using the IIIT-H POS tagset (Bharati et al., 2006), which has 23 tags. Furthermore, we calculated inter-annotator agreement of two annotators for morphological labels and all datasets have Cohen's κ (Cohen, 1968) > 0.80.

Morphological Segmentation
We first examine the task of morphological segmentation in the Dravidian languages. The task entails breaking a word up into its constituent morphs. For example, the English word joblessness can be segmented as job+less+ness. When processing morphologically-rich languages, this helps reduce the sparsity created by the higher OOV rate due to productive morphology, and, empirically, has shown to be beneficial in a diverse variety of down-stream tasks, e.g., machine translation (Clifton and Sarkar, 2011), speech recognition (Afify et al., 2006), keyword spotting (Narasimhan et al., 2014) and parsing (Seeker and Özlem Çetinoğlu, 2015  and unsupervised approaches have been successful, but, when annotated data is available, supervised approaches typically greatly outperform unsupervised approaches (Ruokolainen et al., 2013).
In light of this, we adopt a fully supervised model here.
We apply semi-Markov Conditional Random Fields (S-CRFs) to the problem of morphological segmentation (Sarawagi and Cohen, 2004;Cotterell et al., 2015 . A S-CRF models this transformation as where s is a segmentation, ℓ a labeling, θ ∈ R d is the parameter vector, f is a feature function 2 and the partition function Z θ (w) ensures the distribution is normalized. Note that each ℓ i is taken from a set of labels L. In this work, we take L = {prefix, stem, suffix}.
As an extension to the standard S-CRF Model, we allow for higher-order segment interactions (Nguyen et al., 2011). This allows for feature functions to look at multiple adjacent segments s i , s i−1 and s i−2 as well as multiple labels ℓ i , ℓ i−1 and ℓ i−2 . While higher-order S-CRFs have shown performance improvements in various tasks, e.g., bibliography extraction and OCR (Nguyen et al., 2014), they have yet to applied to morphology. We posit that the increased model expressiveness will help model more complex morphology.
We optimize the model parameters to maximize the L 2 regularized likelihood of the training data using L-BFGS (Liu and Nocedal, 1989). Computation of the likelihood and gradient can be performed efficiently through a generalization of the forward-backward algorithm that runs in O(|w| n+2 |L| m+1 ), where n is the number of adjacent segments to be scored (n = 0 in a standard S-CRF) and m is the number of adjacent labels to be scored (m = 1 in a standard S-CRF). In this work, we explore n ∈ {0, 1, 2} and m ∈ {1, 2, 3}, i.e., our features examine up to three adjacent segments and their labels.

Features
We apply a mixture of features standard for morphological segmentation and novel features based linguistic properties of the Dravidian languages.

Language Independent Feature Templates.
We include the following atomic feature templates from Cotterell et al. (2015): (i) a binary indicator feature for substring s i of the training data, (ii) character n-gram context features on the left and right for each potential boundary and (iii) a binary feature that fires if the segment s i appears in a spell-checker gazetteer, to determine if it itself is a word. We also take conjunctions of all atomic features and the labels. Note that in higher-order models, we include the conjunction of all features that fire on a given segment s i with those that fire on the adjacent segments.
Inflectional Increments. All Dravidian languages discussed in this work have semantically vacuous segments known as inflectional increments that are inserted during word formation between the stem and an inflectional ending. Consider the example from Malayalam, marattinṟ e (മരത് തിന് റെ) (tree), which consists of stem mara, inflectional increment tt and genitive case marker inte. Because they only appear between morphs, inflectional increments serve as a cue for morph boundaries. Luckily, each set of inflectional increments is closed-class, allowing us to create a gazetteer of all increments.
Orthographic Features. The orthography of the Dravidian languages is an important factor that interacts non-trivially with the morphology. Each language uses an alpha-syllabic writing system, where each symbol encodes a syllable, rather than a single phoneme. Since boundaries typically occur between syllables, using a transliterated representation would throw away information. To remedy this, we include a binary feature that indicates whether a boundary corresponds to a syllable boundary in the original script. The orthographies also contain digraphs, which represent a single phoneme using a combination of two other graphs in the system. These characters are typically produced when two syllables are joined together at morpheme boundaries or word boundaries. Since the number of digraph characters are fixed in the orthography, we create another gazetteer for them.
Sandhi. Dravidian languages exhibit rich phonological interactions known as sandhi that occur at morph boundaries and word boundaries in the case of compounding. We encode the major morphophonological processes as features to capture this. We include features for the assimilation, insertion, and deletion of phonemes as these changes are visible in the surface form and can easily be represented as features. Consider an example from Malayalam, kuṭṭiyuṁ(കു ട് ടിയു ം) (child + also ), in this case there are two morphemes: the first morpheme kuṭṭi, which ends with the front vowel i, and the second morpheme um, which starts with the back vowel u. Sandhi inserts a glide y between them, marking the morpheme boundary.

Experiments and Results
Morphological Segmentation. On the task of morphological segmentation, we experimented with four languages from the Dravidian family in our corpus: Kannada, Malayalam, Tamil and Telugu. We first performed a full ablation study (see Table 3) on our model described in §3 to validate that both the higher-order models and the linguistic features have the desired effect. Indeed, both significantly improve performance. We evaluate using border F 1 (Virpioja et al., 2011) against the gold segmentation.
On test data, we compare our best system from the ablation study against two models previously proposed in the literature. First, we compare against the CRF model of Ruokolainen et al. (2013) and, second, we compare against the S-CRF  Table 3: Full ablation study on test data to test the effectiveness of our new features as well as the higher-order models. The metric used is border F1. We denote higher-order models as S-CRF (n, m) where the integers n and m indicate the order of the model, e.g., the S-CRF (1, 2) models scores pairs of segments and triplets of tags. Note that +I marks inflection increment features, +O marks orthography features and +S marks sandhi features. model of Cotterell et al. (2015), which is just a 1 st -order S-CRF. We tune the regularization coefficient for the L 2 regularizer on held-out data.
Segmentation in POS Tagging. Next, we show the efficacy of morphological segmentation used as a preprocessing step for POS tagging (seen as a downstream task). For each type in the POS corpus, we take the MAP segmentation from the best S-CRF segmenter. We train the Marmot (Müller et al., 2013) using features derived from the segmentation. Specifically, we create a binary feature that fires on each segment in the training data. The other features in Marmot are standard shape features for POS tagging described in literature (Ratnaparkhi and others, 1996;Manning, 2011). Notably, the primary source of morphological information for the tagger is obtained through character n-gram features on individual word forms. Some of these features are not useful for the Dravidian languages, e.g., the Dravidian scripts only have lowercase.
In the Dravidian languages (and more generally agglutinative languages), morphological segments mark case, tense, aspect, gender, and number---categories indicative of the POS. For instance, tense markers only appear with verbs. These features have the potential to be more useful than the dynamics of the tagger as Dravidian word-order is relatively free.

Experiments and Results.
We train the Marmot system with and without the morphological seg-mentation features. Following the procedure outlined in Müller et al. (2013), we train using stochastic gradient descent for 10 epochs with a L 1 regularizer with 0.1 coefficient. The results are reported in Table 4. We see clear gains of up to 1.69% with the systems that use the segments as features. This evinces that segmentation is a useful preprocessing step for POS tagging in Dravidian languages---character n-grams alone do not pick up on the layers of affixes.

Related Work
Sequence models such as CRFs and S-CRFs are used for segmentation tasks in NLP, e.g., Peng et al. (2004) applied a CRF model for Chinese word segmentation and Andrew (2006) Cotterell et al. (2015) proposed a 1 st order S-CRF model for morphological segmentation, but did not explore higher-order models. Additionally, we are the first to explore rich phonological and orthographic features in supervised segmentation models.
There are large amount of research literature on construction of POS taggers for south Dravidian languages and most of them are languages specific, e.g., Pandian and Geetha (2009). However, some of the methods are applied to one or two languages in the family. P.V.S. and Karthik (2007) Table 4: Tagging results using the Marmot tagger on the four Dravidian languages studied in the paper. The results indicate strongly that morphological segmentation---rather than simple prefix and suffixes n-gram features---is a useful step in handling the agglutinative Dravidian languages. ply linear-chain CRFs for POS tagging of Bengali, Hindi and Telugu. Another approach that applied to POS tagging of Dravidian language is to use part-of-speech tagger of another similar languages. More recently, Kumar et al. (2015) applied adaptor grammars to unsupervised morphological segmentation of Kannada, Malayalam and Tamil.

Conclusion
In this paper, we presented a higher-order semi-CRF model for morphological segmentation for the Dravidian languages of South India. Our results show that the modeling of higher-order dependencies between segments and linguisticallyinspired features can greatly improve system performance. We also showed that segmentation is beneficial to the down-stream task of POS tagging. To promote research on the Dravidian family, we release hand-corrected corpora for both morphological segmentation and POS tagging in four low-resource languages. Future work should concentrate on canonical segmentation (Cotterell et al., 2016a;Cotterell et al., 2016b;Cotterell and Schütze, 2017), which may be a better fit for the problem given the rich phonological changes in Dravidian morphology. Also, we plan to map the annotations to the universal POS set of Petrov et al. (2012) and the UniMorph schema of Sylak-Glassman et al. (2015).