Labeled Morphological Segmentation with Semi-Markov Models

We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop C HIPMUNK , a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that C HIPMUNK yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2–6 points F 1 over the baseline.

1 Introduction1 Morphological processing is often an overlooked problem because many well-researched languages in NLP, e.g., Chinese and English, are morphologically impoverished.However, for languages with complex morphology, e.g., Finnish and Turkish, morphological processing is essential.A specific form of morphological processing, morphological segmentation, has shown its utility for machine translation (Dyer et al., 2008), sentiment analysis (Abdul-Mageed et al., 2012), bilingual word alignment (Eyigöz et al., 2013), speech processing (Creutz et al., 2007b) and keyword spotting (Narasimhan et al., 2014), inter alia.We advance the state of the art in supervised morphological segmentation by describing a high-performance, data-driven tool for handling complex morphology, even in low-resource settings.
In this work, we make the distinction between unlabeled morphological segmentation (UMS), which is often just called morphological segmentation, and labeled morphological segmentation (LMS).The labels in our supervised discriminative model for LMS capture the distinctions be-tween different types of morphemes and directly model the morphotactics of the language.We further create a hierarchical universal tagset for labeling morphemes, with different levels appropriate for different tasks.Our hierarchical tagset was designed by creating a standard representation from heterogeneous resources for six languages.We give an overview of the tasks addressed in this paper in Figure 1, which shows the expected output for the Turkish word gençleşmelerin ('of the rejuvenatings').In particular, it shows the full labeled morphological segmentation, from which three representations can be directly derived: the unlabeled morphological segmentation, the stem (or root) 2 and the morphological tag containing part-of-speech (POS) and inflectional features.
We model these tasks with CHIPMUNK, a semi-Markov conditional random field (semi-CRF; Sarawagi and Cohen, 2004), a model that is well-suited for morphological segmentation.We provide an evaluation and analysis on six languages; CHIPMUNK yields strong results on all three tasks, including state-of-the-art accuracy on morphological segmentation.
Paper Outline.Section 2 presents our LMS framework and the morphotactic tagsets we develop, i.e., the labels of the sequence prediction task CHIPMUNK solves.Section 3 introduces our semi-CRF model.Section 4 presents our novel features.Section 5 compares CHIPMUNK to previous work.Section 6 presents experiments on the three complementary tasks of segmentation (UMS), stemming, and morphological tag classification.Section 7 briefly discusses finite-state morphology.

Labeled Segmentation and Tagset
We define the framework of labeled morphological segmentation, an enhancement of morphological segmentation that-in addition to identifying the boundaries of segments-assigns a fine-grained 2 Terminological notes: We use root to refer to a morpheme with concrete meaning, stem to refer to the concatenation of all roots and derivational affixes, root detection to refer to stripping both derivational and inflectional affixes, and stemming to refer to stripping only inflectional affixes.morphotactic tag to each segment.LMS leads to both better modeling of segmentation and subsumes several other tasks, e.g., stemming.
Most previous approaches to morphological segmentation are either unlabeled or use a small, coarse-grained set such as {PREFIX, ROOT, SUFFIX}.In contrast, our labels are fine-grained.This finer granularity has two advantages.(i) The labels are needed for many tasks, for instance in sentiment analysis detecting morphologically encoded negation, as in Turkish, is crucial.In other words, for many applications UMS is insufficient.(ii) The LMS framework allows us to learn a probabilistic model of morphotactics.Working with LMS results in higher UMS accuracy.Thus, in applications that only need segments and no labels, LMS is beneficial.Note that the concatenation of labels across segments yields a bundle of morphological attributes similar to those found in the CoNLL datasets often used to train morphological taggers (Buchholz and Marsi, 2006)-thus LMS helps to unify UMS and morphological tagging.We believe that LMS is a needed extension of current work in morphological segmentation.Our framework concisely allows the model to capture interdependencies among various morphemes and model relations between entire morpheme classes-a neglected aspect of the problem.
We first create a hierarchical tagset with increasing granularity, which we created by analyzing heterogeneous resources for the six languages we work on.The optimal level of granularity is taskand language-dependent: the level is a trade-off between simplicity and expressivity.We illustrate our tagset with the decomposition of the German word Enteisungen 'defrostings' (Figure 2): • Level 0: The level 0 tagset involves a single tag indicating a segment.It ignores morphotactics completely and is similar to previous work.
• Level 1: The level 1 tagset crudely approximates morphotactics: it consists of the tags {PREFIX, ROOT, SUFFIX}.This scheme has been successfully used by unsupervised segmenters, e.g., MORFESSOR CAT-MAP (Creutz et al., 2007a).It allows the model to learn simple morphotactics, for instance that a prefix cannot be followed by a suffix.This makes a decomposition like reed → re+ed unlikely.We also add an additional UNKNOWN tag for morphemes that do not fit into this scheme.
• Level 2: The level 2 tagset splits affixes into DERIVATIONAL and INFLECTIONAL, effectively increasing the maximal tagset size from 4 to 6.These tags can encode that many languages allow for transitions from derivational to inflectional endings, but rarely the opposite.This makes the incorrect decomposition of German Offenheit ('openness') into Off, inflectional en and derivational heit unlikely. 3This tagset is also useful for building statistical stemmers.
• Level 3: The level 3 tagset adds the part of speech, i.e., whether a root is VERBAL, NOMINAL or ADJECTIVAL, and the part of speech of the word that an affix derives.
• Level 4: The level 4 tagset includes the inflectional feature a suffix adds, e.g., CASE or NUMBER.This is helpful for certain agglutinative languages, in which, e.g., CASE must follow NUMBER.
• Level 5: The level 5 tagset adds the actual value of the inflectional feature, e.g., PLURAL, and corresponds to the annotation in the datasets.
In preliminary experiments we found that the level 5 tagset is too rich and does not yield consistent improvements; we thus do not report experimental results using it.
Table 1 shows tagset sizes for the six languages.4

Model
CHIPMUNK is a supervised model based on a semi-Markov conditional random field (semi-CRF) (Sarawagi and Cohen, 2004) that naturally fits the task of LMS.Semi-CRFs generalize linear-chain CRFs and model segmentation jointly with sequence labeling.Just as linear-chain CRFs are discriminative adaptations of hidden Markov models (Lafferty et al., 2001), semi-CRFs are an analogous adaptation of hidden semi-Markov models (Murphy, 2002).Semi-CRFs allow us to integrate new features that look at complete segments, which is not possible with CRFs, making semi-CRFs a natural choice for morphology.
A semi-CRF represents w (a word) as a sequence of segments s = ⟨s 1 , . . ., s N ⟩, each of which is assigned a label ℓ n .The concatenation of all segments equals w.We seek a log-linear distribution p θ (s, ℓ | w) over all possible segmentations and label sequences for w, where θ is the parameter vector.Note that we recover the standard CRF if we restrict the segment length to 1. Formally, we define the probability distribution p θ as is the feature function and Z θ (w) is the partition function.We use a generalization of the forward-backward algorithm for efficient gradient computation (Sarawagi and Cohen, 2004).Inspection of the semi-Markov forward recursion, shows that algorithm runs in O(N 2 L 2 ) time where N is the length of the word w and L is the number of labels (size of the tagset).Then, we have the partition function equals Z θ (w) = L ℓ=1 α(N, ℓ).A similar recursion, generalizing the Viterbi algorithm for hidden Markov models (Rabiner, 1989), allows us to find the one-best labeled segmentation in O(N 2 L 2 ) as well.
We employ the maximum-likelihood criterion to estimate the parameters with L-BFGS (Liu and Nocedal, 1989), a gradient-based optimization algorithm.As in all exponential family models, the gradient of the log-likelihood takes the form of the difference between the observed and expected feature counts (Wainwright and Jordan, 2008) and can be computed efficiently with the semi-Markov extension of the forward-backward algorithm.We use L 2 regularization with a regularization coefficient tuned during cross-validation.
We note that semi-Markov models have the potential to obviate typical errors made by standard Markovian sequence models with an IOB labeling scheme over characters.For instance, consider the incorrect segmentation of the English verb sees into se+es.These are reasonable split positions as many English stems end in se (e.g., consider abuse+s).Semi-CRFs have a major advantage because they admit segmental features that allow them to learn se is not a good morph.

Features
We introduce several novel features for LMS.We exploit existing resources, e.g., spell checkers
Affix Features and Gazetteers.In contrast to syntax and semantics, the morphology of a language is often simple to document and a list of the most common morphs can be found in any good grammar.Wiktionary, for example, contains affix lists for all the six languages used in our experiments. 5Providing a supervised learner with such a list is a great boon, just as gazetteer features aid NER (Smith and Osborne, 2006).The benefit is perhaps even greater than in applications like NER because suffixes and prefixes are generally closed-class, and hence these lists are likely to be comprehensive.These features are binary and fire if a given substring occurs in the gazetteer list.In this paper, we simply use suffix lists from English Wiktionary, except for Zulu, for which we use a prefix list, see Table 2.We also include a feature that fires on the conjunction of tags and substrings observed in the training data.In the level 5 tagset, this allows us to link all allomorphs of a given morpheme.In the lower-level tagsets, this links related morphemes.Virpioja et al. (2010) 2013).The n-gram features look at variable length substrings of the word on both the right and left side of each boundary.We create conjunctive features from the cross-product between the morphotactic tagset (Section 2) and the features.

Related Work
Memory-based Learning.van den Bosch and Daelemans (1999) and Marsi et al. (2005) present memory-based approaches to discriminative learning of morphological segmentation and both address the problem of LMS.We distinguish our work from theirs in that we define a cross-lingual schema for defining a hierarchical tagset for LMS.Morever, we tackle the problem with a feature-rich, log-linear model, allowing us to easily incorporate disparate sources of knowledge into a single framework.
Unsupervised UMS.UMS has been mainly addressed by unsupervised algorithms.LIN-GUISTICA (Goldsmith, 2001) and MORFESSOR (Creutz and Lagus, 2002) are built around the idea of optimally encoding the data, in the sense of minimal description length (MDL).MORFESSOR CAT-MAP (Creutz et al., 2007a) 2) for Arabic and Hebrew UMS.Their gradient and objective computation is based on an enumeration of a heuristically chosen subset of the exponentially many segmentations.This limits its applicability to language with complex concatenative morphology, e.g., Turkish and Finnish.
Supervised UMS.Ruokolainen et al. ( 2013) present an averaged perceptron (Collins, 2002), a discriminative structured prediction method, for UMS.The model outperforms the semi-supervised model of Poon et al. (2009) on Arabic and Hebrew morpheme segmentation as well as the semi-supervised model of Kohonen et al. (2010a) on English, Finnish and Turkish.Ruokolainen et al. ( 2014) get further empirical improvements by using features extracted from large corpora, based on the letter successor variety (LSV) model (Harris, 1995) and on unsupervised segmentation models such as Morfessor CatMAP (Creutz et al., 2007a).The idea behind LSV is that for example talking should be split into talk and ing, because talk can also be followed by different letters then i such as e (talked) and s (talks).
Chinese Word Segmentation.Chinese word segmentation (CWS) is related to UMS.Andrew (2006) successfully apply semi-CRFs to CWS.Joint CWS and POS tagging (Ng and Low, 2004;Zhang and Clark, 2008) is related to LMS.

Experiments
We experiment on six languages from diverse language families.The segmentation data for English, Finnish and Turkish was taken from MorphoChal-  2) on Turkish segmentation measured on our development set.As Turkish is an agglutinative language with hundreds of affixes, the efficacy of our approach is expected to be particularly salient here.Recall we optimized for the best tagset granularity for our experiments on Tune.
The German data was extracted from the CELEX2 collection (Baayen et al., 1993), which contains all the requisite information.The Zulu data was taken from the Ukwabelana corpus (Spiegler et al., 2010).Finally, the Indonesian portion was created by applying the rule-based analyzer MORPHIND (Larasati et al., 2011) to the Indonesian portion of an Indonesian-English bilingual corpus. 7e did not have access to the MorphoChallenge test set, and, thus, we used the original development set as our final evaluation set (Test).We developed CHIPMUNK using 10-fold crossvalidation on the 1000-word training set and split every fold into training (Train), tuning (Tune) and development sets (Dev).8For German, Indonesian and Zulu, we randomly selected 1000 word forms as training set and used the rest as evaluation set.For our final evaluation we trained CHIPMUNK on the concatenation of Train, Tune and Dev (the original 1000 word training set), using the optimal parameters from the cross-evaluation and tested on Test.Table 4 shows the important statistics of our datasets.One of our baselines also uses unlabeled training data.MorphoChallenge provides word lists for English, Finnish, German and Turkish.We use the unannotated part of Ukwabelana for Zulu; and for Indonesian, data from Wikipedia and the corpus of Krisnawati and Schulz (2013).
In all evaluations, we use variants of the standard MorphoChallenge evaluation approach.Importantly, for word types with multiple correct segmen- tations, this involves finding the maximum score by comparing the one-best segmentation under CHIP-MUNK, as computed by the Viterbi algorithm, with each correct segmentation, as is standardly done in MorphoChallenge.

UMS Experiments
We first evaluate CHIPMUNK on UMS, by predicting LMS and then discarding the labels.Our primary baseline is the state-of-the-art supervised system CRF-MORPH of Ruokolainen et al. (2013).
We ran the version of the system that the authors published on their website. 9We optimized the model's two hyperparameters on Tune: the number of epochs and the maximal length of n-gram character features.The system also supports Harris's (1995) letter successor variety (LSV) features, extracted from large unannotated corpora.For completeness, we also compare CHIPMUNK with a first-order CRF and a higher-order CRF (Müller et al., 2013), both used the same n-gram features as CRF-MORPH, but without the LSV features. 10 We evaluate all models using the traditional macro F 1 of the segmentation boundaries.
General Discussion.The UMS results on held-out data are displayed in Table 6.Our most complex model beats the best baseline by between 1 (German) and 3 (Finnish) points F 1 on all six languages.We additionally provide extensive ablation studies to highlight the contribution of our novel features.We find that the properties of each specific language highly influences which features are most effective.For the agglutinative languages, i.e, Finnish, Turkish and Zulu, the affix-based features (+Affix) and the morphotactic tagset (+Morph) yield consistent improvements over the semi-CRF 9 http://users.ics.tkk.fi/tpruokol/software/crfs_morph.zip 10 Model order, maximal character n-gram length and regularization coefficients were optimized on Tune.models with a single state.Improvements for the affix features range from 0.2 for Turkish to 2.14 for Zulu.The morphological tagset yields improvements of 0.77 for Finnish, 1.89 for Turkish and 2.10 for Zulu.We optimized tagset granularity on Tune and found that levels 4 and level 2 yielded the best results for the three agglutinative and the three other languages, respectively.The dictionary features (+Dict) help universally, but their effects are particularly salient in languages with productive compounding, i.e., English, Finnish and German, where we see improvements of > 1.7.In comparison with previous work (Ruokolainen et al., 2013), we find that our most complex model yields consistent improvements over CRF-MORPH +LSV for all languages: The improvements range from > 1 for German over > 1.5 for Zulu, English, and Indonesian to > 2 for Turkish and > 4 for Finnish.
The Role of Morphotactics.To illustrate the effect of modeling morphotactics through the larger morphotactic tagset on performance, we provide a detailed analysis of Turkish.See Table 5.We consider three different feature sets and increase the size of the morphotactic tagsets depicted in Figure 2. The results evince the general trend that improved morphotactic modeling benefits segmentation.Additionally, we observe that the improvements are complementary to those from the other features.
Novel Roots and Affixes.As discussed earlier, a key problem in UMS, especially in low-resource settings, is the detection of novel roots and affixes.Since many of our features were designed to combat this problem specifically, we investigated this aspect independently.Table 7 shows the number of novel roots and affixes found by our best model and the baseline.In all languages, CHIPMUNK correctly identifies between 5% (English) and 22% (Finnish) more novel roots than the baseline.We do Px-Px Px-Rt Rt-Rt Rt-Sx Sx-Sx Total Zul Tur Ind Ger Fin Eng not major improvements for affixes, but this is of less interest as there are far fewer novel affixes.
Boundaries.We further explore how CHIP-MUNK and the baseline perform on different boundary types by looking at missing boundaries between different morphotactic types; this error type is also known as undersegmentation.Figure 3 shows a heatmap that overviews errors broken down by morphotactic tag.We see that most errors are caused between root and suffixes across all languages.This is related to the problem of finding new roots, as a new root is often mistaken as a root-affix composition.

Root Detection and Stemming
Root detection 1 and stemming 1 are two important NLP problems that are closely related to morphological segmentation and used in applications such as MT, information retrieval, parsing and information extraction.Here we explore the utility of CHIPMUNK as a statistical stemmer and root detector.Stemming is closely related to the task of lemmatization, which involves the additional step of normalizing to the canonical form. 11Consider the German particle verb participle auf-ge-schrieben 'written down'.The participle is built by applying an alternation to the verbal root schreib 'write' adding the participial circumfix ge-en and finally adding the verb particle auf.In our segmentationbased definition, we would consider schrieb 'write' 11 In our experiments there are no stem alternations.The output is equivalent to that of the Porter stemmer (Porter, 1980 as its root and auf-schrieb as its stem.In order to additionally to restore the lemma, we would also have to reverse the stem alternation that replaced ei with ie and add the infinitival ending en yielding the infinitive auf-schreib-en. Our baseline MORFETTE (Chrupała et al., 2008) is a statistical transducer that first extracts edit paths between input and output and then uses a perceptron classifier to decide which edit path to apply.In short, MORFETTE treats the task as a string-to-string transduction problem, whereas we view it as a labeled segmentation problem.12Note that MORFETTE would in principle be able to handle stem alternations, although these usually lead to an increase in the number of edit paths.We use level 2 tagsets for all experiments-the smallest tagsets complex enough for stemming-and extract the relevant segments.
Discussion.Our results are shown in Table 8.We see consistent improvements across all tasks.For the fusional languages (English, German and Indonesian) we see modest gains in performance on both root detection and stemming.However, for the agglutinative languages (Finnish, Turkish and Zulu) we see absolute gains as high as 50% (Turkish) in accuracy.This significant improvement is due to the complexity of the tasks in these languagestheir productive morphology increases sparsity and makes the unstructured string-to-string transduction approach suboptimal.We view this as solid evidence that labeled segmentation has utility in many components of the NLP pipeline.

Morphological Tag Classification
The joint modeling of segmentation and morphotactic tags allows us to use CHIPMUNK for a crude form of morphological analysis: the task of morphological tag classification, which we define as annotation of a word with its most likely inflectional features. 13To be concrete, our task is to predict the inflectional features of word type based only on its character sequence and not its sentential context.
To this end, we take Finnish and Turkish as two examples of languages that should suit our approach particularly well as both have highly complex inflectional morphologies.We use the level 4 tagset and replace all non-inflectional tags with a simple segment tag.The tagset sizes are listed in Table 10.We use the same experimental setup as in Section 6.2 and compare CHIPMUNK to a maximum entropy classifier (MaxEnt), whose features are character n-grams of up to a maximal length of k. 14  The maximum entropy classifier is L 1 -regularized and its regularization coefficient as well as the value for k are optimized on Tune.As a second, stronger baseline we use a MaxEnt classifier that splits tags into their constituents and concatenates the features with every constituent as well as the complete tag (MaxEnt +Split).Both of the baselines in Table 9 are 0 th -order versions of the stateof-the-art CRF-based morphological tagger MAR-MOT (Müller et al., 2013) (since our model is typebased), making this a strong baseline.We report full analysis accuracy and macro F 1 on the set of individual inflectional features.
Discussion.The results in Table 9 show that our proposed method outperforms both baselines on both performance metrics.We see gains of over 6% in accuracy in both languages.This is evidence 13 We recognize that this task is best performed with sentential context (token-based).Integration with a POS tagger, however, is beyond the scope of this paper.
14 Prefixes and suffixes are explicitly marked.that our proposed approach could be successfully integrated into a morphological tagger to give a stronger character-based signal.

Comparison to Finite-State Morphology
A morphological finite-state analyzer is customarily a hand-crafted tool that generates all the possible morphological readings with their associated features.We believe that, for many applications, high-quality finite-state morphological analysis is superior to CHIPMUNK.Finite-state morphological analyzers output a small set of linguistically valid analyses of a type, typically with only limited overgeneration.However, there are two limitations with finite-state morphological analyzers.The first is that significant effort is required to develop the transducers modeling the morphological grammar and creating and updating the lexicon is laborious.
The second is that it is difficult to use finite-state analyzer to guess analyses involving roots not covered in the lexicon. 15In fact, this is usually solved by viewing it as a different problem, morphological guessing, where linguistic knowledge similar to the features we have presented is used to try to guess POS and morphological analysis for types with no analysis in the finite-state analyzer.
In contrast, our training procedure learns a probabilistic transducer, which is a soft version of the type of hand-engineered grammar that is used in finite-state analyzers.The 1-best labeled morphological segmentation our model produces offers a simple and clean representation which could be of use in many downstream applications.Furthermore, our model unifies analysis and guessing into a single simple framework.Nevertheless, finitestate morphologies are still extremely useful, highprecision tools.A primary goal of future work will be to use CHIPMUNK to attempt to induce higherquality morphological processing systems.

Conclusion and Future Work
We have presented labeled morphological segmentation in this paper, a new approach to morphological processing.LMS unifies three existing tasks in the literature: unlabeled morphological segmentation, stemming, and morphological tag classification.Our hierarchy of labeled morphological segmentation tagsets can be used to map the heterogeneous data in six languages we work with to universal representations of different granularities.We plan future creation of gold standard segmentations in more languages using our annotation scheme.
We further presented CHIPMUNK a semi-CRFbased model for LMS that allows for the integration of various linguistic features and consistently out-performs previously presented approaches to unlabeled morphological segmentation.An important extension of CHIPMUNK is embedding it in a context-sensitive POS tagger.Current state-ofthe-art models only employ character level n-gram features to model word-internal structure (Müller et al., 2013).We have demonstrated that our structured approach outperforms this baseline.We leave this natural extension to future work.

Figure 1 :
Figure 1: Examples of the tasks addressed for the Turkish word gençleşmelerin ('of the rejuvenatings'): Traditional unlabeled segmentation (UMS), Labeled morphological segmentation (LMS), stemming / root detection and (inflectional) morphological tag classification.The morphotactic annotations produced by LMS allow us to solve these tasks using a single model.

Figure 3 :
Figure3: This figure represents a comparative analysis of undersegmentation.Each column (labels at the bottom) shows how often CRF-MORPH +LSV (top number in heatmap) and CHIPMUNK (bottom number in heatmap) select a segment that is two separate segments in the gold standard.E.g., Rt-Sx indicates how a root and a suffix were treated as a single segment.The color depends on the difference of the two counts.

Table 1 :
Morphotactic tagset size at each level of granularity.

Table 2 :
Sizes of the various affix gazetteers.

Table 3 :
Number of words covered by the ASPELL dictionary observed in training; this particularly affects roots, which are open-class.This makes it nearly impossible to correctly segment compounds that contain unseen roots, e.g., to correctly segment homework you need to know that home and work are independent English words.We solve this problem by incorporating spell-check features: binary features that fire if a segment is valid for a given spell checker.Spell-check features act as an effective proxy for a root detector.We use the open-source ASPELL dictionaries as they are freely available in 91 languages.Table3shows the coverage.
Integrating the Features.Our model uses the features discussed in this section and additionally the simple n-gram context features of Ruokolainen et al. (
formulates the model as sequence prediction based on HMMs over a morph dictionary and MAP estimation.The

Table 5 :
Example of the effect of larger tagsets (Figure

Table 8 :
Accuracy on the root detection and stemming on Test.

Table 9 :
Test results on morphological tag classification.

Table 10 :
Number of full word and morpheme tags.