Joint Lemmatization and Morphological Tagging with Lemming

We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech lemmatization, we reduce the error by 60%, from 4.05 to 1.58. We also give empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.


Introduction
Lemmatization is important for many NLP tasks, including parsing (Björkelund et al., 2010;Seddah et al., 2010) and machine translation (Fraser et al., 2012).Lemmata are required whenever we want to map words to lexical resources and establish the relation between inflected forms, particularly critical for morphologically rich languages to address the sparsity of unlemmatized forms.This strongly motivates work on language-independent token-based lemmatization, but until now there has been little work (Chrupała et al., 2008).
Many regular transformations can be described by simple replacement rules, but lemmatization of unknown words requires more than this.For instance the Spanish paradigms for verbs ending in ir and er share the same 3rd person plural ending en; this makes it hard to decide which paradigm a form belongs to. 1 Solving these kinds of problems requires global features on the lemma.Global features of this kind were not supported by previous work (Dreyer et al., 2008;Chrupała, 2006;Toutanova and Cherry, 2009;Cotterell et al., 2014).
There is a strong mutual dependency between (i) lemmatization of a form in context and (ii) disambiguating its part-of-speech (POS) and morpho-1 Compare admiten "they admit" → admitir "to admit", but deben "they must" → deber "to must".logical attributes.Attributes often disambiguate the lemma of a form, which explains why many NLP systems (Manning et al., 2014;Padró and Stanilovsky, 2012) apply a pipeline approach of tagging followed by lemmatization.Conversely, knowing the lemma of a form is often beneficial for tagging, for instance in the presence of syncretism; e.g., since German plural noun phrases do not mark gender, it is important to know the lemma (singular form) to correctly tag gender on the noun.
We make the following contributions.(i) We present the first joint log-linear model of morphological analysis and lemmatization that operates at the token level and is also able to lemmatize unknown forms; and release it as open-source (http: //cistern.cis.lmu.de/lemming).It is trainable on corpora annotated with gold standard tags and lemmata.Unlike other work (e.g., Smith et al. (2005)) it does not rely on morphological dictionaries or analyzers.(ii) We describe a log-linear model for lemmatization that can easily be incorporated into other models and supports arbitrary global features on the lemma.(iii) We set the new state of the art in token-based statistical lemmatization on six languages (English, German, Czech, Hungarian, Latin and Spanish).(iv) We experimentally show that jointly modeling morphological tags and lemmata is mutually beneficial and yields significant improvements in joint (tag+lemma) accuracy for four out of six languages; e.g., Czech lemma errors are reduced by >37% and tag+lemma errors by >6%.
2 Log-Linear Lemmatization Chrupała (2006) formalizes lemmatization as a classification task through the deterministic preextraction of edit operations transforming forms into lemmata.Our lemmatization model is in this vein, but allows the addition of external lexical information, e.g., whether the candidate lemma is in a dictionary.Formally, lemmatization is a stringto-string transduction task.Given an alphabet Σ, it maps an inflected form w ∈ Σ * to its lemma l ∈ Σ * given its morphological attributes m.We model this process by a log-linear model: where f represents hand-crafted feature functions, θ is a weight vector, and h w : Σ * → {0, 1} determines the support of the distribution, i.e., the set of candidates with non-zero probability.

Candidate selection
A proper choice of the support function h(•) is crucial to the success of the model -too permissive a function and the computational cost will build up, too restrictive and the correct lemma may receive no probability mass.Following Chrupała (2008), we define h(•) through a deterministic pre-extraction of edit trees.To extract an edit tree e for a pair form-lemma ⟨w, l⟩, we first find the longest common substring (LCS) (Gusfield, 1997) between them and then recursively model the prefix and suffix pairs of the LCS.When no LCS can be found the string pair is represented as a substitution operation transforming the first string to the second.The resulting edit tree does not encode the LCSs but only the length of their prefixes and suffixes and the substitution nodes (cf. Figure 3); e.g., the same tree transforms worked into work and touched into touch.
As a preprocessing step, we extract all edit trees that can be used for more than one pair ⟨w, l⟩.To generate the candidates of a word-form, we apply all edit trees and also add all lemmata this form was seen with in the training set (note that only a small subset of the edit trees is applicable for any given form because most require incompatible substitution operations).

Features
Our novel formalization lets us combine a wide variety of features that have been used in different Figure 2: Our model is a 2nd-order linear-chain CRF augmented to predict lemmata.We heavily prune our model and can easily exploit higher-order (>2) tag dependencies.
previous models.All features are extracted given a form-lemma pair ⟨w, l⟩ created with an edit tree e.We use the following three edit tree features of Chrupała (2008).(i) The edit tree e. (ii) The pair ⟨e, w⟩.This feature is crucial for the model to memorize irregular forms, e.g., the lemma of was is be.(iii) For each form affix (of maximum length 10): its conjunction with e.These features are useful in learning orthographic and phonological regularities, e.g., the lemma of signalling is signal, not signall.
We define the following alignment features.Similar to Toutanova and Cherry (2009) (TC), we define an alignment between w and l.Our alignments can be read from an edit tree by aligning the characters in LCS nodes character by character and characters in substitution nodes block-wise.Thus the alignment of umgeschautumschauen is: u-u, m-m, ge-ϵ, s-s, c-c, h-h, a-a, u-u, t-en.Each alignment pair constitutes a feature in our model.These features allow the model to learn that the substitution t/en is likely in German.We also concatenate each alignment pair with its form and lemma character context (of up to length 6) to learn, e.g., that ge is often deleted after um.
We define two simple lemma features.(i) We use the lemma itself as a feature, allowing us to learn which lemmata are common in the language.(ii) Prefixes and suffixes of the lemma (of maximum length 10).This feature allows us to learn that the typical endings of Spanish verbs are ir, er, ar.
We also use two dictionary features (on lemmata): Whether l occurs > 5 times in Wikipedia and whether it occurs in the dictionary ASPELL. 2e use a similar feature for different capitalization variants of the lemma (lowercase, first letter uppercase, all uppercase, mixed).This differentiation is important for German, where nouns are capitalized and en is both a noun plural marker and a frequent verb ending.Ignoring capitalization would thus lead to confusion.
POS & morphological attributes.For each feature listed previously, we create a conjunction with the POS and each morphological attribute.3

Joint Tagging and Lemmatization
We model the sequence of morphological tags using MARMOT (Mueller et al., 2013), a pruned higher-order CRF.This model avoids the exponential runtime of higher-order models by employing a pruning strategy.Its feature set consists of standard tagging features: the current word, its affixes and shape (capitalization, digits, hyphens) and the immediate lexical context.We combine lemmatization and higher-order CRF components in a tree-structured CRF.Given a sequence of forms w with lemmata l and morphological+POS tags m, we define a globally normalized model: where f and g are the features associated with lemma and tag cliques respectively and θ and λ are weight vectors.The graphical model is shown in Figure 2. We perform inference with belief propagation (Pearl, 1988) and estimate the parameters with SGD (Tsuruoka et al., 2009).We greatly improved the results of the joint model by initializing it with the parameters of a pretrained tagging model.

Related Work
In functionality, our system resembles MORFETTE (Chrupała et al., 2008), which generates lemma candidates by extracting edit operation sequences between lemmata and surface forms (Chrupała, 2006), and then trains two maximum entropy Markov models (Ratnaparkhi, 1996) for morphological tagging and lemmatization, which are queried using a beam search decoder.
In our experiments we use the latest version4 of MORFETTE.This version is based on structured perceptron learning (Collins, 2002) and edit trees (Chrupała, 2008).Models similar to MORFETTE include those of Björkelund et al. (2010) and Gesmundo and Samardžić (2012) and have also been used for generation (Dušek and Jurčíček, 2013).Wicentowski (2002) similarly treats lemmatization as classification over a deterministically chosen candidate set, but uses distributional information extracted from large corpora as a key source of information.
Toutanova and Cherry (2009)'s joint morphological analyzer predicts the set of possible lemmata and coarse-grained POS for a word type.This is different from our problem of lemmatization and fine-grained morphological tagging of tokens in context.Despite the superficial similarity of the two problems, direct comparison is not possible.TC's model is best thought of as inducing a tagging dictionary for OOV types, mapping them to a set of tag and lemma pairs, whereas LEMMING is a token-level, context-based morphological tagger.
We do, however, use TC's model of lemmatization, a string-to-string transduction model based on Jiampojamarn et al. (2008) (JCK), as a stand-alone baseline.Our tagging-in-context model is faced with higher complexity of learning and inference since it addresses a more difficult task; thus, while we could in principle use JCK as a replacement for our candidate selection, the edit tree approachwhich has high coverage at a low average number of lemma candidates (cf.Section 5) -allows us to train and apply LEMMING efficiently.Smith et al. (2005) proposed a log-linear model for the context-based disambiguation of a morphological dictionary.This has the effect of joint tagging, morphological segmentation and lemmatization, but, critically, is limited to the entries in the morphological dictionary (without which the approach cannot be used), causing problems of recall.In contrast, LEMMING can analyze any word, including OOVs, and only requires the same training corpus as a generic tagger (containing tags and lemmata), a resource that is available for many languages.

Experiments
Datasets.We present experiments on the joint task of lemmatization and tagging in six diverse languages: English, German, Czech, Hungarian, Latin and Spanish.We use the same data sets as in Müller and Schütze (2015) In each cell, overall token accuracy is left (all), accuracy on unknown forms is right (unk).Standalone MARMOT tagging accuracy (line 1) is not repeated for pipelines (lines 2-7).The best numbers are bold.LEMMING-J models significantly better than LEMMING-P (+), or LEMMING models not using morphology (+dict) (×) or both (+ ×) are marked.More baseline numbers in the appendix (Table A2).
out-of-domain test sets.The English data is from the Penn Treebank (Marcus et al., 1993) (Hajič et al., 2009).For German, Hungarian, Spanish and Czech we use the splits from the shared tasks; for English the split from SANCL (Petrov and McDonald, 2012); and for Latin a 8/1/1 split into train/dev/test.For all languages we limit our training data to the first 100,000 tokens.Dataset statistics can be found in Table A4 of the appendix.The lemma of Spanish se is set to be consistent.
Baselines.We compare our model to three baselines.(i) MORFETTE (see Section 4).(ii) SIMPLE, a system that for each form-POS pair, returns the most frequent lemma in the training data or the form if the pair is unknown.(iii) JCK, our reimplementation of Jiampojamarn et al. (2008).Recall that JCK is TC's lemmatization model and that the full TC model is a type-based model that cannot be applied to our task.As JCK struggles to memorize irregulars, we only use it for unknown form-POS pairs and use SIMPLE otherwise.For aligning the training data we use the edit-tree-based alignment described in the feature section.We only use output alphabet symbols that are used for ≥ 5 form-lemma pairs and also add a special output symbol that indicates that the aligned input should simply be copied.We train the model using a structured averaged perceptron and stop after 10 training iterations.
In preliminary experiments we found type-based training to outperform token-based training.This is understandable as we only apply our model to unseen form-POS pairs.The feature set is an exact reimplementation of (Jiampojamarn et al., 2008), it consists of input-output pairs and their character context in a window of 6.
Results.Our candidate selection strategy results in an average number of lemma candidates between 7 (Hungarian) and 91 (Czech) and a coverage of the correct lemma on dev of >99.4 (except 98.4 for Latin). 5We first compare the baselines to LEMMING-P, a pipeline based on Section 2, that lemmatizes a word given a predicted tag and is trained using L-BFGS (Liu and Nocedal, 1989).We use the implementation of MALLET (McCallum, 2002).For these experiments we train all models on gold attributes and test on attributes predicted by MORFETTE.MORFETTE's lemmatizer can only be used with its own tags.We thus use MORFETTE tags to have a uniform setup, which isolates the effects of the different taggers.Numbers for MARMOT tags are in the appendix (Table A1).For the initial experiments, we only use POS and ignore additional morphological attributes.We use different feature sets to illustrate the utility of our templates.The first model uses the edit tree features (edittree).Table 1 shows that this version of LEM-MING outperforms the baselines on half of the languages. 6In a second experiment we add the align- ment (+align) and lemma features (+lemma) and show that this consistently outperforms all baselines and edittree.We then add the dictionary feature (+dict).The resulting model outperforms all previous models and is significantly better than the best baselines for all languages.7These experiments show that LEMMING-P yields state-of-the-art results and that all our features are needed to obtain optimal performance.The improvements over the baselines are >1 for Czech and Latin and ≥.5 for German and Hungarian.
The last experiment also uses the additional morphological attributes predicted by MORFETTE (+mrph).This leads to a drop in lemmatization performance in all languages except Spanish (English has no additional attributes).However, preliminary experiments showed that correct morphological attributes would substantially improve lemmatization as they help in cases of ambiguity.As an example, number helps to lemmatize the singular German noun Raps "canola", which looks like the plural of Rap "rap".Numbers can be found in Table A3 of the appendix.This motivates the necessity of joint tagging and lemmatization.
For the final experiments, we run pipeline models on tags predicted by MARMOT (Mueller et al., 2013) and compare them to LEMMING-J, the joint model described in Section 3.All LEMMING versions use exactly the same features.Table 2 shows that LEMMING-J outperforms LEMMING-P in three measures (see bold tag, lemma & joint (tag+lemma) accuracies) except for English, where we observe a tie in lemma accuracy and a small drop in tag and tag+lemma accuracy.Coupling morphological attributes and lemmatization (lines 8-10 vs 11-13) improves tag+lemma prediction for five languages.Improvements in lemma accuracy of the joint over the best pipeline systems range from .1 (Spanish), over >.3 (German, Hungarian) to ≥.96 (Czech, Latin).

Conclusion
LEMMING is a modular lemmatization model that supports arbitrary global lemma features and joint modeling of lemmata and morphological tags.It is trainable on corpora annotated with gold standard tags and lemmata, and does not rely on morphological dictionaries or analyzers.We have shown that modeling lemmatization and tagging jointly benefits both tasks, and we set the new state of the art in token-based lemmatization on six languages.), |x| − i e ) Create a tree given a form-lemma pair ⟨x, y⟩.LCS returns the start and end indexes of the LCS in x and y. x ie is denotes the substring of x starting at index i s (inclusive) and ending at index i e (exclusive).i e − i s thus equals the length of this substring.|x| denotes the length of x.Note that the tree does not store the LCS, but only the length of the prefix and suffix.This way the tree for umgeschaut can also be applied to transform umgebaut "renovated" into umbauen "to renovate".
For the example umgeschaut-umschauen, the LCS is the stem schau.The function then recursively transforms umge into um and t into en.The prefix and suffix lengths of the form are 4 and 1 respectively.The left sub-node needs to transform umge into um.The new LCS is um.The new prefix and suffix lengths are 0 and 2 respectively.As the new prefix is empty the is nothing more to do.The suffix node needs to transform ge into the empty string ϵ.As the new LCS of the suffix is empty, because ge and ϵ have no character in common, the node is represented as a substitution node.The remaining transformation t into en is also represented as a substitution, resulting in the tree in Figure 3: if tree is a LCS node then 3: tree → tree i , i l , tree j , j l 4: if |x| < i l + j l then ▷ Prefix and Suffix do not fit.p =APPLY(tree i , x i l 0 ) ▷ Create prefix.

7:
if p is ⊥ then ▷ Prefix tree cannot be applied.return ⊥ ▷ tree cannot be applied.
In the code + represents string concatenation and ⊥ a null string, meaning that the tree cannot be applied to the form.We first run the tree depicted in Figure 3 on the form angebaut "attached (to a building)".and token-based unknown form (form unk) and lemma (lemma unk) rates.

Figure 1 :
Figure1: Edit tree for the inflected form umgeschaut "looked around" and its lemma umschauen "to look around".The right tree is the actual edit tree we use in our model, the left tree visualizes what each node corresponds to.The root node stores the length of the prefix umge (4) and the suffix t (1).

Figure 3 :
Figure3: Edit tree for the inflected form umgeschaut "looked around" and its lemma umschauen "to look around".The right tree is the actual edit tree we use in our model, the left tree visualizes what each node corresponds to.Note how the root node stores the length of the prefix umge and the suffix t.

8
Djamé Seddah, Grzegorz Chrupała, Özlem C ¸etinoglu, Josef van Genabith, and Marie Candito.2010.Lemmatization and lexicalized statistical parsing of morphologically-rich languages: the case of French.In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 85-93, Los Angeles, CA, USA.Association for Computational Linguistics.s , i e , j s , j e ← lcs(x, y) i