Robust Morphological Tagging with Word Representations

We present a comparative investigation of word representations for part-of-speech (POS) and morphological tagging, focusing on scenarios with considerable differences between training and test data where a robust approach is necessary. Instead of adapting the model towards a specific domain we aim to build a robust model across domains. We developed a test suite for robust tagging consisting of six languages and different domains. We find that representations similar to Brown clusters perform best for POS tagging and that word representations based on linguistic morphological analyzers perform best for morphological tagging.


Introduction
Most natural language processing (NLP) tasks can be better solved if a preprocessor tags each word in the natural language input with a label like "noun, singular" or "verb, past tense" that gives some indication of the syntactic role that the word plays in its context. The most common form of such preprocessing is POS tagging. However, for morphologically rich languages, a large subset of the languages of the world, POS tagging in its original form -where labels are syntactic categories with little or no morphological information -does not make much sense. The reason is that POS and morphological properties are mutually dependent, so solving only one task or solving the tasks sequentially is inadequate. The most important dependence of this type is that POS can be read off morphology in many cases; e.g., the morphological suffix "-iste" is a reliable indicator of the informal second person singular preterite indicative form of a verb in Spanish. In what follows, we use the term "morphological tagging" to refer to "morphological and POS tagging" since morphological tags generally include POS information.
The importance of morphological tagging as part of the computational linguistics processing pipeline motivated us to conduct the research reported in this paper. The specific setting that we address is increasingly recognized as the setting in which most practical NLP takes place: We look at scenarios with considerable differences between the training data and the application data, i.e., between the data that the tagger is trained on and the data that it is applied to. This type of scenario is frequent because of the great diversity and variability of natural language and because of the high cost of annotation -which makes it impossible to create large training sets for each new domain. For this reason, we address morphological tagging in a setting in which training and application data differ.
The most common approach to this setting is domain adaptation. Domain adaptation has been demonstrated to have good performance in scenarios with differently distributed training/test data. However, it has two disadvantages. First, it requires the availability of data from the target domain. Second, we need to do some extra work in domain adaptation -consisting of taking target domain data and using it to adapt our NLP system to the target domain -and we end up with a number of different versions of our NLP system. The extra work required and the pro-liferation of different versions increase the possibility of errors and increase the complexity of deploying NLP technology. Similar to other recent work (Zhang and Wang, 2009), we therefore take an approach that is different from domain adaptation. We build a system that is robust across domains without any modification. As a result, no extra work is required when the system is applied to a new domain: there is only one system and we can use it for all domains.
The key to making NLP components robust across domains is the use of powerful domain-independent representations for words. One of the main contributions of this paper is that we compare the performance of the most important representations that can be used for this purpose. We find that two of these are best suited for robust tagging. MarLiN (Martin et al., 1998) clusters -a derivative of Brown clusters -perform best for POS tagging. MarLiN clusters are also an order of magnitude more efficient to induce than the original Brown clusters. We provide an open source implementation of MarLiN clustering as part of this publication (Section 8). We compare the word representations to Morphological Analyzers (MAs), which are finite-state transducers that find the stems of a form and use them to derive all its possible morphological readings. MAs produce the best results in our experiments on morphological tagging. Our initial expectation was that domain differences and lack of coverage would put manually created MAs at a disadvantage when compared to learning algorithms that are run on very large text corpora. However, our results clearly show that MA-based representations are the best representations to use for robust morphological tagging.
The motivation for our work is that both morphological tagging and the "robust" application setting are important areas of research in NLP. To support this research, we created an extensive evaluation set for six languages. This involved identifying morphologically rich languages in which usable data sets with different distributional properties were available, designing mappings between different tag sets, organizing a manual annotation effort for one of the six languages and preparing large "general" (not domain-specific) data sets for unsupervised learning of word representations. The preparation and publication (Section 8) of this test suite is in itself a sig-nificant contribution.
The remainder of this paper is structured as follows. Section 2 discusses related work. Section 3 presents the representations we tested. Section 4 describes the data sets and the annotation and conversion efforts required to create the in-domain (ID) and out-of-domain (OOD) data sets. In Section 5, we describe the experiments and discuss our findings. In Section 6, we provide an analysis of our results. Section 7 summarizes our findings and contributions.

Related Work
Morphological tagging (Oflazer and Kuruöz, 1994;Hajič and Hladká, 1998) is the task of assigning a morphological reading to a token in context. The morphological reading consists of features such as case, gender, person and tense and is represented as a single tag. This allows for the application of standard sequence labeling algorithms such as Conditional Random Fields (CRFs) (Lafferty et al., 2001), but also puts an upper bound on the accuracy as only readings occurring in the training set can be produced. It is still the standard approach to morphological disambiguation as the number of readings that cannot be produced is usually small.
The related work can be divided in systems that try to exploit certain properties of a language (Habash and Rambow, 2005;Yuret and Türe, 2006) and language-independent systems (Hajič, 2000;Smith et al., 2005). In this paper, we adopt a language-independent approach.
Semi-supervised learning attempts to increase the accuracy of a machine learning system by using additional unlabeled data. Word representations, especially Brown clusters, have been extensively used for named entity recognition (NER) (Miller et al., 2004), parsing (Koo et al., 2008) and POS tagging (Collobert and Weston, 2008;Huang et al., 2009). In these papers, word representations were shown to yield consistent improvements and to often outperform traditional semi-supervised methods such as self-training. Prior work on semi-supervised training for morphological tagging includes Spoustová et al. (2009) andChrupala (2011). In contrast to this earlier work on morphological tagging, we study a number of morphologically more complex and diverse languages. We also compare learned represen-tations to representations obtained from MAs.
Domain adaptation (DA) attempts to adapt a model trained on a source domain to a target domain. DA can be broadly divided into supervised and unsupervised approaches depending on whether labeled target domain data is available or not. Among unsupervised approaches to DA, representation learning (Ando and Zhang, 2005;Blitzer et al., 2006) uses the unlabeled target domain data to induce a structure that is suitable for transferring information from the labeled source domain to the target domain. Similar to representation learning for DA, we attempt to include word representations into the model. However, we induce the representation from a general domain in an attempt to obtain a model that has robust high accuracy across domains, for the source domain as well as for the target domains, for which neither labeled nor unlabeled training data is available.

Representations
We survey the following distributional representations: (i) count vectors reduced by a Singular Value Decomposition (SVD), (ii) word clusters induced using the likelihood of a class-based language model, (iii) distributed embeddings trained using a neural network and (iv) accumulated tag counts, a task-specific representation obtained from an automatically tagged corpus.
Singular value decomposition of word-feature cooccurrence matrices (Schütze, 1995) has been found to be a fast and efficient way to obtain distributed embeddings. The approach selects a subset of the vocabulary as so-called feature words, usually by including words up to a certain frequency rank. Every word form can then be represented by the accumulated counts of feature words occurring to its left and right. Then an SVD is applied to the cooccurrence matrix as a form of dimension reduction and to reduce sparsity.
We also experimented with unreduced count vectors, but they did not give better results than SVD reduced count vectors. SVD-based representations have been used in English POS induction (Lamar et al., 2010) as well as as features in English POS tagging and syntactic chunking (Huang et al., 2009); they have a similar level of accuracy as unsupervised Hidden Markov Models (HMMs) in these studies.
Language model-based (LM-based) word clusters were introduced by Brown et al. (1992) and later found to be helpful in a range of NLP tasks. The basic idea is to find the optimal clustering with respect to the likelihood of a class-based language model: where g(x) is the cluster assignment function that maps a word form x to a cluster and |D| denotes the length of the training set. Brown et al. (1992) propose a greedy bottom-up algorithm for the optimization that merges the pair of clusters that yields the smallest loss in likelihood; as well as a more efficient approximation of that algorithm that limits the number of clusters under consideration and still works well in practice. It is used by most work in the literature (Liang, 2005;Turian et al., 2010;Koo et al., 2008).
We, however, found the algorithm proposed by Martin et al. (1998) to be faster and to give slightly better results. The algorithm is similar to K-means in that it starts with an initial clustering and greedily improves the objective function by moving single words to their optimal cluster. In contrast to K-means, it updates the objective function immediately. The algorithm has also been shown to work well in unsupervised POS induction (Clark, 2003;Blunsom and Cohn, 2011). Our implementation of this algorithm is called MarLiN and has been made available as open-source software (Section 8). Miller et al. (2004) use tags of different granularity induced from unlabeled text to improve the performance of an averaged perceptron tagger (Collins, 2002) on an English NER task.
The Brown algorithm induces a tree where leaves represent a single word form and the root node the entire vocabulary. Intermediate nodes represent clusters of different sizes and can be addressed by a binary string specifying the path from the root node to the cluster. Brown clusters are also used by Koo et al. (2008) to improve dependency parsing for English and Czech. Chrupala (2011) compare Brown clusters to a Latent Dirichlet Allocation (LDA) model on Spanish and French morphological tagging and find them to yield similar performance. 1 Neural networks have been used by Collobert and Weston (2008) to train embeddings for POS tagging as well as other NLP tasks. These embeddings -henceforth CW embeddings -are trained by building a neural network that given contexts of a word as input is trained to discriminate between the correct center word and a random word. The proposed training algorithm is reported to need several days or even weeks, but has been reimplemented by Al-Rfou et al. (2013), who induced embeddings for the Wikipedias of more than 100 languages. Turian et al. (2010) find that the performance of Brown clusters is competitive with more training intensive embeddings like CW. In our experiments, we find that MarLiN clusters slightly outperform CW. We do not evaluate bag-of-words models such as WORD2VEC (Mikolov et al., 2013), because the ordering of words is essential for finding morphological properties.
Accumulated tag counts (ACT) are a form of taskspecific sparse representation. The unlabeled corpus is first annotated by a tagger; for each occurring word form, the number of times a specific tag was assigned can then be used as a representation. Goldberg and Elhadad (2013) and (Szántó and Farkas, 2014) show that using such information in the wordpreterminal emission probabilities of PCFGs can improve parsing accuracy. Specifically, Szántó and Farkas (2014) show that this approach performs as well as an MA in some cases. We find MAs to be more effective than the accumulated count embeddings; this is not a contradiction as we try to improve the performance of the tagger itself.

Data Preparation
Our test suite consists of data sets for six different languages: Czech (cs), English (en), German (de), Hungarian (hu), Spanish (es) and Latin (la). Czech, German, Hungarian and Latin are morphologically rich. We chose these languages because they represent different families: Germanic (English, German), Romance (Latin, Spanish), Slavic (Czech) and Finno-Ugric (Hungarian) and different degrees of morphological complexity and syncretism. For example, English and Spanish rarely mark case while the other languages do; and as an agglutinative language, Hungarian features a low number of possible readings for a word form while languages like German can have more than 40 different readings for a word form. An additional criterion was to have a sufficient amount of labeled OOD data. The data sets also feature an interesting selection of domain differences. For example, for Latin we have texts from different epochs while the English data contains canonical and non-canonical text.
Labeled Data. This section describes the annotation and conversion we performed to create consistent ID and OOD data sets. 2 No conversion was required for Hungarian, English and Latin as the data is already annotated in a consistent way.
For Hungarian we use the (multi-domain) Szeged Dependency Treebank (Vincze et al., 2010). We use the part that was used in the SPMRL 2013 shared task (Seddah et al., 2013) as ID data (news-wire) and an excerpt from the novel 1984 and a Windows 2000 manual as OOD data.
For Latin we use the PROIEL treebank (Haug and Jøhndal, 2008). It consists of data from the Vulgate (bible text, ≈ 380 AD), Commentarii de Bello Gallico (≈ 50 BC), Letters from Cicero to his friend Atticus (≈ 50 BC) and The Pilgrimage of Aetheria (≈ 380 AD). We use the biggest text source (Vulgate) as ID data and the remainder as OOD data.
For English we use the SANCL shared task data (Petrov and McDonald, 2012), which consists of Ontonotes 4.0 as ID data and five OOD domains from the Google Web treebank: Yahoo! Answers, weblogs, news groups, business reviews and emails. For Czech we use the part of the Prague Dependency Treebank (PDT) (Böhmová et al., 2003) that was used in the CoNLL 2009 shared tasks  as ID data. We use the Czech part of the Multext East (MTE) corpus (Erjavec, 2010) as OOD data. MTE consists of translations of the novel 1984 that have been annotated morphologically. PDT and MTE have been annotated using two different guidelines that without further annotation effort could only be merged by reducing them to a common subset. Specifically, we removed features such as sub POS tags as well as markers for (in)animacy. The PDT features a number of tags that are ambiguous and could not always be resolved. The gender feature Q for example can mean feminine or neuter. If we could not disambiguate such a tag, we removed it; this results in morphological tags that are not present in the MTE corpus and a relatively high number of unseen tags. Instead of describing the conversion process in greater detail we refer to our conversion scripts (Section 8).
For Spanish we use the part of the AnCora corpus (Taulé et al., 2008) of CoNLL 2009 and the IULA treebank (Marimon et al., 2012), which consists of five domains: law, economics, medicine, computer science and environment. We use the AnCora corpus as ID data set and IULA as OOD data set. The two treebanks have been annotated using the same annotation scheme, but slightly different guidelines. Similar to Czech we merged the data sets by deleting features that could not be merged or were not present in one of the treebanks. Again we refer to the conversion script for further details (Section 8).
For German we use the Tiger treebank (Brants et al., 2002) in the same split as Müller et al. (2013) as ID data and the Smultron corpus (Volk et al., 2010) as OOD data. Smultron consists of four parts: a description of Alpine hiking routes, a DVD manual, an excerpt of Sophie's World and economics texts. It has been annotated with POS and syntax, but not with morphological features. We annotated Smultron following the Tiger guidelines. The annotation process was similar to Marimon et al. (2012) in that the data sets were automatically tagged with the MORPH tagger MarMoT (Müller et al., 2013) and then manually corrected by two annotators. This tagger is a strong baseline as we could include features based on gold lemma, POS and syntax (Seeker and Kuhn, 2013). The agreement of the annotators was .9628 and the κ agreement .64. 3 As most of the 3 For calculating κ, we assume that random agreement occurs when both annotators agree with the reading proposed by the tagger. We then estimate the probability of random agreement by multiplying the individual estimated probabilities of differences between the annotators were cases where only one of the annotators had corrected an obvious error that the other had overlooked, the differences were resolved by the annotators themselves.
We used the provided segmentation if available and otherwise split ID data 8/1/1 into training, development and test sets and OOD data 1/1 into development and test sets if not mentioned otherwise. We thus have a classical setup of in-domain news paper text vs. prose, medical, law, economic or technical texts for Czech, German, Spanish and Hungarian. For English we have canonical vs. non-canonical data and for Latin data of different epochs (ca. 400 AD vs 50 BC). Additionally, for German one of the test domains is written in Swiss German.
Looking at some statistics of the labeled data sets, 4 we find that: Hungarian and Latin are the languages with the highest OOV rates (27% and 37%, which for reasons of consistency we will henceforth write as follows: .27 and .37); Hungarian has a very productive agglutinative morphology while the high number of Latin OOVs can be explained by the small training set (<60,000); Czech features the highest unknown tag rate (.05) as well as the highest unseen word-tag rate (.16). This can be explained by the limits of the conversion procedure we discussed above, e.g., ambiguous features like Q.

Unlabeled Data.
As unlabeled data we use Wikipedia dumps from 2014 for all languages except for Latin for which we use the Patrologia Latina, a collection of clerical texts from ca. 100 AD to 1200 AD from Corpus Corporum (Roelli, 2014). We do not use the Latin version of Wikipedia because it is written by enthusiasts, not by native speakers, and contains many errors.
We preprocessed the Wikipedia dumps with WIKIPEDIAEXTRACTOR (Attardi and Fuschetto, 2013) and NLTK'S (Bird et al., 2009) implementation of PUNKT (Kiss and Strunk, 2006) to detect sentence boundaries. Tokenization was performed using MAGYARLANC (Hungarian, Zsibrita et al. (2013)), STANFORD TOKENIZER (English, Manning et al. (2014)), FREELING (Spanish, Padró and Stanilovsky (2012)) and CZECHTOK 5 (Czech). For changing the proposed tagging. This yields a random agreement probability of .8965. 4 Complete tables are in the appendix: Tables 1 and 2. 5 http://sourceforge.net/projects/ Latin, we removed punctuation because PROIEL does not contain punctuation. We also split off the clitics ne, que and ve if the resulting token was accepted by LATMOR (Springmann et al. (2014)). Following common practice, we normalized the text by replacing digits with 0s. 6 In our experiments, we extract representations for the 250,000 most frequent word types. This vocabulary size is comparable to other work; e.g., Turian et al. (2010) use 269,000 types. This threshold yields low fractions of uncovered tokens 7 for English and Latin (.009 and .02). For the other languages, this fraction rises to .04. We also extract the morphological readings of the words in this vocabulary using MAGYARLANC (Hungarian, Zsibrita et al. (2013)), FREELING (English and Spanish, Padró and Stanilovsky (2012)), SMOR (German, Schmid et al. (2004)), an MA from Charles University (Czech, Hajič (2001)) and LATMOR (Latin, Springmann et al. (2014)). Throughout this paper we extract one feature for each cluster id or MA reading of the current word form. For example, SMOR produces two readings for the German word form erhielt 'received': <1><SG><PA S T><IN D> and <2><SG><PA S T><IN D>, we thus fire two features representing the respective tags whenever erhielt is seen in the data. We also experimented with cluster indexes of neighboring uni/bigrams, but obtained no consistent improvement. For the dense embeddings we analogously extract the vector of the current word form.

Experiments
For all our experiments we use MarMoT (Müller et al., 2013) a joint POS and morphological tagger. 8 The CRF tagger employs a pruning strategy on forward-backward lattices to efficiently handle big tag sets and higher orders. Its feature set is similar to Ratnaparkhi (1996) and Toutanova et al. (2003) and includes prefixes, suffixes, immediate lexical context and shape features based on capitalization, special characters and digits. MarMoT was shown to be a competitive POS and morphological tagger czechtok/ 6 For statistics of the unlabeled data sets cf. across six languages (Müller et al., 2013). In order to make sure that it is also robust in an OOD setup we compare it to the two popular taggers SVM-Tool (Giménez and Marquez, 2004) and Morfette (Chrupała et al., 2008). The results are summarized in Table 1.
MarMoT uses stochastic gradient descent and produces different results in each training run. We therefore always report the average of five runs. The OOD numbers are macro-averages over the different OOD data sets of a language. 9 The tables in this paper are based on the development sets; the only exception to this is Table 5, which is based on the test set. MarMoT outperforms SVMTool and Morfette on every language and setup (ID / OOD) except for the Spanish OOD data set. For Czech, German and Latin the improvements over the best baseline are >1. Different orders of MarMoT behave as expected: higher-order models (order>1) outperform first-order models. The only exception to this is Latin. This suggests a drastic difference of the tag transition probabilities between the Latin ID and OOD data sets. Given the results in Table 1 and for simplicity we use an second-order MarMoT model in all subsequent experiments.
LM-based clustering. We first compare different implementations of LM-based clustering. The implementation of Brown clustering by Liang (2005) is most commonly used. Its hierarchical binary structure can be used to extract clusterings of varying granularity by selecting different prefixes of the path from the root to a specific word form. Following other work (Ratinov and Roth, 2009;Turian et al., 2010), we induce 1000 clusters and select path lengths 4, 6, 10 and 20. We call this representation Brown path. We compare Brown path to mkcls 10 (Och, 1999) and MarLiN. These implementations just induce flat clusterings of a certain size; we thus run them for cluster sizes 100, 200, 500 and 1000 to also obtain cluster ids of different sizes. The cluster sizes roughly resemble the granularity obtained in Brown path . We call the corresponding mod-MarMoT (1) MarMoT (2) MarMoT ( Table 2 shows that the absolute differences between systems are small, but overall MarLiN and mkcls are better. 11 We conclude that systems based on the algorithm of Martin et al. (1998) are slightly more accurate for tagging and are several times faster than the more frequently used version of Brown et al. (1992). We thus use MarLiN for the remainder of this paper.
Neural Network Representations. We compare MarLiN with the implementation of CW by Al-Rfou et al. (2013). They extracted 64-dimensional representations for only the most frequent 100,000 word forms. To make the comparison fair, we use the intersection of our and their representation vocabularies. 12 The results in Table 3 show that MarLiN is 11 Brownpath reaches the same performance as MarLiN in one case: pos/es/OOD. 12 We also use representations from Wikipedia (instead of Corpus Corporum) for Latin to increase the similarity of the  best in 15 out of 22 cases and significantly better in eight. CW is best in 9 out of 22 cases and significantly better in two. We conclude that LM-based representations are more suited for tagging as they can be induced faster, are smaller and give better results.
SVD and ACT Representations. For the SVDbased representation we use feature ranks out of {500, 1000} and dimensions out of {50, 100, 200, 500}. We found that l1-normalizing the vectors before and after the SVD improved results slightly. For the accumulated tag counts (ACT) we annotate the data with our baseline model and extract wordtag probabilities. The probabilities are then used as sparse real-valued features.    Table 6: Improvement compared to the baseline for different frequency ranges of words on OOD

Analysis
We now analyze why MarLiN and MA perform better than the baseline. First we compare the improvements in absolute error rate over the baseline by grouping word forms by their training set frequency f . The number are shown in Table 6. We find that most of the improvement comes from OOV words. Rare words (frequency <10) show a smaller, but still important contribution while the contribution of fre-  Table 7: Improvement compared to the baseline for different features quent words can be almost neglected for four languages. The exception is German where frequent words contribute more to the error reduction than rare words. This could be caused by syncretisms such as in plural noun phrases where the gender is not marked in determiner and adjective and can only be derived from the head noun; e.g., the adjectives in schwere Schulfächer 'difficult school subjects' and verdächtige Personen 'suspect persons' are unmarked for gender and the correct genders (neuter vs. feminine) cannot be inferred from distributional information or suffixes for the nouns (although gender is easy to infer distributionally for singular forms of nouns).
Looking at the morphological features with the highest improvement in absolute error rate (Table 7) we find, that the features with the highest improvement are POS, SUB-POS (a finer division of POS, e.g., nouns are split into proper / common nouns), gender, case and number. For all languages POS and -if part of the annotation -SUB-POS are among the three features with the highest improvements. Gender is also always among the three features with the highest improvements for the four languages that have gender (es, de, la, cs). We just discussed an example for German where gender could not be derived from context or inflectional suffixes. Other languages also have word forms that do not mark gender, e.g., Spanish masculine ave 'bird' vs. feminine llave 'key'. The gender can, however, easily be derived if the word representation encodes whether a word form has been seen with a specific determiner or adjective on its right or left.
Lastly, we use Jaccard similarity 13 to compare the sets of gold and predicted morphological features. Jaccard can be interpreted as a soft variant of accuracy: If the two tags are identical it yields 1 and otherwise it corresponds to the number of correctly predicted features divided by the size of the union of gold and predicted features. This table demonstrates that the evaluation measure we have used throughout this paper -a tag counts as completely wrong if a single feature was misidentified even though all others are correct -is conservative. On a feature-by-feature basis accuracy would be much higher. The difference is largest for Czech and Latin.

Conclusion
We have presented a test suite for morphological tagging consisting of in-domain (ID) and out-ofdomain (OOD) data sets for six languages: Czech, English, German, Hungarian, Latin and Spanish. We converted some of the data sets to obtain a reasonably consistent annotation and manually annotated the German part of the Smultron treebank. We surveyed four different word representations: SVDreduced count vectors, LM-based clusters, accumulated tag counts and CW embeddings. We found that the LM-based clusters outperformed the other representations for POS and MORPH tagging, ID and OOD data sets and all languages. We also showed that our implementation of MarLiN (Martin et al., 1998) is an order-of-magnitude more efficient and 13 Jaccard(U, V ) = |U ∩ V |/|U ∪ V | performs slightly better than the implementation by Liang (2005). We also compared the learned representations to manually created Morphological Analyzers (MAs). We found that MarLiN outperforms MAs in POS tagging, but that it is substantially worse in morphological tagging. In our analysis of the results, we showed that both MarLiN and MAs decrease the error most for out-of-vocabulary words and for the features POS and gender.

Resources
As part of this publication we also release the following resources at http://cistern.cis.lmu. de/marmot/: (i) our implementation of MarLiN as open-source (ii) the morphological layer of the German part of the SMULTRON corpus. For easier reproducibility, we also made (iii) the preprocessed Wikipedia dumps and the induced representation dictionaries available. (iv) Morphological dictionaries were released to the extent this was compatible with the usage agreement. (v) We also published the conversion code for unifying the Spanish and Czech annotations.