Modeling Composite Labels for Neural Morphological Tagging

Neural morphological tagging has been regarded as an extension to POS tagging task, treating each morphological tag as a monolithic label and ignoring its internal structure. We propose to view morphological tags as composite labels and explicitly model their internal structure in a neural sequence tagger. For this, we explore three different neural architectures and compare their performance with both CRF and simple neural multiclass baselines. We evaluate our models on 49 languages and show that the neural architecture that models the morphological labels as sequences of morphological category values performs significantly better than both baselines establishing state-of-the-art results in morphological tagging for most languages.


Introduction
The common approach to morphological tagging combines the set of word's morphological features into a single monolithic tag and then, similar to POS tagging, employs multiclass sequence classification models such as CRFs (Müller et al., 2013) or recurrent neural networks (Labeau et al., 2015;Heigold et al., 2017). This approach, however, has a number of limitations. Firstly, it ignores the intrinsic compositional structure of the labels and treats two labels that differ only in the value of a single morphological category as completely independent; compare for instance labels [POS=NOUN,CASE=NOM,NUM=SG] and [POS=NOUN,CASE=NOM,NUM=PL] that only differ in the value of the NUM category. Secondly, it introduces a data sparsity issue as the less frequent labels can have only few occurrences in the 1 The source code is available at https://github.com/AleksTk/seq-morph-tagger training data. Thirdly, it excludes the ability to predict labels not present in the training set which can be an issue for languages such as Turkish where the number of morphological tags is theoretically unlimited (Yuret and Türe, 2006).
To address these problems we propose to treat morphological tags as composite labels and explicitly model their internal structure. We hypothesise that by doing that, we are able to alleviate the sparsity problems, especially for languages with very large tagsets such as Turkish, Czech or Finnish, and at the same time also improve the accuracy over a baseline using monolithic labels. We explore three different neural architectures to model the compositionality of morphological labels. In the first architecture, we model all morphological categories (including POS tag) as independent multiclass classifiers conditioned on the same contextual word representation. The second architecture organises these multiclass classifiers into a hierarchy-the POS tag is predicted first and the values of morphological categories are predicted conditioned on the value of the predicted POS. The third architecture models the label as a sequence of morphological category-value pairs. All our models share the same neural encoder architecture based on bidirectional LSTMs to construct contextual representations for words (Lample et al., 2016).
We evaluate all our models on 49 UD version 2.1 languages. Experimental results show that our sequential model outperforms other neural counterparts establishing state-of-the-art results in morphological tagging for most languages. We also confirm that all neural models perform significantly better than a competitive CRF baseline. In short, our contributions can be summarised as follows: 1) We propose to model the compositional internal structure of complex morphological la-bels for morphological tagging in a neural sequence tagging framework; 2) We explore several neural architectures for modeling the composite morphological labels; 3) We find that tag representation based on the sequence learning model achieves state-of-the art performance on many languages. 4) We present state-of-the-art morphological tagging results on 49 languages on the UDv2.1 corpora.

Related Work
Most previous work on modeling the internal structure of complex morphological labels has occurred in the context of morphological disambiguation-a task where the goal is to select the correct analysis from a limited set of candidates provided by a morphological analyser. The most common strategy to cope with a large number of complex labels has been to predict all morphological features of a word using several independent classifiers whose predictions are later combined using some scoring mechanism (Hajič and Hladká, 1998;Hajič, 2000;Smith et al., 2005;Yuret and Türe, 2006;Zalmout and Habash, 2017;Kirov et al., 2017). Inoue et al. (2017) combined these classifiers into a multitask neural model sharing the same encoder, and predicted both POS tag and morphological category values given the same contextual representation computed by a bidirectional LSTM. They showed that the multitask learning setting outperforms the combination of several independent classifiers on tagging Arabic. In this paper, we experiment with the same architecture, termed as multiclass multilabel model, on many languages. Additionally, we extend this approach and explore a hierarchical architecture where morphological features directly depend on the POS tag. Another previously adopted approach involves modeling complex morphological labels as sequences of morphological feature values (Hakkani-Tur et al., 2000;Schmid and Laws, 2008). In neural networks, this idea can be implemented with recurrent sequence modeling. Indeed, one of our proposed models generates morphological tags with an LSTM network. Similar idea has been applied for the morphological reinflection task (Kann and Schütze, 2016;Faruqui et al., 2016) where the sequential model is used to generate the spellings of inflected forms given the lemma and the morphological label of the desired form. In morphological tagging, however, we generate the morphological labels themselves.
Another direction of research on modeling the structure of complex morphological labels involves structured prediction models (Müller et al., 2013;Müller and Schütze, 2015;Malaviya et al., 2018;Lee et al., 2011). Lee et al. (2011) introduced a factor graph model that jointly infers morphological features and syntactic structures. Müller et al. (2013) proposed a higher-order CRF model which handles large morphological tagsets by decomposing the full label into POS tag and morphology part. Malaviya et al. (2018) proposed a factorial CRF to model pairwise dependencies between individual features within morphological labels and also between labels over time steps for cross-lingual transfer. Recently, neural morphological taggers have been compared to the CRF-based approach (Heigold et al., 2017;Yu et al., 2017). While Heigold et al. (2017) found that their neural model with bidirectional LSTM encoder surpasses the CRF baseline, the results of Yu et al. (2017) are mixed with the convolutional encoder being slightly better or on par with the CRF but the LSTM encoder being worse than the CRF baseline.
Most previous work on neural POS and morphological tagging has shared the general idea of using bidirectional LSTM for computing contextual features for words (Ling et al., 2015;Huang et al., 2015;Labeau et al., 2015;Ma and Hovy, 2016;Heigold et al., 2017). The focus of the previous work has been mostly on modeling the inputs by exploring different character-level representations for words (Heigold et al., 2016;Santos and Zadrozny, 2014;Ma and Hovy, 2016;Inoue et al., 2017;Ling et al., 2015;Rei et al., 2016). We adopt the general encoder architecture from these works, constructing word representations from characters and using another bidirectional LSTM to encode the context vectors. In contrast to these previous works, our focus is on modeling the compositional structure of the complex morphological labels.
The morphologically annotated Universal Dependencies (UD) corpora (Nivre et al., 2017) offer a great opportunity for experimenting on many languages. Some previous work have reported results on several UD languages (Yu et al., 2017;Heigold et al., 2017). Morphological tagging results on many UD languages have been also reported for parsing systems that predict POS and morphological tags as preprocessing (Andor et al., 2016;Straka et al., 2016;Straka and Straková, 2017). Since UD treebanks have been in constant development, these results have been obtained on different UD versions and thus are not necessarily directly comparable. We conduct experiments on all UDv2.1 languages and we aim to provide a baseline for future work in neural morphological tagging.

Neural Models
We explore three different neural architectures for modeling morphological labels: multiclass multilabel model that predicts each category value separately, hierarchical multiclass multilabel model where the values of morphological features depend on the value of the POS, and a sequence model that generates morphological labels as sequences of feature-value pairs.

Notation
Given a sentence w 1 , . . . , w n consisting of n words, we want to predict the sequence t 1 , . . . , t n of morphological labels for that sentence. Each label . . , f im } consists of a POS tag (f i0 ≡ POS) and a sequence of m category values. For each word w i , the encoder computes a contextual vector h i , which captures information about the word and its left and right context.

Decoder Models
Multiclass Multilabel model (MCML) This model formulates the morphological tagging as a multiclass multilabel classification problem. For each morphological category, a separate multiclass classifier is trained to predict the value of that category (Figure 1 (a)). Because not all categories are always present for each POS (e.g., a noun does not have a tense category), we extend the morphological label of each word by adding all features that are missing from the annotated label and assign them a special value that marks the category as "off". Formally, the model can be described as: where M is the total number of morphological categories (such as case, number, tense, etc.) observed in the training corpus. The probability of each feature value is computed with a softmax function: where W j and b j are the parameter matrix and bias vector for the jth morphological feature (j = 0, . . . , M ). The final morphological label for a word is obtained by concatenating predictions for individual categories while filtering out off-valued categories.
Hierarchical Multiclass Multilabel model (HMCML) This is a hierarchical version of the MCML architecture that models the values of morphological categories as directly dependent on the POS tag (Figure 1 (b)): The probability of the POS is computed from the context vector h using the respective parameters: The POS-dependent context vector l is obtained by concatenating the context vector h with the unnormalised log probabilities of the POS: The probabilities of the morphological features are computed using the POS-dependent context vector: Sequence model (SEQ) The SEQ model predicts complex morphological labels as sequences of category values. This approach is inspired from neural sequence-to-sequence models commonly used for machine translation (Cho et al., 2014;Sutskever et al., 2014). For each word in a sentence, the decoder uses a unidirectional LSTM network (Figure 1 (c)) to generate a sequence of morphological category-value pairs based on the context vector h and the previous predictions. The probability of a morphological label t is under this model: Decoding starts by passing the start-of-sequence symbol as input. At each time step, the decoder computes the label context vector g j based on the previously predicted category value, previous label context vector and the word's context vector. The probability of each morphological featurevalue pair is then computed with a softmax.
At training time, we feed correct labels as inputs while at inference time, we greedily emit the best prediction from the set of all possible feature-value pairs. The decoding terminates once the end-ofsequence symbol is produced.

Encoder
We adopt a standard sequence tagging encoder architecture for all our models. It consists of a bidirectional LSTM network that maps words in a sentence into context vectors using character and wordlevel embeddings. Character-level word embeddings are constructed with a bidirectional LSTM network and they capture useful information about words' morphology and shape. Word level embeddings are initialised with pre-trained embeddings and fine-tuned during training. The character and word-level embeddings are concatenated and passed as inputs to the bidirectional LSTM encoder. The resulting hidden states h i capture contextual information for each word in a sentence. Similar encoder architectures have been applied recently with notable success to morphological tagging (Heigold et al., 2017;Yu et al., 2017) as well as several other sequence tagging tasks (Lample et al., 2016;Chiu and Nichols, 2016;Ling et al., 2015).

Experimental Setup
This section details the experimental setup. We describe the data, then we introduce the baseline models and finally we report the hyperparameters of the models.

Data
We run experiments on the Universal Dependencies version 2.1 (Nivre et al., 2017). We excluded corpora that did not include train/dev/test split, word form information 2 , or morphological features 3 . Additionally, we excluded corpora for which pretrained word embeddings were not available. 4 The resulting dataset contains 69 corpora covering 49 different languages. Tagsets were constructed by concatenating the POS and morphological annotations of the treebanks. Table 1 gives corpus statistics. We present type and token counts for both training and test sets. For training set, we also show the average and maximum number of tags per word type and the size of the morphological tagset. For the test set, we report the proportion of out-of-vocabulary (OOV) words as well as the number of OOV tag tokens and types.
In the encoder, we use fastText word embeddings (Bojanowski et al., 2017)   For training sets we report the number of word tokens and types, the average (Avg) and maximum (Max) tags per word type, the proportion of word types for which pre-trained embeddings were available (% Emb) and the size of the morphological tagset (# Tags). For the test sets, we also give the total number of tokens and types, the proportion of OOV words (% OOV) and the number of OOV tag tokens and types. means of character-level embeddings. In Table 1, we also report for each language the proportion of word types for which the pre-trained embeddings are available.

Baseline Models
We use two models as baseline: the CRF-based MARMOT (Müller et al., 2013) and the regular neural multiclass classifier.
MarMoT (MMT) MARMOT 6 is a CRF-based morphological tagger which has been shown to achieve competitive performance across several languages (Müller et al., 2013). MARMOT approximates the CRF objective using a pruning strategy which enables training higher-order models and handling large tagsets. In particular, the tagger first predicts the POS part of the label and based on that, constrains the set of possible morphological labels. Following the results of Müller et al. (2013), we train second-order models. We tuned the regularization type and weight on German development set and based on that, we use L2 regularization with weight 0.01 in all our experiments.
Neural Multiclass classifier (MC) As the second baseline, we employ the standard multiclass classifier used by both Heigold et al. (2017) and Yu et al. (2017). The proposed model consists of an LSTM-based encoder, identical to the one described above in section 3.3, and a softmax classifier over the full tagset. The tagset sizes for each corpora are shown in Table 1. During preliminary experiments, we also added CRF layer on top of softmax, but as this made the decoding process considerably slower without any visible improvement in accuracy, we did not adopt CRF decoding here. The multiclass model is shown in Figure 1 (d).
The inherent limitation of both baseline models is their inability to predict tags that are not present in the training corpus. Although the number of such tags in our data set is not large, it is nevertheless non-zero for most languages.

Training and Parametrisation
Since tuning model hyperparameters for each of the 69 datasets individually is computationally demanding, we optimise parameters on Finnish-a morphologically complex language with a reasonable dataset size-and apply the resulting values to 6 http://cistern.cis.lmu.de/marmot/  other languages. We first tuned the character embedding size and character-LSTM hidden layer size of the encoder on the SEQ model and reused the obtained values with all other models. We tuned the batch size, the learning rate and the decay factor for the SEQ and MC models separately since these models are architecturally quite different. For the MCML and HMCML models we reuse the values obtained for the MC model. The remaining hyperparameter values are fixed. Table 2 lists the hyperparameters for all models. We train all neural models using stochastic gradient descent for up to 400 epochs and stop early if there has been no improvement on development set within 50 epochs. For all models except SEQ, we decay the learning rate by a factor of 0.98 after every 2500 batch updates. We initialise biases with zeros and parameter matrices using Xavier uniform initialiser (Glorot and Bengio, 2010).
Words in training sets with no pre-trained embeddings are initialised with random embeddings. At test time, words with no pre-trained embedding are assigned a special UNK-embedding. We train the UNK-embedding by randomly substituting the singletons in a batch with the UNK-embedding with a probability of 0.5. Table 3 presents the experimental results. We report tagging accuracy for all word tokens and also for OOV tokens only. A full morphological tag is considered correct if both its POS and all morphological features are correctly predicted.   First of all, we can confirm the results of Heigold et al. (2017) that the performance of neural morphological tagging indeed exceeds the results of a CRFbased model. In fact, all our neural models perform significantly better than MARMOT (p < 0.001). 7 The best neural model on average is the SEQ model, which is significantly better from both the MC baseline as well as the other two compositional models, whereby the improvement is especially well-pronounced on smaller datasets. We do not observe any significant differences between MCML and HMCML models neither on all words nor OOV evaluation setting.

Results
We also present POS tagging results in the rightmost section of Table 3. Here again, all neural models are better than CRF which is in line with the results presented by Plank et al. (2016). For POS tags, the HMCML is the best on average. It is also significantly better than the neural MC baseline, however, the differences with the MCML and SEQ models are insignificant.
In addition to full-tag accuracies, we assess the performance on individual features. Table 4 reports macro-averaged F1-cores for the SEQ and the MC models on universal features. Results indicate that the SEQ model systematically outperforms the MC model on most features.

Analysis and Discussion
OOV label accuracy Our models are able to predict labels that were not seen in the training data. Figure 2 presents the accuracy of test tokens with OOV labels obtained with our best performing SEQ model plotted against the number of OOV label types. The datasets with zero accuracy are omitted. The main observation is that although the OOV label accuracy is zero for some languages, it is above zero on ca. half of the datasets-a result that would be impossible with MARMOT or MC baselines. 7 As indicated by Wilcoxon signed-rank test.  Figure 3 shows the largest error rates for distinct morphological categories for both SEQ and MC models averaged over all languages. We observe that the error patterns are similar for both models but the error rates of the SEQ model are consistently lower as expected.

Error Analysis
Stability Analysis To assess the stability of our predictions, we picked five languages from different families and with different corpus size, and performed five independent train/test runs for each language.  Hyperparameter Tuning It is possible that the hyperparameters tuned on Finnish are not optimal for other languages and thus, tuning hyperparameters for each language individually would lead to different conclusions than currently drawn. To shed some light on this issue, we tuned hyperparameters for the SEQ and MC models on the same subset of five languages. We first independently optimised the dropout rates on word embeddings, encoder's LSTM inputs and outputs, as well as the number of LSTM layers. We then performed a grid search to find the optimal initial learning rate, the learning rate decay factor and the decay step. Value ranges for the tuned parameters are given in Table 6.   Table 7 reports accuracies for the tuned models compared to the mean accuracies reported in Table 5. As expected, both tuned models demonstrate superior performance on all languages, except for German with the SEQ model. Hyperparameter tuning has a greater overall effect on the MC model, which suggests that it is more sensitive to the choice of parameters than the SEQ model. Still, the tuned SEQ model performs better or at least as good as the MC model on all languages.
Comparison with Previous Work Since UD datasets have been in rapid development and different UD versions do not match, direct comparison of our results to previously published results is difficult. Still, we show the results taken from Heigold et al. (2017), which were obtained on UDv1.3, to provide a very rough comparison. In addition, we compare our SEQ model with a neural tagger presented by Dozat et al. (2017), which is similar to   our MC model, but employs a more sophisticated encoder. We train this model on UDv2.1 on the same set of languages used by Heigold et al. (2017). Table 8 reports evaluation results for the three models. The SEQ model and Dozat's tagger demonstrate comparable performance. This suggests that the SEQ model can be further improved by adopting a more advanced encoder from Dozat et al. (2017).

Conclusion
We hypothesised that explicitly modeling the internal structure of complex labels for morphological tagging improves the overall tagging accuracy over the baseline with monolithic tags. To test this hypothesis, we experimented with three approaches to model composite morphological tags in a neural sequence tagging framework. Experimental results on 49 languages demonstrated the advantage of modeling morphological labels as sequences of category values, whereas the superiority of this model is especially pronounced on smaller datasets. Furthermore, we showed that, in contrast to baselines, our models are capable of predicting labels that were not seen during training.