LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network, predicting tag subcategories, and using the tagger output as an input to the lemmatizer. We evaluate our model across several languages with complex morphology, which surpasses state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.


Introduction
Morphologically rich languages are often difficult to process in many NLP tasks (Tsarfaty et al., 2010). As opposed to analytical languages like English, morphologically rich languages encode diverse sets of grammatical information within each word using inflections, which convey characteristics such as case, gender, and tense. The addition of several inflectional variants across many words dramatically increases the vocabulary size, which results in data sparsity and outof-vocabulary (OOV) issues.
Due to these issues, morphological part-ofspeech (POS) tagging and lemmatization are heavily used in NLP tasks such as machine translation (Fraser et al., 2012) and sentiment analysis (Abdul-Mageed et al., 2014). In morphologically rich languages, the POS tags typically consist of multiple morpho-syntactic subcategories providing additional information (see Figure 1). Closely related to POS tagging is lemmatization, which involves transforming each word to its root or dictionary form. Both tasks require context-sensitive awareness to disambiguate words with the same form but different syntactic or semantic features and behavior. Furthermore, lemmatization of a word form can benefit substantially from the information present in morphological tags, as grammatical attributes often disambiguate word forms using context (Müller et al., 2015).
We address context-sensitive POS tagging and lemmatization using a neural network model that jointly performs both tasks on each input word in a given sentence. 1 We train the model in a supervised fashion, requiring training data containing word forms, lemmas, and POS tags. In addition, we incorporate the ideas from Inoue et al. (2017) to optionally allow the network to predict the subcategories of each tag to improve accuracy. Our model is related to the work of Müller et al. (2015), which use conditional random fields (CRF) to jointly tag and lemmatize words for morphologically rich languages. The idea of jointly predicting several dimensions of categories has been explored prior to this work, for example, joint morphological and syntactic analysis (Bohnet et al., 2013) or joint parsing and semantic role labeling (Gesmundo et al., 2009).
Our model consists of three parts: (1) The shared encoder, which creates an internal representation for every word based on its character se-1 The code for this project is available at https:// github.com/hyperparticle/LemmaTag V B -S ---3 P -A A --- quence and the sentence context. We adopt the encoder architecture of Chakrabarty et al. (2017), utilizing character-level (Heigold et al., 2017) and word-level embeddings (Mikolov et al., 2013b;Santos and Zadrozny, 2014) processed through several layers of bidirectional recurrent neural networks (BRNN/BiRNN) (Schuster and Paliwal, 1997;Chakrabarty et al., 2017).
(2) The tagger decoder, which applies a fully-connected layer to the outputs of the shared encoder to predict the POS tags.
(3) The lemmatizer decoder, which applies an RNN sequence decoder to the combined outputs of the shared encoder and tagger decoder, producing a sequence of characters that predict each lemma (similar to Bergmanis and Goldwater (2018)). The main advantages over other proposed models are: (i) The model is featureless, requiring little to no text preprocessing or morphological analysis postprocessing. (ii) The model shares the word embeddings, character embeddings, and RNN encoder weights in the tagger and lemmatizer, improving both tagging and lemmatization accuracy while reducing the number of parameters required for both tasks. (iii) The model predicts tag subcategories and provides the output of the tagger as features for the input of the lemmatizer, further improving accuracy.
We evaluate the accuracy of our model in POS tagging and lemmatization across several languages: Czech, Arabic, German, and English. For each language, we also compare the performance of a fully separate tagger and lemmatizer to the proposed joint model. Our results show that our joint model is able to improve the accuracy for both tasks, and achieves state-of-the-art performance in both POS tagging and lemmatization in Czech, German, and Arabic, while closely matching state-of-the-art performance for English.

The Joint LemmaTag Model
Given a sequence of words in a sentence w 1 , . . . , w k , the task of the model is to produce a sequence of associated tags t 1 , . . . , t k and lemmas 1 , . . . , k . For a word w i at position i, we denote c i,1 , c i,2 . . . c i,m i to be the sequence of characters that make up w i , where m i indicates the length of the word string at position i. Analogously, we define l i,1 , . . . l i,λ i to be the sequence of characters that make up the lemma i .
Our proposed model (shown in Figures 2 and 3) is split into three parts: the shared encoder, the tagger, and the lemmatizer. The initial layers of the model are shared between the tagger and lemmatizer, encoding the words, characters, and context in a given sentence. The encoder then passes its outputs to two networks, which perform a classification task to predict tags by the tagger and a sequence prediction task to output lemmas (character-by-character) in the lemmatizer.

Shared Encoder
In the encoder shown in Figure 2, each character c i,1 , c i,2 . . . c i,m i of a word w i is indexed into an embedding layer to produce fixed-length embedded vectors representing each character. These vectors are further passed into a layer of BRNNs composed of gated recurrent units (GRU)  producing outputs e c 1 , . . . , e c m , and whose final states are concatenated to produce the character-level embedding s c i of the word. Similarly, we index w i into a word-level embedding layer to compute vector e b i . Then we sum these results to produce the final word embedding e w i = s c i + e b i . We repeat this process independently for all the words in the sentence and feed the resulting sequence e w 1 . . . e w k into another two BRNN layers composed of long short-term memory units (LSTM) with residual connections. This produces word-level outputs o w 1 , . . . o w k that encode sentence-level context for each word (we ignore the final hidden states).

Tagger
The task of the tagger is to predict a tag t i ∈ T given a word w i and its context, where T is a set of possible tags. As explained the introduction, morphologically rich languages typically subdivide tags further into several subcategories t i = (t i,1 , . . . , t i,τ ), where t i,j ∈ T j , the j-th subcategory. See Figure 1 for an illustration taken from the Czech PDT tagset where τ = 15.
Having the encoded words of a sentence available, the tagger consists of a fully-connected layer with |T | neurons whose input is the output of the word feature RNN o w i (see figure 2). This layer produces the logits t i of the tag values and the predictions t i as the maximum-likelihood value (i.e., softmax).
To obtain the information about categorical nature of each tag, we also predict every category t i,j of the tag independently (if they exist in the dataset) with τ dense layers similar to Inoue et al. (2017). The j-th layer has |T j | neurons and outputs the logits t i,j for the category values. While these values are trained for, their value is not used in tag prediction. All tag values T i = (t i , t i,1 . . . , t i,τ ) are concatenated into a flat vector and fed into the lemmatizer as an additional set of potentially useful features.

Lemmatizer
The task of the lemmatizer is to produce a sequence of characters l i,1 , . . . , l i,λ i and the lemma length λ i for each lemma i . We use a recurrent sequence decoder, a setup typical of many sequenceto-sequence (seq2seq) tasks such as in neural ma- chine translation (Sutskever et al., 2014).
The lemmatizer consists of a recurrent LSTM layer whose initial state is taken from word-level output o w i and whose inputs consist of three parts. The first part is the embedding of the previous output character (initially a beginning-of-word character BOW).
The second part is a character-level attention mechanism  on the outputs of the character-level BRNN e c i,1 , . . . , e c i,m i . We employ the multiplicative attention mechanism described in Luong et al. (2015), which allows the LSTM cell to compute an attention vector that selectively weights character-level information in e c i,j at each time step j based on the input state of the LSTM cell.
The third and final part of the RNN input allows the network to receive the information about the embedding of the word, the surrounding context of the sentence, and the output of the tagger. This output is the same for all time steps of a lemma and is a concatenation of the following: the output of the encoder o w i , the embedded word e w i and processed tag features T f i . The tag features are obtained by projecting the concatenated outputs of the tagger T i through a fully connected layer with ReLU activation. During training, we do not pass the gradients back through T i to prevent the distortion of the tagger output.
The decoder performs greedy decoding to predict the character outputs. It runs until it produces the end-of-word character EOW or reaches a character limit of m i + 10.

Loss Function
We define the final loss function as the weighted sum of the losses of the tagger and the lemmatizer: where y are the predicted outputs,ŷ the expected outputs, y t , the tag components and y are the lemma characters. The tagger and lemmatizer losses are separately computed as the softmax cross entropy of the output logits. The weight hyperparameters α, β scale the training losses so that the subtag and lemmatizer losses do not overpower the unfactored tag predictor gradients. The vector α contains τ + 1 weights: one for the whole tag and one for every component. 2

Experiments
In this section, we show the outcomes of evaluation when running our joint tagger and lemmatizer and compare with the current state of the art in Czech, German, Arabic, and English datasets. Additionally, we evaluate the lemmatizer and tagger separately to compare the relative increase in tagging and lemmatization accuracy.

Datasets
Our datasets consist of the Czech Prague Dependency Treebank (PDT) (Hajič et al., 2006(Hajič et al., , 2018, the German TIGER corpus (Brants et al., 2004), the Universal Dependencies Prague Arabic Dependency Treebank (UD-PADT) (Hajic et al., 2004), the Universal Dependencies English Web Treebank (UD-EWT) (Silveira et al., 2014), and the WSJ portion of the English Penn Treebank (tags only) (Marcus et al., 1993). In all datasets, we use the tags specific to their respective language. Of these datasets, only Czech and Arabic provide subcategorical tags, and we use unfactored tags for the rest. See Table 1 for tagger and lemmatizer accuracies.
Note that the PDT dataset disambiguates lemmas with the same textual representation by appending a number as lemma sense indicator. For example, the dataset contains disambiguated lemmas moc-1 (as power) and moc-2 (as too much). About 17.5% of the PDT tokens have such sensedisambiguated lemmas. LemmaTag predicts the lemmas including the senses and the accuracies in Table 1 take that into account. Ignoring the sense ambiguity, the lemmatization accuracy of the joint LemmaTag model is 98.94% for Czech-PDT.

Hyperparameters
We use loss weights α 0 = 1.0 for the whole tags, α 1,...,τ = 0.1 for the tag component losses and β = 0.5 for the lemmatizer loss. 3 The RNNs and word embedding tables have dimensionality 768 except for character-level embeddings and the character-level RNN, which are of dimension 384. The fully-connected layer whose inputs are T i is of dimension 256.
We train the models for 40 epochs with random permutations of training sentences and batches of 16 sentences. The starting learning rate is η = 0.001 and we scale this by 0.25 at epochs 20 and 30 to increase accuracy. We train the network using the lazy variant of the Adam optimizer (Kingma and Ba, 2014), which only updates accumulators for variables that appear in the current batch (TensorFlow, 2018), with parameters β 1 = 0.9 and β 2 = 0.99. We clip the global gradient norm to 3.0 to reduce the risk of exploding gradients.
To prevent the tagger from overfitting, we devise several strategies for regularization. We apply dropouts with rate 0.5 as indicated in Figures 2  and 3. The word dropout (WD) replaces 25% of words by the unknown token <unk> to force the network to rely more on context, combatting data sparsity issues. Lastly, we employ label smoothing (Pereyra et al., 2017) which is a way to prevent the network from being too confident in any one class. The label smoothing parameter is set to 0.1 for the tagger logits (both whole tags and the tag components).
Note that we did not perform any complex hyperparameter search. For additional information on real-world performance and additional techniques which have not improved evaluation accuracy, see Appendix A.

Conclusion
The evaluation results show that performing lemmatization and tagging jointly by sharing encoder parameters and utilizing tag features is

Approach
Czech-PDT * German-TIGER    Ling et al. (2015). The results marked with a plus + use additional resources apart from the dataset, and datasets marked with a star * indicate the availability of subcategorical tags.
mutually beneficial in morphologically rich languages. We have shown that incorporating these ideas results in excellent performance, surpassing state-of-the-art in Czech, German, and Arabic POS tagging and lemmatization by a substantial margin, while closely matching state-of-theart English POS tagging accuracy.
However, in languages with weak morphology such as English (and German to a lesser extent), sharing the encoder parameters may even hurt the performance of the tagger. We believe this is a consequence of tags correlating less with word-level morphology, and more with sentencelevel syntax in morphologically poor languages. Lemma prediction could benefit from the syntactic information in the tags, but the tag predictions rely more on syntactic structure (i.e., word order) rather than on root forms of individual words which could be ambiguous.
There are some possible performance improvements and additional metrics which we leave for future work. For simplicity, one improvement we intentionally left out is the use of additional data. We can incorporate word2vec (Mikolov et al., 2013a) or ELMo (Peters et al., 2018) word representations, which have shown to reduce outof-domain issues and provide semantic information (Eger et al., 2016). A second improvement is to integrate information from a morphological dictionary to resolve certain ambiguities (Hajič et al., 2009;Inoue et al., 2017). A third improvement can be to replace the seq2seq lemmatizer decoder with a classifier that chooses a corresponding edit tree to modify (reduce) the word form to its lemma (Chakrabarty et al., 2017). A fourth possible improvement would be to experiment with the Transformer model (Vaswani et al., 2017), which utilizes non-recurrent multi-headed self-attention and has been shown to achieve stateof-the-art performance in several related sequence tasks (Dehghani et al., 2018). Lastly, we would like to evaluate LemmaTag on a wider range of languages, e.g., on the Universal Dependencies (Nivre et al., 2016) languages and treebanks which employ lemmatization, and to analyze the use of different types of POS tags in the model.