Morpheus: A Neural Network for Jointly Learning Contextual Lemmatization and Morphological Tagging

In this study, we present Morpheus, a joint contextual lemmatizer and morphological tagger. Morpheus is based on a neural sequential architecture where inputs are the characters of the surface words in a sentence and the outputs are the minimum edit operations between surface words and their lemmata as well as the morphological tags assigned to the words. The experiments on the datasets in nearly 100 languages provided by SigMorphon 2019 Shared Task 2 organizers show that the performance of Morpheus is comparable to the state-of-the-art system in terms of lemmatization. In morphological tagging, on the other hand, Morpheus significantly outperforms the SigMorphon baseline. In our experiments, we also show that the neural encoder-decoder architecture trained to predict the minimum edit operations can produce considerably better results than the architecture trained to predict the characters in lemmata directly as in previous studies. According to the SigMorphon 2019 Shared Task 2 results, Morpheus has placed 3rd in lemmatization and reached the 9th place in morphological tagging among all participant teams.


Introduction
Lemmatization is the process of reducing an inflected word into its dictionary form known as the lemma. Morphological tagging, on the other hand, is the process of marking up words with their morphological information and part of speech (POS) tags. Lemmatization and morphological tagging are essential tasks in natural language processing since they usually represent initial steps of subsequent tasks such as dependency parsing (Chen and Manning, 2014;McDonald and Pereira, 2006) and semantic role labeling (Haghighi et al., 2005). Morphological information of words is utilized in various tasks including statistical machine trans-lation (Huck et al., 2017), neural machine translation (Conforti et al., 2018) and named entity recognition (Güngör et al., 2019) to improve the performance. Morphological tagging and lemmatization is crucial especially in morphologically rich languages such as Turkish and Finnish since inflected and derived words carry a substantial amount of information such as number, person, case, tense and aspect. Moreover, lexical ambiguities can occur in highly inflectional and derivational languages such as Turkish. The correct lemma and morphological tags may differ according to the context in which words appear. As shown in table 1, the Turkish word "dolar" may have different lemma and morphological properties according to the context it is used.
To achieve lemmatization and morphological tagging in highly inflectional languages, traditional approaches employ finite state machines which are constructed to model grammatical rules of a language (Oflazer, 1993;Karttunen et al., 1992). Building a state machine for morphological analysis is not a trivial task and requires considerable effort necessitating linguistic knowledge. Furthermore, morphological analyzers frequently produce multiple analyzes for each word which introduces morphological ambiguities. Morphological disambiguation which is the process of selecting correct analyzes of words according to the context (Yildiz et al., 2016;Shen et al., 2016) is mostly needed after morphological analysis step. Morphological disambiguation is also a difficult problem due to the fact that it requires the classification of both lemmata and the corresponding labels. Therefore, researchers have studied language-agnostic data-driven solutions for both lemmatization and morphological tagging. In most of the studies, applying machine learning or statistical methods over morphologically an-  (Kirov et al., 2018)) have been proposed to perform joint morphological tagging and lemmatization. One of the early studies, Morfette (Chrupała et al., 2008) utilized a Maximum Entropy Classifier to find lemmata and morphological tags of each word in a sentence. Two separate classifiers are employed in their architecture: one for assigning morphological tags to the words and one for predicting the shortest edit script between the surface word and its lemma. Shortest edit script is the shortest sequence of instructions (insertions, deletions, and replacements) which transforms a string into another one. In this way, the system is able to apply lemmatization to out of vocabulary words by predicting the transformation which should be applied to the surface word to obtain the lemma of the word. More recent work, namely Lemming (Müller et al., 2015) has out-performed Morfette by using a Conditional Random Field classifier to classify each candidate sequences of lemmata and morphological tags jointly. The feature space of Lemming differs from Morfette as Lemming also uses external lexical features such as the occurrences of a candidate lemma in a dictionary. As deep neural networks gain popularity and lead state-of-the-art results in various natural language processing tasks, sequential neural networks have been successfully employed for lemmatization and morphological tagging in recent studies (Bergmanis and Goldwater, 2018;Malaviya et al., 2019;Dayanık et al., 2018;Chakrabarty et al., 2017). Promising results are obtained through standard encoder-decoder neural architectures where inputs are the character sequences of the words and outputs are the character sequences of lemmata and morphological tags (Bergmanis and Goldwater, 2018;Dayanık et al., 2018). Neural architectures which are designed to predict the edit operations between surface words and lemmata are also proposed in recent works (Chakrabarty et al., 2017). The current state of the art is held by Malaviya et al. (2019) using a neural hard attention mechanism to align the characters of surface words and 1 https://unimorph.github.io/ lemmata. Morphological tagging and lemmatization are jointly modeled in their architecture and a dynamic programming approach is used to maximize both morphological tagging and lemmatization scores. In SigMorphon 2019 workshop, a shared task about morphological tagging and contextual lemmatization in nearly 100 distinct languages is organized (McCarthy et al., 2019). In this study, we propose a neural network architecture, namely Morpheus for SigMorphon 2019 Shared Task 2. Our architecture is inspired by Mor-phNet (Dayanık et al., 2018), which has produced promising results in Turkish using an encoderdecoder neural architecture. In MorphNet, all characters are represented with a vector, and word vectors are generated by using long short term memories (LSTM) over character vectors. Another bidirectional LSTM is applied over word vectors to obtain context-aware representations of each word in a sentence. An LSTM based decoder inputs context-aware word representations and produces lemmata and morphological tags, respectively. Our architecture differs from Mor-phNet as we use two separate decoders for generating lemmata and morphological tags. Another difference of our architecture is that we follow the minimum edit script prediction approach considering the promising performance outputs of prior work (Chrupała et al., 2008;Müller et al., 2015;Chakrabarty et al., 2017). The lemma decoder of our network is optimized to predict the minimum edit operations between surface words and lemmata instead of predicting the character sequences of the lemmata as in MorphNet and Lematus.
Our experiments show that predicting the minimum edit operations instead of characters improves the performance significantly on Uni-Morph dataset, which is provided in SigMorphon 2019 Shared Task 2. The performance of the proposed architecture is comparable to the current state-of-the-art system (Malaviya et al., 2019), which is provided as a strong baseline by Sig-Morphon 2019 organizers. All of the experiments in this paper are reproducible using the codes we make publicly available 2 .

Method
The input of our neural network based model is a sentence containing surface form words and the outputs are edit operations between surface words and their lemmata and morphological tags assigned to the words. The problem can be defined as we are searching a function whose inputs are surface words of a sentence f ([w o , .., w n ]) and whose output is the set of (o i , m i ) tuples where o i is the set of edit operations to generate the lemma of the surface form w i and m i is the set of morphological tags assigned to the surface form w i . The overall architecture of the system is illustrated in Figure 1. The system comprises 3 neural components that are running sequentially: • The first component generates word vectors using LSTMs over the vector representations of its characters.
• The second component generates contextaware word vectors applying bidirectional LSTMs over word vectors • Two separate LSTM decoders accept the same context-aware word vectors. The first decoder generates edit operations between surface words and lemmata while the second decoder generates morphological tags In the final step, lemmata are generated by applying predicted edit operations to the surface words.

Generating minimum edit operations
The proposed model is designed to predict minimum edit operations to obtain the lemma from a surface word. The fundamental edit operations are Same, Delete, Replace, and Insert operations. To find minimum edit operations between surface word and its lemma, we use a dynamic programming approach 3 which is based on Levenshtein distance. Some sample edit operations between surface words and lemmata are given in Figure  2 for several languages. As seen in Figure 2, Same and Delete operations have only one version whereas Replace and Insert operations have multiple versions combined with the character to be replaced or inserted. So the actual number of elements in the edit operations set are determined for each language separately by processing the training data. Generally, the length of a lemma is shorter than or equals to the length of the corresponding surface form and consequently, the number of edit operations are usually the same as the length of the surface form. However, for some languages, lemmata longer than the corresponding surface forms are observed. Since our minimum edit prediction decoder predicts an edit operation label for each character in the surface (see Section 2.4.1 for details), it fails in generating lemmata longer than the surface forms. Thus we make some modifications over the base operations generated by standard Levenshtein distance based algorithm. To ensure the length of the operation labels is the same as the length of the surface words, we merge consecutive Insert labels in the same position into one Insert label with multiple characters. We also combine the Replace labels and the following consequent Insert labels in the same position into one Replace label with multiple characters. For example, the minimum edit operations for the Russian surface-lemma pair "видна"-"видный" have four Same operation labels and one Replace_ый label, respectively. Note that the last character in the surface word "a" is replaced with the character "ы" and then the character й is inserted into the end. The base labels Replace_ы and Insert_й are merged into one label Replace_ый to ensure that edit operations and surface words have the same length ( Figure 2).

Word representations
The first component of our network inputs character sequences of each word in a sentence and generates vector representations of the words using an LSTM network. Let w i represents the i th word with L i characters in a sentence and w ij is j th character in w i . In our model, each character w ij is represented by a vector a ij ∈ IR da and we calculate the vector representation of i th word e i ∈ IR de by applying an LSTM over the vector representations of its constituent characters from left to right as shown in eq. (1). The last hidden state vector of the LSTM h L i ∈ IR de is considered as the vector representation of the word w i .

Context-aware word representations
The context of the words have a substantial impact on morphological tagging and lemmatization in most languages (Shen et al., 2016;Malaviya et al., 2019). In order to take into account the context of the words we employ another LSTM which is bidirectional and inputs vector representations e i and outputs context-aware representations c i ∈ IR dc for each surface word in the context as shown in eqs.
(2) to (4) The final output is context-aware vector representation c i for each word w i in the sentence.

Decoding Components
One of the important differences of the proposed network from previous studies (Bergmanis and Goldwater, 2018;Dayanık et al., 2018) is that it has two separate decoders for lemmatization and morphological tagging. The parameters of the decoders are not shared. However, they are both fed with the same word vectors e i and context-aware word vectors c i that are generated in the encoding step.

Minimum edit prediction decoder
The minimum edit prediction decoder component consists of a two layer bidirectional LSTM network and an embedding layer which maps each character w ij in surface word to a one dimensional vector u ij ∈ IR du . Forward LSTM network inputs previous hidden states g →1 j−1 , g →2 j−1 ∈ IR dg and outputs current hidden states g →1 j , g →2 j and an output vector y → j ∈ IR dy . Backward LSTM network, on the other hand, applies the same operations in opposite direction and outputs g ←1 j , g ←2 j ∈ IR dg and y ← j . Softmax function is then applied to the multiplication of trainable matrix W o ∈ IR dy×|o| with the concatenation vector of output vectors y → j and y ← j where |o| represents the number of distinct edit operations observed in the dataset. The output of softmax operation is the probabilities of each minimum edit operation p(o ij ) corresponding to the character w ij as shown in eqs. (5) to (7).
The first hidden states of both forward and backward LSTMs are initialized with the word vector e i (see section 2.2) and a linear transformation of the context-aware word vector: W c × c i where W c is a matrix in IR dc×de (see section 2.3), respectively (see eq. (11)).

Morphological tagging decoder
The morphological tagging decoder component is another LSTM decoder which is able to generate morphological tags without length restriction. Each word w i has K i morphological tags and each morphological tag m il is represented by a one dimensional vector v il ∈ IR dv . A two layer LSTM network which is unidirectional is initialized same as in minimum edit prediction component. An LSTM cell inputs the vector representation v i(l−1) of previously predicted tag m il and previous hidden states q 1 l−1 , q 2 l−1 ∈ IR dq in each step. The outputs of the LSTM cell are current hidden states q 1 l , q 2 l and an output vector z ij ∈ IR dz . Softmax function is then applied to multiplication of the output vector z ij and trainable matrix W m ∈ IR dz×|m| where m equals to the number of distinct morphological tags in the dataset. The first input to the decoder is the vector representation of a special start symbol v start ∈ IR dv . In this way the probabilities of each morphological tags p(m il ) in given position i, l are calculated as shown in eqs. (5), (12) and (13).

Character prediction decoder
The character prediction decoder component which sequentially predicts the characters occur in lemmata is not employed in the proposed architecture. However, we build an alternative model in which the character prediction decoder component is used instead of a minimum edit prediction decoder. In this way, we aim to evaluate the impact of predicting minimum edit operations instead of characters in lemmata. The character prediction decoder used in the experiments has the same architecture and parameter set with the morphological tagging decoder.

Training objective
All the parameters in whole architecture including all LSTM parameters and the trainable matrices W c , W o , W m are optimized jointly in training Turku NLP (Kanerva et al., 2018) 92.18 86.7 UPPSALA Uni. (Moor, 2018) 58.5 88.32 SigMorphon 2019 Baseline (Malaviya et al., 2019)   phase by minimizing the sum of two cross entropy losses as follow: The loss for lemmatization is calculated by taking cross entropy over predicted minimum edit operations p(o ij ) as in eq. (15) where N stands for the number of words in the sentence and L i stands for the number of characters in the word w i . The loss for morphological tagging is cross entropy over predicted tag probabilities p(m il ) as in eq. (16) where K i stands for the number of morphological tags assigned to the word w i . The total loss to minimize is the sum of the lemmatization loss and morphological tagging loss (see eq. (17)).

Experiments
In our experiments, we train and evaluate the proposed architecture on UniMorph dataset collection (Kirov et al., 2018) for each language. Same architecture with the same hyper-parameters is used for all the languages. To investigate the impact of the minimum edit prediction component on the performance, we also train the network in which the character prediction decoder component is used instead of a minimum edit prediction component. The results of the experiments are provided in section 3.3 and table 2

Dataset
UniMorph dataset collection, which includes a various number of sentences consist of surface words with annotated lemmata and morphological tags in 97 different languages are provided in Sig-Morphon 2019 shared task 2. The dataset for each language is split into train and validation sets, and the size of the dataset differs in each language. In our experiments, we train our architecture on train sets and evaluate the performances on validation sets. The experimental results presented in section 3.3 are obtained on the validation sets. Note that final results which are presented in SigMorphon paper (McCarthy et al., 2019) are calculated over the test sets which were not available to the systems before the final submission stage.

Experimental Setup
The same settings are used in the training of the architecture for each language. The input character embedding length d a is set to 128 while the length of the word vectors d e is set to 1024 and the length of the context-aware word vectors d c is set to 2048. The length of character vectors in the minimum edit prediction component d u and the length of the morphological tag vectors d v are set to 256 while the hidden unit sizes in decoder LSTMs d g and d q are set to 1024. We use Adam optimization algorithm (Kingma and Ba, 2014) with learning rate 3e − 4 for minimizing the loss   (Glorot and Bengio, 2010). We use an early stop mechanism which stops the training after four consecutive epochs without improvement on validation set. Table 2 presents the lemmatization and morphological performances of the proposed method on UniMorph dataset collection. The lemmatization accuracy on a dataset is the proportion of the number of correctly found lemmata over the total lemmata count. The lemmatization accuracy given in table 2 is the average of the accuracies obtained over the validation sets of all languages. The performance of morphological tagging is measured by the F1 score calculated over the predicted and actual individual morphological tags.

Results
In addition to the performance of the proposed architecture with minimum edit prediction decoder, the performance of the architecture with character prediction decoder is also given. The performances of SigMorphon 2019 neural baseline, Turku NLP system (Kanerva et al., 2018) which is the best lemmatization performer in CONLL 2018 Shared Task (Zeman and Hajič, 2018) and UPP-SALA Uni system which is the best morphological tagging performer in CONLL 2018 are also given. Although the dataset provided in CONLL 2018 share the same basis with the dataset provided in SigMorphon, important differences exist between them. Hence the performances of Turku NLP and UPPSALA Uni are not directly comparable to our systems and SigMorphon baselines. However, we present the performances of those systems averaged on the same languages in Sig-Morphon dataset to provide an idea of how much improvement is achieved over a year. According to the results, the proposed architecture, Morpheus performs slightly better than the SigMorphon neural baseline in terms of the average lemmatization accuracy. Similarly, for the morphological tagging task, Morpheus with a minimum edit prediction decoder significantly outperforms the baseline and Morpheus with character prediction decoder The experiments show that the performance is improved considerably when the minimum edit prediction decoder is used instead of the character prediction decoder. An im-pressive result is that the performance of morphological tagging is also enhanced by employing the minimum edit prediction decoder. Table 3 shows the lemmatization and morphological performances of both character prediction and minimum edit prediction models for each language. The performance of the minimum edit prediction model is better than the character prediction model in almost all languages. Figure  3 shows that there is a correlation between the size of the training data and the improvement on the performance when the minimum edit prediction decoder is employed. For instance, the relative lemmatization improvement is extreme in languages with relatively small dataset such as Tagalog-TRG (400 tokens/0.75 relative improvements), Komi-Zyrian (1.1K tokens/0.78 relative improvement) and Akkadian (1.7 tokens/0.44 relative improvement). On the other hand, in languages with large size dataset such as Spanish-AnCora (496K tokens), Catalan-AnCora (480K tokens) and French-GSD (359K tokens), the improvement is relatively low (0.006, 0.007 and 0.01, respectively). Although improvement magnitude is highly correlated with the training dataset size, there must be other factors specific to the properties of the language. For instance, the dataset size of the language Marathi-UFAL is small (4.1K tokens). However, the improvement degree is also small (0.03 relative improvement).  To investigate in which cases the edit prediction model performs better, we explore the outputs of the models for English and Turkish languages. A significant portion of the errors in the character prediction model is observed in unseen words and proper nouns. Some of the errors made by the character prediction based model and corrected by the edit prediction based model are shown in Table  4. A possible reason is that the lemmatization of a singular nominative noun which is rarely seen in the training data is easier to edit prediction model since all of the edit operations are Same and the model should only produce a sequence of Same labels. Character prediction based model, on the other hand, have to learn to reproduce the word from scratch. Additionally, we observe a significant amount of samples in which the edit prediction model produced morphological tags and lemmata more appropriate to the context than the outputs of the character prediction model. As a result, further research is needed to understand in which cases the edit prediction decoder helps to better learning of morphological properties of a language.

Conclusion
In this study, we propose a neural architecture, namely Morpheus, which is based on sequential neural encoder-decoders. The input words are encoded in context-aware vector representations using two-level LSTM network and the decoders initialized with context-aware word vectors generates both morphological tags assigned to the words and minimum edit operations between surface words and their lemmata. We perform experiments to evaluate the performance of Morpheus on UniMorph dataset collection (Kirov et al., 2018), which comprised nearly 100 language datasets. The experiments show that the lemmatization performance of the Morpheus is comparable to the SigMorphon neural baseline system (Malaviya et al., 2019), which has obtained current state-of-the-art results on UniMorph dataset collection. Regarding morphological tagging performance, Morpheus outperforms Sig-Morphon morphological tagger baseline significantly ( 0.3 relative improvement). In lemmatization, Morpheus has placed 3 rd in the SigMorphon 2019 Shared Task 2, and it has reached the 9 th place in morphological tagging. In our experiments, we also show that predicting the minimum edit operations between surface words and their lemmata instead of directly predicting the characters improves the performances of the system significantly especially when the dataset is small.