CBNU System for SIGMORPHON 2019 Shared Task 2: a Pipeline Model

In this paper we describe our system for morphological analysis and lemmatization in context, using a transformer-based sequence to sequence model and a biaffine attention based BiLSTM model. First, a lemma is produced for a given word, and then both the lemma and the given word are used for morphological analysis. We also make use of character level word encodings and trainable encodings to improve accuracy. Overall, our system ranked fifth in lemmatization and sixth in morphological accuracy among twelve systems, and demonstrated considerable improvements over the baseline in morphological analysis.


Introduction
In this paper we present our neural network architecture that we have used for the SIGMORPHON 2019 shared task 2 (McCarthy et al., 2019). We use two models by pipelining them in the sequence of operations. Our approach is based on the idea that lemmatization is an m-to-n mapping task where given a word of m characters we need to produce its lemma consisting of n characters. Unlike lemmatization, morphological analysis calls for a different approach where given a sentence consisting of m words, we need to choose one label from a fixed set of labels for each word.
Hence, morphological analysis/tagging is a classification task for an input sequence.

Task and Dataset
There are two tasks in SIGMORPHON 2019 and we chose task 2. The idea of the task is simple: the input is a sentence made of words and the output is a lemma and morphosyntactic description (MSD) for each word. Table 1 shows sample data for task 2: the first column is the input, the second is the lemma, and the last is the MSD for each word. There may be a difference in the result if a lemma is used as an additional input for MSD tagging. Our experiments showed improved performance when a lemma was incorporated.
The dataset consists of initial 98 datasets of more than 60 distinct languages, and additional nine surprise languages/datasets that were added later. Some of the datasets consist of languages that are not widespread in terms of their usage and amount of available training data. For example, Akkadian has only 80 sentences in training data, and other low-resource languages similarly have small numbers of sentences: Amharic has 859, Bambara 820, Buryat 741, Cantonese 520, etc. On the other hand, Russian SynTagRus and Czech PDT respectively have 49,511 and 70,330 sentences in their training data. In addition to  having less training data, some of the lowresource languages also do not have pre-trained word vectors. In such cases, we use other related languages' word vectors as a substitute, as will be discussed later.

Model
The baseline model  provided by the task organizers approaches task 2 by first finding a MSD tag for a given word and incorporating that information in lemmatization. Given a sequence of words w, a sequence of morphological tags m, and a sequence of lemmas l, they define their model as: This illustrates the importance of MSD tags in the lemmatization process. However, lemmatization can be done effectively even without consideration of morphological tags. Therefore, our approach flips the order of operations: we first find the lemma for a given word and input the original sentence with the generated lemma to the MSD tagger. Equation 2 summarizes this idea: Overall, given the nature of the required tasks, an m-to-n sequence to sequence model for lemmatization and a label classifier model for morphological analysis are used. The two models are trained separately and pipelined as shown in Figure 1. As an example, when given an initial sentence "these guys are fantastic!", we lemmatize each input word as "these guy be fantastic!" We then input the derived lemmas and the original input to the MSD tagger. At the end, we obtain MSD tag for each input word.

Lemmatizer
Our lemmatizer is a sequence to sequence model and is based on an encoder-decoder architecture using Google's transformer (Vaswani et al., 2017). Lemmatization is a similar task to translation, where an input sequence is mapped to an output sequence of a different length. Therefore, our approach is justified by the model's robust performance in neural machine translation, particularly for WMT 2014 Englishto-German and WMT 2014 EN-FR datasets. An informal leaderboard at http://nlpprogress.com demonstrates that the best performing teams use a transformer architecture for their encoder-decoder architecture (cf. Edunov et al., 2018.
A more formal leaderboard for the GLUE benchmark  consists of tasks that mainly use the encoder part of the encoderdecoder architecture. Therefore, the tasks of the GLUE benchmark are not directly comparable with lemmatization, but even in this case, at least the top 10 performers use BERT (Devlin et al., 2018), which uses a transformer encoder architecture (cf. Liu et al., 2019, Keskar et al., 2019.
The specific code for lemmatization is taken from the tensor2tensor library 1 version 1.13.4 with some modification added for our task. We chose the built-in hyperparameter configuration of transformer_tiny. The input and the output is a sequence of characters and no pre-trained embedding is used. One word is input at a time, and thus no consideration is taken of context words. For instance, in the mentioned example, the encoder input is "t h e s e" as a sequence of characters and the decoder output is "t h e s e". Likewise, "g u y s" and "g u y", "w e r e" and "b e", etc. are input and output one by one. Overall, the number of attention layers or heads is 4 as opposed to 8 in the original paper and hence it requires less computational power without substantial loss in the accuracy. The model performs quite well and with this basic setup was ranked fifth among 12 participating systems.

MSD tagger
The task of morphological analysis uses the output of lemmatization after pipelining it. Furthermore, MSD tagging is very similar to another well researched NLP task: headdependent relation labelling in dependency parsing. Like head-dependent relation labelling, an MSD tag of a word is dependent on the word itself and its position within the sentence. As an example, let's consider two sentences: "I live in an apartment" and "I like live music". Even though "live" occurs in both sentences, the label we attach is dependent on the context. In other words, context words and the word itself determine its MSD tag. Therefore, we use the modified dependency parser reported by Dozat et al. (2017), which is based on Kiperwasser et al. (2016). The original model won in the CoNLL 2017 shared task (Nivre et al. 2017a, Nivre et al. 2017b) and its subsequent modifications won in the CoNLL 2018 shared task (Zeman et al., 2018, Che et al., 2018. Unlike dependency parsing, for the morphological analysis it is not necessary to find the head of a word. Therefore, we amend the dependency parser by Dozat et al. (2017) and use only the model's head-dependent relation labeling functionality for the MSD tagging.
The model's input is an elementwise addition of four embeddings for an input word. We then pass the vector representation for each input word through BiLSTM layers with subsequent multilayer perceptron (MLP) and biaffine attention layers. The MSD tagging assigns a tag to each word while the dependency parsing assigns a tag to a relation between a pair of words. In the latter case, even though we need to tag a relation between a pair of words, each word needs a label. Furthermore, information from two words only is not enough and the parser has to attend actually to the whole context to assign the correct label. Therefore, we need attention over all input words in the dependency parsing and we leave this feature for the MSD tagger too.
The optimization is done by the Adam optimizer (Knigma and Ba 2014). We trained the model until there were no improvements after 5000 steps. The number of BiLSTM layers was three and the dimension of each LSTM cell as well as the word vector was 100 (300 when fastText 2 is used). We mainly used pre-trained embeddings of words from the CoNLL 2017 shared task (Nivre et al. 2017a, Nivre et al. 2017b) trained on word2vec (Mikolov et al., 2013). For Akkadian, Amharic, and Japanese we used fastText (Bojanowski et al., 2017). Interestingly, using the pre-trained word vector of Dutch from the CoNLL 2017 shared task demonstrates better performance than the Afrikaans pre-trained word vector of fastText for Afrikaans-AfriBooms treebank. Similar results were observed for some other datasets and therefore we used fastText only for the mentioned languages. At the same time, using the word vector for a related language is also in the spirit of cross-lingual learning transfer from a resourcerich to a resource-lean language (Ruder et al., 2017).
For each word, there are four embeddings, which are summed elementwise: pre-trained, trainable, character level, and lemma. Trainable embeddings are vectors that are initialized randomly and then trained as the training proceeds. Likewise, lemma vectors are also initialized randomly. The process of character level embedding generation is more involved and is based on the character level word representation by Cao and Rei (2016). Character level embeddings are a sequence of characters that pass through unidirectional LSTM cells (Hochreiter and Schmidhuber, 1997) and are then 2 https://fasttext.cc/ summed after the conventional attention layer (Bahdanau et al., 2015). Figure 2 summarizes this process.

Results
After experiments with different hyperparameter settings, we were able to choose optimal settings, as was described earlier.  percentage points. We conjecture that the larger decrease in Korean is due to its higher morphological complexity than English; a lemma itself is more important to find MSD tags for morphological rich languages.
In general, as more training data were available, higher scores were obtained in absolute terms. As an example, for Russian, among four available datasets (Russian-GSD, Russian-PUD, Russian-SynTagRus, and Russian-Taiga) Russian-SynTagRus was the largest, and its accuracy was best by all four metrics used.
Some languages have more MSD tags than others and therefore present another dimension for the task complexity. For instance, Czech-PDT treebank has 2895 unique MSD tags while English-EWT has only 179, i.e. 16 times less. This, therefore, partly affects the accuracy of the MSD tagger, where Czech-PDT treebank's morphological accuracy is 89.88% while English-EWT's is 95.82%.
While there is a lot of variance in the number of MSD tags among languages, most of the languages have around twenty to sixty characters in their alphabet. Hence, the number of characters in the alphabet does not seem to affect lemmatization. At the same time, Chinese uses distinct characters for each word and does not have word inflections. Despite having 3536 unique characters, Chinese-GSD treebank's lemma accuracy is 99.98%. It also has only 40 MSD tags due to the absence of inflections.
Overall, lemmatization appears to be a slightly easier task than MSD tagging, and in our case, incorporating lemma information in MSD tagging yielded more accurate results for the latter.

Conclusion
Our pipeline model has shown favorable results in SIGMORPHON Shared Task 2 and scored fifth and sixth place, respectively, for lemmatization and MSD tagging. For future work, it would be interesting to assess how incorporating the output of MSD tagging into lemmatization would affect lemma accuracy.