UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging

We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis and lemmatization. We submitted a modification of the UDPipe 2.0, one of best-performing systems of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies and an overall winner of the The 2018 Shared Task on Extrinsic Parser Evaluation. As our first improvement, we use the pretrained contextualized embeddings (BERT) as additional inputs to the network; secondly, we use individual morphological features as regularization; and finally, we merge the selected corpora of the same language. In the lemmatization task, our system exceeds all the submitted systems by a wide margin with lemmatization accuracy 95.78 (second best was 95.00, third 94.46). In the morphological analysis, our system placed tightly second: our morphological analysis accuracy was 93.19, the winning system’s 93.23.


Introduction
This work describes our participant system in the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology. We contributed a system in Task 2: contextual morphological analysis and lemmatization.
Given a segmented and tokenized text in a CoNLL-U format with surface forms (column 2) as in the following example: # sent-id = 1 # text = They buy and sell books. the task is to infer lemmas (column 3) and morphological analysis (column 6) in the form of concatenated morphological features: # sent-id = 1 # text = They buy and sell books. The SIGMORPHON 2019 data consists of 66 distinct languages in 107 corpora (McCarthy et al., 2018).
We submitted a modified UDPipe 2.0 (Straka, 2018), one of the three winning systems of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018) and an overall winner of the The 2018 Shared Task on Extrinsic Parser Evaluation (Fares et al., 2018).
Our improvements to the UDPipe 2.0 are threefold: • We use the pretrained contextualized embeddings (BERT) as additional inputs to the network (described in Section 3.3). • Apart from predicting the whole POS tag, we regularize the model by also predicting individual morphological features (Section 3.4). • In some languages, we merge all the corpora of the same language (Section 3.5).
Our system placed first in lemmatization and closely second in morphological analysis.
We give an overview of the related work in Section 2, we describe our methodology in Section 3, the results with ablation experiments are given in Section 4 and we conclude in Section 5.

Related Work
A new type of deep contextualized word representation was introduced by Peters et al. (2018).
The proposed embeddings, called ELMo, were obtained from internal states of deep bidirectional language model, pretrained on a large text corpus. The idea of ELMos was extended by Devlin et al. (2018), who instead of a bidirectional recurrent language model employ a Transformer (Vaswani et al., 2017) architecture.
The Universal Dependencies project (Nivre et al., 2016) seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for many languages. In 2017 and 2018 CoNLL Shared Tasks Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2017(Zeman et al., , 2018, the goal was to process raw texts into tokenized sentences with POS tags, lemmas, morphological features and dependency trees of Universal Dependencies. Straka (2018) was one of the winning systems of the 2018 shared task, performing the POS tagging, lemmatization and dependency parsing jointly. Another winning system of Che et al. (2018) employed manually trained ELMo-like contextual word embeddings and ensembling, reporting 7.9% error reduction in LAS parsing performance.
The Universal Morphology (UniMorph) is also a project seeking to provide annotation schema for morphosyntactic details of language (Sylak-Glassman, 2016). Each POS tag consists of a set of morphological features, each belonging to a morphological category (also called a dimension of meaning).

Architecture Overview
Our baseline is the UDPipe 2.0 (Straka, 2018). The original UDPipe 2.0 is available at http://github.com/CoNLL-UD-2018/ UDPipe-Future. Here, we describe the overall architecture, focusing on the modifications made for the SIGMORPHON 2019. The resulting model is presented in Figure 1.
In short, UDPipe 2.0 is a multi-task model predicting POS tags, lemmas and dependency trees. For the SIGMORPHON 2019, we naturally train and predict only the POS tags (morphosyntactic features) and lemmas. After embedding input words, three shared bidirectional LSTM (Hochreiter and Schmidhuber, 1997) layers are performed. Then, softmax classifiers process the output and generate the lemmas and POS tags (morphosyntactic features).  The lemmas are generated by classifying into a set of edit scripts which process input word form and produce lemmas by performing characterlevel edits on the word prefix and suffix. The lemma classifier additionally takes the characterlevel word embeddings as input. The lemmatization is further described in Section 3.2.
The input word embeddings are the same as in the UDPipe 2.0 (Straka, 2018): • end-to-end word embeddings, • word embeddings (WE): We use FastText word embeddings (Bojanowski et al., 2017) of dimension 300, which we pretrain for each language on plain texts provided by CoNLL 2017 UD Shared Task, using segmentation and tokenization trained from the UD data. 1 For languages not present in the CoNLL 2017 UD Shared Task, we use pretrained embeddings from (Grave et al., 2018), if available. • character-level word embeddings (CLE): We employ bidirectional GRUs (Cho et al., 2014;Graves and Schmidhuber, 2005) of dimension 256 in line with (Ling et al., 2015): we represent every Unicode character with a vector of dimension 256, and concatenate GRU output for forward and reversed word characters. The character-level word embeddings are trained together with UDPipe network.
We refer the readers for detailed description of the architecture and the training procedure to  Table 1: Eleven most frequent lemma rules in English EWT corpus, ordered from the most frequent one. Straka (2018).
The main modifications to the UDPipe 2.0 are the following: • contextualized embeddings (BERT): We add pretrained contextual word embeddings as another input to the neural network. We describe this modification in Section 3.3. • regularization with individual morphological features: We predict not only the full POS tag, but regularize the model by also predicting individual morphological features, which is described in Section 3.4. • corpora merging: In some cases, we merge the corpora of the same language. We describe this step in Section 3.5.
Furthermore, we also employ model ensembling, which we describe in Section 3.6.

Lemmatization
The lemmatization is modeled as a multi-class classification, in which the classes are the complete rules which lead from input forms to the lemmas. We call each class encoding a transition from input form to lemma a lemma rule. We create a lemma rule by firstly encoding the correct casing as a casing script and secondly by creating a sequence of character edits, an edit script.
Firstly, we deal with the casing by creating a casing script. By default, word form and lemma characters are treated as lowercased. If the lemma however contains upper-cased characters, a rule is added to the casing script to uppercase the corresponding characters in the resulting lemma. For example, the most frequent casing script is "keep the lemma lowercased (don't do anything)" and the second most frequent casing script is "uppercase the first character and keep the rest lowercased".
As a second step, an edit script is created to convert input lowercased form to lowercased lemma. To ensure meaningful editing, the form is split to three parts, which are then processed separately: a prefix, a root (stem) and a suffix. The root is discovered by matching the longest substring shared between the form and the lemma; if no matching substring is found (e.g., form went and lemma go), we consider the word irregular, do not process it with any edits and directly replace the word form with the lemma. Otherwise, we proceed with the edit scripts, which process the prefix and the suffix separately and keep the root unchanged. The allowed character-wise operations are character copy, addition and deletion.
The resulting lemma rule is a concatenation of a casing script and an edit script. The most common lemma rules in English EWT corpus are presented in Table 1, and the number of lemma rules for every language is displayed in Tables 5 and 6.
Using the generated lemma rules, the task of lemmatization is then reduced to a multiclass classification task, in which the artificial neural network predicts the correct lemma rule.

Contextual Word Embeddings (BERT)
We add pretrained contextual word embeddings as another input to the neural network. We use the pretrained contextual word embeddings called BERT (Devlin et al., 2018). 2 For English, we use the native English model (BERT-Base English), for Chinese use use the native Chinese model (BERT-Base Chinese) and for all other languages, we use the Multilingual model (BERT-Base Uncased). All models provide contextualized embeddings of dimension 768.
We average the last four layers of the BERT model to produce the embeddings. Because BERT utilizes word pieces, we decompose words into appropriate subwords and then average the generated embeddings over subwords belonging to the same word.
Contrary to finetuning approach used by the BERT authors (Devlin et al., 2018), we never finetune the embeddings.

Regularization with Individual Morphological Features
Our model predicts the POS tags as a unit, i.e., the whole set of morphological features at once. There are other possible alternatives -for example, we could predict the morphological features individually. However, such a prediction needs to decide which morphological categories to use and should use a classifier capable of handling dependencies between the predicted features, and all our attempts to design such a classifier resulted in systems with suboptimal performance. Using a whole-set classifier alleviates the need for finding a correct set of categories for a word and handling the feature dependencies, but suffers from the curse of dimensionality, especially on smaller corpora with richer morphology. Nevertheless, the performance of a whole-set classifier can be improved by regularizing with the individual morphological feature prediction. Similarly to Kondratyuk et al. (2018), our model predicts not only the full set of morphological features at once, but also the individual features. Specifically, we employ as many additional softmax output layers as the number of morphological categories used in the corpus, each predicting the corresponding feature or a special value of None. The averaged cross-entropy loss of all predicted categories multiplied by a weight w is added to the training loss. The predicted features are not used in any way during inference and act only as model regularization.
The number of full POS tags (complete sets of morphological features), individual morphological features and number of used morphological categories for every corpus is provided in Tables 5 and 6.

Corpora Merging
Given that the shared task data consists of multiple corpora for some of the languages, it is a natural approach to concatenate all corpora of the same language and use the resulting so-called merged model for prediction on individual corpora.
In theory, concatenating all corpora of the same language should be always beneficial considering the universal scheme used for annotation. Nonetheless, the merged model exhibits worse performance in many cases, compared to a specialized model trained on the corpus training data only, supposedly because of systematically different annotation. We consequently improve the merged model performance during inference by allowing only such lemma rules and morphological feature sets that are present in the training data of the predicted corpus.

Model Ensembling
For every corpus, we consider three model configurations -the regular model with BERT embeddings trained only on the corpus data, the merged model with BERT embeddings trained on all corpora of the corresponding language, and the no-BERT model trained only on the corpus data.
To allow automatic model selection and to obtain highest performance, we use ensembling. Namely, we train three models for every model configuration, obtaining nine models for every language. Then, we choose a model subset whose ensemble achieves the highest performance on the development data. The chosen subsets then formed the competition entry of our system. However, post-competition examination using half of development data for ensemble selection and the other for evaluation revealed that the model selection can overfit, sometimes choosing one or two models with high performance caused by noise instead of high-quality generalization. Therefore, we also consider another model selection method -we ensemble the three models for every configuration, and choose the best configuration out of three ensembles on the development data. This second system has been submitted as a post-competition entry.    When not specified otherwise, all models utilize pretrained word embeddings, BERT, and feature regularization with weight w = 1.

SIGMORPHON 2019 Test Results
Our participant system placed as one of the winning systems of the shared task. In the lemmatization task, our system exceeds all the submitted systems by a wide margin with lemmatization accuracy 95.78 (second best was 95.00, third 94.46). In the morphological analysis, our system placed tightly second: our morphological analysis accuracy was 93.19, the winning system's 93.23.

Ablation Experiments
The effect of pretrained word embeddings, BERT contextualized embeddings and regularization with morphological features is evaluated in Table 3. Even the baseline model without any of the mentioned enhancements achieves relatively high performance and would place third in both lemmatization and tagging accuracy (when not considering our competition entry).
Pretrained word embeddings improve the performance of both the lemmatizer and the tagger by a substantial margin. For comparison with the embeddings we trained on CoNLL 2017 UD Shared Task plain texts, we also evaluate the embeddings provided by Grave et al. (2018), which achieve only slightly lower performance than our embeddings -we presume the difference is caused mostly by different tokenization, given that the training data comes from Wikipedia and Com-monCrawl in both cases.
BERT contextualized embeddings further considerably improve POS tagging performance, and have minor impact on lemmatization improvement.
When used in isolation, the regularization with morphological categories provides quite considerable gain for both lemmatization and tagging, nearly comparable to the effect of adding precomputed word embeddings. Combining all the enhancements together then produces a model with the highest performance.

Model Combinations
For every corpus, we consider three model configurations -a regular model, then a model trained on the merged corpora of a corresponding language, and a model without BERT embeddings (which we consider since even if BERT embeddings can be computed for any language, the results might be misleading if the language was not present in the BERT training data). For every model configuration, we train three models using different random initialization.
The test set results of choosing the best model configuration on a development set are provided in Table 4. Employing the merged model in addition to the regular model increases the performance slightly, and the introduction of no-BERT model results in minimal gains. Finally, ensembling the models of a same configuration provides the highest performance.
As discussed in Section 3.6, our competition entry selected the ensemble using arbitrary subset of all the nine models which achieved best performance on the development data. This choice resulted in overfitting on POS tag prediction, with results worse than no ensembling.

Detailed Results
Tables 5 and 6 present detailed results of our best system from Table 4. Note that while this system is not our competition entry, it utilizes the same models as the competition entry, only combined in a different way. Furthermore, because one model configuration was chosen for every language, we can examine which configuration performed best, and quantify what the exact effect of corpora merging and BERT embeddings are.

Conclusions
We described our system which participated in the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis and lemmatization, which placed first in lemmatization and closely second in morphological analysis. The contributed architecture is a modified UDPipe 2.0 with three improvements: addition of pretrained contextualized BERT embeddings, regularization with morphological categories and corpora merging in some languages. We described these improvements and published the related ablation experiment results.

Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).  Table 5: For every corpus, its size, the number of unique lemma rules, the number of unique POS tags, and the number of morphological features and morphological categories is presented. Then the test set results of lemma accuracy, lemma Levenshtein, morphological accuracy and morphological F1 follow, using a model achieving best score on the development set. We consider the regular model R, or a model on the merged corpus M and a model without BERT embeddings N. Finally, we show the increase of the merged model to the regular model, the increase of the regular model to the no-BERT model, and indicate if the language is present in BERT training data (BT).