Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning

We present our CHARLES-SAARLAND system for the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology, in task 2, Morphological Analysis and Lemmatization in Context. We leverage the multilingual BERT model and apply several fine-tuning strategies introduced by UDify demonstrating exceptional evaluation performance on morpho-syntactic tasks. Our results show that fine-tuning multilingual BERT on the concatenation of all available treebanks allows the model to learn cross-lingual information that is able to boost lemmatization and morphology tagging accuracy over fine-tuning it purely monolingually. Unlike UDify, however, we show that when paired with additional character-level and word-level LSTM layers, a second stage of fine-tuning on each treebank individually can improve evaluation even further. Out of all submissions for this shared task, our system achieves the highest average accuracy and f1 score in morphology tagging and places second in average lemmatization accuracy.


Introduction
We focus on track 2 of the SIGMORPHON 2019 Shared Task (McCarthy et al., 2019), which requires systems to predict lemmas and morphosyntactic descriptions (MSDs) of words given sentences of pre-tokenized words. The data relies on treebanks provided by the Universal Dependencies (UD) project (Nivre et al., 2018), where MSDs are converted from UD format to the UniMorph schema (McCarthy et al., 2018;Kirov et al., 2018). Systems must predict from sentences given test data provided in 107 separate treebanks each representing one of 66 different languages.
Recent advances in contextual word representations show that pretraining language models on a large corpus of unsupervised text can be used to

Wordpiece Tokenizer
The best optimizer is grad student descent The best op ##timi ##zer is grad student descent ... transfer their internal knowledge representations to other NLP tasks to boost evaluation scores significantly (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2018). We utilize the BERT base multilingual cased model pretrained on raw sentences found in the top 104 most-resourced languages of Wikipedia (Devlin et al., 2018) for all of our experiments. In addition, we use methods introduced by UDify (Kondratyuk, 2019) to further fine-tune and regularize BERT, which has been shown to be especially helpful in predicting morpho-syntactic tasks.
Our system defines a simple multi-task multilingual neural architecture for predicting lemmas and MSDs jointly. Our contributions to achieve high lemmatization and morphology tagging performance are as follows: 1. We leverage the pretrained multilingual BERT cased model to encode input sentences and apply additional word-level and character-level LSTM layers before jointly decoding lemmas and morphology tags using simple sequence tagging layers.
2. Instead of only training models for each treebank separately, we use a two-stage training process to incorporate cross-linguistic information present in other treebanks, training multilingually over all treebanks in the first stage and then monolingually using saved multilingual weights in the second stage.
Our results show that applying an intermediate multilingual fine-tuning stage on BERT is superior to just fine-tuning monolingually in nearly all cases. Code for our model is released along with UDify at https://github.com/ hyperparticle/udify.

Model Architecture
We describe the architecture of our system as follows. See Figure 1 for an illustration of this description. Our network consists of a shared BERT encoder followed by joint lemma and morphology tag decoders.
Given an input sentence consisting of a sequence of word tokens, we apply BERT's multilingual cased tokenizer to each word, potentially splitting it into multiple subword tokens. We encode this token sequence with the pretrained multilingual cased BERT base model consisting of 12 layers with 12 attention heads per layer and hidden output dimensions of 768. Following this, we take the subset of wordpieces corresponding to the first wordpiece of each word to align the BERT encoding with the sequence of input words 1 .
Once BERT encoding is complete, we apply two separate instances of layer attention defined in UDify which is similar to ELMo (Peters et al., 2018), i.e., a trainiable weighted sum of all 12 layers of BERT, which has been shown to improve evaluation performance over just computing representations on the last layer. The layer attention instances generate embeddings specific to each task, one for lemmatization and the other for morphology tagging.
But before decoding, we also apply characterlevel embeddings (Santos and Zadrozny, 2014;Ling et al., 2015;Kim et al., 2016) to produce an enhanced morphological representation by encoding the sequence of character tokens for each word through a bidirectional LSTM with a residual connection (Schuster and Paliwal, 1997;Kim et al., 2017), keeping the hidden layers fixed to dimensions of 384. We concatenate the final hidden states of both LSTM directions, and then sum these character-level word representations with each of the two encoded representations produced by the task-specific layer attention.
Similar to Kondratyuk et al. (2018) and Straka (2018), both the lemmatizer and morphological tagger employ two successive layers of word-level bidirectional residual LSTMs computed over the entire task layer attention sequence with hidden dimensions of 768, summing both directions together along each output state.
For lemmatization, we precompute edit scripts representing a minimal sequence of character operations to transduce a word form to its lemma counterpart, as seen in Chrupała (2006); Straka (2018). As is typical for neural sequence tagging, we apply a feedforward layer to the final layer of the lemmatizer LSTM, representing the activations of classes of all edit scripts found in the training data.
Similarly for morphology tagging, we apply a feedforward layer whose units correspond to the vocabulary over all unfactored MSD strings. We apply the method of Inoue et al. (2017) to jointly predict the classes of unfactored and factored morphology tags, i.e., we also predict each dimension of the morphology tag whose subcategories are defined by the UniMorph schema (e.g., case, mood, person, tense, etc.). We only use the factored tags to improve training, and for prediction we use the full unfactored tags.

Experiments
We train our system on the provided treebank training data with three separate configurations.

Configurations
MONO We train the network (as seen in Figure 1) monolingually by simply fine-tuning it on each treebank separately.
MULTI We fine-tune the network as in MONO, except on a dataset consisting of all treebank training data concatenated together, as seen in UDify. All word, character, and tag vocabularies of each language are combined together.
MULTI+MONO We train the network monolingually as in MONO, but using the BERT weights saved from the model fine-tuned according to MULTI. This effectively defines a two-stage training process: the first stage involves multilingual fine-tuning of BERT, and the second stage re-trains the layer attention, LSTMs and feedforward taggers from scratch on each treebank with a reduced monolingual vocabulary (keeping finetuned BERT intact).
For all MONO and the second stage of MULTI+MONO, we ensure that we do not combine multiple treebanks of the same language but always fine-tune on just the training data from each provided treebank.

Hyperparameters
A summary of specific values for each of the hyperparameters discussed can be seen in Table 1.
We train each configuration using a batch size of 32 over 50 epochs. We employ the Adam optimizer, computing the loss as the softmax cross entropy between the predicted tags and the  gold labels. We apply discriminative fine-tuning (Howard and Ruder, 2018) by defining four separate parameter groups each with their own base learning rate, decreasing as the layers get closer to the input: the first 6 layers of BERT, the last 6 layers of BERT, the layer attention and LSTM layers, and the final feedforward layers. We apply regularization as defined by UDify, with a few extra modifications. We raise the layer dropout, BERT dropout, input mask probability slightly to prevent overfitting, especially for the MONO and MULTI+MONO configurations. We also apply dropout to all intermediate wordembedding representations between each of the word-level LSTM layers.

Results
We display comparisons between each of the three configurations. We compute lemma accuracy, lemma Levenstein distance, morphology tag accuracy, and morphology f1 scores for each of the 107 treebanks. A summary of the averages of all scores for each configuration can be found in Table 2. The full results are shown in Tables 3, 4, 5, and 6.

Discussion
Our results show that not only does finetuning BERT provide excellent lemmatization and morphology tagging performance, two-stage MULTI+MONO training can provide significant improvements for practically every treebank when compared to MONO. While some of these improvements can be attributed to learning from monolingual data from multiple treebanks of the same language, we can see improvements even for languages possessing just one treebank. This provides evidence that the MULTI and MULTI+MONO models regularize well to multilingual training. This could be explained by a combination of: multilingual learning providing     language-invariant generalizations, out-of-domain data providing noise to reduce overfitting, or warm restarts aiding in improved convergence of model parameters. More experimentation is necessary to quantify these possible contributors.
Unlike the results shown by UDify, we see that the MULTI configuration provides overall inferior predictions on almost every treebank when compared to both MONO and MULTI+MONO. This is likely due to the added LSTM layers and character-level embeddings, which provide additional information that improves monolingual training representations far more than it improves multilingual. Our intuition is that the LSTM layers pose an information bottleneck for massively multilingual data, unlike the BERT encoder, whose large capacity has been shown to be able to scale to more than 100 languages. Predictions using a smaller vocabulary subset could provide a much stronger signal to the LSTM layers to incorporate character-level morphology more accuractely. But we do see that learning MULTI still learns useful cross-lingual information, just that it requires the LSTMs and character embeddings to be reconfigured to the specific treebank at hand to gain the benefits of both types of training.
Note that we specifically do not perform any extensive hyperparameter search or use ensembling. As such, we predict that our evaluation results could still be raised much higher.

Conclusion
We have demonstrated our system consisting of fine-tuning a multi-task enhanced BERT model for lemmatization and morphology tagging using a two-stage multilingual training scheme. We show that while pretrained BERT does provide word representations capable of surpassing the baseline, we are able to improve this significantly by also incorporating multilingual pretraining on all available treebanks, allowing the model to regularize and likely incorporate cross-lingual information useful for morphological parsing. We leave a more detailed analysis as to what extent multilingual fine-tuning and BERT pretraining contribute to model performance for future work.