Towards JointUD: Part-of-speech Tagging and Lemmatization using Recurrent Neural Networks

This paper describes our submission to CoNLL UD Shared Task 2018. We have extended an LSTM-based neural network designed for sequence tagging to additionally generate character-level sequences. The network was jointly trained to produce lemmas, part-of-speech tags and morphological features. Sentence segmentation, tokenization and dependency parsing were handled by UDPipe 1.2 baseline. The results demonstrate the viability of the proposed multitask architecture, although its performance still remains far from state-of-the-art.


Introduction
The Universal Dependencies project (Nivre et al., 2016) aims to collect consistently annotated treebanks for many languages.Its current version (2.2) (Nivre et al., 2018) includes publicly available treebanks for 71 languages in CoNLL-U format.The treebanks contain lemmas, part-ofspeech tags, morphological features and dependency relations for every word.
Neural networks have been successfully applied to most of these tasks and produced state-of-theart results for part-of-speech tagging and dependency parsing.Part-of-speech tagging is usually defined as a sequence tagging problem and is solved with recurrent or convolutional neural networks using word-level softmax outputs or conditional random fields (Lample et al., 2016;Strubell et al., 2017;Chiu and Nichols, 2016).Reimers and Gurevych (2017) have studied these architectures in depth and demonstrated the effect of network hyperparameters and even random seeds on the performance of the networks.
Neural networks have been applied to dependency parsing since 2014 (Chen and Manning, 2014).The state-of-the-art in dependency parsing is a network with deep biaffine attention module, which won CoNLL 2017 UD Shared Task (Dozat et al., 2017).
Nguyen et al. ( 2017) used a neural network to jointly learn POS tagging and dependency parsing.To the best of our knowledge, lemma generation and POS tagging have never been trained jointly using a single multitask architecture.
This paper describes our submission to CoNLL 2018 UD Shared Task.We have designed a neural network that jointly learns to predict part-ofspeech tags, morphological features and lemmas for the given sequence of words.This is the first step towards JointUD, a multitask neural network that will learn to output all labels included in UD treebanks given a tokenized text.Our system used UDPipe 1.2 (Straka et al., 2016) for sentence segmentation, tokenization and dependency parsing.
Our main contribution is the extension of a sequence tagging network by Reimers and Gurevych (2017) to support character-level sequence outputs for lemma generation.The proposed architecture was validated on nine UD v2.2 treebanks.The results are generally not better than the UDPipe baseline, but we did not extensively tune the network to squeeze most out of it.Hyperparameter search and improved network design are left for the future work.

System Architecture
Our system used in CoNLL 2018 UD Shared Task consists of two parts.First, it takes the raw input and produces CoNLL-U file using UDPipe 1.2.Then, if the corresponding neural model exists, the columns corresponding to lemma, part-of-speech and morphological features are replaced by the arXiv:1809.03211v1[cs.CL] 10 Sep 2018 predictions of the neural model.Note that UDPipe 1.2 did not use the POS tags and lemmas produced by our neural model.We did not train neural models for all treebanks, so most of our submissions are just the output of UDPipe.
The codename of our system in the Shared Task was ArmParser.The code is available on GitHub1 .

Neural model
In this section we describe the neural architecture that takes a sequence of words and outputs lemmas, part-of-speech tags, and 21 morphological features.POS tag and morphological feature prediction is done using a sequence tagging network from (Reimers and Gurevych, 2017).To generate lemmas, we extend the network with multiple decoders similar to the ones used in sequence-tosequence architectures.
Suppose the sentence is given as a sequence of words w 1 , . . ., w n .Each word consists of characters For each w i , we are given its lemma as a sequence of characters: l i = l 1 i . . .l m i i , POS tag p i ∈ P , and 21 features f 1 i ∈ F 1 , . . ., f 21 i ∈ F 21 .The sets P, F 1 , . . ., F 21 contain the possible values for POS tags and morphological features and are language-dependent: the sets are constructed based on the training data of each language.Table 1 shows the possible values for POS tags and morphological features for English -EWT treebank.
The network consists of three parts: embedding layers, feature extraction layers and output layers.

Embedding layers
By Emb d (a) we denote a d-dimensional embedding of the integer a. Usually, a is an index of a word in a dictionary or an index of a character in an alphabet.
Each word w i is represented by a concatenation of three vectors: e(w i ) = (e word (w i ), e casing (e), e char (w)).
The first vector, e word (w i ) is a 300-dimensional pretrained word vector.In our experiments we used FastText vectors (Bojanowski et al., 2017) released by Facebook2 .The second vector, e casing (w i ), is a one-hot representation of eight casing features, described in Table 2.
The third vector, e char (w i ) is a character-level representation of the word.We map each character to a randomly initialized 30-dimensional vector c j i = Emb 30 (c j i ), and apply a bi-directional LSTM on these embeddings.e char (w i ) is the concatenation of the 25-dimensional final states of two LSTMs.
The resulting e(w i ) is a 358-dimensional vector.

Feature extraction layers
We denote a recurrent layer with inputs x 1 , . . ., x n and hidden states h 1 , . . ., h n by We use two types of recurrent cells: LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014).
We apply three layers of LSTM with 150dimensional hidden states on the embedding vectors: where h 0 i = e(w i ).We also apply 50% dropout before each LSTM layer.
The obtained 150-dimensional vectors represent the words with their contexts, and are expected to contain necessary information about the lemma, POS tag and morphological features.

POS tags and features
Part-of-speech tagging and morphological feature prediction are word-level classification tasks.For each of these tasks we apply a linear layer with softmax activation.
The dimensions of the matrices W p , W f k and vectors b p , b f k depend on the training set for the given language: So we end up with 22 cross-entropy loss functions:

Lemma generation
This subsection describes our main contribution.
In order to generate the lemmas for all words, we add one GRU-based decoder per each word.These decoders share the weights and work in parallel.
The i-th decoder outputs l 1 i , . . ., l m i i , the predicted characters of the lemma of the i-th word.We denote the inputs to the i-th decoder by x 1 i , . . ., x m i i .Each of x j i is a concatenation of four vectors: 1. h 3 i is the representation of the i-th word after feature extractor LSTMs.This is the only part of x j i vector that does not depend on j.This trick is important to make sure that word-level information is always available in the decoder.
2. c j i = Emb 30 (c j i ) is the same embedding of the j-th character of the word used in the character-level BiLSTM described in Section 3.1.

π j
i is some form of positional encoding.It indicates the number of characters remaining till the end of the input word: π j i = Emb 5 (n i −j +1).Positional encodings were introduced in (Sukhbaatar et al., 2015) and were successfully applied in neural machine translation (Gehring et al., 2017;Vaswani et al., 2017).
4. l j−1 i is the indicator of the previous character of the lemma.During training it is the one-hot vector of the ground-truth: ).During inference it is the output of the GRU in the previous timestep l j−1 i = l j−1 i .These inputs are passed to a single layer of GRU network.The output of the decoder is formed by applying another dense layer on the GRU state: Here, , where |C| is the number of characters in the alphabet.The initial state of the GRU is the output of the feature extractor LSTM: s 0 i = h3 i .All GRUs share the weights.
The loss function for lemma output is:

Multitask loss function
The combined loss function is a weighted average of the loss functions described above: The final version of our system used λ p = 0.2 and λ l = λ f k = 1 for every k.

Experiments
We have implemented the architecture defined in the previous section using Keras framework.Our implementation is based on the codebase for (Reimers and Gurevych, 2017) 3 .The new part of the architecture (lemma generation) is quite slow.The overall training speed is decreased by more than three times when it is enabled.We have left speed improvements for future work.
To train the model we used RMSProp optimizer with early stopping.The initial learning rate was 0.001, and it was decreased to 0.0005 since the seventh epoch.The training was stopped when the loss function was not improved on the development set for five consecutive epochs.
Due to time constraints, we have trained our neural architecture on just nine treebanks.These include three English and two French treebanks.
Our system was evaluated on Ubuntu virtual machines in TIRA platform (Potthast et al., 2014) and on our local machines using the test sets available on UD GitHub repository (Zeman et al., 2018a).
The version we ran on TIRA had a bug in the preprocessing pipeline and was doubling new line symbols in the input text.Raw texts in UD v2.2 occasionally contain new line symbols inside the sentences.These symbols were duplicated due to the bug, and the sentence segmentation part of UDPipe treated them as two different sentences.The evaluation scripts used in CoNLL 2018 UD Shared Task obviously penalized these errors.After the deadline of the Shared Task, we ran the same models (without retraining) on the test sets on our local machines without new line symbols.
Additionally, we locally trained models for two more non-Indo-European treebanks: Arabic PADT and Korean GSD.

Results
Table 3 shows the main metrics of CoNLL 2018 UD Shared Task on the nine treebanks that we used for training our models.For each of the metrics we report five scores, two scores on our local machine (our model and UDPipe 1.2), and three scores from the official leaderboard4 (our model, UDPipe baseline, the best score for that particular treebank).LAS metric evaluates sentence segmentation, tokenization and dependency parsing, so the numbers for our models should be identical to UDPipe 1.2.MLAS metric additionally takes into account POS tags and morphological features, but not the lemmas.BLEX metric evaluates dependency parsing and lemmatization.The full description of these metrics are available in (Zeman et al., 2018b)

Input vectors for lemma generation
The initial versions of the lemma decoder did not get the state of the LSTM below h 3 i and positional embedding π j i as inputs.The network learned to produce lemmas with some accuracy but with many trivial errors.In particular, after training on English -EWT treebank, the network learned to remove s from the end of the plural nouns.But it also started to produce ¡end-of-the-word¿ symbol even if s was in the middle of the word.We believe the reason was that there was almost no information available that would allow the decoder to distinguish between plural suffix and a simple s inside the word.One could argue that the initial state of the GRU (h 3 i ) could contain such information, but it could have been lost in the GRU.
To remedy this we decided to pass h 3 i as an input at every step of the decoder.This idea is known to work well in image caption generation.The earliest usage of this trick we know is in (Donahue et al., 2015).
Additionally, we have added explicit information about the position in the word.Unlike (Vaswani et al., 2017), we encode the number of characters left before the end of the word.This choice might be biased towards languages where the ending of the word is the most critical in lemmatization.
By combining these two ideas we got significant improvement in lemma generation for English.We did not do ablation experiments to determine the effect of each of these additions.
The additional experiments showed that this architecture of the lemmatizer does not generalize to Arabic and Korean.We will investigate this problem in the future work.

Balancing different tasks
Multitask learning in neural networks is usually complicated because of varying difficulty of individual tasks.The λ coefficients in (1) can be used to find optimal balance between the tasks.Our initial experiments with all λ coefficients equal to 1 showed that the loss term for POS tagging (L p ) had much higher values than the rest.We decided to set λ p = 0.2 to give more weight to the other tasks and noticed some improvements in lemma generation.
We believe that more extensive search for better coefficients might help to significantly improve the overall performance of the system.

Fighting against overfitting
The main challenge in training these networks is to overcome overfitting.The only trick we used was to apply dropout layers before feature extractor LSTMs.We did not apply recurrent dropout (Gal and Ghahramani, 2016) or other noise injection techniques, although recent work in language modeling demonstrated the importance of such tricks for obtaining high performance models (Merity et al., 2018).

Conclusion
In this paper we have described our submission to CoNLL 2018 UD Shared Task.Our neural network was learned to jointly produce lemmas, partof-speech tags and morphological features.It is the first step towards a fully multitask neural architecture that will also produce dependency relations.Future work will include more extensive hyperparameter tuning and experiments with more languages.

Table 1 :
The values for part-of-speech and morphological features for English -EWT treebank.

Table 2 :
Casing features used in the embedding layer.

Table 3 :
and in CoNLL 2018 UD Shared Task website 5 .Table 4 compares the same models using another set of metrics that measure the performance of POS tagging, morphological feature extraction and lemmatization.Performance of our model compared to UDPipe 1.2 baseline and the winner models of CoNLL 2018 UD Shared Task.

Table 4 :
Additional metrics describing the performance of our model, UDPipe 1.2 baseline, and the winner models of CoNLL 2018 UD Shared Task.