Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation

This paper describes our system (HIT-SCIR) submitted to the CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. We base our submission on Stanford’s winning system for the CoNLL 2017 shared task and make two effective extensions: 1) incorporating deep contextualized word embeddings into both the part of speech tagger and parser; 2) ensembling parsers trained with different initialization. We also explore different ways of concatenating treebanks for further improvements. Experimental results on the development data show the effectiveness of our methods. In the final evaluation, our system was ranked first according to LAS (75.84%) and outperformed the other systems by a large margin.


Introduction
In this paper, we describe our system (HIT-SCIR) submitted to CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018).We base our system on Stanford's winning system (Dozat et al., 2017, §2) for the CoNLL 2017 shared task (Zeman et al., 2017).Dozat and Manning (2016) and its extension (Dozat et al., 2017) have shown very competitive performance in both the shared task (Dozat et al., 2017) and previous parsing works (Ma and Hovy, 2017;Shi et al., 2017a;Liu et al., 2018b;Ma et al., 2018).A natural question that arises is how can we further improve their part of speech (POS) tagger and dependency parser via a simple yet effective technique.In our system, we make two noteworthy extensions to their tagger and parser: • Incorporating the deep contextualized word embeddings (Peters et al., 2018, ELMo: Embeddings from Language Models) into the word representaton ( §3); • Ensembling parsers trained with different initialization ( §4).
For some languages in the shared task, multiple treebanks of different domains are provided.Treebanks which are of the same language families are provided as well.Letting these treebanks help each other has been shown an effective way to improve parsing performance in both the crosslingual-cross-domain parsing community and last year's shared tasks (Ammar et al., 2016;Guo et al., 2015;Che et al., 2017;Shi et al., 2017b;Björkelund et al., 2017).In our system, we apply the simple concatenation to the treebanks that are potentially helpful to each other and explore different ways of concatenation to improve the parser's performance ( §5).
In dealing with the small treebanks and treebanks from low-resource languages ( §6), we adopt the word embedding transfer idea in the crosslingual dependency parsing (Guo et al., 2015) and use the bilingual word vectors transformation technique (Smith et al., 2017) 1 to map fasttext 2 word embeddings (Bojanowski et al., 2016) of the source rich-resource language and target low-resource language into the same space.The transferred parser trained on the source language is used for the target low-resource language.
We conduct experiments on the development data to study the effects of ELMo, parser ensemble, and treebank concatenation.Experimental results show that these techniques substantially im-prove the parsing performance.Using these techniques, our system achieved an averaged LAS of 75.84 on the official test set and was ranked the first according to LAS (Zeman et al., 2018).This result significantly outperforms the others by a large margin. 3e release our pre-trained ELMo for many languages at https://github.com/HIT-SCIR/ELMoForManyLangs.

Deep Biaffine Parser
We based our system on the tagger and parser of Dozat et al. (2017).The core idea of the tagger and parser is using an LSTM network to produce the vector representation for each word and then predict POS tags and dependency relations using the representation.For the tagger whose input is the word alone, this representation is calculated as is the word embeddings.After getting h i , the scores of tags are calculated as where each element in s (pos) i represents the possibility that i-th word is assigned with corresponding tag.
For the parser whose inputs are the word and POS tag, such representation is calculated as And a pair of representations are fed into a biaffine classifier to predict the possibility that there is a dependency arc between these two words.The scores over all head words are calculated as where h (arc-dep) i is computed by feeding h i into an MLP and H (arc-head) is the stack of h (arc-head) i which is calculated in the same way as h (arc-dep) i but using another MLP.After getting the head y (arc) word, its relation with i-th word is decided by calculating where h (rel−head) and h (rel−dep) are calculated in the same way as h This decoding process can lead to cycles in the result.Dozat et al. (2017) employed an iterative fixing methods on the cycles.We encourage the reader of this paper to refer to their paper for more details on training and decoding.
For both the biaffine tagger and parser, the word embedding v (word) i is obtained by summing a finetuned token embedding w i , a fixed word2vec embedding p i , and an LSTM-encoded character representation vi as

Deep Contextualized Word Embeddings
Deep contextualized word embeddings (Peters et al., 2018, ELMo) has shown to be very effective on a range of syntactic and semantic tasks and it's straightforward to obtain ELMo by using an LSTM network to encode words in a sentence and training the LSTM network with language modeling objective on large-scale raw text.More specifically, the ELMo i is computed by first computing the hidden representation h where ṽi is the output of a CNN over characters, then attentively summing and scaling different layers of h with s j and γ as , where L is the number of layers and h is identical to ṽi .In our system, we follow Peters et al. (2018) and use a two-layer bidirectional LSTM as our BiLSTM (LM ) .
In this paper, we study the usage of ELMo for improving both the tagger and parser and make several simplifications.Different from Peters et al. (2018), we treat the output of ELMo as a fixed representation and do not tune its parameters during tagger and parser training.Thus, we cancel the layer-wise attention scores s j and the scaling factor γ, which means In our preliminary experiments, using h for ELMo i yields better performance on some treebanks.In our final submission, we decide using either 2 j=0 h based on their development.
After getting ELMo i , we project it to the same dimension as v (word) i and use it as an additional word embedding.The calculation of v for both the tagger and parser.We need to note that training the tagger and parser includes W (ELM o) .To avoid overfitting, we impose a dropout function on projected vector W (ELM o) • ELMo i during training.

Parser Ensemble
According to Reimers and Gurevych (2017), neural network training can be sensitive to initialization and Liu et al. (2018a) shows that ensemble neural network trained with different initialization leads to performance improvements.We follow their works and train three parsers with different initialization, then ensemble these parsers by averaging their softmaxed output scores as ).

Treebank Concatenation
For 15 out of the 58 languages in the shared task, multiple treebanks from different domains are provided.There are also treebanks that come from the same language family.Taking the advantages of the relation between treebanks has been shown a promising direction in both the research community (Ammar et al., 2016;Guo et al., 2015Guo et al., , 2016a) ) and in the CoNLL 2017 shared task (Che et al., 2017;Björkelund et al., 2017;Shi et al., 2017b).
In our system, we adopt the treebank concatenation technique as Ammar et al. (2016) with one exception: only a group of treebanks from the same language (cross-domain concatenation) or a pair of treebanks that are typologically or geographically correlated (cross-lingual concatenation) is concatenated.
In our system, we tried cross-domain concatenation on nl, sv, ko, it, en, fr, gl, la, ru, and sl. 4  We also tried cross-lingual concatenation on ugtr, uk-ru, ga-en, and sme-fi following Che et al. (2017).However, due to the variance in vocabulary, grammatical genre, and even annotation, treebank concatenation does not guarantee to improve the model's performance.We decide the usage of concatenation by examining their development set performance.For some small treebanks which do not have development set, whether using treebank concatenation is decided through 5-fold cross validation. 5We show the experimental results of treebank concatenation in Section 9.3.

Low Resources Languages
In the shared task, 5 languages are presented with training set of less than 50 sentences.4 languages do not even have any training data.It's difficult to train reasonable parser on these low-resource languages.We deal with these treebanks by adopting the word embedding transfer idea of Guo et al. (2015).We transfer the word embeddings of the rich-resource language to the space of lowresource language using the bilingual word vectors transformation technique (Smith et al., 2017) and trained a parser using the source treebank with only pretrained word embeddings on the transformed space as v (word) i = p i .The transformation matrix is automatically learned on the fasttext word embeddings using the same tokens shared by two languages (like punctuation).
Table 1 shows our source languages for the target low-resource languages.For a treebank with a few training data, its source language is decided by testing the source parser's performance on the training data. 6For a treebank without any training data, we choose the source language according to their language family.7 Naija presents an exception for our method since it does not have fasttext word embeddings and embedding transformation is infeasible.Since it's a dialect of English, we use the full pipeline of en ewt for pcm nsc instead.

Preprocessing
Besides improving the tagger and parser, we also consider the preprocessing as an important factor to the final performance and improve it by using the state-of-the-art system for sentence segmentation, or developing our own word segmentor for languages whose tokenizations are non-trival.

Sentence Segmentation
For some treebanks, sentence segmentation can be problematic since there is no explicitly sentence delimiters.de Lhoneux et al. (2017) and Shao (2017) presented a joint tokenization and sentence segmentation model (denoted as Uppsala segmentor)8 that outperformed the baseline model in last year's shared task (Zeman et al., 2017).We select a set of treebanks whose udpipe sentence segmentation F-scores are lower than 95 on the development set and use Uppsala segmentor instead.9Using the Uppsala segmentor leads to a development improvement of 7.67 F-score in these treebanks over udpipe baseline and it was ranked the first according to sentence segmentation in the final evaluation.

Tokenization for Chinese, Japanese, and Vietnamese
Tokenization is non-trivial for languages which do not have explicit word boundary markers, like Chinese, Japanese, and Vietnamese.We develop our own tokenizer (denoted as SCIR tokenizer) for these three languages.Following Che et al. (2017) and Zheng et al. (2017), we model the tokenization as labeling the word boundary tag10 on characters and use features derived from large-scale unlabeled data to further improve the performance.11 In addition to the pointwise mutual information (PMI), we also incorporate the character ELMo into our tokenizer.Embeddings of these features are concatenated along with a bigram character embeddings as input.These techniques lead to the best tokenization performance on all the related treebanks and the average improvement over udpipe baseline is 7.5 in tokenization F-score.12

Preprocessing for Thai
Thai language presents a unique challenge in the preprocessing.Our survey on the Thai Wikipedia indicates that there is no explicit sentence delimiter and obtaining Thai words requires tokenization.To remedy this, we use the whitespace as sentence delimiter and use the lexicon-based word segmentation -forward maximum matching algorithm for Thai tokenization.Our lexicon is derived from the fasttext word embeddings by preserving the top 10% frequent words.

Lemmatization and Morphology Tagging
We did not make an effort on lemmatization and morphology tagging, but only use the baseline model.This lags our performance in the MLAS and BLEX evaluation, in which we were ranked 6th and 2nd correspondingly.However, since our method, especially incorporating ELMo, is not limited to particular task, we expect it to improve both the lemmatization and morphology tagging and achieve better MLAS and BLEX scores.

Implementation Details
Pretrained Word Embeddings.We use the 100-dimensional pretrained word embeddings released by the shared task for the large languages.
For the small treebanks and treebanks for lowresource languages where cross-lingual transfer is required, we use the 300-dimensional fasttext word embeddings.Old French treebank (fro srcmf ) presents the only exceptions and we use the French embeddings instead.For all the embeddings, we only use 10% of the most frequent words.
ELMo.We use the same hyperparameter settings as Peters et al. (2018) for BiLSTM (LM ) and the character CNN.We train their parameters as training a bidirectional language model on a set of 20-million-words data randomly sampled from the raw text released by the shared task for each language.Similar to Peters et al. (2018), we use the sample softmax technique to make training on large vocabulary feasible (Jean et al., 2015).However, we use a window of 8192 words surrounding the target word as negative samples and it shows better performance in our preliminary experiments.The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.
Biaffine Parser.We use the same hyperparameter settings as Dozat et al. (2017).When trained with ELMo, we use a dropout of 33% on the projected vectors.
SCIR Tokenizer.We use a 50-dimensional character bigram embeddings.For the character ELMo whose input is a character, the language model predict next character in the same way as the word ELMo.The final model is an ensemble of five single segmentors.
Uppsala Segmentor.We use the default settings for the Uppsala segmentor and the final model is an ensemble of three single segmentors.

Effects of ELMo
We study the effect of ELMo on the large treebanks and report the results of a single tagger and parser with and without ELMo.Figure 1a shows the tagging results on the development set and Figure 1b shows the parsing results.Using ELMo in the tagger leads to a macro-averaged improvement of 0.56% in UPOS and the macro-averaged error reduction is 17.83%.Using ELMo in the parser leads to a macro-averaged improvement of 0.84% in LAS and the macro-averaged error reduction is 7.88%.ELMo improves the tagging performance almost on every treebank, except for zh gsd and gl ctg.Similar trends are witnessed in the parsing experiments with ko kaist and pl lfg being the only treebanks where ELMo slightly worsens the performance.
We also study the relative improvements in dependence on the size of the treebank.The line in Figure 1a and Figure 1b shows the error reduction from using ELMo on each treebank.However, no clear relation is revealed between the treebank size and the gains using ELMo.

Effects of Ensemble
We also test the effect of ensemble and show the results in Figure 2. Parser ensemble leads to an averaged improvement of 0.55% in LAS and the averaged error reduction is 4.0%.These results indicate that ensemble is an effective way to improve the parsing performance.The relationship between gains using ensemble and treebank size is also studied in this figure and the trend is that small treebank benefit more from the ensemble.We address this to the fact that the ensemble im-proves the model's generalization ability in which the parser trained on small treebank is weak due to overfitting.

Effects of Treebank Concatenation
As mentioned in Section 5, we study the effects of both the cross-domain concatenation and crosslingual concatenation.
Cross-Domain Concatenation.For the treebanks which have development set, the development performances are shown in Table 2. Numbers of sentences in the training set are also shown in this table.The general trend is that for the treebank with small training set, cross-domain concatenation achieves better performance.While for those with large training set, concatenation does not improve the performance or even worsen the results.
For the small treebanks which do not have development set, the 5 fold cross validation results are shown in    (Guo et al., 2015(Guo et al., , 2016b) and treebank transfer (Guo et al., 2016a) are still necessary.

Effects of Better Preprocessing
We also study how preprocessing contributes to the final parsing performance.The experimental results on the development set are shown in Ta-ble 5. From this table, the performance of word segmentation is almost linearly correlated with the final performance.Similar trends on sentence segmentation performance are witnessed but el gdt and pt bosque presents some exceptions where better preprocess leads drop in the final parsing performance.

Parsing Strategies and Test Set Evaluation
Using the development set and cross validation, we choose the best model and data combination and the choices are shown in Table 6 along with the test evaluation.From this table, we can see that our system gains more improvements when both ELMo and parser ensemble are used.For some treebanks, concatenation also contributes to the improvements.Parsing Japanese, Vietnamese, and Chinese clearly benefits from better word segmentation.Since most of the participant teams use single parser for their system, we also remove the parser ensemble and do a post-contest evaluation.The results are also shown in this table.
Our system without ensemble achieves an macroaveraged LAS of 75.26, which unofficially ranks the first according to LAS in the shared task.We report the time and memory consumption.A full run over the 82 test sets on the TIRA virtual machine (Potthast et al., 2014) takes about 40 hours and consumes about 4G RAM memory.

Conclusion
Our system submitted to the CoNLL 2018 shared task made several improvements on last year's winning system from Dozat et al. (2017), including incorporating deep contextualized word embeddings, parser ensemble, and treebank concatenation.Experimental results on the development set show the effectiveness of our methods.Using these techniques, our system achieved an averaged LAS of 75.84% and obtained the first place in LAS in the final evaluation.

Credits
There are a few references we would like to give proper credit, especially to data providers: the core Universal Dependencies paper from LREC 2016 (Nivre et al., 2016), the UD version 2.2 datasets (Nivre et al., 2018), the baseline udpipe model released by Straka et al. (2016), the deep contextualized word embeddings code released by Peters et al. (2018), the biaffine tagger and parser released by Dozat et al. (2017), the joint sentence segmentor and tokenizer released by de Lhoneux et al. (2017), and the evaluation platform TIRA (Potthast et al., 2014).  .self denotes that the model is trained with the treebank itself.If the model field is not filled with self, the model is trained with treebank concatenation.The ref. column shows the top performing system if we are not top, or the second-best performing system on LAS.We also show the results without parser ensemble and our unofficial ranks of this system.

Figure 1 :
Figure 1: The effects of ELMo.Treebanks are sorted from the smallest to the largest.

Table 1 :
Cross-lingual transfer settings for lowresource target languages.

Table 2 :
The developement performance with cross-domain concatenation for languages which has multiple treebanks.single means training the parser on it own treebank without concatenation.# train shows the number of training sentences in the treebank measured in thousand.
Table 3 in which concatenation improves most of the treebanks except for gl treegal.

Table 3 :
The 5-fold cross validation results for the cross-domain concatenation of treebank which does not have development set.

Table 4 :
Cross-lingual concatenation results.The results for ug udt and uk iu are obtained on the development set.The results for ga idt and sme giella are obtained with udpipe by 5-fold cross validation.

Table 6 :
The strategies used in the final submission.The toolkit and model are separated by colon.(upp- sala: the Uppsala segmentor; scir: our segmentor; biaffine: the biaffine tagger and parser; biaffine trans: our transfer parser for low-resource languages.)h0 and h 0,1,2 denotes the ELMo used to train the model.h0 means using h (LM ) i,0 and h 0,1,2 means using 2 j=0 h(LM ) i,j