Multilingual Universal Dependency Parsing from Raw Text with Low-Resource Language Enhancement

This paper describes the system of our team Phoenix for participating CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Given the annotated gold standard data in CoNLL-U format, we train the tokenizer, tagger and parser separately for each treebank based on an open source pipeline tool UDPipe. Our system reads the plain texts for input, performs the pre-processing steps (tokenization, lemmas, morphology) and finally outputs the syntactic dependencies. For the low-resource languages with no training data, we use cross-lingual techniques to build models with some close languages instead. In the official evaluation, our system achieves the macro-averaged scores of 65.61%, 52.26%, 55.71% for LAS, MLAS and BLEX respectively.


Introduction
Universal Dependencies (UD) (Nivre et al., 2016) is a framework that provides cross-linguistically consistent grammatical annotations for various languages, which enables comparative evaluations for some cross-lingual learning tasks. As a follow up of CoNLL 2017 UD Shared Task (Zeman et al., 2017), the goal of CoNLL 2018 UD Shared Task (Zeman et al., 2018) is to develop multilingual dependency parsers from raw text for many typologically different languages with training data from UD project. The task comprises 82 test sets from 57 languages. However, there are a category of low-resource languages that have little or no training data, which requires cross-lingual techniques (Zeman and Resnik, 2008;Tiedemann, 2015) with the help of the data from other languages.
In this paper, we present the system of our team Phoenix for multilingual universal dependency parsing from raw text. The targeted task is a challenging one in terms of deep learning based natural language processing (Wu and Zhao, 2018;Wang et al., 2017;Qin et al., 2017). We adopt the trainable open source tool UDPipe 1.2 (Straka et al., 2016;Straka and Straková, 2017) to train the dependency parser for each test set with UD version 2.2  treebanks as training data. There are three main components of our model to perform, tokenization, Part-of-Speech (POS) tagging and dependency parsing. When evaluated on the web interface of TIRA (Potthast et al., 2014) platform, the system reads the raw text for input and chooses the corresponding model for a particular test set with a model selector. After the tokenization and tagging on the raw text, the system finally outputs the syntactic dependencies in the CoNLL-U format. To deal with the low-resource languages which have no training data, some cross-lingual techniques are applied by training with other related or close languages. Our official submission obtains macro-averaged scores of 65.61%, 52.26%, 55.71% for LAS, MLAS and BLEX on all treebanks.
The rest of this paper is organized as follows. Section 2 introduces the architecture overview of our system. Section 3 gives the implementation details and the specific strategies applied for the low-resource languages. Finally, we report and an-alyze the official results with the three main evaluation metrics in Section 4. Figure 1 illustrates the overall architecture of our system for training and predicting. In the training procedure, the system takes as input the treebanks (training set) of CoNLL-U format and trains a model for each of them. Every model has three components, tokenizer, tagger and parser. As we train the parser, pretrained word embeddings for word forms in the word2vec format are applied. In the predicting procedure, the system takes as input the raw text (test set) and selects a model according to the language code and treebank code. With the selected model, the system outputs the syntactic head and the type of the dependency relation for each word.

Training Data
Our models are trained by using only UD 2.2 treebanks provided by the CoNLL 2018 UD Shared Task without any other additional data. There are 82 test sets from 57 languages, and 61 of the 82 treebanks are large enough to provide training and development sets. However, the other 21 treebanks lack of development data and some of them even have no training data. Among these lowresource treebanks, 7 have training data with still reasonable size; 5 are extra test sets in languages where another large treebank exists; 9 are lowresource languages with no training data available (Breton, Faroese, Naija, Thai) or the training set being just a tiny sample (Armenian, Buryat, Kazakh, Kurmanji, Upper Sorbian). For the treebanks that still have training data but without development data, we split the last 10% of the training data as the development set to tune model hyperparameters even if the training set is very small. To deal with those having no training data, we train the parsers with the treebanks of the same languages or related languages. The details will be given in Section 3.4.

Word Embedding
We adopt pretrained embeddings for word forms with the provided training data by word2vec (Mikolov et al., 2013). The parameter settings of word2vec are shown in Table 1. We use the skip-gram model to train the word vectors with a dimension of 50. The context window is set to 10 words and the word will be dropped if its frequency is less than twice. After converting CoNLL-U to the horizontal format (replacing spaces within word forms with a Unicode character), we train the word embeddings of each treebank on the UD data for 15 iterations.

Parameters
Value algorithm skip-gram size 50 window 10 min-count 2 iterations 15 Table 1: Parameters for training word embeddings.

Model Selector
Since we train models individually for every languages and treebanks, a model selector is needed to decide which model to use when predicting. The model selector in our system simply reads the json file in the test file folder and assigns the corresponding trained model for each test set according to the language code and treebank code.

Tokenizer
In our system pipeline, the first step is the sentence segmentation and tokenization which is performed jointly in UDPipe. A single-layer bidirectional GRU network is used to train the tokenizer which predicts for each character whether it is the last one in a sentence or the last one in a token. In UD treebanks, the text is structured on several levels: document, paragraph, sentence and token. A MISC feature SpaceAfter=No is defined to denote that a given token is not followed by a space. Thus, the tokenizer is trained according to the SpaceAfter=No features in the CoNLL-U files. The parameters used for training the tokenizer are listed in Table 2. The segmenter and tokenizer network employs character embeddings and is trained using dropout both before and after the recurrent units. The GRU dimension, dropout probability and learning rate are tuned on the development set. All the tokenizers are trained for 100 epochs. Other parameters like tokenize url and allow spaces are set as default.

Tagger
The second step in our system pipeline is to generate some POS tags and other morphological features for the tokenized data, which will be utilized as the input for the final dependency parser. We adopt the built-in tagger in UDPipe, which is based on an open source morphological analysis tool MorphoDita (Straková et al., 2014). In this step, the tagger will produce the following outputs: Lemma: Lemma or stem of word form.

FEATS:
List of morphological features from the universal feature inventory or from a defined language-specific extension.
We use two MorphoDita models to produce different features, whose effectiveness has been verified in (Straka et al., 2016). The first model called tagger generates the UPOS, XPOS and FEATS tags while the second one called lemmatizer performs lemmatization.
The tagger consists of a guesser and an averaged perceptron. The guesser generates several triplets (UPOS, XPOS, FEATS) for each word according to its last four characters. The averaged perceptron with a fixed set of features disambiguates the generated tags (Straka et al., 2016;Straková et al., 2014).
The structure of the lemmatizer is similar to the tagger. A guesser produces (lemma rule, UPOS) tuples and an averaged perceptron performs disambiguation. The lemmatizer generates a lemma from a word by stripping some affix and adding new affix according to the last four characters of a word and its prefix.
The training parameters of the two models for different treebanks are provided in Table 3

Parser
The final parsing step is performed using Parsito (Straka et al., 2015), which is a transitionbased parser with a neural-network classifier. The parser supports several transition systems including a projective arc-standard system (Nivre, 2008), a partially non-projective link2 system (Gómez-Rodríguez et al., 2014) and a fully non-projective swap system (Nivre, 2009). Meanwhile, the transition oracles can be configured into static oracles, dynamic oracle for the arc-standard system (Goldberg et al., 2014) or a search-based oracle (Straka et al., 2015).
We use the golden Lemmas, UPOS, XPOS and FEATS tags for both the training and development data when training the parser. The parser employs FORM embeddings of dimension 50, and UPOS, FEATS, DEPREL embeddings of dimension 20. The FORM embeddings are pretrained with word2vec using the training data, and the other embeddings are initialized randomly. All the embeddings are updated for each iteration during training. The hidden layer size is set to 200 and the batch size is limited to 30. All the parsing models are trained for 10 iterations. The training parameters for different datasets are reported in Table 4. The optimal parameters are chosen to maximize the accuracy on the development set.

Low-resource Treebanks
In the UD 2.2 datasets provided by the shared task, there are low-resource treebanks with little or even no training data. As we have stated before, in this work, we mainly focus on the strategy to deal with the treebanks without any training data in this section.
For the ones with other treebanks in the same languages, we trained models both on the mixture of all those treebanks and on the largest treebank as the official baseline did. For those without other treebanks in the same language, the direct solution is to use other related treebanks as training data. Hence, we take advantage of the crosslingual knowledge and train the mixture models with the treebanks of similar or related languages. Specifically, we manually selected treebanks with similar languages as the ingredients of a mixture dataset according to many factors such as grammar, morphology and vocabulary. The training and development sets of the selected treebanks are merged together on which we train and evaluate the results. The no-training-data treebanks and their corresponding training sets are shown in Table 5.

Results
The final results are evaluated blindly on TIRA platform. There are three main scoring metrics, LAS, MLAS and BLEX. Our system ranks 19 in LAS, 17 in MLAS and 13 in BLEX on the main metric ranking board. The main evaluation scores of our system on all treebanks, big treebanks, PUD treebanks, small treebanks and low-resource languages are shown in Table 6. Overall, our system gives a similar performance to the BASELINE UDPipe 1.2 system, which is not surprising as we closely followed the hyper-parameter settings and data splitting of the baseline system on big and small treebanks.
As described in Section 3.4, we select training sets differently for the low-resource treebanks including PUD treebanks and other treebanks without training data. Table 7 shows the results comparison with the baseline system on those treebanks. Our system shows a consistent improvement over the baseline model for all the three metrics on PUD treebanks, which suggests that enlarging training data with different types of treebanks of the same language indeed helps building a better model. Our system shows a slight ad-     Table 9: LAS F1 scores and rankings of our system on low-resource languages. Table 8 shows the results on each PUD treebanks and our selected model for testing. The models with suffix ' all' represent those trained with all treebanks of the same language. During the test phase, we evaluated both models trained with all treebanks of the same language and with the largest treebank of that language, and compared the rounded results to decide which one to take for our final system. The model trained by mixed English and Swedish treebanks with all data of the same languages shows better performance than those of the single largest treebank. However, the models trained by the largest Czech and Finnish treebanks get a higher score. We conjecture that more training data may provide more information for the modeling while too large training sets will also bring noise for a specific domain.  Table 10: Other metrics and rankings of our system. Table 9 shows the LAS F1 scores and rankings of each low-resource language. Note that our system has a good performance on pcm nsc and th pud treebanks when most teams get an unsatisfying result. It also proves that our data selection method is effective for improving the model performance to some extent. Table 10 shows the F1 scores and our system rankings on other metrics. In particular, our system ranks first in the Sentence Segmentation F1 score of PUD treebanks.

Conclusion
In this paper, we describe our system to CoNLL 2018 UD Shared Task. In our system, we focus on the accuracy improvement of the low resource treebanks against the baseline. The results of the official blind test show that our system achieves 65.61%, 52.26%, 55.71% in macroaveraged LAS, MLAS and BLEX.