UParse: the Edinburgh system for the CoNLL 2017 UD shared task

This paper presents our submissions for the CoNLL 2017 UD Shared Task. Our parser, called UParse, is based on a neural network graph-based dependency parser. The parser uses features from a bidirectional LSTM to to produce a distribution over possible heads for each word in the sentence. To allow transfer learning for low-resource treebanks and surprise languages, we train several multilingual models for related languages, grouped by their genus and language families. Out of 33 participants, our system achieves rank 9th in the main results, with 75.49 UAS and 68.87 LAS F-1 scores (average across 81 treebanks).


Introduction
Dependency parsing aims to automatically extract dependencies between words in a sentence, in the form of tree structure. These dependencies define the grammatical structure of the sentence, which makes it beneficial for many natural language applications, such as question answering (Cui et al., 2005), machine translation (Carreras and Collins, 2009), and information extraction (Angeli et al., 2015). The most common approaches for dependency parsing are transitionbased (Nivre et al., 2006) or graph-based (Mc-Donald et al., 2005). Recent works also apply neural network approaches for dependency parsing (Chen and Manning, 2014;Kiperwasser and Goldberg, 2016;Zhang et al., 2017), particularly for learning rich feature representations that improve parser accuracy.
To train a high-quality parser, one typically needs a large treebank, annotated with some linguistic information, such as part of speech (POS) tags, lemmas, and morphological features. However, human annotations are expensive. As a result, most of the work has been focused on few languages, such as English, Czech, or Chinese.
The Universal Dependencies (UD; Nivre et al. (2016)) is an initiative to develop consistent treebank annotations across many languages. It provides an opportunity to perform model transferusing model trained on high-resource languages to parse low-resource languages, allowing the development of treebanks for many more languages. Several works (McDonald et al., 2011;Zhang and Barzilay, 2015;Duong et al., 2015a,b;Guo et al., 2015Guo et al., , 2016 have shown that this technique can help improve accuracy for low-resource languages, and in fact recent work of Ammar et al. (2016) demonstrated that it is possible to train a single multilingual model that works well both in low-resource and high-resource settings.
The CoNLL 2017 UD Shared Task (Zeman et al., 2017) uses Universal Dependencies version 2.0 , with training data consists of 64 treebanks from 45 languages. Some of the challenges are the truly low-resource treebanks (e.g., Kazakh and Uyghur with only 30 and 100 training sentences, respectively), small treebanks without development data (e.g., Irish, French-ParTUT, Galician-TreeGal, Ukrainian), and the surprise languages and treebanks needed to be parse during test phase.
To address these challenges, we designed our system for the shared task to use both monolingual and multilingual models. In particular: • We train one monolingual model per highresource treebank in the training set.
• For low-resource treebanks, we train several multilingual models, each for related languages grouped by their genus and language families.
• For surprise languages, we train several delexalized parsers using treebanks that are closest to the surprise languages in terms of language family.
Our parsing model uses pretrained word vectors, gold universal POS tags (UPOS), and gold morphological analysis (XFEATS, if available). For the multilingual models, we also use language ID and replace pre-trained word vectors with multilingual word vectors. For the delexicalized models, we remove the word vectors from our feature set because we want to use the model for other languages which use different vocabularies. We submitted three systems, which are described in Section 5. The final ranking of the shared task brings our parser to the ninth place, with average UAS and LAS, 75.49 and 68.87, respectively. On the surprise languages, our system reaches the 6th rank, with 39.17 LAS.

System Description
Our system, called UParse, is a combination of monolingual, multilingual, and delexicalized models. In this section, we describe our parsing model which extends DENSE, the neural network graph-based parser of Zhang et al. (2017).

DENSE Parser
DENSE (Dependency Neural Selection) is a neural graph-based parser which generates dependency tree by predicting the heads of each word in a sentence. Given an input sentence of length N 1 , the parser first produces N head, dependent dependency arcs by greedily selecting the most likely head word. If the predicted dependency arcs do not result a (projective) tree structure, a maximum spanning tree algorithm will be used to adjust the output to a (projective) tree. In the following, we will describe the DENSE parser in details.
Token Representations. In the first step, the parser computes the representation of each word in the sentence. The objective is to encode both local (lexical meaning and POS tag) and global information (word position and context). To do this, the parser uses a bidirectional LSTM (bi-LSTMs), which have shown to be effective in capturing long-term dependencies. More formally, let S = (w 0 , w 1 , . . . , w N ) be the input sentence of length N , where w 0 denotes the artificial ROOT token. Each input token w i is represented by x i , which is a concatenation of its word and POS tag embeddings, e(w i ) and e(t i ), respectively.
These representations are the input to a bi-LSTM, which produces a sentence-specific representation of token w i computed by concatenating the hidden states of a forward and a backward LSTM: where h f i and h b i denotes the hidden states of the forward and backward LSTMs.
Head Predictions. For each token w i , the parser computes the probability of w j being the head as: where a i and a j are the word representations of w i and w j , respectively. Function g is a neural network with a single layer which computes the associative score between the two words: Note that, this step is similar to the neural attention mechanism in the sequence-to-sequence models (Bahdanau et al., 2015). The model is trained to minimize the negative log likelihood of the gold standard head, dependent arcs of all the training sentences. At test time, the parser greedily choose the most probable head for each word in the sentence.
Adjusting Tree Outputs. In many cases, the individual predictions form a tree. However, if this is not the case, a maximum spanning tree (MST) algorithm is used to constrain the set of predictions to form a tree. DENSE can use two algorithms: Chu-Liu-Edmonds (Chu and Liu, 1965;Edmonds, 1967) algorithm to generating nonprojective trees; and the Eisner algorithm (Eisner, 1996) to generate projective trees. The decision of the MST algorithms depends on the language's treebank. For the shared task, we assume that each language can produce non-projective trees.  Label Predictions. After obtaining the unlabeled dependency trees, the parser needs to predict labels. To do this, a two-layer rectifier network (Glorot et al., 2011) is used. More formally, to predict the arc label between w i and w j , the classifier takes as input the concatenation of the local (Eq. 1) and global (Eq. 2) vector representations of both words, [a i ; a j ; x i ; x j ] and predicts a valid dependency label. Zhang et al. (2017) presents more detailed account of the parsing model.

UParse
Next, we describe UParse, the extended version of DENSE which we use for the UD shared task. As mentioned in Section 1, UParse is a combination of monolingual, multilingual, UDPipe baseline, and delexicalized models. In general, the key difference between DENSE and UParse is in the type of features used for training. UParse uses richer linguistic features, namely word embeddings, universal POS tag (UPOS), morphological analysis (XFEATS), and language ID (LID). This design is mostly inspired by the work of Ammar et al. (2016) and Straka et al. (2016) for monolingual and multilingual parsing models. Each feature is represented by its vector representations and we concatenate them together to represent each input token which will be fed into the bi-LSTMs. Specifically, we modify Eq. 1 to where e(m i ) and e(lid i ) denotes the embeddings of XFEATS and language ID, respectively. 2 During training our system uses gold annotations (tokenization, UPOS, and XFEATS) provided in the data. At test time, it uses predicted annotations produced by UDPipe (Straka et al., 2016). Table 1 shows different feature set used in each type of model in the UParse. We employ the original DENSE architecture for the monolingual models in UParse, with an additional feature (XFEATS, if available). For the multilingual models, we replace the standard word embeddings with multilingual word embeddings (Section 2.3). This is important since we need to project word vectors of different languages to the same vector space. We also use language ID as a feature, to inform the parser about the language of the sentence it is currently parsing. This allows the model to learn not only transferable dependency features across languages, but also the language-specific features.

Multilingual Word Embeddings
Following Ammar et al. (2016), we adapt the robust projection approach of Guo et al. (2016) to build our multilingual word embeddings. The idea is to train word embeddings of a source language and project them to obtain word embeddings for the target languages. For the shared task, we use English pre-trained word vectors trained on the Wikipedia data (Bojanowski et al., 2016) as our source embeddings. Next, we use OPUS data (Tiedemann, 2012(Tiedemann, , 2009) to build alignment dictionaries for languages that have parallel text with English. Specifically, we use parallel corpora of Europarl, Global Voices, Wikipedia, and hrWaC (for Croatian).
To build the alignment dictionaries, we use fast align toolkit (Dyer et al., 2013). We then compute vector for each target word using the weighted average of its aligned English word embeddings, weighted by the alignment probabilities. A limitation of this approach is that it creates embeddings for target words that appear in the parallel data. Thus, the final step of this approach also compute embeddings for other target words not aligned with the source words by averaging the embeddings of all aligned target words within an edit distance of 1. The token level embeddings are shared across languages.

Preliminary Experiments
Prior to our participation in the shared task, we ran a number of preliminary experiments that informed the design of the final system. Our shared task submission is based on these results.
In our preliminary experiments, our main goal is to evaluate the multilingual model of UParse. These experiments are mainly inspired by the   In addition, we also compare our parser performance for the monolingual models with UDPipe parser. Table 2 and 3 present the performance of our parser compared to UDPipe (monolingual) and MALOPA (multilingual) parsers. In terms of UAS, our multilingual model achieves the best scores, except for English, German, and French. The results for LAS are slightly different. We found that for languages where we have more than 10K training sentences, our monolingual model outperforms the other models, with an exception on Italian. For the smaller treebanks, although we see UAS improvements for Portuguese and Swedish when we use multilingual model, we only obtain LAS improvement on Portuguese. We believe that these mixed results are due to poor accuracy of our label classifier, since the UAS results demonstrate that the parser itself is quite effective in predicting the dependency arcs.

Experiments
This section describes the experimental design, training, and also our submissions to the shared task. After looking at the results of our preliminary experiments, we decided to train both monolingual and multilingual parsers, evaluate them on the shared task development data and choose the best settings for our submissions.

Language Groups
To build the multilingual models, we first group the treebanks such that treebanks of related languages will be trained in a single model. We use genus and language family information taken from the World Atlas of Language Structures(WALS; Dryer and Haspelmath (2013)) to group the languages. For each treebank in which the language is not related to any other treebanks, it will be in a singleton group, hence the same as a monolingual model. For classic languages like Ancient Greek, Latin, Gothic, and Old Church Slavonic, we group them to the same group, instead of using the WALS information. Table 4 shows the language groups used in UParse.

Training
In the preprocessing step, following the common setup in parsing, we remove multiword tokens and language specific dependency relations. For the multilingual training, we also combine treebanks of the same language in the same training data. We also use two additional datasets: pre-trained  word embeddings from Bojanowski et al. (2016) and OPUS parallel data (Tiedemann, 2012(Tiedemann, , 2009). Unless we explicitly mention in the description, we follow the same training configurations as described in Zhang et al. (2017). We use two-layer bi-LSTMs with 150 hidden units, and set embedding size for {words, UPOS, XFEATS, LID} to {300, 30, 40, 10}, respectively. The word embedding size matches that of the pre-trained embeddings. We did not use the Czech-CLLT or any ParTUT treebanks for training since they contain many long sentences (the longest sentence in the Czech-CLLT treebank consists of 534 words). At test time, we parse these treebanks using the models trained on the same language. We trained our models on an Nvidia GPU card; training a monolingual model takes 1-2 hours, while training a multilingual model takes 4-5 hours.
Word embeddings. For monolingual training, we initialize the embeddings with the pre-trained ones and keep them fixed during training. For the multilingual models, we first create multilingual word embeddings as described in Section 2.3, using OPUS parallel data and English as the source language. Unlike Ammar et al. (2016) and Guo et al. (2016), we also share representations for words which are used by more than one language. For example, if system appears in the English and German data, we only use a single vector to represent it. Of course, this means we allow parameter sharing across words with the same forms, but different meanings. But on the other hand, it also enables named entities and loanwords to have the same representation across languages. We initialize the embeddings with the multilingual word embeddings and update them during training. 3 For all models, embeddings for words with no pretrained representation are initialized uniformly at random in the range [-0.1, 0.1].
Optimization for multilingual training. For multilingual training, we follow Ammar et al. (2016) when updating the parameters. Specifically, we use mini-batch updates in which we uniformly sampled (without replacement) the same number of sentences for each treebank, until all sentences in the smallest treebank are used. In other words, each epoch will use N ×L sentences, where N is the number of sentences in the smallest treebank and L is the number of languages.

Truly Low-Resource Treebanks
There are some challenges when training the truly low-resource treebanks, i.e., treebanks with less than 2K sentences, with no other treebanks from the same language available. For example, Vietnamese treebank only has 1400 sentences with no related languages in terms of genus and language family. Ideally, we want to apply multilingual learning for these treebanks since we do not have enough examples to train them using monolingual models. Moreover, languages like Kazakh and Uyghur have 100 training sentences or fewer and no development data, which makes it difficult to do multilingual training as described above. Our initial experiments show that multilingual learning helps improve accuracy of the truly low-resource treebanks (with less than 1K training sentences), but degrades accuracy of the high-resource treebanks. This is because using our training set up, each epoch will only consists of small number of sentences per language. Irish is particularly challenging, with only 566 training sentences, no development data, and no related languages. Our training strategy for these particular cases are as follows: Estonian and Hungarian. These languages are belong to the Uralic language family. Since Finnish has two treebanks with large training data, we train two more multilingual models for each, using additional Finnish treebanks.
We do not use a single model to train both Estonian and Hungarian since Estonian has more training sentences than Hungarian.
Greek. We train a multilingual model for Greek, using training data from Ancient Greek and Greek treebanks. 4 Irish. Since this language does not have any related languages, we use delexicalized model of Czech. We chose Czech since the language has the largest treebank.
Kazakh and Uyghur. For the two languages, since the training data are very small, we use a single delexicalized model of Turkish. We only use Turkish data during training, but include both Kazakh and Uyghur training data in the development set.

Surprise Languages and Treebanks
For the surprise languages, since we do not have any training data, we train delexicalized models on related languages. In particular, we use delexicalized Russian for Buryat, Persian for Kurmanji, Finnic for North Sami, and Czech for Upper Sorbian. Note that the delexicalized models of Russian, Finnic, and Czech use all the treebanks of the language, thus allowing transfer learning between different treebanks of the same language. For example, to train a delexicalized model of Russian, we use both UD Russian and UD Russian-SynTagRus treebanks. For the surprise treebanks from known languages, we simply use a parser trained on other treebanks in that language.

Initial Results on Development Data
During the training phase, we evaluated the performance of our monolingual and multilingual systems using the official development data. Since we use gold annotations (tokenization, UPOS, and XFEATS) as our features, we compare our performance with UDPipe baseline which also use gold annotations. Table 5 shows the average UAS and LAS of the monolingual and multilingual systems. Similar to our preliminary results, we see improvements on UAS for the multilingual model, but with  LAS lower than the monolingual or even the UD-Pipe system. When we look at the results for individual treebanks, we found that our models are especially achieved lower LAS than the baseline system on the smaller treebanks.

Submission
The UD shared task employs TIRA (Potthast et al., 2014) to evaluate all systems. When we deployed our system on the TIRA virtual machine, we encountered two problems which break the evaluation script. First, our system sometimes produces multiple roots in the prediction, which the script rejects. To address this, we post-processed the predicted tree by taking the first prediction as the root, and connect other roots to the first root with a clausal component label, ccomp. 5 The second problem occurs when the test data has sentence longer than the maximum sentence length in the training data. 6 Because we had limited time to address this, we used the following algorithm: Let n be the maximum length of sentence allowed by the parsing model. For each sentences with length k, where k > n: 1. Parse the first n words in the sentence.
2. For the rest k − n words, connect each word with the previous word, and label the arc between them using a heuristic label (DIST), or a random label (RAND). We simply take the most frequent label between the head POS and the dependent POS in the training data for DIST.
We decided to use the combination of monolingual, multilingual, UDPipe (only for the primary system, UP-1), and delexicalized models for our primary system. For each treebank, we pick the   best model based on its performance on the development data. We use UDPipe models for 24 treebanks in which we achieved lower performance than the baseline on the development data (denoted by (*) in Table 9). UP-2 and UP-3 do not use any UDPipe models. Table 7 lists all treebanks which use multilingual or delexicalized models for parsing. Our final submission consists of three different systems: 1. UP-1: UParse + DIST + UDPipe 2. UP-2: UParse + DIST 3. UP-3: UParse + RAND Table 6 shows the macro-averaged LAS F1 scores for all the systems. Table 8 shows the results of our primary system of LAS, UAS, and CLAS (Nivre and Fang, 2017). The more detailed results for each treebank and system is given in Table 9. Similar to the results  Table 8: LAS, UAS, and CLAS results of our primary system, UP-1.

Results on Test Data
on development data (Table 6), UP-1 achieves the best macro-average F1 score out of the three systems. The results of UP-2 and UP-3 are quite similar, which is not surprising since there are only a few long sentences in the test data.
We further observe the performance of the UD-Pipe baseline model versus UParse models, by comparing the performance of UP-1 and UP-2 on the 24 treebanks (treebanks with (*) in Table 9). Based on the results, our system achieves lower LAS-F1 scores on 16 treebanks, which are either treebanks with small training data or treebanks with long sentences, for which we did not train any model. For the other six treebanks, our system achieves higher LAS-F1 scores than the UD-Pipe baseline system, with 4 treebanks predicted using the multilingual models.
Our system is deployed on the TIRA virtual machine, which is a quad-core CPU with 16GB RAM. It took 2 hours and 43 minutes for our primary system to parse the official test data.

Conclusion and Future Work
We described UParse, our system for the CoNLL UD Shared Task 2017. Our observation from the overall results suggested that our parsing model outperforms the UDPipe baseline model, except in cases when there is little training data available. Our approach to perform multilingual learning by transferring models from high-resource to low-resource treebank seems to be quite effective in predicting the dependency arcs, but less for the label predictions. However, we observed some improvements for a number of treebanks when we use a multilingual model trained using treebanks from related languages.
In the light of these results, some possible directions for the future work include improving the label predictions of the parsing model and exploring the possibilities to use character-level models, as they have shown to be effective for parsing morphologically rich languages . Another interesting direction is to combine  Table 9: LAS F-1 scores for each treebank in the test data. (*) denotes treebanks which are predicted using UDPipe baseline models in the UP-1 system and the best accuracies are shown in bold.
both morphological analysis and also sub-word unit representation (characters, character n-grams, or morphemes) and investigate whether these features are transferable across languages with similar typology.