A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations

In this paper, we present our multilingual dependency parser developed for the CoNLL 2017 UD Shared Task dealing with “Multilingual Parsing from Raw Text to Universal Dependencies”. Our parser extends the monolingual BIST-parser as a multi-source multilingual trainable parser. Thanks to multilingual word embeddings and one hot encodings for languages, our system can use both monolingual and multi-source training. We trained 69 monolingual language models and 13 multilingual models for the shared task. Our multilingual approach making use of different resources yield better results than the monolingual approach for 11 languages. Our system ranked 5 th and achieved 70.93 overall LAS score over the 81 test corpora (macro-averaged LAS F1 score).


Introduction
Many existing parsers are trainable on monolingual data. Normally such systems take a monolingual corpus in input, along with monolingual word embeddings and possibly monolingual dictionaries as well as other knowledge sources. However for resource-poor languages such as Kurmanji and Buryat 2 , there are generally not enough resources to train an efficient parser. One reasonable approach is then to infer knowledge from similar languages (Tiedemann, 2015). Developing tools to process several languages including resource-poor languages has been conducted in many different ways in the past (Heid and Raab, 1989). Thanks to Universal Dependency (Nivre et al., 2016), it is now possible to train a system for several languages from the same set of POS tags. It has also been demonstrated that, with current machine learning approaches, parsing accuracy improves when using multilingual word embeddings (i.e. word embeddings inferred from corpora in different languages) even for resource-rich languages (Ammar et al., 2016a;Guo et al., 2015).
In this paper, we describe the development of a system using either a monolingual or multilingual strategy (depending on the kind of resources available for each language considered) for the CoNLL 2017 shared task (Zeman et al., 2017). For the multilingual model, we assume that learning over words and POS sequences is a first step from which better parsers can then be derived. For this reason, we re-used most of the training algorithms implemented for the BIST-parser since these have proven to be effective when dealing with sequential information even for long sentences, thanks to bidirectional LSTM feature representations (Kiperwasser and Goldberg, 2016).
In addition, our parser can also have recourse to multilingual word embeddings that merge different word vectors in a single vector space in order to get multi-source models. As for multilingual word embeddings, we extend the bilingual word mapping approach (Artetxe et al., 2016) to be able to deal with multilingual data. We have only used this approach based on multilingual word embeddings for two different language groups in this experiment: (i) for resource-poor languages for which less than 30 sentences were provided for training such as surprise languages and Kazakh, and (ii) for another group of 7 resource-rich languages that are all Indo-European languages. This is to show that even the analysis of resource-rich languages can be improved thanks to a multilingual approach. (3) Multi-Layer Perceptron: build candidate of parse trees based on trained(changed) features by bidirectional LSTM layer, and then calculate probabilistic scores for each of candidates. Finally, if it has multiple roots, revise it (section 3) or select the best parse tree.
Although we could theoretically train a single model for all the languages considered in the evaluation based on our multilingual approach, relevant results can only be obtained if one takes into account language similarities and typological information. Moreover, given the limited time and the specific resource environment designed for the shared task, it was hard to get better results using a multilingual approach than using a monolingual approach for resource-rich languages since training new word embeddings requires time. Thus, we processed 69 corpora with monolingual models, and only 13 corpora with our multilingual approach.
In what follows we describe the architecture of our system (section 2), our monolingual (section 3) as well as our multilingual approach (section 4). Finally, we compare the results with the baseline provided by UDPipe1.1 and with the results of other teams (section 5).

System Overview
Our system extends the Graph-based parser (Taskar et al., 2005) especially in BIST-parser that works by default with monolingual data. Basically the Graph-based BIST-parser uses bidirectional Long Short Term Memory (LSTM) feature representations thanks to two neural network layers (Kiperwasser and Goldberg, 2016). In order to select the best relation and head for each tokens in a sentence, Kiperwasser and Goldberg link the output of the bidirectional LSTM with the Multi-Layer Perceptron (MLP) thanks to one neural layer. Here we adopt the same feature representation and MLP but different training features and decision models.
In order to adapt the parser to a multilingual approach, we add new parameters and new features to the training algorithm, notably the possibility to use multilingual word embeddings and one hot encoding to encode languages. Finally, the parser can be trained on monolingual and multilingual data depending on the parameters chosen for training. An overview of the overall architecture is given in Figure 1. Details on word embeddings along with the number of dimensions considered are given below.
• Word: randomly generated word embedding (100) • XPOS: language-specific POS (25) • UPOS: universal POS (25) • External embedding1: pretrained word embedding (100) • External embedding2: pretrained word embedding that is replaced with Word (100) • one-hot encoding: one-hot encoding of the language ID (65) Word refers to words taken from the training corpora and used as lexical features with vectorized embeddings in our parser. Both XPOS and UPOS 3 are used as delexicalized features. The content of Word and POS is set randomly when the training phase starts. In addition, two external word embeddings are added to the representations of words, one is concatenated with the Word vector additionally, and the other is used to replace the Word vector. For example, let Word be generated randomly with 100 dimensional vector values and External1 and let External2 be pretrained word embeddings made from different resources with 100 dimensional vector values. If we just add External1 to an additional word embedding, then final word embedding could be Word+External1 (200 dimensions) based on concatenation. However, if we add just External2 as an additional word embedding, Word is deleted because it is replaced with External2 so that final word embedding could be External2 (100 dimensions). If both are used, final word embedding could be Exter-nal1 + External2 (Word is deleted because of Ex-ternal2). Since we have found if we can use two external word embeddings, replacing one word embedding as the Word made better results than concatenating two word embeddings based on experiments.
Since our goal is to develop a multilingual pars-3 http://universaldependencies.org/format.html ing model, we took the idea of one-hot encodings from (Ammar et al., 2016a). The idea is to add language one-hot encoding as an additional feature while training multilingual models. It allows the model to directly focus on language specificities. There are 65 hot-encoding dimensions because there are 64 languages in UD 2.0 (Nivre et al., 2017) plus unknown languages.

Monolingual Model
There were 81 different corpora to be parsed within the CoNLL 2017 shared task. We used a monolingual approach for 69 corpora, and our results are detailed in section 5. As mentioned above, training a monolingual model in our system is very similar to training a BIST-parser model. However, we made two modifications to the original approach.
Multiple roots: The BIST-parser can generate multiple roots for a given sentence. This is not a problem in general but for the shared task we need to provide only one single root per sentence. Not detecting the right root for a sentence leads to major errors so the problem had to be addressed carefully. We chose to develop a simple algorithm: when the parser returns multiple roots, our system revises the overall sentence analysis so as to select one single primary root and change other previous roots as links pointing to the new head node. Choosing the primary root is the result of an empirical process depending on the language considered (i.e. taking into account language-specific word order). For example, the primary root is the first head node in the case of an English sentence and the last one in the case of a Korean sentence. This very simple trick improved the LAS scores by 0.43 overall F1-measure on the development set.
Customizing for UD: Basically, the BISTparser is not adapted to the Universal Dependency format. Thus several changes had to be made. First, we added both XPOS and UPOS categories to the parser. Second, if a word in a training sentence did not exist in external word embeddings, we replaced the word as a lemma of the word. Third, we used the external word embeddings provided by the shared task organizers 4 and concatenated them with the original Word embedding.

Multilingual Model
We processed 13 test corpora with our multilingual model. The F1-measure of our system for these corpora are better than with our monolingual system for resource-poor languages and even for most of the resource-rich languages.

Surprise Languages and Kazakh
There were four surprise languages provided for evaluation within the CONLL 2017 shared task: Buryat, Kurmanji, North Sámi and Upper Sorbian (all in the Universal Dependency format). Less than 30 sentences were provided for training, and Kazakh also had 30 sentences for training. We divided the training corpus in half: half od the data were set apart for development and never used for training.
Word embeddings. The first step for training multilingual model is finding topologically similar languages. Thus, we selected three languages for each surprise language in order to be able to derive multilingual word embeddings. The choice of languages was based on the Word Atlas of Language Structures 5 and on advices from linguists.
Bilingual Dictionary. There has been many attempts to build multilingual embeddings (Ammar et al., 2016b;Smith et al., 2017). One simple but powerful method is finding a linear transformation matrix from two monolingual embed-4 https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 5 The Word Atlas of Language Structures provides information about different languages in the world (family, latitude and longitude, see http://wals.info). dings. (Artetxe et al., 2016) propose to do this with pretrained word embeddings and bilingual dictionaries. We tried to follow their approach using monolingual embeddings provided by Facebook research 6 and building bilingual dictionaries. Unfortunately there were not many resources (even with a limited coverage) for building a bilingual dictionary in the case of surprise languages.
For some languages we were able to find bilingual dictionaries from OPUS 7 . When no corpus was available, we used Swadesh lists from Wikipedia dumps. Swadesh lists are composed of 207 bilingual words that are supposed to be "basic concepts for the purposes of historicalcomparative linguistics" 8 . Finally, we transformed both embeddings in a single vector space.
Table1 shows details about the selected pairs of languages and the different sources used for our dictionaries. From these resources, we trained a multilingual model and after testing with the development set set apart for each pair of candidate languages, we picked up the best candidate for the different surprise languages and for Kazakh.

Italian and Portuguese
There have been several attempts aiming at training multilingual models for resource-rich languages (Guo et al., 2016;Ammar et al., 2016a). We have tested our multilingual system in a similar way as explained in the previous section for resource-rich languages, except that we of course changed the resources used. We used the multilingual word embeddings for 7 languages presented in Ammar et al.'s paper (average and brown cluster model), and then trained a multilingual model with the training set provided for the 7 languages considered. Although the size of word vectors for multilingual embeddings is almost 10 times smaller than with the monolingual embeddings made by Facebook research, the result (F1measure) is slightly better than with the monolingual model for Italian and Portuguese, with 0.39 and 0.41 within development sets.

Czech-CLTT, Galician-TreeGal, Slovenian-SST
These three corpora have a small number of training sentences. We thus chose to train them together but with different language hot-encoding values.

Experimental Results
Because we wanted to focus on the dependency parsing task, we used automatically annotated corpora for testing and also trained all models with the annotated corpora provided by UDPipe (Straka et al., 2016). As described in section 4, we used different word embeddings and training corpora for multilingual models. As for monolingual models, we simply trained the system with monolingual embeddings (see details in section 3).
Overall results. Table 2, 3 and 4 show the official results (except for it ParTUT), using the F1measure computed by the TIRA platform (Potthast et al., 2014) for the CoNLL 2017 Shared task 9 . Our system achieved 70.93 F1 (LAS) on the overall 81 test sets and ranked 5 th out of 33 teams. The average gap between the baseline obtained with UDPipe1.1 (Straka et al., 2016) and our system is 2.58 LAS in our favor. Our system shows better results in avoiding over-fitting issues. Performance gaps are narrowed when considering only 9 http://universaldependencies.org/conll17/results.html PUD test sets (for example, our system ranked second best for processing English PUD and Russian PUD), which is encouraging for practical applications.
Multilingual model. Table 4 shows the results obtained when using the multilingual models on the small treebank dataset (fr partut, ga, gl treegal, kk, la, sl sst, ug, uk). We ranked 4 th , with 54.78 LAS score on this group of languages. However, in terms of extremely resource-poor languages (surprise languages), we have ranked only 12 th , with 36.93 LAS score. This is slightly lower than the UDPipe1.1 baseline model: we assume this is the result of using half of the corpus for training surprise languages (section 4). If we compare monolingual models of surprise languages with multilingual ones, we see an improvement between 2.5 and 9.31 percent. The same kind of improvement can be observed for the ParTUT group. In this case, the multilingual approach improves performance by almost 3 points.

Conclusion
In this paper, we have described our system for multilingual dependency parsing that has been tested over the 81 Universal Dependency corpora provided for the CoNLL 2017 shared task. Our parser mainly extends the monolingual BISTparser as a multi-source trainable parser. We proposed three main contributions: (1) the integration of multilingual word embeddings and one hot encodings for the different languages, which means our system can work using monolingual models as well as on multilingual ones. (2) a simple but effective way to solve the multiple roots problem of the original BIST parser and (3) an original approach for the elaboration of multilingual dictionaries for resource-poor languages and the projection of monolingual word embeddings in a single vector space. Our system ranked 5 th and achieved 70.93 overall F1-measure over the 81 test corpora provided for evaluation. We are confident there is room for improvement since this system is only preliminary and lots of components could be optimized. A better account of language typology could also help the process and show the benefit of linguistic knowledge in this kind of environment.