Multi-Model and Crosslingual Dependency Analysis

This paper describes the system of the Team Orange-Deskiñ, used for the CoNLL 2017 UD Shared Task in Multilingual Dependency Parsing. We based our approach on an existing open source tool (BistParser), which we modified in order to produce the required output. Additionally we added a kind of pseudo-projectivisation. This was needed since some of the task’s languages have a high percentage of non-projective dependency trees. In most cases we also employed word embeddings. For the 4 surprise languages, the data provided seemed too little to train on. Thus we decided to use the training data of typologically close languages instead. Our system achieved a macro-averaged LAS of 68.61% (10th in the overall ranking) which improved to 69.38% after bug fixes.


Introduction
For our work in our lab (Orange-Deskiñ) we needed a robust dependency analysis for written French with the highest Labeled Attachment Score (LAS) 1 possible, using a wide range of dependency relations. Having worked in the past on rule based dependency analysis, it became obvious that we need to adopt a more modern approach to dependency analysis. Thus during the last year we tried several freely available open source tools available (e.g. MaltParser 2 , Google's SyntaxNet 3 , Standford Dependency Tools 4 , Bist-Parser 5 and HTParser 6 ), trained on different Treebanks (notably French Sequoia (Candito et al., 2014) and Universal Dependencies (McDonald et al., 2013)). All combinations of tools and treebanks had some advantages and some inconveniences. For instance, the underlying linguistic models of the treebanks are not the same or some tools would not accept CONLLU input but only raw text and apply their own segmentation and POS tagging.
In a next step we enriched the French treebanks with additional information like lemmas, morphological features and more fine-graded XPOS in addition to the about 20 UPOS categories of the treebanks (UD-French v1.2 does not contain neither lemmas nor morphological features) and conducted a new training/test/evaluation cycle. Since the initial results for French were encouraging we tried the same approaches with other languages, such as the languages proposed for CoNLL 2017 UD Shared Task (Zeman et al., 2017). However, for participation at the shared task, we relied exclusively on the data provided by Universal Dependencies (Nivre et al., 2016(Nivre et al., , 2017b, also for French in spite of our previous work.
For the shared task we have trained models separately for each language. So strictly speaking, this is not a multilingual but a monolingual multimodel approach.
2 System Description 2.1 Software For the shared task, we used an (older) version of BistParser for all treebanks (ud-treebanks-conll2017). BistParser (Kiperwasser and Goldberg, 2016) is a transition based parser (Nivre (2008), and which uses the arc-hybrid transition 5 https://github.com/elikip/bist-parser 6 https://github.com/elikip/htparser system (Kuhlmann et al., 2011)) with the three "basic" transitions LEFT ARC, RIGHT ARC and SHIFT. Since the shared task requires that output dependency trees have exactly one root, we modified BistParser accordingly by deleting the additional ROOT node added to each sentence in the original version of this parser. BistParser uses a bidirectional LSTM neural network. Currently BistParser uses forms and XPOS for both learning and predicting. We have started implementing the use of feature column as well, but this has not been used for the CoNLL 2017 UD Shared Task.
Some of the languages in the shared task have a large percentage of non-projective sentences. We thus decided to implement a pseudoprojectivisation (Kübler et al., 2009, p. 37) of the input sentences before training or predicting. The output sentences are than de-projectivised. Sometimes of course, the de-projectivisation can fail, especially if there are other dependency relation errors. Our tests showed, however, that the overall result for most languages is still better than without any pseudo-projectivisation.
In order to reduce memory usage during training and prediction, we modified BistParser and the underlying CNN library 7 to load word embeddings only for the words present in the training or test data. For the same reason we modified Bist-Parser to read sentences one by one, to predict, and to output the result, instead of reading the entire test file at once 8 .

Training Data
We trained our models using all treebanks provided by the CoNLL 2017 UD Shared Task. Since for some of the languages there were no development treebanks available, we split the training treebank in order to get a small development corpus (10% of the training corpus is split to test during development). This posed a certain problem for treebanks like Kazakh and Uyghur, which are hopelessly small (31 and 100 sentences respectively). Eventhough both languages are geneti-cally and typologically very close to Turkish (3685 sentences), we finally trained on those small treebanks for time constraints (with more time available we would have experimented with various other parameters and a cross-lingual approach).
In most cases, adding word embeddings improved the LAS considerably. We downloaded the language specific corpora provided 9 by the task organisers and calculated our own word embeddings with Mikolov's word2vec (Mikolov et al., 2013) 10 , which gave better results than the 100dimensional word embeddings provided. In order to get the best results, we cleaned the text corpora (e.g. deleting letter-digit combinations and separating punctuation symbols such as commas, question marks etc. by a white space from the preceding token). For those languages which use an alphabet which has case distinction (Latin, Cyrillic and Greek) we put everything in lowercase. Finally we trained word embeddings with 300 and 500 dimensional vectors respectively. For all other parameters of word2vec we used the default setting, apart from the lower frequency limit, which we increased to 15 words.
The word embeddings were calculated on a server with a 32 core CPU running Ubuntu 14.04 11 . For the biggest text corpora like English (9 billion words), German (5,9 billion words), Indonesian (5 billion words) French (4,8 billion words) training for 500 dimensional word vectors took up to 6 hours (English).
A similar approach to word2vec is fastText The fundamental difference is the adoption of the "subword model" described in . A subword model is described as a open model allowing each word to be represented not only by the word itself but also the subword components of the word in combination. Subword components can be n-grams with varying values for n, stems, root words, prefixes, and suffixes or any other possible formalism. As a matter of fact, word2vec can been seen as the minimum configuration of fastText where only the words are considered. FastText has been demonstrated  to perform rather well in two different tasks i.e. sentiment analysis and tag prediction. For the CoNLL 2017 UD Shared Task we finally 9 https://lindat.mff.cuni.cz/ repository/xmlui/handle/11234/1-1989 10 https://code.google.com/archive/p/ word2vec/ 11 Intel Xeon CPU E5-2640 v3 at 2.60GHz. used word2vec, since the results were similar, but fastText was taking significantly more time to train.

Training and development
We trained all treebanks without any word embeddings, with 300 and with 500 dimensional word embeddings. For BistParser, the only other parameter we changed was the size of the first hidden layer (default 100) which we set to 50 (or lower, especially for languages whose treebanks are very small). Every sentence of the training treebanks was pseudo-projectivised before training. Using the weighted LAS, we then chose the best combination of parameters for each language. Since the python version of CNN (used by our adaptation of BistParser) does not support GPU, training was slow 12 . Thus we stopped training usually after 15 epochs unless the intermediary results were promising enough to continue. Figure 1 shows the system architecture. The upper part represents the data flow for the training, the lower part represents the predicting phase. We did all training on two Ubuntu 16.04 servers 13 with 64 GB RAM. As said above, the version of the CNN library we used, does not run on GPU, so all training was single threaded. The training processes used up to 15 GB RAM, and took between 1 minute (Kazakh) and 53 hours (Czech). depending on the size of the treebank. This corresponds to 0.5 to 3 seconds per sentence during training. Training for the surprise languages (using treebanks of typologically close languages, cf. section, 4), took significantly longer (up to 90 hours for Czech).
Training was on the gold values (form, lemma, XPOS, UPOS, deprel, head) of the training treebanks 14 , however, both, the development set (on the Tira-platform) and the final test set use the UD-Pipe output e.g. lemma, XPOS or UPOS (Straka et al., 2016) which may be erroneous. So we expected a certain drop of LAS for the tests. In order to be prepared, we tried to add erroneous lemmas and UPOS in the training data. This, however, did not produce better results, so we abandoned the 12 The successor of CNN, Dynet, supports GPU, but since BistParser learns on a phrase by phrase base, no gain in time can be observed. 13 Intel Xeon CPU E5-1620 v4 at 3.50GHz and Intel Core i7-6900K CPU at 3.20GHz respectively. 14 Apart from numerous punctuation symbols with wrong heads, we found several bad annotations for words as well in different languages. Figure 1: Schema of the system architecture idea. Knowing that training takes a certain time we did not POS-tag the training treebanks with UDpipe to have similar "noise" than the test treebanks. The final results obtained with the development corpora (or split from train corpora when there were no development corpora) are shown in table 1. We did not (yet) use the morphological features (column 6). First tests on French showed that a slight increase in LAS is possible, so we will work on this in the future.
With 16GB RAM on the virtual machine provided by Tira (Potthast et al., 2014) 15 the 56 development corpora (on the Tira platform) were processed in about 130 minutes.

Surprise languages
The biggest challenge were the 4 surprise languages. Having only between 20 and 109 sentences to train on (even less if we wanted to split it into a train and development corpus) did not help (see table 2 for some details). Since the word embedding files where also rather small we chose not to train on the languages themselves, but to keep all of the provided sentences for the development corpus. So we first tried three similar approaches in order to be able to predict dependency relations for these languages:  In all three cases we replaced the forms of all closed word classes (i.e. all but nouns, adjectives and verbs) with the corresponding UPOS in the training and in the test corpus (for the CoNLL 2017 UD Shared Task we inserted the original forms again after predicting the dependency relations. The "mix" is then trained with a hidden layer size of either 100 or 50, but without word embeddings. We initially tested these models using the test corpus for the Tamil treebank (UD v2.0). Using the "mix" with 23 languages (3)   By replacing words of the closed word classes by their UPOS we tried to get similar corpora for training (on languages other than the surprise language) and predicting (surprise languages), assuming that the syntactical structures are similar enough, especially if we use only typologically close languages (see below). This technique avoids also the problem of different alphabets for typologically close languages, since we use UPOS and not character chains.
Eventhough these results were encouraging, we hoped an increase of the weighted LAS should still be feasible. Especially since some of the surprise  a Mongolian language, is not typologically close to any of the shared task's languages. Even though Turkish seems close enough, to our surprise Hindi was finally the best guess. With Urdu, which is very similar to Hindi apart from the fact that it uses the Arabic alphabet instead of Devanagari, the LAS was less good. As for the language mix, we replaced the forms of the closed word classes in the training corpora by the corresponding UPOS (except nouns, verbs and adjectives) and trained the modified treebanks (cf. tables 4 and 5, best configuration in bold).

Kurmanji
Buryat fa (100) (50) 18.0% Table 5: Weighted LAS using typologically close languages (with hidden layer size) The reason for not replacing nouns, adjectives, and verbs is simple: Leaving the original words of the training corpus language and the test corpus language, means while the parser predicts, it comes across a word which it has never seen during training. But since it has the UPOS, it has  As expected, Upper Sorbian and Norther Sami give quite acceptable results using models trained on Czech and Finnish respectively. Due to the fact that the provided treebanks for Kazakh and Uyghur are both very small we tried to apply the same approach of using the training corpus of a typologically close language (here Turkish). However, the results were disappointing. Thus, we continue to use the models trained on very small corpora for these two languages in the shared task. Possibly the fact that the raw text corpus used to calculate word embeddings for Kazakh and Uyghur are much bigger than those of the surprise languages allowed to produce usable word embeddings. If so, this would mean that word embeddings play a very prominent role in data driven dependency parsing.

Results of the Shared Task
Our final macro-averaged LAS F1 score on the CoNLL 2017 UD Shared Task test data (Nivre et al., 2017a) was 68.61%, (10th out of 33) 17 . The details show that our approach worked well for the bigger treebanks and the surprise languages (where we ended up as 8th). In general, the results per language are slightly lower than those we had during training on the development corpora (cf. table 1). This is due to the fact we did our training on forms, lemmas, UPOS and XPOS of the training corpus, which are gold. In the test data, lemmas, UPOS and XPOS (if present), however, are predicted by UDpipe, and do contain some errors with respect to the gold standard.
After the end of the test phase, we discovered a bug in our chain, which concerned languages, which have only UPOS data. In this case the UPOS information was totally discarded by error. Thus all training and testing are done only on the 17 http://universaldependencies.org/ conll17/results.html forms 18 .
Further we made en error uploading the models for the gl TreeGal, fr parTut and sl sst treebanks. During the tests the models trained on the basic gl, fr and sl treebanks were used instead. After the test phase we corrected these errors. Fortunately, their impact was not that hard. Apart from the result for gl TreeGal and sl sst, which went up to 66.13% (from 22.46%) and to 47.68 (from 40.25) respectively once the correct model was used, the results for the other corpora changed only slightly, the global results could have been 69.38%. All results are shown in table 6. The column on the right shows the difference between the results of the development corpora and test data. For some languages, the test results are unexpectedly lower than the results on the development corpora. For gl TreeGal, fr parTut and sl sst, this is due to errors when installing our system on the Tira-platform. The lower performance on languages like Chinese, Ukrainian, Vietnamese or Latin (both ITTB and PROIEL) seems to be caused by the nature of the test corpora themselves. Systems of other participants seem to drop in performance as well; for all these languages our system is still around the 10th position of the global ranking. Perhaps a cause may be the fact that the XPOS we use (predicted by UDpipe) contain more errors than average for the Chinese, Ukrainian or Vietnamese treebanks than for languages where our test score is closer to the development score.