A Fast and Lightweight System for Multilingual Dependency Parsing

We present a multilingual dependency parser with a bidirectional-LSTM (BiLSTM) feature extractor and a multi-layer perceptron (MLP) classifier. We trained our transition-based projective parser in UD version 2.0 datasets without any additional data. The parser is fast, lightweight and effective on big treebanks. In the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, the official results show that the macro-averaged LAS F1 score of our system Mengest is 61.33%.


Introduction
Developing tools that can process multiple languages has always been an important goal in NLP. Ten years ago, CoNLL 2006 (Buchholz and Marsi, 2006) and CoNLL 2007 (Nivre et al., 2007) Shared Task were a major milestone for multilingual dependency parsing. The CoNLL 2017 UD Shared Task (Zeman et al., 2017) is an extension of the tasks addressed in previous years. Unlike CoNLL 2006 andCoNLL 2007, the focus of the CoNLL 2017 UD Shared Task is learning syntactic dependency parsers on a universal syntactic annotation standard. This shared task requires participants to parse raw texts from different languages, which vary both in typology and training set size.
In this paper, We present our multilingual dependency parsing system Mengest for CoNLL 2017 UD Shared Task. The system contains a BiLSTM feature extractor for feature representation and a MLP classifier for the transition system. The inputs of our system are word form (lemma or stem, which depending on the particular treebank) and part of speech (POS) tags (coarse-grained and fine-grained) for each token. Based on this input, the system finds a governor for each token, and assigns a universal dependency relation label to each syntactic dependency. Our official submission obtains 61.33% macro-averaged LAS F1 score on all treebanks.
The rest of this paper is organized as follows. Section 2 discusses the transition-based model (Kiperwasser and Goldberg, 2016) and our implementation. Section 3 explains how our system deals with parallel sets and surprise languages. Finally, we present experimental and official results in Section 4.

System Description
We implement a transition-based projective parser following Kiperwasser and Goldberg (2016). The system consists of a BiLSTM feature extractor and an MLP classifier. We describe their model and our implementation in the following sections in detail.

Arc-Hybrid System
In this work, we use the arc-hybrid transition system (Kuhlmann et al., 2011). In the arc-hybrid system, a configuration c = (α, β, A) consists of a stack α, a buffer β, and a set of dependency arcs A. Given n words sentence s = w 1 , · · · , w n , the initial configuration c = (∅, {1, 2, · · · , n, root}, ∅) with an empty stack, an empty arc set, and a full buffer β = 1, 2, · · · , n, root, where root is the special root index. The terminal configuration set contains configurations with an empty stack, an arc set and a buffer containing only root.
For each configuration c = (σ|s 1 |s 0 , b 0 |β, A), the arc-hybrid system has 3 kinds of transitions, T = {SHIFT, LEFT l , RIGHT l }: The SHIFT transition moves the first item of the buffer (b 0 ) to the stack. The LEFT l transition removes the first item on top of the stack (s 0 ) and attaches it as a modifier to b 0 with label l, adding the arc (b 0 , s 0 , l) to arc set A. The RIGHT l transition removes s 0 from the stack and attaches it as a modifier to the next item on the stack (s 1 ), adding the arc (s 1 , s 0 , l) to arc set A.
We apply a classifier to determine the best action for a configuration. Following Chen and Manning (2014), we use a MLP with one hidden layer. The score of the transition t ∈ T is defined as: [t] denotes an indexing operation taking the output element which is the class of transition t.

The Feature Representation
We consider two types of feature repersentations ϕ(c) of a configuration: simple and extended. Simple: For an input sequence s = w 1 , · · · , w n , we associate each word w i with a vector x i : where e(w i ) is the embedding vector of word w i , e(p i ) is the embedding vector of POS tag p i , e(q i ) is the embedding vector of coarsegrained POS (CPOS) tag q i . The embeddings e(w i ), e(p i ), e(q i ) are randomly initialized (without pre-training) and jointly trained with the parsing model. Then, in order to encode context features, we use a 2-layer sentence level BiLSTM on top of x 1:n : ⃗ θ are the model parameters of the forward hidden sequence ⃗ h. ⃗ θ are the model parameters of the backward hidden sequence ⃗ h. The vector v i is our final vector representation of ith token in s, which has took into account both the entire history ⃗ h i and the entire future ⃗ h i by concatenating the matching Long Short-Term Memory Network (LSTM).
For ϕ(c), our simple feature function is the concatenated BiLSTM vectors of the top 3 items on the stack and the first item on the buffer. A configuration c is represented by: Extended: We add the feature vectors corresponding to the right-most and left-most modifiers of s 0 , s 1 and s 2 , as well as the left-most modifier of b 0 , reaching a total of 11 BiLSTM vectors as extended feature representation. As we will see in experimental sections, using the extended set does indeed improves parsing accuracies.

Training Details
The training objective is to make the score of correct transitions always above the scores of incorrect transitions. We use a margin-based criteria. Assume T gold is the set of gold transitions at the current configuration c. At each time stamp, the objective function tries to maximize the margin between T gold and T − T gold . The hinge loss of a configuration c is defined as: Our system use the backpropagation algorithm to calculate the gradients of the entire network (including the MLP and the BiLSTM).
Since our parser can only deal with projective dependency trees, we exclude all training examples with non-projective dependencies. This approach undoubtedly downgrades the performance of our system, we plan to use pseudo-projective approach to improve it in the future work.

Multilingual Dependency Parsing
There are 81 treebanks in the CoNLL 2017 UD Shared Task, including 55 big treebanks, 14 PUD treebanks (additional parallel test sets), 8 small treebanks and 4 surprise language treebanks. For each language treebank of UD version 2.0 training sets, we train a parser only using its monolingual training set (no cross-lingual features). In total, we trained 61 models, 55 on big treebanks and 6 on small treebanks 1 . Our system reads the CoNLL-U files predicted by UDPipe, and uses morphology (lemmas, UPOS, XPOS) predicted by UDPipe.

Dealing with Parallel Test Sets
There are 14 additional parallel test sets. Our system simply selects one trained model when we encounter a parallel test set where multiple training treebanks exist. For example, although we don't have English-PUD training set but we have English, English-LinES and English-ParTUT training set. So we only use the model trained on English training set to predict English-PUD test set.

Dealing with Surprise Languages
There are 4 surprise languages in the CoNLL 2017 UD Shared Task. Our system simply use the model trained on English to predict 4 surprise languages, without looking at the input words.

Results
We trained our system based on a MacBook Air with a Intel Core i5 1.6 GHz CPU and 4G memory. We used the official TIRA (Potthast et al., 2014) to evaluate the system. We used Dynet neural network library to build our system (Neubig et al., 2017).
The hyper-parameters of the final system used for all the reported experiments are detailed in Table 1.

Token Representation
We compare two constructions of x i : • lemma and POS tag (w i • p i ).
• lemma, POS tag and CPOS tag (w i • p i • q i ).
The performance of different token representations on 4 example languages are given in

BiLSTM Feature Representation
Performances of simple feature representation and extended feature representation are given in Table 3. The results show that the extended feature representation slightly increases the performance of our system. while the simple feature representation can significantly speed up the system.

Overall Performances
In our final submitted system to the shared task, we used lemmas, POS tags and CPOS tags in token representation and selected extended feature representation.
shown in Table 4. The macro-average LAS of the 8 small treebanks is 33.88% and the results for each language are shown in Table 5. The macroaverage LAS of the 14 PUD treebanks is 63.68% and the results for each language are shown in Table 6. The macro-average LAS of the 4 surprise language treebanks is 11.31% and the results for each language are shown in Table 7. The macroaveraged LAS F1 score of our system on all treebanks is 61.33%.

Computational Efficiencies
The parser is fast. Offline training time is about 300 words/sec. Prediction time on the official TIRA is about 400 words/sec without asking for more resources. Memory requirements are lower than 512M for each language.

Conclusions
In this paper, we present a fast and lightweight multilingual dependency parsing system for the CoNLL 2017 UD Shared Task, which composed of a BiLSTMs feature extractor and a MLP classifier. Our system only uses UD version 2.0 datasets (without any additional data). The parser makes a good ranking at some of the big treebanks. The results suggests that the simple BiLSTM extractor is a reasonable baseline for multilingual dependency parsing. We will continue to improve our system and add cross-lingual techniques in our future work.