A non-projective greedy dependency parser with bidirectional LSTMs

The LyS-FASTPARSE team present BIST-COVINGTON, a neural implementation of the Covington (2001) algorithm for non-projective dependency parsing. The bidirectional LSTM approach by Kiperwasser and Goldberg (2016) is used to train a greedy parser with a dynamic oracle to mitigate error propagation. The model participated in the CoNLL 2017 UD Shared Task. In spite of not using any ensemble methods and using the baseline segmentation and PoS tagging, the parser obtained good results on both macro-average LAS and UAS in the big treebanks category (55 languages), ranking 7th out of 33 teams. In the all treebanks category (LAS and UAS) we ranked 16th and 12th. The gap between the all and big categories is mainly due to the poor performance on four parallel PUD treebanks, suggesting that some ‘suffixed’ treebanks (e.g. Spanish-AnCora) perform poorly on cross-treebank settings, which does not occur with the corresponding ‘unsuffixed’ treebank (e.g. Spanish). By changing that, we obtain the 11th best LAS among all runs (official and unofficial). The code is made available at https://github.com/CoNLL-UD-2017/LyS-FASTPARSE


Introduction
Dependency parsing is one of the core structured prediction tasks researched by computational linguists, due to the potential advantages that obtaining the syntactic structure of a text has in many natural language processing applications, such as machine translation (Miceli-Barone and Attardi, 2015;Xiao et al., 2016), sentiment analysis (Socher et al., 2013;Vilares et al., 2017) or information extraction (Yu et al., 2015).
The goal of a dependency parser is to analyze the syntactic structure of sentences in one or several human languages by obtaining their analyses in the form of dependency trees. Let w = [w 1 , w 2 , ..., w |w| ] be an input sentence, a dependency tree for w is an edge-labeled directed tree T = (V, E) where V = {0, 1, 2, . . . , |w|} is the set of nodes and E = V × D × V is the set of labeled arcs. Each arc, of the form (i, d, j), corresponds to a syntactic dependency between the words w i and w j ; where i is the index of the head word, j is the index of the child word and d is the dependency type representing the kind of syntactic relation between them. 1 We will write i d − → j as shorthand for (i, d, j) ∈ E and we will omit the dependency types when they are not relevant.
A dependency tree is said to be non-projective if it contains two arcs i − → j and k − → l where min(i, j) < min(k, l) < max(i, j) < max(k, l), i.e., if there is any pair of arcs that cross when they are drawn over the sentence, as shown in Figure 1. Unrestricted non-projective parsing allows more accurate syntactic representations than projective parsing, but it comes at a higher computational cost, as there is more flexibility in how the tree can be arranged so that more operations are usually needed to explore the much larger search space.
Non-projective transition-based parsing has been actively explored in the last decade (Nivre and Nilsson, 2005;Attardi, 2006;Nivre, 2008Nivre, , 2009Gómez-Rodríguez and Nivre, 2010;Gómez-Rodríguez et al., 2014). The success of neural networks and word embeddings for pro-He gave a talk yesterday about parsing Figure 1: A non-projective dependency tree jective dependency parsing (Chen and Manning, 2014) also encouraged research on neural nonprojective models (Straka et al., 2016). However, to the best of our knowledge, no neural implementation is available of unrestricted nonprojective transition-based parsing with a dynamic oracle. Here, we present such an implementation for the Covington (2001) algorithm using bidirectional long short-term memory networks (LSTM) (Hochreiter and Schmidhuber, 1997), which is the main contribution of this paper.
The system is evaluated at the CoNLL 2017 UD Shared Task: end-to-end multilingual parsing using Universal Dependencies (Zeman et al., 2017). The goal is to obtain a Universal Dependencies v2.0 representation (Nivre et al., 2016) of a collection of raw texts in different languages.
2 End-to-end multilingual parsing Given a raw text, we: (1) segment and tokenize sentences and words, (2) apply part-of-speech (PoS) tagging over them and (3) obtain the dependency structure for each sentence.

Segmentation and PoS tagging
For these two steps we relied on the output provided by UDpipe v1.1 (Straka et al., 2016), which was provided as a baseline model for the shared task.

The BIST-COVINGTON parser
BIST-COVINGTON is built on the top of three core ideas: a non-projective transition-based parsing algorithm (Covington, 2001;Nivre, 2008), a neural scoring model with bidirectional long short-term memory networks as feature extractors that feed a multilayer perceptron (Kiperwasser and Goldberg, 2016), and a dynamic oracle to mitigate error propagation (Gómez-Rodríguez and Fernández-González, 2015).

The Covington (2001) algorithm
The idea of Covington's algorithm is quite intuitive: any pair of words w i , w j in w have a chance to be connected, so we need to consider all such pairs to determine the type of relation that exists between them (i.e. i d − → j, j d − → i or none). One pair (i, j) is compared at a time. We will be referring to the indexes i and j as the focus words. It is straightforward to conclude that the theoretical complexity of the algorithm is O(|w| 2 ).
Covington's algorithm can be easily implemented as a transition system (Nivre, 2008). The set of transitions used in BIST-COVINGTON and their preconditions is specified in Table 1. Each transition corresponds to a parsing configuration represented as a 4-tuple c = (λ 1 , λ 2 , β, A), such that: • λ 1 , λ 2 are two lists storing the words that have been already processed in previous steps. λ 1 contains the already processed words for which the parser still has not decided, in the current state, the type of relation with respect to the focus word j, located at the top of β. λ 2 contains the already processed words for which the parser has already determined the type of relation with respect to j in the current step.
• β contains the words to be processed.
• A contains the set of arcs already created.
Given a sentence w the parser starts at an initial configuration c s = ([0], [], [1, ..., |w|], {}) and will apply valid transitions until reaching a final configuration c f such that c f = (λ 1 , λ 2 , [], A). 2.2.2 A dynamic oracle for Covington's algorithm (Gómez-Rodríguez and Fernández-González, 2015) Given a gold dependency tree, τ g , and a parser configuration c, we can define a loss function  Nivre (2008). a − → ... − → b indicates there is a path in the dependency tree that allows to reach b from a L(c, τ g ) that determines the minimum number of missed arcs of τ g across the possible outputs (A) of final configurations that can be reached from c, i.e., the least possible number of errors with respect to τ g that we can obtain from c. A static (traditional) oracle is only defined on canonical transition sequences that lead to the gold tree, so that L(c, τ g ) = 0 at every step during the training phase. However, during the test phase such training strategy might end up in serious error propagation, as it is difficult for the parser to recover from wrong configurations that it has never seen, resulting from suboptimal transitions that increase loss. A dynamic oracle (Goldberg and Nivre, 2012) explores such wrong configurations during the training phase to overcome this issue. Instead of always picking the optimal transition during training, the parser moves with probability x to an erroneous (loss-increasing) configuration, namely the one with the highest score among those that increase loss.
To compute L for non-projective trees we used the approach proposed by Gómez-Rodríguez and Fernández-González (2015, Algorithm 1). This dynamic oracle can be computed in O(|w|) although the current implementation in BIST-COVINGTON is O(|w| 3 ). To choose the dependency type corresponding to the selected transition (in case it is a LEFT or RIGHT ARC), we look at the gold treebank. BIST-parsers (Kiperwasser and Goldberg, 2016) The original set of BIST-parsers is composed of a projective transition-based model using the arc-hybrid algorithm (Kuhlmann et al., 2011) and a graph-based model inspired in Eisner (1996). They both rely on bidirectional LSTM's (BILSTM's). We kept the main architecture of the arc-hybrid BIST-parser and changed the parsing algorithm to that described in §2.2.1 and §2.2.2. We encourage the reader to consult Kiperwasser and Goldberg (2016) for a detailed explanation of their architecture, but we now try to give a quick overview of its use as the core part of  In contrast to traditional parsers (Nivre et al., 2006;Martins et al., 2010;Rasooli and Tetreault, 2015), BIST-parsers rely on embeddings as inputs instead of on discrete events (co-occurrences of words, tags, features, etc.). Embeddings are lowdimensional vectors that provide a continuous representation of a linguistic unit (word, PoS tag, etc.) based on its context (Mikolov et al., 2013).

The
Let w=[w 1 , ..., w |w| ] be a list of word embeddings for a sentence, let u=[u 1 , ..., u |w| ] be the corresponding list of universal PoS tag embeddings, t=[t 1 , ..., t |w| ] the list of specific PoS tag embeddings, f =[f 1 , ..., f |w| ] the list of morphological features ("feats" column in the Universal Dependencies data format) and e=[e 1 , ..., e |w| ] a list of external word embeddings; an input x i for a word w i to BIST-COVINGTON is defined as: 3 where • is the concatenation operator. Let LSTM(x) be an abstraction of a standard long short-term memory network that processes the sequence x = [x 1 , ..., x |x| ], then a BILSTM encoding of its ith element, BILSTM(x, i) can be defined as: In the case of multilayer BILSTM'S (BISTparsers allow it), given n layers, the output of the BILSTM m is fed as input to BILSTM m+1 . From the BILSTM network we take a hidden vector h, which can contain the output hidden vectors for: the x leftmost words in β, the rightmost y of λ 1 , and the z leftmost and v rightmost words in λ 2 .
The hidden vector h is used to feed a multilayer perceptron with one hidden layer and four output neurons that predicts which transition to take. The output is computed as where W, W 2 , b and b 2 correspond to the weight matrices and bias vectors of the hidden and output layer of the perceptron. Similarly, BIST-parsers (including BIST-COVINGTON) use a second perceptron with one hidden layer to predict the dependency type. In this case the output layer corresponds to the number of dependency types in the training set.

Postprocessing
BIST-COVINGTON as it is allows parses with multiple roots, i.e., with several nodes assigned as children of the dummy root. This was not allowed however by the task organizers, as it is enforced by Universal Dependencies that only one word per sentence must depend on the dummy root. To overcome this, the output is postprocessed according to Algorithm 1. Basically, we look for the first verb rooted at 0, or for the first word whose head is 0 if there is no verb, and reassign all other words to the selected term:

Experiments
We here describe the official treebanks used in the shared task ( §3.1), the general setup used to train the models ( §3.2) and some exceptions to said general setup that were applied to special cases ( §3.3). We also discuss the experimental results obtained by our system in the shared task ( §3.4).

CoNLL 2017 treebanks
3.1.1 Training/development splits 60 treebanks from 45 languages were released to train the models, based on Universal Dependencies 2.0 (Nivre et al., 2017a). Most of them already contained official training and development splits. A few others lacked a development set. For these, we applied a training/dev random split Algorithm 1 Multiple to single node root 1: procedure TO SINGLE(V, E) Get the nodes rooted at zero (those whose head has to be reassigned) 2: if head(i) = 0 then 5: append(RO, i) We select the first verb linked to the dummy root to remove multiple roots 6: if len(RO) > 1 then 7: closest head ← RO[0] 8: for r0 in RO do 9: if utag(r0) = VERB then 10: closest head ← r0 11: break Reassign the head of the invalid nodes (rooted to the dummy root) to closest head 12: for r0 in RO do 13: if r0 = closest head then 14: head(r0) ← closest head (80/20) over the original training set. All development sets were only used to evaluate and tune the trained models. No development set was used to train any of the runs, as specified in the task guidelines.
Additionally, four surprise languages (truly low resource languages), were considered by the organization for evaluation: Buryat, Kurmanji, North Sami and Upper Sorbian. For these, the organizers only released a tiny sample set consisting of very few sentences annotated according to the UD guidelines.

Test splits
The organizers provided a test split for each of the treebanks released in the training phase, including the surprise languages. Additionally, they provided test sets corresponding to 14 parallel treebanks in different languages translated from a unique source. All of these test sets (Nivre et al., 2017b) were hidden from the participating teams until the shared task had ended. Using the TIRA environment (Potthast et al., 2014) provided for the shared task, participants could execute runs on them, but not see the outputs or the results.

General setup
We used the gold training treebanks to train the parsing models. We trained one model per treebank. No predicted training treebank (predicted universal and/or specific tags and morphological features) was used for training, except for the case of Portuguese (see §3.3.1).
Embeddings: Word embeddings are set to size 100 and universal tag embeddings to 25. Language-specific tag and morphological feature embeddings are used and set to size 25, if they are available for the treebank at hand. Using external word embeddings seems to be beneficial to improve parsing performance (Kiperwasser and Goldberg, 2016), but it also makes models take more time and especially much more memory to train. The external word embeddings used in this work (the ones pretrained by the CoNLL 2017 UD Shared Task organizers 4 ) are of size 100. Due to lack of enough computational resources, we only had time to train 38 models (mainly corresponding to the smallest treebanks) including this information. Models trained with external word embeddings are marked in Table 3 with .
Parameters: Adam is used as optimizer (Kingma and Ba, 2014). Models were trained for up to 30 epochs, except for the two smallest training sets (Kazakh and Uyghur), where models were trained for up to 100 epochs. The size of the output of the stacked BILSTM was set to 512. For very large treebanks (e.g. Czech or Russian-SyntagRus) or treebanks where sentences are very long (e.g. Arabic), we set it to 256, also to counteract the lack of physical resources to finish the task on time. These models are marked in Table 3 with •. The number of BILSTM layers is set to 2. To choose a transition, BIST-COVINGTON looks at the embeddings of: the first word in β, the rightmost three words in λ 1 , and the leftmost and rightmost word in λ 2 (i.e., following the notation in Section 2.2.3, we set x = 1, y = 3, z = 1 and v = 1).
Other relevant features of the setup: Aggressive exploration is applied to the dynamic oracle, as in the original arc-hybrid BIST-PARSER.

Special cases
For some treebanks, we followed a different strategy due to various issues. We enumerate the changes below: file. We first hypothesized this was due to a low accuracy on predicting the "feats" column in comparison to other languages, as they are pretty sparse. To try to overcome this, we trained a model without considering them, but it did not solve the problem. Our second option was to train a Portuguese model on its predicted training treebank. 5 Additionally, despite being a relative large treebank, we included external word embeddings to boost performance. This helped us to obtain a performance similar to that reported by UDpipe.

Surprise languages
As training an accurate parser with so little data might be a hard task , especially in the case of data-hungry deep learning models, we used other training treebanks for this purpose. We built a set of parsers inspired on the approach presented by Vilares et al. (2016), who find that training a multilingual model on merged harmonized treebanks might actually have a positive impact on parsing the corresponding monolingual treebank. In this particular case, we are assuming that a trained model over multilingual treebanks might be able to capture similar treebank structures for unseen languages.
In particular, we: (1) ran every trained monolingual model on the sample sets, (2) for each surprise language, we chose the top three languages where the corresponding models obtained the best performance and (3) trained a parser taking the first 2 000 sentences of the training sets corresponding to such languages and merging them.
Thus, we did not use the provided sample data for training, but only as a development set to choose suitable source languages for our crosslingual approach.

Parallel (PUD) treebanks
The only information our models knew about the parallel treebanks during the testing phase was the language in which they were written. To parse these languages we follow a simplistic approach, using the models we had already trained on the provided training corpora: (1) if there is only one model trained on the same language we take that model, (2) else if there is more than one model trained on that language, we take the one trained over the largest treebank (in number of sentences), otherwise (3) we parse the PUD treebank using the English model. 6

Results
Official and unofficial results for our model and for the rest of participants on the test set can be found at the task website: http://universaldependencies. org/conll17/results.html, but in this section we detail the results obtained by BIST-COVINGTON. 3.4.1 Results on small and big treebanks categories Table 2 shows the performance on the test sets for the treebanks where an official training set was released.
In Table 3 we summarize our results on the development sets for those treebanks that provided an official one. Although not shown for brevity and clarity reasons, it is easy to check for the reader that BIST-COVINGTON outperformed the baseline UDpipe 7 for all these treebanks on the gold configuration (gold segmentation, gold tags). The same is true, except for Chinese (-0.69 decrease in LAS) and Portuguese (-0.09), in the fully predicted configuration (end-to-end parsing). It is easy to conclude from the table that including external word embeddings has a positive effect in most of the treebanks we had time to try. This is especially true when performing endto-end parsing, where only for three languages (English-LinES, Gothic and Old Church Slavonic) a negative effect was observed. 8 Table 4 shows the top three selected languages for each surprise treebank, the performance of the monolingual and multilingual (merged) models on them on the sample set (used as dev set), and also shows the performance of the multilingual models in the official test sets.   Table 3: BIST-COVINGTON results on the dev set, for those treebanks that have an official dev set (all treebanks except French-ParTUT, Irish, Galician-TreeGal, Kazakh, Slovenian-SST, Kazakh, Uyghur and Ukrainian). indicates the model was also trained with external word embeddings (E). • indicates the BILSTM output dimension was 256. The performance of some models is likely to be improved, as its training finished earlier than expected due to lack of time to finish it or memory issues (see also §5) responding treebank was 32.47, which implied a LAS loss up to 1.60 points in the official global ranking. We hypothesized that taking the model  Table 4: LAS on the surprise languages sample sets for: (1) top 3 best performing monolingual models for which there is an official training treebank and (2) a multilingual model trained on the first 2 000 sentences of each of such treebanks. For the multilingual models, the last column shows its performance on the test sets (subscripts indicate our ranking in that language) trained on the largest treebank of the same language was the safest option to parse PUD texts, but in retrospective this clearly was not the optimal choice. Those four PUD treebanks were parsed with models trained on Universal Dependencies (UD) treebanks whose official name has a suffix (i.e. Spanish-Ancora, Finnish-FTB, Portuguese-BR and Russian-SyntagRus), which were larger than the unsuffixed UD treebank. However, we think such a poor performance surpasses what can be reasonably expected from an universal treebank written in the same language. From Table 5 it is reasonable to conclude that such suffixed treebanks parse more than poorly on cross-treebank settings, in comparison to the model trained on the unsuffixed treebank (rightmost column). We wonder if this can be an indicator of those treebanks sharing universal dependency types, but diverging in terms of syntactic structures, which caused the low LAS scores in those cases.
A possible contributing factor to this could be that the annotators of the parallel treebanks used guidelines from the unsuffixed treebanks, or automatic output trained on them, as a starting point from the annotation process. At the point of writing we cannot confirm whether this is the case, as documentation for the PUD treebanks is not yet publicly available.
We failed on a subset of the PUD treebanks. As previously explained, the main gap came from the Spanish, Russian, Portuguese and Finnish PUD treebanks. We analyzed those treebanks based on existing UD CoNLL treebanks. We parsed them with the model trained on the largest treebank that shared the language. It turned out that those PUD treebanks that were parsed with suffixed treebanks (e.g. Spanish-Ancora or Russian-SynTagRus) obtained a very low performance, something that did not happen when parsing them with the model trained on the corresponding unsuffixed treebank (e.g. Spanish or Russian). In cases where there was only one UD treebank sharing the language, our approach worked reasonably well, in spite of the simplistic strategy followed (e.g. Turkish-PUD or Czech-PUD).
We did not perform too well either on the set of small treebanks (French-ParTUT, Irish, Galician-TreeGal, Kazakh, Slovenian-SST, Uyghur and Ukrainian). This was somewhat expected for two reasons: (1) neural models that are fed with continuous vector representations are usually datahungry and (2) the submitted model was only trained on our training split; we did not include the ad-hoc dev sets for those languages as a part of the final training data.
We believe that the cases where the parser did not work well were due to external causes (e.g. the chosen cross-treebank strategy), as shown in the case of the PUD treebanks. Unofficial results such as the ones in Table 5 show that this can be easily addressed to push BIST-COVINGTON to obtain competitive results in those treebanks too.

Hardware requirements and issues
Our models required DyNet (Neubig et al., 2017), which allocates memory when it is launched. We ran them on CPU. To train the models we used two servers with 128GB of RAM memory each. Estimating the required memory to allocate to train each model was a hard task for us. Dynet does not currently have a garbage collector, 9 so many models ran out of memory even before finishing their training, probably due to wrong memory estimations to complete this phase, and our lack of resources to allocate memory for many treebanks at a time. We observed that models such as Arabic with external word embeddings could take up to 64GB during the training phase.
The performance on the dev set of our trained models was close, but not equal, in our training machine and in TIRA. This might be caused by a serialization versioning issue: https:// github.com/clab/dynet/issues/84.
To safely run a large trained model with external embeddings we recommend at least 32GB of RAM memory. We think a safe estimate to run any model without external embeddings would be something between 15 and 20GB.
The current version of BIST-COVINGTON is not very fast. Average speed (tokens/second) over all test treebanks was 18.27. The fastest models were Kazakh (66.36), Uyghur (54.11) and Czech-PUD (45.79) and the slowest ones Czech-CLTT (5.37), Latin-PROIEL (7.69) and . To complete the testing phase of the shared task, BIST-COVINGTON took around 28 hours. These times correspond to those of the official evaluation on the TIRA virtual machine. Several factors influence these speeds. Firstly, RNN approaches tend to be slower than feedforward approaches (e.g., reported speeds for the original transition-based BIST-parser by Kiperwasser and Goldberg (2016) are an order of magnitude behind those of Chen and Manning (2014), although the latter is also much less accurate). Secondly, parsing UD data for different languages accurately requires using more linguistic information (e.g. feature embeddings), increasing the model size with respect to models evaluated on simpler settings like the English Penn Treebank. Finally, we are aware that Covington's algorithm may become slower when sentences are too long due to its quadratic worst-case complexity, an issue that is likely to happen due to the predicted segmentation (the organizers actually informed that some treebanks contained sentences of about 300 words).

Conclusion
This paper presented BIST-COVINGTON, a bidirectional LSTM implementation of the Covington (2001) algorithm for non-projective transitionbased dependency parsing. Our model was evaluated on the end-to-end multilingual parsing with universal dependencies shared task proposed at CoNLL 2017. For segmentation and part-ofspeech tagging our model relied on the official UDpipe baseline. The official results located us 7th out of 33 teams in the big treebanks category, in spite of not using any ensemble method.
As future work, there is room for improvement. Due to lack of resources to train the models and complete the task on time, we could not train all models using external word embeddings, which has been shown to produce a significant overall improvement. Jackniffing (Agić and Schluter, 2017) might be a simple way to improve the LAS scores. Finally, it would be interesting to implement the non-monotonic version of the Covington transition system, together with approximate dynamic oracles (Fernández-González and Gómez-Rodríguez, 2017), shown to improve accuracy over the regular Covington parser.