The parse is darc and full of errors: Universal dependency parsing with transition-based and graph-based algorithms

We developed two simple systems for dependency parsing: darc, a transition-based parser, and mstnn, a graph-based parser. We tested our systems in the CoNLL 2017 UD Shared Task, with darc being the official system. Darc ranked 12th among 33 systems, just above the baseline. Mstnn had no official ranking, but its main score was above the 27th. In this paper, we describe our two systems, examine their strengths and weaknesses, and discuss the lessons we learned.


Introduction
Universal Dependencies (UD) (Nivre et al., 2016) is a cross-linguistically consistent annotation scheme for dependency-based treebanks. UD version 2.0 (UD2) (Nivre et al., 2017b,a) provided the datasets for the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2017). In the shared task participating systems were evaluated through the TIRA platform (Potthast et al., 2014). The main evaluation metric was the labeled attachment F 1 -score (LAS). 33 systems completed the official evaluation, including the baseline UDPipe (Straka et al., 2016).
We submitted a primary system darc and a secondary system mstnn, with the primary system partaking in the official evaluation. Both are open sourced under the MIT license. 1 The two systems differ only in the parsing algorithm.
Darc is equipped with a transitionbased non-projective/projective parser. Mstnn is equipped with a graph-based non-projective unlabeled parser and a standalone labeler. Both sys-tems utilize a neural network classifier with similar input features.
In this paper, we start with a description of our treatments for different datasets in the shared task, and then the separate descriptions of our two parsers, followed by an analysis of the results.

Treatments for datasets
We were tasked with producing parsed outputs for 81 test-sets, either from raw texts or from segmented and tagged inputs produced by the baseline system.
The outputs were required to conform to the CoNLL-U format. 2 In this format, each node in a dependency-graph has ten fields named ID, FORM, LEMMA, UPOSTAG, XPOSTAG, FEATS, HEAD, DEPREL, DEPS, and MISC, where ID, HEAD, and DEPREL defines an edge. Segmentation establishes the graph/sentence boundaries while filling in ID, FORM, and MISC. Tagging fills in LEMMA, UPOSTAG, XPOSTAG, and FEATS. 63 test-sets have corresponding treebanks in UD2. These treebanks were the only training resources we used.

Big treebanks
The majority of the treebanks in UD2 (55/63) consist of train-sets and dev-sets. These are the big treebanks.
For segmentation and tagging, we trained UD-Pipe models on the train-sets and used the dev-sets for parameter tuning. 3 The only hyperparameter tuned by the dev-sets was the number of training iterations. All the other hyperparameters were simply taken from the baseline models (Straka, 2017), because of our limited computing power.
The gold-standard train-sets were re-tagged by our UDPipe models to produce training data for our parsers.

Small treebanks
The remainder of the treebanks (8/63) consist only of train-sets. These are the small treebanks.
Here we consulted the approach of the baseline system, which split the train-sets into three parts: train, tune, and dev. For our UDPipe models, both the train-sets and tune-sets were used for training, and the dev-sets for tuning. For our parser models, the entirety of treebanks was used for training, and all hyperparameters took the default values.

Parallel test-sets
The 14 parallel test-sets have no corresponding treebanks, but the corresponding languages exist. For these we used the preprocessed inputs from the baseline system, and picked our parser models according to the languages. If multiple treebanks exist for the same language, we took the model trained on the first one.

Surprise languages
4 test-sets have no corresponding languages, though small samples of gold-standard data were released as part of the shared task. Again, we used the preprocessed inputs from the baseline system.
For each sample we applied our existing parser models to pick the best treebank for training a delexicalized model. These delexicalized models rely mostly on UPOSTAG, but may utilize FEATS as well. This setting, along with the other hyperparameters, were tuned by the sample data. 4 3 Primary system: darc Our primary system employs a transition-based parser.
We adapted our parser from Chen and Manning (2014), who used a neural network classifier in a transition-based parsing algorithm known as the arc-standard system (Nivre, 2008).
The neural network classifier requires little feature engineering, and therefore is easily adaptable for different languages, making it ideal for UD parsing. However, the arc-standard system is only applicable to projective parsing, while over half of the treebanks in UD2 have more than 10% non-projective sentences in the train-sets. For this reason, we adopted a non-projective variant by adding a swap action to the transition system (Nivre, 2009;. We chose either the projective or the non-projective algorithm based on how they performed for each treebank. In the end, we used the non-projective one for all but three treebanks. 5

The transition algorithm
The algorithm produces a directed acyclic graph from a sequence of nodes [w 0 , w 1 , . . . , w n ], which are the syntactic tokens of a sentence, where w 0 is a pseudo root node. The transition system consists of several transition actions defined over configurations. Each configuration is a triple consisting of a stack σ, a buffer β, and a set of edges A. From the initial configuration c 0 : (σ : [w 0 ], β : [w 1 , w 2 , . . . , w n ], A : {}), a series of transition actions are taken to produce a terminal configuration c m : (σ : Possible transition actions are listed below. A different action is defined for each l in the set of dependency labels.

Input features
As input features, we use the set of 18 graph nodes from Chen and Manning (2014): • The top three words on the stack & buffer For each node we take its FORM, LEMMA, UPOSTAG, FEATS, and DEPREL fields. Each field is represented through an embedding into the real vector space. However, some treebanks have no informative LEMMA. For these treebanks we omit the LEMMA embedding, and double the dimension of the FORM embedding. 6 All embeddings are trainable, except for the FEATS embedding. Each FEATS is represented by a vector of binary values, indicating the presence or absence of any attribute-value pairs in the morphological vocabulary of its affiliated treebank.
In any transition configuration, some nodes may be missing, for which special dummy members are in all embeddings. Special members are also appointed for the root node. For FORM and LEMMA, all hapaxes are replaced with a single arbitrary symbol. The amount of parameters in the embedding matrices for FORM and LEMMA are substantial. Initializing these parameters with pretrained embeddings has been shown to be beneficial (Chen and Manning, 2014). To produce embeddings which are more suitable for capturing syntactic information, we used the tool developed by Ling et al. (2015). 7 The embeddings for UPOSTAG and DE-PREL are randomly initialized from the unif orm(−0.5, 0.5) distribution. 6 Korean, Portuguese-BR, English-LinES, Indonesian, Swedish-LinES, and Uyghur. 7 https://github.com/wlin12/wang2vec/ with options {-type 3 -hs 0 -min-count 2 -window 7 -sample 0.1 -negative 7 -iter 20}; Though in fact [-min-count 2] had no effect, as we had all hapaxes replaced by an obscure symbol.
The inputs are first transformed by a hidden layer with 256 rectified linear units (ReLU), then by a second, similar hidden layer, and finally by a softmax layer with as many units as the number of transition actions. The softmax output assigns a probability prediction for each action.
The weights for all layers are initialized in a random uniform distribution following He et al. (2015). The ReLU layers have their biases initialized to ones, in order to alleviate the dying ReLU problem. The network is trained through backpropagation by the AdaMax (Kingma and Ba, 2014) algorithm.
In our experiments, we found it helpful to apply dropout (Srivastava et al., 2014) to both the trainable embedding layers and the hidden layers. For our network, 25% dropout rate seems to be the optimal. The 50% dropout rate suggested by Srivastava et al. (2014) requires extending the sizes of these layers, which would result in a polynomial amount of increase in the number of parameters. Even though we did find a slight improvement in accuracy with a larger network and a higher dropout rate, we rejected extending the network because of our limited computing power.
For regularization, we apply the unit-norm constraint to the trainable embeddings, which ensures that each column of the embedding matrices is a unit vector. We found this helpful for stabilizing the accuracy in later iterations and achieving higher scores. We also experimented with the max-norm constraint, which only forces the norms of the column vectors to be no greater than the max-norm; We found that it can be better than the unit-norm constraint, but only for certain optimal max-norm values, which were different for every dataset.

Training and parsing
These hyperparameters are tuned during training, with their default values marked bold: • Algorithm: projective, non-projective • Batch-size: 16, 32, 64 • Iterations: maximally 16 Our parser is greedy during parsing. From any configuration, only the action with the highest probability prediction is taken to advance into the next configuration. In case the action suggested by the classifier is illegal, the next best action is taken.
The transition algorithm does not prevent multiple nodes from being attached to the pseudo root node. However, this is not allowed in the UD treebanks. When this occurs, we keep the first attachment, and attach the other nodes to that node with the parataxis label.
Apart from the regular syntactic nodes, the CoNLL-U format allows for empty-words and multi-words. We completely ignore the emptywords. We keep track of the multi-words, but ignore them during parsing.
The evaluation is only concerned with the UD labels, and not the language-specific subtypes. For example, acl:relcl is considered to be the same as acl. We experimented with removing the language specific information before parsing, and we found it to be helpful in some cases, but harmful in others. Either way, the differences are negligible.

A comparison with Parsito
Our parser is very similar to the Parsito parser (Straka et al., 2015) incorporated in UD-Pipe, which is also a transition-based parser with a feedforward neural network classifier.
The primary difference is in the training. Our parser uses only a static oracle, while Parsito supports a dynamic oracle, and may additionally utilize a search-based oracle.
The static oracle produces transition sequences which must lead to the gold-standard parse trees. A classifier trained only on the gold-standard transition sequences is not robust against its own errors. When an error is made, the parser arrives in a configuration which it has never seen before. To help the classifier make the best decision possible in any configuration, the dynamic oracle (Goldberg and Nivre, 2012) explores erroneous transitions suggested by the classifier itself. Parsito's search-based oracle applies the SEARN algorithm (Daumé et al., 2009) to mitigate this problem.
Moreover, in addition to the projective and non-projective transition systems, Parsito supports link2 (Attardi, 2006), a partially non-projective transition algorithm, which was used for more than one-third of the baseline models.
Despite the limitations of our parser in comparison with UDPipe's Parsito, it achieved comparable results in the shared task.

Secondary system: mstnn
In our secondary system, a graph-based nonprojective unlabeled parser and a labeler are used.

Unlabeled parsing
We adapted the MSTParser (McDonald et al., 2005) with a neural network classifier. Starting with a fully connected directed graph, the classifier scores the edges between every two nodes, and then the Chu-Liu-Edmonds' algorithm (Chu and Liu, 1965;Edmonds, 1967) is applied to find the maximum spanning arborescence. The algorithm was implemented using NetworkX (Hagberg et al., 2008).
The neural network classifier accepts the following inputs: • The distance between the two nodes (the arithmetic difference of their ID) • All features are constructed the same as in the primary system, except for the added distance feature. 8 The structure of the neural network is also the same, except that the output layer consists of a single sigmoid unit. The probability prediction of the sigmoid unit is taken as the score associated with the dependency arc in consideration.

Labeling
For labeling the edges, we implemented a linear support-vector classifier (Cortes andVapnik, 1995) using LIBLINEAR (Fan et al., 2008) through scikit-learn (Pedregosa et al., 2011). 9 Input features to the classifier are the FORM, LEMMA, UPOSTAG, and FEATS fields of the two nodes, plus the UPOSTAG of their left & right neighbors. The FEATS field is represented in the ranking darc baselineÚFAL best score best system

Results
The official test-run took 1 hour 47 minutes on a single-core Intel Xeon CPU, which included segmentation, tagging, and parsing. The secondary system took 3 hours and 14 minutes.
In Table 2, we report our official LAS (labeled attached F 1 -score) and our rankings among the 33 systems. 10 We compare our system against the baseline (UDPipe 1.1),ÚFAL (UDPipe 1.2), and the best systems. Included are the macro-averaged scores for all and some subsets of the treebanks, plus three of our best-ranking & worst-ranking per-treebank scores.
Our official system was far from the best, but it was comparable to the two UDPipe systems, despite having a much simpler parser. Parsing aside, it had the highest all-tags F 1 -score with 73.92%, thanks to the MorphoDiTa (Straková et al., 2014) tool incorporated in UDPipe. However, the base-10 http://universaldependencies.org/conll17/ results.html line was very close with 73.74%.
In general, our primary system (darc) outperformed our secondary system (mstnn), both in LAS and UAS (unlabeled attached F 1 -score). However, mstnn occasionally achieved better UAS or LAS, as shown in Table 3.

Discussions
In Section 3.5 we made a comparison between our darc parser and UDPipe's Parsito parser. Specifically, Parsito supports better oracles and an additional transition algorithm. We attribute the comparable performance of our much simpler parser to the following factors: • We used more training data.
The baseline models were trained on subsets of the train-sets, while using the rest for parameter tuning, leaving the dev-sets untouched. Our models were trained on the entire train-sets, and tuned on the dev-sets.
We believe this was the reason that we ranked above the baseline but belowÚFAL.
• We used LEMMA when possible.
In our experiments, we fixed the dimensionality of the lexical space, namely the target real vector space where we embed the lexical representation (FORM and/or LEMMA) of the vocabulary. We found that splitting the dimensions between FORM and LEMMA, as opposed to dedicating exclusively to either one, consistently produced the best results.
Further evidence of this is that for four out of the six treebanks where LEMMA was not used, darc performed worse than the baseline.
Splitting the lexical space between FORM and LEMMA actually decreases the number of parameters in the embedding matrices, comparing with using FORM alone, because LEMMA has fewer types.
• We had a better representation for FEATS.
Simply treating FEATS as atomic symbols is subject to data sparsity as shown in Table 4  Our representation is explained in Section 3.2. We experimented with normalizing the FEATS vectors into unit vectors by their L 2 -norm, or into probability distributions by their L 1 -norm as in Alberti et al. (2015). But simple binary indicators seemed to work the best.
Despite the generally similar performance of the original MSTParser in comparison with transition-based parsers with similar learning algorithms, our own mstnn did not meet the expectation, when compared against darc.
The graph-based approach and the transitionbased approach are faced with different challenges, and produces different types of errors (Mc-Donald and Nivre, 2007). The former suffers less from the errors of local decisions, but the latter usually benefits from richer features. In our case, the neural network classifier in mstnn used much less information from neighboring nodes than the classifier in darc.
The separate labeler in mstnn was also suboptimal. From UAS to LAS, the absolute drop was 4% higher for mstnn than it was for darc, which actually means a 15.6% higher increase in error rate. This exemplified a general problem with the pipeline approach: Errors made in each step of the pipeline stack up quickly, which is made even worse by the snowball effect, where errors made in one step bring about more errors in the following steps. Another problem is that in a pipeline, the information necessary for making the correct decisions in one step may not be available until later. We experimented with unlabeled parsing using darc, and despite facing a much simpler task than labeled parsing, it yielded lower UAS.
The pipeline approach is a common weakness of our both systems. We believe that for tasks such as this one, an integrated end-to-end system is more desirable.