Integrating Graph-Based and Transition-Based Dependency Parsers in the Deep Contextualized Era

Graph-based and transition-based dependency parsers used to have different strengths and weaknesses. Therefore, combining the outputs of parsers from both paradigms used to be the standard approach to improve or analyze their performance. However, with the recent adoption of deep contextualized word representations, the chief weakness of graph-based models, i.e., their limited scope of features, has been mitigated. Through two popular combination techniques – blending and stacking – we demonstrate that the remaining diversity in the parsing models is reduced below the level of models trained with different random seeds. Thus, an integration no longer leads to increased accuracy. When both parsers depend on BiLSTMs, the graph-based architecture has a consistent advantage. This advantage stems from globally-trained BiLSTM representations, which capture more distant look-ahead syntactic relations. Such representations can be exploited through multi-task learning, which improves the transition-based parser, especially on treebanks with a high ratio of right-headed dependencies.


Introduction
Dependency parsers can roughly be divided into two classes: graph-based (Eisner, 1996;McDonald et al., 2005) and transition-based (Yamada and Matsumoto, 2003;Nivre, 2003). The two paradigms differ in their approach to the trade-off between access to contextual features in the output dependency tree and exactness of search (Mc-Donald and Nivre, 2007). The complementary strengths of those paradigms have given grounds to numerous diversity-based methods for integrating parsing models (Nivre and McDonald, 2008;Sagae and Lavie, 2006, among others). To date, the methods are commonly used for improving the accuracy of single parsers 1 , achieving robust predictions for the silver-standard resource preparation (Schweitzer et al., 2018), or as analysis tools .
One of the most significant recent developments in dependency parsing is based on encoding rich sentential context into word representations, such as BiLSTM vectors (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) and deep contextualized word embeddings (Peters et al., 2018;Devlin et al., 2019). Including these representations as features has set a new state of the art for both graph-based and transition-based parsers (Kiperwasser and Goldberg, 2016;Che et al., 2018). However, it also brought the two architectures closer. Kulmizev et al. (2019) showed that after including deep contextualized word embeddings, the average error profiles of graph-and transition-based parsers converge, potentially reducing gains from combining them. On the other hand, the authors also noticed that the underlying trade-off between the parsing paradigms is still visible in their results. Thus, it is an open question to what extent the differences between the parsing paradigms could still be leveraged.
In this paper, we fill the gaps left in understanding the behavior of transition-and graph-based dependency parsers that employ today's state-ofthe-art deep contextualized representations. We start from the setting of Kulmizev et al. (2019), i.e., Kiperwasser and Goldberg's (2016) seminal transition-based and graph-based parsers extended with deep contextualized word embeddings. We show that, on average, the differences between BiLSTM-based graph-based and transition-based models are reduced below the level of different random seeds. Interestingly, the diversity needed for a successful integration vanishes al- ready with BiLSTM feature representations and does not change when deep contextualized embeddings are added. We further consider treebank-specific differences between graph-and transition-based models. Through a set of carefully designed experiments, we show that our graph-based parser has an advantage when parsing treebanks with a high ratio of right-headed dependencies. This advantage comes from globally-trained BiLSTMs and can be exploited by the locally-trained transitionbased parser through multi-task learning (Caruana, 1993). This combination improves the performance of the two parsing architectures and narrows the gap between them without requiring additional computational effort at parsing time.

Parsing Architecture
We re-implement the basic transition-and graphbased architectures proposed by Kiperwasser and Goldberg (2016) (denoted K&G) with a few changes outlined below. We follow Kulmizev et al. (2019) and intentionally abstain from extensions such as Dozat and Manning's (2016) attention layer to keep our experimental setup simple. This enables us to control for all relevant methodological aspects of the architectures. Our hypothesis is that adding more advanced mechanisms would resemble adding contextualized word embeddings, i.e., improve the overall performance but not change the picture regarding parser combination. However, testing this hypothesis is orthogonal to this work.
All the described parsers are implemented with the DyNet library (Neubig et al., 2017). 2 We provide details on used hyperparameters in Appendix A.
Deep contextualized word representations. The two most popular models of deep contextualized representations are ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). Both models have been used with dependency parsers, either for multi-lingual applications (Kondratyuk and Straka, 2019;Schuster et al., 2019) or to improve parsing accuracy (Che et al., 2018;Jawahar et al., 2018;Lim et al., 2018). Recently, Kulmizev et al. (2019) analyzed the influence of both of the models on the K&G architecture and showed that they give similar results, BERT being slightly ahead. Since the scope of our experiments is to analyze the influence of contextualized embeddings on parser integration, and not to analyze differences between different embedding models, we use ELMo, which is more accessible.
ELMo representations encode words within the context of the entire sentence. The representations are built from a linear combination of several layers of BiLSTMs pre-trained on a task of language modeling: We use pre-trained ELMo models provided by Che et al. (2018) and train task-specific parameters s j and γ together with the parser. The final representations are combinations of L = 3 layers and have dimensionality 1024.
Word representations. In both transition-and graph-based architectures input tokens are represented in the same way (see level [1] in Figure 1). For a given sentence with words [w 1 , . . . , w n ] and part-of-speech (POS) tags [t 1 , . . . , t n ] each word representation x i is built from concatenating: embedding of the word, its POS tag, Bi-LSTM character-based embedding, and word's ELMo representation: Word embeddings are initialized with the pretrained fastText vectors (Grave et al., 2018) and trained together with the model. The representations x i are passed to the BiLSTM feature extractors (level [2]) and represented by a vector x i = BiLSTM(x 1:n , i).
Transition-based parser. The part of the architecture that is specific to the transition-based K&G parser is colored red in Figure 1. For every configuration consisting of a stack, buffer, and the current set of arcs, the parser builds a feature set of three items: the two top-most items of the stack and the first item on the buffer (denoted s 0 , s 1 , and b 0 ). Next, it concatenates their BiLSTM vectors and passes on to a multi-layer perceptron (MLP,level [3] in Figure 1). The MLP scores all possible transitions, and the highest-scoring one is applied to proceed to the next configuration. Our implementation (denoted TB) uses the arcstandard transition system extended with the SWAP transition (Nivre, 2009) and can thus handle nonprojective trees. 3 We use Nivre et al.'s (2009) lazy SWAP oracle for training. Labels are predicted together with the transitions.
For analysis, we also use variants of TB trained without BiLSTMs. In these cases, vectors x i are passed directly to the MLP layer (similarly to Chen and Manning (2014)), and the implicit context encoded by the BiLSTMs is lost. We compensate for it by using Kiperwasser and Goldberg's (2016) extended feature set, which adds the embedding information of eight additional tokens in the structural context of the parser state.
Graph-based parser. The specific parts of the graph-based K&G parser are highlighted in blue in Figure 1. At parsing time, every pair of tokens x i , x j yields a feature set { x i , x j }. The BiLSTM representations are concatenated and passed to an MLP to compute the score for an arc x i lbl − → x j for every possible dependency label lbl (unlike the original K&G implementation, we predict labels together with the arcs). To find the highest-scoring tree, we apply the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967). We denote this architecture GB. In experiments where GB is trained without Bi-LSTMs, we extend the feature set with surface features known from classic graph-based parsers, such as distance between head and dependent, and words at the distance of 1 and 2 from heads and dependents (McDonald et al., 2005).

Integration Methods
Parser combination approaches can be divided into two categories: methods that integrate base parsers at prediction time and training time. We use one well-established representative from each of the categories, i.e., blending and feature-based stacking, respectively. Additionally, for analysis purposes, we combine the two parsers through multi-task learning.
Blending (see Figure 2a), also known as reparsing (Sagae and Lavie, 2006), is a parsing-time integration method. It consists of running basic models in separation and combining their outputs into one graph. Weights in this graph depend on how many basic models predicted a particular arc. Finally, a graph-based decoder is used to find the maximum spanning tree in the combined graph.
In our implementation, we use the Chu-Liu-Edmonds algorithm to find the final tree. For every resulting arc, we select the most frequent label across all the labels previously assigned to it. Blending needs at least three basic models to apply the voting scheme. Therefore, we follow Kuncoro et al. (2016) and train multiple instances of models with random seeds and denote BLEND n m a combination of m×GB and n×TB parsers. For analysis, we vary the ratio of TB and GB models while leaving the total number of models constant at 6 for a fair comparison.
Feature-based stacking (see Figure 2b) was introduced by Nivre and McDonald (2008) and Martins et al. (2008). It involves running two parsers in sequence so that the second (level-1) parser can use the output of the first (level-0) parser as features (denoted STACK level-1 level-0 ). To generate training data for the level-1 parser, we apply 10-fold cross-validation on the training sets with the level-0 parser. Then, we follow Ouchi et al. (2014) and extract stacking features from the level-0 parser's predictions in the form of supertags. More precisely, for every word w i , we build its supertag by filling the template label/hdir+hasLdep hasRdep, where label is the dependency relation, hdir denotes relative head direction, and hasLdep/hasRdep mark presence of left/right dependents. Such supertags are then, similarly to POS tags, represented as embeddings and concatenated with other representations to build x i . The dimensionality and type of information encoded in the stacking representations were determined in exploratory experiments on the English development data and left unchanged for other languages.
Multi-task learning (see Figure 2c) allows combining the transition-and graph-based K&G parsers by sharing their BiLSTM representations (level [2] in Figure 1). We keep feature extraction and MLP layers separate, and do not enforce any agreement between the two decoders. Effectively this means that training yields two parsers that can be applied independently: one transition-based (denoted MTL TB GB ) and one graph-based (MTL GB TB ). We use a straightforward MTL training protocol: for every sentence, we calculate the Bi-LSTM representations x i and collect all local losses from both tasks (TB and GB). Then, the losses are summed and the model parameters are updated through backpropagation. We note in passing that this training protocol leaves many options for improvements, such as adding weights to losses from different tasks (Shi et al., 2017b), sharing representations on different levels of Bi-LSTMs (Søgaard and Goldberg, 2016), or employing stack-propagation (Zhang and Weiss, 2016). We abstain from such extensions as they are orthogonal to the central points of our analysis.
We use automatically predicted universal POS tags in all the experiments. The tags are assigned using a CRF tagger (Mueller et al., 2013). We annotate the training sets via 5-fold jackknifing.

Evaluation and Analysis
We evaluate the experiments using Labeled Attachment Score (LAS). 4 We train models for 30 epochs and select the best model based on development LAS. For the results on the test sets, we follow Reimers and Gurevych's (2018) recommendation and report averages and standard deviations from six models trained with different random seeds. We test for significance using the Wilcoxon rank-sum test with p-value < 0.05.
An analysis is carried out on the development sets in order not to compromise the test sets. We follow Kulmizev et al. (2019) and sample the same number of sentences from every development set (484 sentences since this is the size of the smallest one). We then aggregate results from three models trained with different random seeds and present the combined results.

Diversity-Based Integration
We start by evaluating the two integration methods (STACK and BLEND) and applying them to our transition-and graph-based parsers (TB and GB).
Average results. The first column in Table 1 gives the average results. In the case of stacking, the performance of combined models is almost the same as that of the baseline models. Small improvements are noticeable for STACK TB GB vs. TB, but they are statistically significant only for one treebank. Comparing STACK GB TB vs. GB we even notice a small average drop of 0.08 LAS. In the case of blending, the method does provide big improvements over single baselines (BLEND 6 0 vs. TB and BLEND 0 6 vs. GB). However, those improvements are not coming from integrating different paradigms since BLEND 3 3 achieves the same average performance as BLEND 0 6 , which uses only GB. There are two possible explanations for lack of gains from integrating parsing paradigms: either (1) in general, the neural models are simply not capable of benefiting from such combination, or (2) feature representations based on the BiLSTMs and the deep contextualized representations bring the architectures too close to each other for the integration to be beneficial. In Section 4, we investigate which of those two situations takes place.  Table 4 in Appendix A.
Since blending already involves multiple models, we run it only once and do not test the results for significance.
Treebank-specific results. Next, we take a closer look at the treebank-specific accuracy.
Comparing single baselines (TB vs. GB), we note that GB has a clear advantage over TB. It surpasses TB on twelve out of thirteen treebanks (all improvements are significant). We reproduce analysis from Kulmizev et al. (2019) and confirm that this advantage is consistent across arcs of different lengths, distances to root, and sentences with different sizes (we provide corresponding plots in Appendix A). Interestingly, the dominance of GB over TB significantly differs across treebanks and is especially prominent for more challenging ones, e.g., with small amounts of training data or a high level of non-projectivity. For instance, the largest difference of 2.25 LAS is visible for Basque, which is the treebank with the largest number of non-projective arcs, and Turkish, which has the smallest training dataset. Moreover, those are the treebanks where STACK TB GB offers small improvements (0.39 LAS and 0.24 LAS, respectively), but both STACK GB TB and BLEND 3 3 cannot make use of the diversity in predictions of the two models and cause the accuracy to drop (comparing STACK GB TB vs. GB and BLEND 3 3 vs. BLEND 0 6 ). In the case of non-neural parsers, a big gap between the performance of a strong graph-based model and a greedy transition-based model does not prevent the former to learn from the latter (Faleńska et al., 2015). Therefore, the questions arise where those treebank-specific differences come from and why integration methods cannot benefit from them. We address these questions in Section 5.

Parsing Architectures and Diversity
In this section, we investigate which aspects of the K&G architecture are responsible for no gains from the integration. For this purpose, we run ablation experiments and apply blending and stack-ing on models trained with and without BiLSTMs and with and without ELMo representations.

Feature-based Stacking
We perform stacking with different types of level-0 information. Apart from the standard way, in which TB is stacked on top of GB or vice versa (denoted O; for other) we carry out two types of control experiments: S (for self), where we stack a model on itself, and G (for gold), where goldstandard trees are used as level-0 predictions.
Oracle experiments. Figure 3a displays results for stacking with different level-0 information. We immediately see that scenario G, in which models are stacked on gold-standard trees, exhibits almost perfect performance. Regardless of the level-1 parser and employment of BiLSTMs and ELMo, all models achieve accuracy higher than 95 LAS, proving that they are capable of learning from the stacking representations.
Influence of representations. Next, we consider the models which were trained without Bi-LSTMs and ELMo (left, lightest bars). Surprisingly, for both TB (green) and GB (blue), small improvements can be noticed in the self-application scenario S, which was not the case for non-neural models (Martins et al., 2008;Faleńska et al., 2015). One explanation for this is the diversity of the models coming with random seeds, which was less prominent in their non-neural versions (Reimers and Gurevych, 2017). However, clearer improvements are visible in scenario O, which combines models of different types. Both STACK TB GB and STACK GB TB surpass both of the single baselines, proving that integration is beneficial when BiLSTMs and ELMo are not used.
Considering the case where BiLSTMs are included (middle) changes the picture.
Self-  application behaves almost on par with stacking the parsers on each other. The only modest improvement (amounting to 0.18 LAS on average) occurs for STACK TB GB , but it is not enough to surpass a single GB baseline.
As expected, adding ELMo (right, darkest bars) results in big improvements comparing to the models without these representations. However, those improvements do not impact stacking results, and the picture regarding the integration of the architectures stays the same. Figure 3b presents results for blending with different ratios of TB and GB. We start by analyzing models trained without BiLSTMs and ELMo (left). We can observe a pattern we would expect from diverse models: (1) blending always improves over the baselines (signified by red lines);

Blending
(2) combining models only of one sort (BLEND 0 6 or BLEND 6 0 ) yields lower scores than when we introduce more diversity into the combination; (3) the best result is obtained by BLEND 3 3 , where the same number of TB and GB models is used.
For the models that use BiLSTMs (middle), the gains coming from blending are smaller. For example, BLEND 6 0 improves TB by 1.84 LAS, whereas the corresponding improvement when no BiLSTMs are used is 3.67 LAS. Interestingly, the models show a different pattern when it comes to diversity within the combination. The accuracy of the blend increases with the number of GB models. Although BLEND 2 4 achieves the highest accuracy, it surpasses BLEND 0 6 by only 0.05 LAS. This suggests that TB models do not bring enough diversity into the combination, and the accuracy of BLEND is mostly influenced by GB models.
Finally, for models that use ELMo (right), improvements over baselines are slightly smaller -BLEND 0 6 improves GB by 1.34 LAS comparing to 1.59 LAS when no ELMo is used. However, the picture regarding diversity is the same, and overall performance depends on the number of GB models and not the diversity among combined paradigms.
To conclude, we showed that the performance of TB and GB models can be improved through the traditional diversity-based approaches as long as no BiLSTMs are used. Otherwise, the gains from combination methods decrease considerably. Adding ELMo representations improves the performance of both of the models but has almost no impact on the outcome of the integration.

Representations and Treebank-Specific Diversity
In the previous section, we saw that BiLSTMs mitigate average benefits from integration methods. One explanation might be that when both TB and GB use the same feature representations, the diversity between them is much smaller, thus reducing the gains the models could draw from each other. However, when comparing treebank-specific results in Table 1, we noticed that in specific cases the two baselines differ considerably. We now investigate where do those differences come from and if they could be beneficial.

Representation Analysis
First, we take a closer look at information encoded in representations learned by the transition-and graph-based models. ticular part of the architecture is with respect to changes in input. Specifically, we use our metric IMPACT from Falenska and Kuhn (2019) that measures how every BiLSTM representation x i is influenced by every word representation x j from the sentence. Intuitively, IMPACT can be thought of as a percentage distributed over all words of the sentence -the higher the percentage of x j the more it influenced the representation of x i . For every sentence from the development set and every vector x i we calculate IMPACT values of all words x j on x i and bucket those values according to the distance between j and i. Figure 4 shows the average impact of tokens at particular positions. We see the same two general patterns as Gaddy et al. (2018): (1) closer words have larger effects on the representations, and (2) even words 15 or more positions away influence the vectors.
Transition-based parser. For representations trained with TB ( Figure 4a) the difference in signals coming from heads and other tokens is bigger on the left side than on the right side (see, e.g., positions −15 and 15). de Lhoneux et al. (2019) provided an explanation for this and showed that for greedy locally-trained models, the forward LSTMs could be interpreted as rich history-based features while the backward LSTMs could be thought of as look-ahead features. Since the information to the right mostly (i.e., except for the buffer front) comes from backward LSTMs, it contains, as in the case of standard look-ahead features, less structural relations.
Graph-based parser. Representations trained together with GB ( Figure 4b) show a slightly different pattern. Compared to TB, the impact of heads is smaller for tokens closeby, but it deterio-rates slower. Since this model is globally-trained, the influence of heads does not depend on the side -the plot is almost symmetrical, suggesting that representations encode as much information about syntactic relations on the left as on the right.

BiLSTMs Integration
Next, we investigate whether the observed differences in the information encoded in BiLSTM representations can explain the advantage of GB over TB. We train new models where we share those intermediate representations between the two parsers through multi-task learning (MTL). We hypothesize that if the advantage of GB stems from global training and the influence it has on the representations, then MTL will re-balance the representations and, as a result, narrow the gap between the two models. We note in passing that MTL is typically carried out on different tasks, often with different training sets. However, it is perfectly possible to consider graph-based and transition-based dependency parsing as two separate tasks trained on the same training set. IMPACT analysis. Figure 4c displays the IM-PACT statistics for MTL models.
The plot shows that the BiLSTM representations draw on the advantages from both locally trained TB and globally-trained GB -the distribution has a slightly stronger peak for closer words as in TB, but flattens out more slowly as in GB. This effect is particularly pronounced when comparing the far right (look-ahead) of TB with the MTL distribution, especially as heads become more influential.
Error analysis. To understand how the changes in representations influence the parsing performance, we break down the LAS by dependency length and head direction. Figure 5 shows the dependency recall and precision of models with respect to the positions of the heads. 5 First, we compare TB (blue) with GB (green) and observe that GB has a consistent advantage. However, when comparing recall and precision, we note an interesting difference. In terms of recall (Figure 5a), the plot is symmetrical and the advantage of GB is roughly the same for heads on the left and on the right side of the token. In terms of precision (Figure 5b), both TB and GB behave identically for heads on the left, but the performance of TB drops faster for heads on positions 3 and more to the right.
Second, we analyze MTL TB GB (red). We notice that sharing representations with GB does not influence TB's recall or precision on heads on the left, while for right-headed dependencies precision improves. The model catches up with GB's performance and starts deteriorating much later, for heads on positions 6 and more to the right.
Treebank-specific improvements. Finally, we look at the treebank-specific accuracy of the MTL models. The two bottom rows in Table 2 show the effects of MTL on GB. Sharing representations between the two architectures has a small influence on GB, and on average, improves its performance by 0.18 LAS. Although for few treebanks bigger improvements can be seen, e.g., Chinese (0.39 LAS) or Swedish (0.35 LAS), none of them is statistically significant. Therefore, it is not clear 5 Dependency recall is defined as the percentage of correct predictions among gold standard arcs with head position p. Precision is the percentage of correct predictions among all predicted arcs. The definitions slightly differ from McDonald and Nivre (2007), who looked at absolute arc lengths.
if those improvements come from the actual combination of different parsing paradigms, or MTL in this case acts as additional regularization, ultimately reducing overfitting during training.
In the case of TB, the average performance is improved through MTL by 0.42 LAS, with statistically significant differences for four treebanks. The biggest gains are visible for the treebanks, where the difference between TB and GB is greatest, such as Basque (0.94 LAS). Interestingly, among the treebanks with the biggest improvements, we can notice Turkish (1.28 LAS) and Chinese (0.52 LAS), which are the two treebanks with the greatest ratio of right-headed arcs (62.58% and 71.86%, respectively). This result is in line with the results of de Lhoneux et al. (2019), who demonstrated that backward LSTMs are especially important for head-final languages.
To conclude, we saw that the advantage of GB over TB stems from global training. The training increases the impact of tokens (far) to the right as compared to a locally trained TB model and translates into an improved prediction of right-headed dependencies. Thus, the distance between the two models is treebank-related and can be reduced through integration methods such as MTL, especially when parsing more challenging treebanks.

Related Work
Traditional integration of dependency parsers. Classical integration methods were initially introduced to take advantage of differences in the strengths of the component parsers. Such differences were usually the result of different parsing paradigms, as in the case of feature-based stack-   Table 5 in Appendix A. ing, blending, or beam search-based transitionbased parsers with features strongly inspired by graph-based models (Zhang and Clark, 2008;Bohnet and Kuhn, 2012). However, combining parsers that process input left-to-right and rightto-left (Hall et al., 2007;Attardi and Dell'Orletta, 2009), or even parsers and sequence labelers (Faleńska et al., 2015), was also proposed. Blending was usually applied to a mixture of graphbased and transition-based left-to-right and rightto-left parsers (Sagae and Lavie, 2006;Surdeanu and Manning, 2010;Björkelund et al., 2017, among others). Moreover, in the case of stacking, integrating two parsers of the same type gives at most minor improvements (Martins et al., 2008).
Since neural network training can be sensitive to initialization (Reimers and Gurevych, 2017), recent ensemble dependency parsers are rather combining models trained with different random seeds than different paradigms. For example, out of 24 teams participating in the CoNLL 2018 Shared Task on dependency parsing (Zeman et al., 2018), five employed ensemble techniques. However, all of them took advantage of either diversity coming from random seeds or different languages. Neural parsers of the same type can be combined by taking the sum of their MLP scores (Che et al., 2017), averaging softmax scores (Che et al., 2018), or through re-parsing (Kuncoro et al., 2016). The last authors also showed that such an ensemble could be distilled into a single graphbased parser. Finally, Shi et al. (2017b) used MTL in a similar way to ours. They shared BiLSTMs between three parsers to speed up their training time. However, all the models where globallytrained and the authors did not evaluate if the combination improved their performance.

Discussion and Conclusion
In this paper, we investigated the recent advances in dependency parsing from the perspective of the traditional integration methods. These methods are known for exploiting diversity in the strengths and weaknesses of transition-and graph-based parsing paradigms. We found out that when models use BiLSTMs, such diversity is on the level of different random seeds. Adding deep contextualized representations on top of BiLSTMs improves the performance of both parsers but does not change the picture regarding the integration.
Rich-feature sets used to be the advantage of the transition-based parsers. Now that the parsers do not need structural features (Falenska and Kuhn, 2019), the graph-based parsers have an advantage that the locally-trained transition-based parsers cannot make up for. Therefore, improving parsers through combination methods is not as straightforward as it used to be. Such a combination has to take into consideration the specificity of the treebank and depend on whether accuracy or parsing time is the priority. The greatest gains in accuracy can be obtained by blending multiple graphbased models. However, the method comes with the cumbersome overhead of running multiple predictors at application time. When speed is essential and the accuracy can be sacrificed (Gómez-Rodríguez et al., 2017) greedy transition-based parsers or even sequence labelers are the preferable choices (Strzyz et al., 2019). In such cases, alternative integration approaches such as multitask learning can boost the performance of locallytrained models without requiring additional computational effort at parsing time.
Introduction of BiLSTMs into dependency parsers had another consequence, i.e., it enabled the use of exact search algorithms for transitionbased parsers (Shi et al., 2017a;Gómez-Rodríguez et al., 2018). Therefore, it is an interesting question if the error profiles of such parsers are even less distinguishable from the graph-based outputs. We leave this question for future work.      Table 5: Standard deviation for results in Table 2.