Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing - A Tale of Two Parsers Revisited

Transition-based and graph-based dependency parsers have previously been shown to have complementary strengths and weaknesses: transition-based parsers exploit rich structural features but suffer from error propagation, while graph-based parsers benefit from global optimization but have restricted feature scope. In this paper, we show that, even though some details of the picture have changed after the switch to neural networks and continuous representations, the basic trade-off between rich features and global optimization remains essentially the same. Moreover, we show that deep contextualized word embeddings, which allow parsers to pack information about global sentence structure into local feature representations, benefit transition-based parsers more than graph-based parsers, making the two approaches virtually equivalent in terms of both accuracy and error profile. We argue that the reason is that these representations help prevent search errors and thereby allow transition-based parsers to better exploit their inherent strength of making accurate local decisions. We support this explanation by an error analysis of parsing experiments on 13 languages.


Introduction
For more than a decade, research on data-driven dependency parsing has been dominated by two approaches: transition-based parsing and graphbased parsing (McDonald andNivre, 2007, 2011).Transition-based parsing reduces the parsing task to scoring single parse actions and is often combined with local optimization and greedy search algorithms.Graph-based parsing decomposes parse trees into subgraphs and relies on global optimization and exhaustive (or at least non-greedy) search to find the best tree.These radically different approaches often lead to comparable parsing accuracy, but with distinct error profiles indicative of their respective strengths and weaknesses, as shown by McDonald andNivre (2007, 2011).
In recent years, dependency parsing, like most of NLP, has shifted from linear models and discrete features to neural networks and continuous representations.This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question whether their complementary strengths and weaknesses are still relevant.In this paper, we replicate the analysis of McDonald andNivre (2007, 2011) for neural parsers.In addition, we investigate the impact of deep contextualized word representations (Peters et al., 2018;Devlin et al., 2019) for both types of parsers.
Based on what we know about the strengths and weaknesses of the two approaches, we hypothesize that deep contextualized word representations will benefit transition-based parsing more than graph-based parsing.The reason is that these representations make information about global sentence structure available locally, thereby helping to prevent search errors in greedy transition-based parsing.The hypothesis is corroborated in experiments on 13 languages, and the error analysis supports our suggested explanation.We also find that deep contextualized word representations improve parsing accuracy for longer sentences, both for transition-based and graph-based parsers.

Two Models of Dependency Parsing
After playing a marginal role in NLP for many years, dependency-based approaches to syntactic parsing have become mainstream during the last fifteen years.This is especially true if we consider languages other than English, ever since the influ-ential CoNLL shared tasks on dependency parsing in 2006 (Buchholz and Marsi, 2006) and 2007 (Nivre et al., 2007) with data from 19 languages.
The transition-based approach to dependency parsing was pioneered by Yamada and Matsumoto (2003) and Nivre (2003), with inspiration from history-based parsing (Black et al., 1992) and data-driven shift-reduce parsing (Veenstra and Daelemans, 2000).The idea is to reduce the complex parsing task to the simpler task of predicting the next parsing action and to implement parsing as greedy search for the optimal sequence of actions, guided by a simple classifier trained on local parser configurations.This produces parsers that are very efficient, often with linear time complexity, and which can benefit from rich non-local features defined over parser configurations but which may suffer from compounding search errors.
The graph-based approach to dependency parsing was developed by McDonald et al. (2005a,b), building on earlier work by Eisner (1996).The idea is to score dependency trees by a linear combination of scores of local subgraphs, often single arcs, and to implement parsing as exact search for the highest scoring tree under a globally optimized model.These parsers do not suffer from search errors but parsing algorithms are more complex and restrict the scope of features to local subgraphs.
The terms transition-based and graph-based were coined by McDonald andNivre (2007, 2011), who performed a contrastive error analysis of the two top-performing systems in the CoNLL 2006 shared task on multilingual dependency parsing: MaltParser (Nivre et al., 2006) and MSTParser (McDonald et al., 2006), which represented the state of the art in transition-based and graph-based parsing, respectively, at the time.Their analysis shows that, despite having almost exactly the same parsing accuracy when averaged over 13 languages, the two parsers have very distinctive error profiles.MaltParser is more accurate on short sentences, on short dependencies, on dependencies near the leaves of the tree, on nouns and prounouns, and on subject and object relations.MSTParser is more accurate on long sentences, on long dependencies, on dependencies near the root of the tree, on verbs, and on coordination relations and sentence roots.
McDonald and Nivre (2007Nivre ( , 2011) ) argue that these patterns can be explained by the complementary strengths and weaknesses of the systems.The The three parsers show larger variance in performance when ties of the dependency tree.Figure 3 shows the precision and the arc lengths in the predicted and gold-standard dependen is defined as the absolute difference between the indices of represents the percentage of predicted arcs with a particular represents the percentage of gold arcs of a particular length MaltParser gives higher precision than MSTParser for short d cision drops rapidly for arcs with increased lengths.These ar to build, and are hence more prone to error propagation.T slower compared to MaltParser, demonstrating the effect of error propagation.Another important factor is the use of rich is a likely reason for its precision to drop slower even than th increases from 1 to 8. Interestingly, the precision of ZPar is a of MaltParser for size 1 arcs (arcs between neighbouring wor of features in ZPar is the most helpful in arcs that take more reduce actions to build.The recall curves of the three pars  Many of the developments in dependency parsing during the last decade can be understood in this light as attempts to mitigate the weaknesses of traditional transition-based and graph-based parsers without sacrificing their strengths.This may mean evolving the model structure through new transition systems (Nivre, 2008(Nivre, , 2009;;Kuhlmann et al., 2011) or higher-order models for graphbased parsing (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010); it may mean exploring alternative learning strategies, in particular for transition-based parsing, where improvements have been achieved thanks to global structure learning (Zhang and Clark, 2008;Zhang and Nivre, 2011;Andor et al., 2016) and dynamic oracles (Goldberg andNivre, 2012, 2013); it may mean using alternative search strategies, such as transition-based parsing with beam search (Johansson and Nugues, 2007;Titov and Henderson, 2007;Zhang and Clark, 2008) or exact search (Huang and Sagae, 2010;Kuhlmann et al., 2011) or graph-based parsing with heuristic search to cope with the complexity of higher-order models, especially for non-projective parsing (McDonald and Pereira, 2006;Koo et al., 2010;Zhang and McDonald, 2012); or it may mean hybrid or en-semble systems (Sagae and Lavie, 2006;Nivre and McDonald, 2008;Zhang and Clark, 2008;Bohnet and Kuhn, 2012).A nice illustration of the impact of new techniques can be found in Zhang and Nivre (2012), where an error analysis along the lines of McDonald andNivre (2007, 2011) shows that a transition-based parser using global learning and beam search (instead of local learning and greedy search) performs on par with graph-based parsers for long dependencies, while retaining the advantage of the original transition-based parsers on short dependencies (see Figure 1).
Neural networks for dependency parsing, first explored by Titov and Henderson (2007) and Attardi et al. (2009), have come to dominate the field during the last five years.While this has dramatically changed learning architectures and feature representations, most parsing models are still either transition-based (Chen and Manning, 2014;Dyer et al., 2015;Weiss et al., 2015;Andor et al., 2016;Kiperwasser and Goldberg, 2016) or graph-based (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017).However, more accurate feature learning using continuous representations and nonlinear models has allowed parsing architectures to be simplified.Thus, most recent transition-based parsers have moved back to local learning and greedy inference, seemingly without losing accurracy (Chen and Manning, 2014;Dyer et al., 2015;Kiperwasser and Goldberg, 2016).Similarly, graph-based parsers again rely on first-order models and obtain no improvements from using higher-order models (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017).
The increasing use of neural networks has also led to a convergence in feature representations and learning algorithms for transition-based and graph-based parsers.In particular, most recent systems rely on an encoder, typically in the form of a BiLSTM, that provides contextualized representations of the input words as input to the scoring of transitions -in transition-based parsersor of dependency arcs -in graph-based parsers.By making information about the global sentence context available in local word representations, this encoder can be assumed to mitigate error propagation for transition-based parsers and to widen the feature scope beyond individual word pairs for graph-based parsers.For both types of parsers, this also obviates the need for complex structural feature templates, as recently shown by Falenska and Kuhn (2019).We should therefore expect neural transition-based and graph-based parsers to be not only more accurate than their non-neural counterparts but also more similar to each other in their error profiles.

Deep Contextualized Word Representations
Neural parsers rely on vector representations of words as their primary input, often in the form of pretrained word embeddings such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), or fastText (Bojanowski et al., 2016), which are sometimes extended with characterbased representations produced by recurrent neural networks (Ballesteros et al., 2015).These techniques assign a single static representation to each word type and therefore cannot capture contextdependent variation in meaning and syntactic behavior.
By contrast, deep contextualized word representations encode words with respect to the sentential context in which they appear.Like word embeddings, such models are typically trained with a language-modeling objective, but yield sentence-level tensors as representations, instead of single vectors.These representations are typically produced by transferring a model's entire feature encoder -be it a BiLSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017) -to a target task, where the dimensionality of the tensor S is typically S ∈ R N ×L×D for a sentence of length N , an encoder with L layers, and word-level vectors of dimensionality D. The advantage of such models, compared to the parser-internal encoders discussed in the previous section, is that they not only produce contextualized representations but do so over several layers of abstraction, as captured by the model's different layers, and are pre-trained on corpora much larger than typical treebanks.
Deep contextualized embedding models have proven to be adept at a wide array of NLP tasks, achieving state-of-the-art performance in standard Natural Language Understanding (NLU) benchmarks, such as GLUE (Wang et al., 2019).Though many such models have been proposed, we adopt the two arguably most popular ones for our experiments: ELMo and BERT.Both models have previously been used for dependency parsing (Che et al., 2018;Jawahar et al., 2018;Lim et al., 2018;Kondratyuk, 2019;Schuster et al., 2019), but there has been no systematic analysis of their impact on transition-based and graph-based parsers.

ELMo
ELMo is a deep contextualized embedding model proposed by Peters et al. (2018), which produces sentence-level representations yielded by a multi-layer BiLSTM language model.ELMo is trained with a standard language-modeling objective, in which a BiLSTM reads a sequence of N learned context-independent embeddings w 1 , . . ., w N (obtained via a character-level CNN) and produces a context-dependent representation h j,k = BiLSTM(w 1:N , k), where j (1 ≤ j ≤ L) is the BiLSTM layer and k is the index of the word in the sequence.The output of the last layer h L,k is then employed in conjunction with a softmax layer to predict the next token at k + 1.
The simplest way of transferring ELMo to a downstream task is to encode the input sentence S = w 1 , . . ., w N by extracting the representations from the BiLSTM at layer L for each token w k ∈ S: h L,1 , . . ., h L,N ,.However, Peters et al. (2018) posit that the best way to take advantage of ELMo's representational power is to compute a linear combination of BiLSTM layers: where s j is a softmax-normalized task-specific parameter and γ is a task-specific scalar.Peters et al. (2018) demonstrate that this scales the layers of linguistic abstraction encoded by the BiLSTM for the task at hand.

BERT
BERT (Devlin et al., 2019) is similar to ELMo in that it employs a language-modeling objective over unannotated text in order to produce deep contextualized embeddings.However, BERT differs from ELMo in that, in place of a BiLSTM, it employs a bidirectional Transformer (Vaswani et al., 2017), which, among other factors, carries the benefit of learning potential dependencies between words directly.This lies in contrast to recurrent models, which may struggle to learn correspondences between constituent signals when the time-lag between them is long (Hochreiter et al., 2001).For a token w k in sentence S, BERT's input representation is composed by summing a word embedding x k , a position embedding i k , and a WordPiece embedding s k (Wu et al., 2016): Each w k ∈ S is passed to an L-layered Bi-Transformer, which is trained with a masked language modeling objective (i.e., randomly masking a percentage of input tokens and only predicting said tokens).For use in downstream tasks, Devlin et al. (2019) propose to extract the Transformer's encoding of each token w k ∈ S at layer L, which effectively produces BERT k .

Hypotheses
Based on our discussion in Section 2, we assume that transition-based and graph-based parsers still have distinctive error profiles due to the basic trade-off between rich structural features, which allow transition-based parsers to make accurate local decisions, and global learning and exact search, which give graph-based parsers an advantage with respect to global sentence structure.At the same time, we expect the differences to be less pronounced than they were ten years ago because of the convergence in neural architectures and feature representations.But how will the addition of deep contextualized word representations affect the behavior of the two parsers?
Given recent recent work showing that deep contextualized word representations incorporate rich information about syntactic structure (Goldberg, 2019;Liu et al., 2019;Tenney et al., 2019;Hewitt and Manning, 2019), we hypothesize that transition-based parsers have most to gain from these representations because it will improve their capacity to make decisions informed by global sentence structure and therefore reduce the number of search errors.Our main hypothesis can be stated as follows: Deep contextualized word representations are more effective at reducing errors in transitionbased parsing than in graph-based parsing.
If this holds true, then the analysis of McDonald andNivre (2007, 2011) suggests that the differential error reduction should be especially visible on phenomena such as: 1. longer dependencies, 2. dependencies closer to the root, 3. certain parts of speech, 4. certain dependency relations, 5. longer sentences.
The error analysis will consider all these factors as well as non-projective dependencies.

Parsing Architecture
To be able to compare transition-based and graphbased parsers under equivalent conditions, we use and extend UUParser1 (de Lhoneux et al., 2017a;Smith et al., 2018a), an evolution of bistparser (Kiperwasser and Goldberg, 2016), which supports transition-based and graph-based parsing with a common infrastructure but different scoring models and parsing algorithms.
For an input sentence S = w 1 , . . ., w N , the parser creates a sequence of vectors w 1:N , where the vector w k = x k • BILSTM(c 1:M ) representing input word w k is the concatenation of a pretrained word embedding x k and a character-based embedding BILSTM(c 1:M ) obtained by running a BiLSTM over the character sequence c 1:M of w k .Finally, each input element is represented by a BiLSTM vector, h k = BILSTM(w 1:N , k).
In transition-based parsing, the BiLSTM vectors are input to a multi-layer perceptron (MLP) for scoring transitions, using the arc-hybrid transition system from Kuhlmann et al. ( 2011) extended with a SWAP transition to allow the construction of non-projective dependency trees (Nivre, 2009;de Lhoneux et al., 2017b).The scoring is based on the top three words on the stack and the first word of the buffer, and the input to the MLP includes the BiLSTM vectors for these words as well as their leftmost and rightmost dependents (up to 12 words in total).
In graph-based parsing, the BiLSTM vectors are input to an MLP for scoring all possible dependency relations under an arc-factored model, meaning that only the vectors corresponding to the head and dependent are part of the input (2 words in total).The parser then extracts a maximum spanning tree over the score matrix using the Chu-Liu-Edmonds (CLE) algorithm2 (Edmonds, 1967) which allows us to construct non-projective trees.
It is important to note that, while we acknowledge the existence of graph-based parsers that outperform the implementation of Kiperwasser and Goldberg (2016), such models do not meet our criteria for systematic comparison.The parser

Input Representations
In our experiments, we evaluate three pairs of systems -differing only in their input representations.The first is a baseline that represents tokens by w k = x k • BILSTM(c 1:M ), as described in Section 5.1.The word embeddings x k are initialized via pretrained fastText vectors (x k ∈ R 300 ) (Grave et al., 2018), which are updated for the parsing task.We term these transition-based and graphbased baselines TR and GR.
For the ELMo experiments, we make use of pretrained models provided by Che et al. (2018), who train ELMo on 20 million words randomly sampled from raw WikiDump and Common Crawl datasets for 44 languages.We encode each goldsegmented sentence in our treebank via the ELMo model for that language, which yields a tensor S ELMo = R N ×L×D , where N is the number of words in the sentence, L = 3 is the number of ELMo layers, and D = 1024 is the ELMo vector dimensionality.Following Peters et al. (2018) (see Eq. 1), we learn a linear combination and a task-specific γ of each token's ELMo representation, which yields a vector ELMo k ∈ R 1024 .We then concatenate this vector with w k and pass it to the BiLSTM.We call the transition-based and graph-based systems enhanced with ELMo TR+E and GR+E.
For the BERT experiments, we employ the pretrained multilingual cased model provided by Google,3 4 which is trained on the concatenation of WikiDumps for the top 104 languages with the largest Wikipedias. 5 The model's parameters feature a 12-layer transformer trained with 768 hid-den units and 12 self-attention heads.In order to obtain a word-level vector for each token in a sentence, we experimented with a variety of representations: namely, concatenating each transformer layer's word representation into a single vector w concat ∈ R 768 * 12 , employing the last layer's representation, or learning a linear combination over a range of layers, as we do with ELMo (e.g., via Eq.1).In a preliminary set of experiments, we found that the latter approach over layers 4-8 consistently yielded the best results, and thus chose to adopt this method going forward.Regarding tokenization, we select the vector for the first subword token, as produced by the native BERT tokenizer.Surprisingly, this gave us better results than averaging subword token vectors in a preliminary round of experiments.Like with the ELMo representations, we concatenate each BERT vector BERT k ∈ R 768 with w k and pass it to the respective TR+B and GR+B parsers.
It is important to note that while the ELMo models we work with are monolingual, the BERT model is multilingual.In other words, while the standalone ELMo models were trained on the tokenized WikiDump and CommonCrawl for each language respectively, the BERT model was trained only on the former, albeit simultaneously for 104 languages.This means that the models are not strictly comparable, and it is an interesting question whether either of the models has an advantage in terms of training regime.However, since our purpose is not to compare the two models but to study their impact on parsing, we leave this question for future work.

Language and Treebank Selection
For treebank selection, we rely on the criteria proposed by de Lhoneux et al. (2017c) and adapted by Smith et al. (2018b) to have languages from different language families, with different morphological complexity, different scripts and character set sizes, different training sizes and domains, and with good annotation quality.This gives us 13 treebanks from UD v2.3 (Nivre et al., 2018), information about which is shown in Table 1.

Parser Training and Evaluation
In all experiments, we train parsers with default settings 6 for 30 epochs and select the model with 6 All hyperparameters are specified in the supplementary material (Part A).  (Haspelmath et al., 2005).Train = number of training sentences.
the best labeled attachment score on the dev set.
For each combination of model and training set, we repeat this procedure three times with different random seeds, apply the three selected models to the test set, and report the average result.

Error Analysis
In order to conduct an error analysis along the lines of McDonald andNivre (2007, 2011), we extract all sentences from the smallest development set in our treebank sample (Hebrew HTB, 484 sentences) and sample the same number of sentences from each of the other development sets (6,292 sentences in total).For each system, we then extract parses of these sentences for the three training runs with different random seeds (18,876 predictions in total).Although it could be interesting to look at each language separately, we follow McDonald andNivre (2007, 2011) and base our main analysis on all languages together to prevent data sparsity for longer dependencies, longer sentences, etc.7

Results and Discussion
Table 2 shows labeled attachment scores for the six parsers on all languages, averaged over three training runs with random seeds.The results clearly corroborate our main hypothesis.While ELMo and BERT provide significant improvements for both transition-based and graph-based  parsers, the magnitude of the improvement is greater in the transition-based case: 3.99 vs. 2.85 for ELMo and 4.47 vs. 3.13 for BERT.In terms of error reduction, this corresponds to 21.1% vs. 16.5% for ELMo and 22.5% vs. 17.4% for BERT.The differences in error reduction are statistically significant at α = 0.01 (Wilcoxon).
Although both parsing accuracy and absolute improvements vary across languages, the overall trend is remarkably consistent and the transitionbased parser improves more with both ELMo and BERT for every single language.Furthermore, a linear mixed effect model analysis reveals that, when accounting for language as a random effect, there are no significant interactions between the improvement of each model (over its respective baseline) and factors such as language family (IE vs. non-IE), dominant word order, or number of training sentences.In other words, the improvements for both parsers seem to be largely independent of treebank-specific factors.Let us now see to what extent they can be explained by the error analysis.

Dependency Length
Figure 2 shows labeled F-score for dependencies of different lengths, where the length of a dependency between words w i and w j is equal to |i − j| (and with root tokens in a special bin on the far left).For the baseline parsers, we see that the curves diverge with increasing length, clearly indicating that the transition-based parser still suffers   from search errors on long dependencies, which require longer transition sequences for their construction.However, the differences are much smaller than in McDonald andNivre (2007, 2011) and the transition-based parser no longer has an advantage for short dependencies, which is consistent with the BiLSTM architecture providing the parsers with more similar features that help the graph-based parser overcome the limited scope of the first-order model.
Adding deep contextualized word representations clearly helps the transition-based parser to perform better on longer dependencies.For ELMo there is still a discernible difference for dependencies longer than 5, but for BERT the two curves are almost indistinguishable throughout the whole range.This could be related to the aforementioned intuition that a Transformer captures long dependencies more effectively than a BiLSTM (see Tran et al. (2018) for contrary observations, albeit for different tasks).The overall trends for both baseline and enhanced models are quite consistent across languages, although with large variations in accuracy levels.

Distance to Root
Figure 3 reports labeled F-score for dependencies at different distances from the root of the tree, where distance is measured by the number of arcs in the path from the root.There is a fairly strong (inverse) correlation between dependency length and distance to the root, so it is not surprising that the plots in Figure 3 largely show the mirror image of the plots in Figure 2.For the baseline parsers, the graph-based parser has a clear advantage for dependencies near the root (including the root itself), but the transition-based parser closes the gap with increasing distance. 8For ELMo and BERT, the curves are much more similar, with only a slight advantage for the graph-based parser near the root and with the transition-based BERT parser being superior from distance 5 upwards.The main trends are again similar across all languages.

Non-Projective Dependencies
Figure 4 shows precision and recall specifically for non-projective dependencies.We see that there is a clear tendency for the transition-based parser to have better precision and the graph-based parser better recall. 9In other words, non-projective dependencies are more likely to be correct when they are predicted by the transition-based parser using the swap transition, but real non-projective dependencies are more likely to be found by the graphbased parser using a spanning tree algorithm.Interestingly, adding deep contextualized word representations has almost no effect on the graphbased parser, 10 while especially the ELMo em-8 At the very end, the curves appear to diverge again, but the data is very sparse in this part of the plot.
9 Incidentally, the same pattern is reported by McDonald andNivre (2007, 2011), even though the techniques for processing non-projective dependencies are different in that study: pseudo-projective parsing (Nivre and Nilsson, 2005) for the transition-based parser and approximate second-order non-projective parsing (McDonald and Pereira, 2006) for the graph-based parser. 10The breakdown per language shows marginal improvements for the enhanced graph-based models on a few lan- beddings improve both precision and recall for the transition-based parser.

Parts of Speech and Dependency Types
Thanks to the cross-linguistically consistent UD annotations, we can relate errors to linguistic categories more systematically than in the old study.The main impression, however, is that there are very few clear differences, which is again indicative of the convergence between the two parsing approaches.We highlight the most notable differences and refer to the supplementary material (Part B) for the full results.
Looking first at parts of speech, the baseline graph-based parser is slightly more accurate on verbs and nouns than its transition-based counterpart, which is consistent with the old study for verbs but not for nouns.After adding the deep contextualized word representations, both differences are essentially eliminated.
With regard to dependency relations, the baseline graph-based parser has better precision and recall than the baseline transition-based parser for the relations of coordination (conj), which is consistent with the old study, as well as clausal subjects (csubj) and clausal complements (ccomp), which are relations that involve verbs in clausal structures.Again, the differences are greatly reduced in the enhanced parsing models, especially for clausal complements, where the transitionbased parser with ELMo representations is even slightly more accurate than the graph-based parser.find the most unexpected results of the study.First of all, although the baseline parsers exhibit the familiar pattern of accuracy decreasing with sentence length, it is not the transition-based but the graph-based parser that is more accurate on short sentences and degrades faster.In other words, although the transition-based parser still seems to suffer from search errors, as shown by the results on dependency length and distance to the root, it no longer seems to suffer from error propagation in the sense that earlier errors make later errors more probable.The most likely explanation for this is the improved training for transition-based parsers using dynamic oracles and aggressive exploration to learn how to behave optimally also in non-optimal configurations (Goldberg andNivre, 2012, 2013;Kiperwasser and Goldberg, 2016).

Sentence Length
Turning to the models with deep contextualized word representations, we find that transitionbased and graph-based parsers behave more similarly, which is in line with our hypotheses.However, the most noteworthy result is that accuracy improves with increasing sentence length.For ELMo this holds only from 1-10 to 11-20, but for BERT it holds up to 21-30, and even sentences of length 31-40 are parsed with higher accuracy than sentences of length 1-10.A closer look at the breakdown per language reveals that this picture is slightly distorted by different sentence length distributions in different languages.More precisely, high-accuracy languages seem to have a higher proportion of sentences of mid-range length, causing a slight boost in the accuracy scores of these bins, and no single language exhibits exactly the patterns shown in Figure 5.Nevertheless, several languages exhibit an increase in accuracy from the first to the second bin or from the second to the third bin for one or more of the enhanced models (especially the BERT models).And almost all languages show a less steep degradation for the enhanced models, clearly indicating that deep contextualized word representations improve the capacity to parse longer sentences.

Conclusion
In this paper, we have essentially replicated the study of McDonald andNivre (2007, 2011) for neural parsers.In the baseline setting, where parsers use pre-trained word embeddings and character representations fed through a BiLSTM, we can still discern the basic trade-off identified in the old study, with the transition-based parser suffering from search errors leading to lower accuracy on long dependencies and dependencies near the root of the tree.However, important details of the picture have changed.The graph-based parser is now as accurate as the transition-based parser on shorter dependencies and dependencies near the leaves of the tree, thanks to improved representation learning that overcomes the limited feature scope of the first order model.And with respect to sentence length, the pattern has actually been reversed, with the graph-based parser being more accurate on short sentences and the transitionbased parser gradually catching up thanks to new training methods that prevent error propagation.
When adding deep contextualized word representations, the behavior of the two parsers converge even more, and the transition-based parser in particular improves with respect to longer dependencies and dependencies near the root, as a result of fewer search errors thanks to enhanced information about the global sentence structure.One of the most striking results, however, is that both parsers improve their accuracy on longer sentences, with some models for some languages in fact being more accurate on medium-length sentences than on shorter sentences.This is a milestone in parsing research, and more research is needed to explain it.
In a broader perspective, we hope that future studies on dependency parsing will take the results obtained here into account and extend them by investigating other parsing approaches and neural network architectures.Indeed, given the rapid development of new representations and architectures, future work should include analyses of how all components in neural parsing architectures (embeddings, encoders, decoders) contribute to distinct error profiles (or lack thereof).

Figure 3 :
Figure 3: Labeled F-score by distance to root.

Figure 5 :
Figure 5: Labeled attachment score by sentence length.

Figure 5
Figure 5 plots labeled attachment score for sentences of different lengths, measured by number of words in bins of 1-10, 11-20, etc.Here we guages, canceled out by equally marginal degradations on others.

Table 2 :
Labeled attachment score on 13 languages for parsing models with and without deep contextualized word representations.