Dynamic Oracles for Top-Down and In-Order Shift-Reduce Constituent Parsing

We introduce novel dynamic oracles for training two of the most accurate known shift-reduce algorithms for constituent parsing: the top-down and in-order transition-based parsers. In both cases, the dynamic oracles manage to notably increase their accuracy, in comparison to that obtained by performing classic static training. In addition, by improving the performance of the state-of-the-art in-order shift-reduce parser, we achieve the best accuracy to date (92.0 F1) obtained by a fully-supervised single-model greedy shift-reduce constituent parser on the WSJ benchmark.


Introduction
The shift-reduce transition-based framework was initially introduced, and successfully adapted from the dependency formalism, into constituent parsing by Sagae and Lavie (2005), significantly increasing phrase-structure parsing performance.
A shift-reduce algorithm uses a sequence of transitions to modify the content of two main data structures (a buffer and a stack) and create partial phrase-structure trees (or constituents) in the stack to finally produce a complete syntactic analysis for an input sentence, running in linear time. Initially, Sagae and Lavie (2005) suggested that those partial phrase-structure trees be built in a bottom-up manner: two adjacent nodes already in the stack are combined under a non-terminal to become a new constituent. This strategy was followed by many researchers (Zhang and Clark, 2009;Zhu et al., 2013;Watanabe and Sumita, 2015;Mi and Huang, 2015;Crabbé, 2015;Cross and Huang, 2016b;Coavoux and Crabbé, 2016;Fernández-González and Gómez-Rodríguez, 2018) who managed to improve the accuracy and speed of the original Sagae and Lavie's bottom-up parser. With this, shift-reduce algorithms have become com-petitive, and are the fastest alternative to perform phrase-structure parsing to date.
Some of these attempts (Cross and Huang, 2016b;Coavoux and Crabbé, 2016;Fernández-González and Gómez-Rodríguez, 2018) introduced dynamic oracles (Goldberg and Nivre, 2012), originally designed for transition-based dependency algorithms, to bottom-up constituent parsing. They propose to use these dynamic oracles to train shift-reduce parsers instead of a traditional static oracle. The latter follows the standard procedure that uses a gold sequence of transitions to train a model for parsing new sentences at test time. A shift-reduce parser trained with this approach tends to be prone to suffer from error propagation (i.e. errors made in previous states are propagated to subsequent states, causing further mistakes in the transition sequence). Dynamic oracles (Goldberg and Nivre, 2012) were developed to minimize the effect of error propagation by training parsers under closer conditions to those found at test time, where mistakes are inevitably made. They are designed to guide the parser through any state it might reach during learning time. This makes it possible to introduce error exploration to force the parser to go through nonoptimal states, teaching it how to recover from mistakes and lose the minimum number of gold constituents.
Alternatively, some researchers decided to follow a different direction and explore non-bottomup strategies for producing phrase-structure syntactic analysis.
On the one hand, Kuncoro et al., 2017) proposed a top-down transition-based algorithm, which creates a phrase structure tree in the stack by first choosing the non-terminal on the top of the tree, and then considering which should be its child nodes. In contrast to the bottom-up approach, this top-down strategy adds a lookahead guidance to the parsing process, while it loses rich local features from partially-built trees.
On the other hand, Liu and Zhang (2017a) recently developed a novel strategy that finds a compromise between the strengths of top-down and bottom-up approaches, resulting in state-of-the-art accuracy. Concretely, this parser builds the tree following an in-order traversal: instead of starting the tree from the top, it chooses the non-terminal of the resulting subtree after having the first child node in the stack. In that way each partial constituent tree is created in a bottom-up manner, but the non-terminal node is not chosen when all child nodes are in the stack (as a purely bottom-up parser does), but after the first child is considered. Liu and Zhang (2017a) report that the top-down approach is on par with the bottom-up strategy in terms of accuracy and the in-order parser yields the best accuracy to date on the WSJ. However, despite being two adequate alternatives to the traditional bottom-up strategy, no further work has been undertaken to improve their performance. 1 We propose what, to our knowledge, are the first optimal dynamic oracles for both the topdown and in-order shift-reduce parsers, allowing us to train these algorithms with exploration. The resulting parsers outperform the existing versions trained with static oracles on the WSJ Penn Treebank (Marcus et al., 1993) and Chinese Treebank (CTB) benchmarks (Xue et al., 2005). The version of the in-order parser trained with our dynamic oracle achieves the highest accuracy obtained so far by a single fully-supervised greedy shift-reduce system on the WSJ.

Preliminaries
The original transition system of Sagae and Lavie (2005) parses a sentence from left to right by reading (moving) words from a buffer to a stack, where partial subtrees are built. This process is per-formed by a sequence of Shift (for reading) and Reduce (for building) transitions that will lead the parser through different states or parser configurations until a terminal one is reached. While in the bottom-up strategy the Reduce transition is in charge of labeling the partial subtree with a nonterminal at the same time the tree is built,  and Liu and Zhang (2017a) introduce a novel transition to choose the non-terminal on top, leaving the Reduce transition just to create the subtree under the previously decided nonterminal. We will now explain more in detail both the top-down and the in-order transition systems.
In both transition systems, parser configurations have the form c = Σ, i, f, γ, α , where Σ is a stack of constituents, i is the position of the leftmost unprocessed word in the buffer (which is the next to be pushed onto the stack), f is a boolean variable used by the in-order transition system to mark if a configuration is terminal or not and with no value in top-down parser configurations, γ is the set of constituents that have already been built, and α is the set of non-terminal nodes that are currently in the stack.
Each constituent is represented as a tuple (X, l, r), where X is a non-terminal and l and r are integers defining its span. Constituents are composed of one or several words or constituents, and just one non-terminal node on top. Each word w i is represented as (w, i, i + 1). To define our oracles, we will need to represent each non-terminal node of the tree as (X, j), where j has the value of i when X is included in the stack and is used to keep them in order. 2 For instance, the phrase-structure tree in Figure 1 can be decomposed as the following set of gold constituents: {(S, 0, 6), (NP, 0, 2), (VP, 2, 5), (ADVP, 3, 4), (ADJP, 4, 5)}. In addition, the ordered set of gold non-terminal nodes added to the stack while following a top-down strategy will be {(S, 0), (NP, 0), (VP, 2), (ADVP, 3), (ADJP, 4)} and, according to an in-order approach, {(NP, 1), (S, 2), (VP, 3), (ADVP, 4), (ADJP, 5)}. It is worth mentioning that the index of non-terminal nodes in the top-down method is the same as the leftmost span index of the constituent that it will produce. However, this does not hold in the in-order approach, as the leftmost child is fully processed before the node is added to the stack, so the index for the node will point to the leftmost span index of the second leftmost child.
Note that the information about the span of a constituent, the set of predicted constituents γ and the set α of predicted non-terminal nodes in the stack is not used by the original top-down and inorder parsers. However, we need to include it in parser configurations at learning time to allow an efficient implementation of the proposed dynamic oracles.
Given an input string w 0 · · · w n−1 , the in-order parsing process starts at the initial configuration c s (w 0 . . . w n−1 ) = [ ], 0, false, {}, {} and, after applying a sequence of transitions, it ends in a terminal configuration (S, 0, n), n, true, γ, α , where n is the number of words in the input sentence. The top-down transition system shares the same form for the initial and terminal configurations, except for the fact that variable f has no value in both cases. Figure 2 shows the available transitions in the top-down algorithm. In particular, the Shift transition moves the first (leftmost) word in the buffer to the stack; the Non-Terminal-X transition pushes onto the stack the non-terminal node X that should be on top of a coming constituent, and the Reduce transition pops the topmost stack nodes until the first non-terminal node appears (which is also popped) and combines them into a constituent with this non-terminal node as their parent, pushing this new constituent into the stack. Note that every reduction action will add a new constituent to γ and remove a non-terminal node from α, and every Non-Terminal transition will include a new non-terminal node in α. Figure 3 shows the top-down transition sequence that produces the phrase-structure tree in Figure 1.
In Figure 4 we describe the available transitions in the in-order algorithm. The Shift, Non-Terminal-X and Reduce transitions have the same behavior as defined for the top-down transition system, except that the Reduce transition not only pops stack nodes until finding a non-terminal node (also removed from the stack), but also the node below this non-terminal node, and combines them into a constituent spanning all the popped nodes with the non-terminal node on top. And, finally, a Finish transition is also available to end the parsing process. Figure 5 shows the in-order transition sequence that outputs the constituent tree in Figure 1.
The standard procedure to train a greedy shiftreduce parser consists of training a classifier to approximate an oracle, which chooses optimal transitions with respect to gold parse trees. This classifier will greedily choose which transition sequence the parser should apply at test time.
Depending on the strategy used for training the parser, oracles can be static or dynamic. A static oracle trains the parser only on gold transition sequences, while a dynamic one can guide the parser through any possible transition path, allowing the exploration of non-optimal sequences.

Dynamic Oracles
Previous work such as (Cross and Huang, 2016b;Coavoux and Crabbé, 2016;Fernández-González and Gómez-Rodríguez, 2018) has introduced and successfully applied dynamic oracles for bottomup phrase-structure parsing. We present dynamic oracles for training the top-down and in-order transition-based constituent parsers. Goldberg and Nivre (2012) show that implementing a dynamic oracle reduces to defining a loss function on configurations to measure the distance from the best tree they can produce to the gold parse. This allows us to compute which transitions will lead the parser to configurations where the minimum number of mistakes are made.

Loss function
According to Fernández-González and Gómez-Rodríguez (2018), we can define a loss function in constituent parsing as follows: given a parser configuration c and a gold tree t G , a loss function (c) is implemented as the minimum Hamming loss between t and t G , (L(t, t G )), where t is the already-built tree of a configuration c reachable from c (written as c t). This Hamming loss is computed as the size of the symmetric difference between the set of constituents γ and γ G in the trees t and t G , respectively. Therefore, the loss function is defined as: and, according to the authors, it can be efficiently computed for a non-binary bottom-up transition system by counting the individually unreachable arcs from configuration c (|U(c, γ G )|) plus the erroneous constituents created so far (|γ c \ γ G |):  We adapt the latter to efficiently implement a loss function for the top-down and in-order strategies. While in bottom-up parsing constituents are created at once by a Reduce transition, in the other two approaches a Non-Terminal transition begins the process by naming the future constituent and a Reduce transition builds it by setting its span and children. Therefore, a Non-Terminal transition that deviates from the non-terminals expected in the gold tree will eventually produce a wrong constituent in future configurations, so it should be penalized. Additionally, a sequence of gold Non-Terminal transitions may also lead to a wrong final parse if they are applied in an incorrect order. Then, the computation of the Hamming loss in top-down and in-order phrase-structure parsing adds two more terms to the bottom-up loss expression: (1) the number of predicted non-terminal nodes that are currently in the stack (α c ), 3 but not included in the set of gold non-terminal nodes (α G ), and (2) the number of gold non-terminal nodes in the stack that are out of order with respect to the order needed in the gold tree: This loss function is used to implement a dynamic oracle that, when given any parser configuration, will return the set of transitions τ that do not increase the overall loss (i.e., (τ (c)) − (c) = 0), leading the system through optimal configurations that minimize Hamming loss with respect to t G .
As suggested by (Coavoux and Crabbé, 2016;Fernández-González and Gómez-Rodríguez, 2018), constituent reachability can be used to efficiently compute the first term of the symmetric difference (|γ G \ γ|), by simply counting the gold constituents that are individually unreachable from configuration c, as we describe in the next subsection.
The second and third terms of the loss (|γ c \ γ G | and |α c \ α G |) can be trivially computed and are used to penalize false positives (extra erroneous constituents) so that final F-score is not harmed due to the decrease of precision, as pointed out by (Coavoux and Crabbé, 2016;Fernández-González and Gómez-Rodríguez, 2018). Note that it is crucial that the creation of non-gold Non-Terminal transitions is avoided, since these might not affect the creation of gold constituents, however, they will certainly lead the parser to the creation of extra erroneous constituents in future steps.
Finally, the function out of order of the last term can be implemented by computing the longest increasing subsequence of gold nonterminal nodes in the stack, where the order relation is given by the order of non-terminals (provided by their associated index) in the transition sequence that builds the gold tree (this order is unique, as none of our two parsers of interest have spurious ambiguity). Obtaining the longest increasing subsequence is a well-known problem solvable in time O(n log n) (Fredman, 1975), where n denotes the length of the input sequence. Once we have the largest possible sub-Shift: Σ, i, false, γ, α ⇒ Σ|(wi, i, i + 1), i + 1, false, γ ∪ {(wi, i, i + 1)}, α Non-Terminal-X: Finish: (S, 0, n), n, false, γ, α ⇒ (S, 0, n), n, true, γ, α Figure 4: Transitions of a in-order constituent parser.  sequence of gold non-terminal nodes in our configuration's stack that is compatible with the gold order, the remaining ones give us the number of erroneous constituents that we will unavoidably generate, even in the best case, due to building them in an incorrect order. We will prove below that this loss formulation returns the exact loss and the resulting dynamic oracle is correct.

Constituent reachability
We now show how the computation of the set of reachable constituents developed for bottomup parsing in (Coavoux and Crabbé, 2016;Fernández-González and Gómez-Rodríguez, 2018) can be extended to deal with the top-down and in-order strategies.
Top-down transition system Let γ G and α G be the set of gold constituents and the set of gold non-terminal nodes, respectively, for our current input. We say that a gold constituent (X, l, r) ∈ γ G is reachable from a con- i 1 , j)], and it is included in the set of individually reachable constituents R(c, γ G ), iff it satisfies one of the following conditions: 4 (i) (X, l, r) ∈ γ c (i.e. it has already been created and, therefore, it is reachable by definition). (ii) j ≤ l < r ∧ (X, l) / ∈ α c (i.e. the words dominated by the gold constituent are still in the buffer and the non-terminal node that begins its creation has not been added to the stack yet; therefore, it can be still created after pushing the correct non-terminal node and shifting the necessary words).
(i.e. its span is partially or completely in the stack and the corresponding non-terminal node was already added to the stack, then, by shifting more words or/and reducing, the constituent can still be created).
In-order transition system Let γ G and α G be the set of gold constituents and the set of gold non-terminal nodes, respectively, for our current input. We say that a gold constituent (X, l, r) ∈ γ G is reachable from a configuration c = Σ, j, false, γ c , α c with Σ = [(Y p , i p , i p−1 ) · · · (Y 2 , i 2 , i 1 )|(Y 1 , i 1 , j)], and it is included in the set of individually reachable constituents R(c, γ G ), iff it satisfies one of the following conditions: (i) (X, l, r) ∈ γ c (i.e. it has already been created). (ii) j ≤ l < r (i.e. the constituent is entirely in the buffer, then it can be still built).
∈ α c (i.e. its first child is still a totally-or partiallybuilt constituent on top of the stack and the non-terminal node has not been created yet; therefore, it has to wait till the first child is completed (if it is still pending) and, then, it can be still created by pushing onto the stack the correct non-terminal node and shifting more words if necessary). (iv) l ∈ {i k | 1 ≤ k ≤ p} ∧ j ≤ r ∧ (X, m) ∈ α c ∧ ∃(Y, l, m) ∈ Σ (i.e. its span is partially or completely in the stack, and its first child (which is an alredy-built constituent) and the non-terminal node assigned are adjacent, thus, by shifting more words or/and reducing, the constituent can still be built). In both transition systems, the set of individually unreachable constituents U(c, γ G ) with respect to the set of gold constituents γ G can be easily computed as γ G \ R(c, γ G ) and will contain the gold constituents that can no longer be built.

Correctness
We will now prove that the above expression of (c) indeed provides the minimum possible Hamming loss to the gold tree among all the trees that are reachable from configuration c. This implies correctness (or optimality) of our oracle.
To do so, we first show that both algorithms are constituent-decomposable. This amounts to saying that if we take a set of m constituents that are tree-compatible (can appear together in a constituent tree, meaning that no pair of constituent spans overlap unless one is a subset of the other) and individually reachable from a configuration c, then the set is also reachable as a whole.
We prove this by induction on m. The base case (m = 1) is trivial. Let us suppose that constituent-decomposability holds for any set of m tree-compatible constituents. We will show that it also holds for any set T of m+1 tree-compatible constituents.
Let (X, l, r) be one of the constituents in T such that r = min{r | (X , l , r ) ∈ T } and l = max{l | (X , l , r) ∈ T }. Let T = T \ {(X, l, r)}. Since T has m constituents, by induction hypothesis, T is a reachable set from configuration c.
Since (X, l, r) is individually reachable by hypothesis, it must satisfy at least one of the conditions for constituent reachability. As these conditions are different for each particular algorithm, we continue the proof separately for each: Top-down constituent-decomposability In this case, we enumerated three constituent reachability conditions, so we divide the proof into three cases: If the first condition holds, then the constituent (X, l, r) has already been created in c. Thus, it will still be present after applying any of the possible transition sequences that build T starting from c. Hence, T = T ∪ {(X, l, r)} is reachable from c.
If the second condition holds, then j ≤ l < r and the constituent (X, l, r) can be created by l−j Shift transitions, followed by one Non-Terminal transition, r − l Shift transitions and one Reduce transition. This will leave the parser in a configuration whose value of j is r, and where stack elements with left span index ≤ l (apart from those referencing the new non-terminal and its leftmost child) have not changed. Thus, constituents of T are still individually reachable in this configuration, as their left span index is either ≥ r (and then they meet the second reachability condition) or ≤ l (and then they meet the third), so T is reachable from c.
Finally, if the third condition holds, then we can create (X, l, r) by applying r − j Shift transitions followed by a sequence of Reduce transitions stopping when we obtain (X, l, r) on the stack (this will always happen after a finite number of such transitions, as the reachability condition guarantees that l is the left span index of some constituent already on the stack, and that (X, l) is on the stack). Following the same reasoning as in the previous case regarding the resulting parser configuration, we conclude that T is reachable from c.
With this we have shown the induction step, and thus constituent decomposability for the top-down parser.
In-order constituent decomposability The inorder parser has four constituent reachability conditions. Analogously to the previous case, we prove the reachability of T by case analysis.
If the first condition holds, then we have a situation where the constituent (X, l, r) has already been created in c, so reachability of T follows from the same reasoning as for the first condition in the top-down case.
If the second condition holds, we have j ≤ l < r and the constituent (X, l, r) can be created by l − j + 1 Shift transitions (where the last one shifts a word that will be assigned as left child of the new constituent), followed by the relevant Non-Terminal-X transition, r − l − 1 more Shift transitions and one Reduce transition. After this, the parser will be in a configuration where j takes the value r, where we can use the same reasoning as in the second condition of the top-down parser to show that all constituents of T are still reachable, proving reachability of T .
For the third condition, the proof is analogous but the combination of transitions that creates the non-terminal starts with a sequence composed of Reduce transitions (when there is a non-terminal at the top of the stack) or Non-Terminal-Y transitions for arbitrary Y (when the top of the stack is a constituent) until the top node on the stack is a constituent with left span index l (this ensures that the constituent at the top of the stack can serve as leftmost child for our desired constituent), followed by a Non-Terminal-X, r−j Shift transitions and one Reduce transition.
Finally, for the fourth condition, the reasoning is again analogous, but the computation leading to the non-terminal starts with as many Reduce transitions as non-terminal nodes located above (X, m) in the stack (if any). If we call j the index associated to the resulting transition, then it only remains to apply r − j Shift transitions followed by a Reduce transition.
Optimality With this, we have shown constituent decomposability for both parsing algorithms. This means that, for a configuration c, and a set of constituents that are individually reachable from c, there is always some computation that can build them all. This facilitates the proof that the loss function is correct.
To finish the proof, we observe the following: • Let c be a final configuration reachable from c. The set (γ c \ γ G ), representing erroneous constituents that have been built, will always contain at least |γ c \ γ G |, as the algorithm never deletes constituents. • In addition, c will contain one erroneous constituent for each element of (α c \ α G ), as once a non-terminal node is on the stack, there is no way to reach a final configuration without using it to create an erroneous constituent. Note that these erroneous constituents do not overlap those arising from the previous item, as γ c stores already-built constituents and α c non-terminals that have still not been used to build a constituent. • Given a subset S of R(c, γ G ), the previously shown constituent decomposability property implies that there exists at least one transition sequence starting from c that generates the tree S ∪(γ c \γ G )∪E, where E is a set of erroneous constituents containing one such constituent per element of (α c \ α G ). This tree has loss |t G |−(|γ c ∪S|)+|γ c \γ G |+|α c \α G |. The term |t G | − (|γ c ∪ S|) corresponds to missed constituents (gold constituents that have not been already created and are not created as part of S), the other two to erroneous constituents. • As we have shown that the erroneous constituents arising from (γ c \γ G ) and (α c \α G ) are unavoidable, computations yielding a tree with minimum loss are those that maximize |γ c ∪ S| in the previous term. In general, the largest possible |S| is for S = R(c, γ G ). In that case, we would correctly generate every reachable constituent and the loss would be However, we additionally want to generate constituents in the correct order, and this may not be possible if we have already shifted some of them into the stack in a wrong order. The function out of order gives us the number of reachable constituents that are lost for this cause in the best case. Thus, indeed, the expression provides the minimum loss from configuration c.

Data
We test the two proposed approaches on two widely-used benchmarks for constituent parsers: the Wall Street Journal (WSJ) sections of the English Penn Treebank 5 (Marcus et al., 1993) and version 5.1 of the Penn Chinese Treebank (CTB) 6 (Xue et al., 2005). We use the same predicted POS tags and pre-trained word embeddings as  and Liu and Zhang (2017a).

Neural Model
To perform a fair comparison, we define the novel dynamic oracles on the original implementations of the top-down parser by  and in-order parser by Liu and Zhang (2017a), where parsers are trained with a traditional static oracle. Both implementations follow a stack-LSTM approach to represent the stack and the buffer, as well as a vanilla LSTM to represent the action history.
In addition, they also use a bi-LSTM as a compositional function for representing constituents in the stack. Concretely, this consists in computing the composition representation s comp as: where e nt is the vector representation of a nonterminal, and s i , i ∈ [0, m] is the ith child node. Finally, the exact same word representation strategy and hyper-parameter values as  and (Liu and Zhang, 2017a) are used to conduct the experiments.

Error exploration
In order to benefit from training a parser by a dynamic oracle, errors should be made during the training process so that the parser can learn to avoid and recover from them. Unlike more complex error-exploration strategies as those studied in Cross and Huang, 2016b;Fried and Klein, 2018), we decided to consider a simple one that follows a non-optimal transition when it is the highest-scoring one, but with a certain probability. In that way, we easily simulate test time conditions, when the parser greedily chooses the highest-scoring transition, even when it is not an optimal one, placing the parser in an incorrect state.
In particular, we run experiments on development sets for each benchmark/algorithm with three different error exploration probabilities and choose the one that achieves the best F-score. Table 1 reports all results, including those obtained by the top-down and in-order parsers trained by a dynamic oracle without error exploration (equivalent to a traditional static oracle).  The "Type" column shows the type of parser: gs is a greedy parser trained with a static oracle, gd a greedy parser trained with a dynamic oracle, b a beam search parser, bp a beam search parser trained with a policy gradient method, bd a beam search parser trained with a nonoptimal dynamic oracle, bg a generative beam search parser, and ch a chart-based parser. Finally, the "Strat" column describes the strategy followed (bu=bottom-up, td=top-down and in=in-order).  Table 3: F-score on constituents with a number of children ranging from one to five on WSJ §23.

Results
we also include some recent state-of-the-art parsers with global chart decoding that achieve the highest accuracies to date on WSJ, but are much slower than shift-reduce algorithms.
Top-down and in-order parsers benefit from being trained by these new dynamic oracles in both datasets. The top-down strategy achieves a gain of 0.5 and 0.7 points in F-score on WSJ and CTB benchmarks, respectively. The in-order parser obtains similar improvements on the CTB (0.5 points), but less notable accuracy gain on the WSJ (0.2 points). Although a case of diminishing returns might explain the latter, the in-order parser trained with the proposed dynamic oracle still achieves the highest accuracy to date in greedy transition-based constituent parsing on the WSJ. 7 While this work was under review, Fried and Klein (2018) proposed to train the top-down and in-order parsers with a policy gradient method instead of custom designed dynamic oracles. They also present a non-optimal dynamic oracle for the top-down parser that, combined with more complex error-exploration strategies and size-10 beam search, significantly outperforms the policy gradient-trained version, confirming that even non-optimal dynamic oracles are a good option. 8

Analysis
Dan Bikel's randomized parsing evaluation comparator (Bikel, 2004) was used to perform significance tests on precision and recall metrics on WSJ §23 and CTB §271-300. The top-down parser trained with dynamic oracles achieves statistically significant improvements (p < 0.05) in precision 7 Note that the proposed dynamic oracles are orthogonal to approaches like beam search, re-ranking or semi-supervision, that can boost accuracy but at a large cost to parsing speed. 8 Unfortunately, we cannot directly compare our approach to theirs, since they use beam-search decoding with size 10 in all experiments, gaining up to 0.3 points in F-score, while penalizing speed with respect to greedy decoding. However, by extrapolating the results above, we hypothesize that our optimal dynamic oracles (especially the one designed for the in-order algorithm) with their same training and beam-search decoding setup might achieve the best scores to date in shiftreduce parsing. both on the WSJ and CTB benchmarks, and in recall on WSJ. The in-order parser trained with the proposed technique obtains significant improvements (p < 0.05) in recall in both benchmarks, although not in precision.
We also undertake an analysis to check if dynamic oracles are able to mitigate error propagation. We report in Table 3 the F-score obtained in constituents with different number of children on WSJ §23 by the top-down and in-order algorithms trained with both static and dynamic oracles. Please note that creating a constituent with a great number of children is more prone to suffer from error propagation, since a larger number of transitions is required to build it. The results seem to confirm that, indeed, dynamic oracles manage to alleviate error propagation, since improvements in F-score are more notable for larger constituents.

Conclusion
We develop the first optimal dynamic oracles for training the top-down and the state-of-the-art inorder parsers. Apart from improving the systems' accuracies in both cases, we achieve the best result to date in greedy shift-reduce parsing on the WSJ. In addition, these promising techniques could easily benefit from recent studies in error-exploration strategies and yield stateof-the-art accuracies in transition-based parsing in the near future. The parser's source code is freely available at https://github.com/ danifg/Dynamic-InOrderParser.