Don’t Stop Me Now! Using Global Dynamic Oracles to Correct Training Biases of Transition-Based Dependency Parsers

This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.


Introduction
Transition-based parsers with beam search are among the most widely used models for dependency parsing: they achieve state-of-the-art performance while their training and inference, which rely on approximate search, are very efficient. Training a beam parser faces two difficulties: error propagation and search errors (Huang et al., 2012). Specific learning methods, early-update and maxviolation (presented in §2), have been designed to address them. But they require to update the parameters on partial derivations only, which introduces a discrepancy between the feature distributions seen during training and testing. Notably, derivation endings are under-represented during training, which hurts parsing performance.
In this work, we propose an improved training strategy that corrects such sampling biases for beam parsers ( §3). Experiments with the SPMRL treebanks (Seddah et al., 2013), reported in §4, show that the training configurations sampled by this new strategy are closer to the parser configurations seen at test time and result in increases up to 0.7 UAS, with no computation time overhead. These improvements rely on a sound extension of dynamic oracles for global training, the lack of which has repeatedly been pointed out (Goldberg and Nivre, 2012;Sartorio, 2015). These global dynamic oracles have more general benefits than the training strategy proposed here; for instance, they allow to train beam parsers on partially annotated data in a context of active learning or multilingual transfer (Lacroix et al., 2016).

Training a Dependency Parser
In a transition-based parser (Nivre, 2008), a parse is computed by performing a sequence of transitions building the parse tree in an incremental fashion. In the following, c denotes a parser configuration representing a partially built dependency tree. Applying transition t to configuration c results in the parser moving to a successor of c, denoted c • t.
At each step of the parsing process, every possible transition is scored by a classifier, given a feature representation of c and model parameters θ; the score of a derivation (a sequence of transitions) generating a given parse tree is the sum of its transition scores. Parsing thus amounts to finding the derivation having the highest score, usually through greedy or beam search.
Parsers using beam search are typically trained with a global criterion, that updates the parameters once for each training sentence. Algorithm 1 summarizes the training for each sentence x (with gold parse y): INITIAL(x) denotes the initial configuration for x and the procedure ORACLE performs decoding to find configurations that play the role of the 'positive' and 'negative' examples (resp. c + and c − ) required by the UPDATE operation (typi-Algorithm 1: Global training on one sentence.
θ: model parameters, initialized to θ 0 before training cally a perceptron update rule (Collins and Roark, 2004) or a gradient computation with the globally normalized loss of Andor et al. (2016)). Several strategies, corresponding to various implementations of the ORACLE function, have been used to find these examples. In the early-update strategy (Collins and Roark, 2004;Zhang and Clark, 2008), a reference derivation is first computed, generally using handcrafted heuristics. The sentence is then parsed using conventional beam decoding and an update happens as soon as this pre-computed gold derivation falls off the beam, while the rest of the sequence is ignored. The top scoring configuration at this step is penalized and the reference that has just fallen off the beam is reinforced. Another strategy, max-violation (Huang et al., 2012), is to continue decoding even though the reference has fallen off the beam, in order to find the configuration having the largest gap between the scores of the (partial) hypothesis and the (partial) gold derivation. Compared to early-update, max-violation speeds up convergence by covering longer transition sequences and can yield slightly better parsers.

Correction of Training Biases
Both standard learning strategies suffer from biases that introduce a discrepancy between the feature distributions seen during training and testing.
First, parameters updates reinforce only gold derivations; at test time, the model might find itself, after an error, in a part of the search space where it was not trained to take good decisions, thus propagating errors (Goldberg and Nivre, 2012). 1 Second, they both use a static oracle that relies on the deterministic pre-computation of a canon-ical reference. An update occurs as soon as the parser strays from this particular gold derivation, even when the reference tree could still be obtained using an alternative derivation. Updating in such cases raises the risk of lowering parser performance. Indeed, we measured that a beam parser trained with early-update and a static oracle counter-intuitively predicts correctly fewer heads of the current sentence just after an update than just before, for 15% of the updates (French SPMRL, during 10th epoch).
Third, both the early-update and the maxviolation strategies consider only partial derivations when updating the model parameters. For instance on the French SPMRL, when training with an early-update strategy, the end of the derivation is reached for only 41% of the examples at the 10th epoch 2 and, on average, only 57% of a derivation is considered; the max-violation strategy, which computes longer partial derivations, partly alleviates this effect: these proportions raise, respectively, to 53% and 81%. While the choice of partial updates has been experimentally proved (Huang et al., 2012) to be critical in achieving good performance, it prevents parsers from visiting configurations corresponding to derivation endings. This explains why configurations and transitions involving final punctuation marks, verbs in SOV languages like Japanese or German subordinate clauses, the ROOT token when placed at the end (Ballesteros and Nivre, 2013), but also stack features involving long distance siblings, are too rarely seen in training, thereby hurting predictions in such configurations.
In the following, we describe improvements addressing those issues.
Dynamic oracles The limits of static oracles have already been highlighted for ARCEAGER greedy parsers: Goldberg and Nivre (2012) show how parsing performance can be significantly improved with a dynamic oracle that computes a reference tailored to the current parser state. Dynamic oracles are at the heart of most state-ofthe-art parsers (Ballesteros et al., 2016;Coavoux and Crabbé, 2016;Cross and Huang, 2016;Kiperwasser and Goldberg, 2016). But, to the best of our knowledge, dynamic oracles have only been partially generalized to beam parsers: Björkelund and Nivre (2015)'s oracles address the second but not the first issue, while the dynamic oracle of the YaraParser (Rasooli and Tetreault, 2015) arbitrarily rules out some configurations that can generate the reference tree.
Algorithm 2 shows how a dynamic oracle can be integrated within the early-update learning strategy; this extension can be done in the same way for the max-violation strategy but is not detailed here, for space reasons. The specificity of that formalism is to consider that an error occurs only when none of the configurations in the beam can result in the dependency tree that was initially the best reachable one, i.e. when all hypotheses insert new erroneous dependencies. 3 The Boolean function that tests this condition, denoted CORRECT y (c |c), can be efficiently computed using the COST y (t) function, formally defined in  as the number of dependencies of a gold parse tree y that can no longer be predicted when transition t is applied: a configuration c is considered as COR-RECT in the context of a configuration c, if there exists a sequence of transitions t 1 , . . . , t n such that c = c • t 1 • . . . • t n and COST y (t 1 ) = · · · = COST y (t n ) = 0.
Once an error is detected, the negative example c − is chosen, as in the 'standard' early-update strategy, as the top scoring configuration in the beam. The positive example c + is computed in constant time, by choosing the top scoring configuration in the beam (just before k-best truncation) for which CORRECT is true.

Restart Strategy
To avoid over-representing the beginning of derivations during training, we propose a new learning strategy: contrary to the baseline training method (Algorithm 1) in which parsing stops as soon as an error is detected and the parameters updated, in our strategy (Algorithm 3) decoding is restarted with a beam containing only the positive configuration c + and parsing continues until a new error is detected, triggering new updates. The ORACLE function is then called from several successive configurations, as many times as needed to completely parse the sentence.
This training method ensures that configurations that are close to derivations endings will be seen more often during training. 4 3 While fairly simple, this formalism is a major change from the traditional paradigm where references are explicitly computed for each action. 4 Standard training with full update also ensures this, but Algorithm 2: Dynamic oracle for the earlyupdate strategy.
c 0 : configuration to start decoding from top θ (·): best scoring element according to θ NEXT(c): the set of all successors of c Function EARLYUPDATEORACLE(c 0 , y, θ) Algorithm 3: Global training with restart. FINAL(·): true iff the whole sentence is parsed Restarting with an oracle tailored to the restart configuration is made possible by our global dynamic oracle. In this frame, the strategy can even be further improved: similarly to their greedy counterpart, global dynamic oracles enable to augment training with an error exploration component by restarting from c − instead of c + after an error, thus addressing the first issue mentioned.

Experiments
Experimental Setup The validity of our approach is evaluated on the SPMRL treebank (Seddah et al., 2013). We consider, as baselines, a greedy parser trained with a dynamic oracle (GREEDY DYN) and beam parsers trained with the early-update and max-violation strategies and a static oracle (resp. EARLY and MAXV). The imwith the risk of divergence (Huang et al., 2012). Restarting in c + with a new beam has the same convergence guarantee as standard early-update and max-violation. provements of §3 are applied to these two strategies (resp. IMP-EARLY and IMP-MAXV). In all our experiments, we use our in-house, open source implementation of a beam ARCEA-GER parser in the PanParser framework , 5 with the averaged structured perceptron (Collins, 2002), a beam size of 8 and the ROOT placed at the end. We use coarse gold PoS tags and the extended features set of Zhang and Nivre (2011), without label information. These features, designed for English, have not been adapted to the specificities of the languages. All models are trained up to convergence on a validation set. As a point of comparison, on average over the treebank, our GREEDY DYN baseline is 2.7 UAS higher than a MaltParser trained with ARCEAGER and the same kind of information (coarse tags, no label).
Results Table 1 reports the performance of all training strategies evaluated by the traditional UAS on the projective test sets, ignoring punctuation tokens. All reported scores are averaged over 5 runs. Results show that our learning strategy consistently outperforms the corresponding baseline, with average increases of 0.2 UAS, up to 0.7 UAS.
Discussion Table 2 shows the performance imbalance between various positions in the sentence and confirms that our improvements partly alleviate this phenomenon: the scores on the first half of the sentence are mostly unchanged, while large gains are reported on the second half.
To assess that these UAS gains result from a better matching of training and test configurations, we compute the Kullback-Leibler divergence be-5 The oracle for beam parsers described in this work can be used with any scoring function and learning method, such as Andor et al. (2016). But its implementation may require to change the whole code architecture as reference derivations must be computed on the fly.   tween the probability distribution (estimated with frequency counts and 0.1 Laplace smoothing) of the features of all configurations in beam scored during the 10th training epoch and the feature distribution seen at test time. Table 3 reports the Kullback-Leibler divergences induced by our refinements with respect to the corresponding baselines. It clearly shows that our 'improved' learning strategy considers training examples that are closer to test configurations. Similar experiments on greedy parsers show that their train-test divergence is reduced from 0.320 to 0.219 by the dynamic oracle and exploration strategy of Goldberg and Nivre (2012). In these two experiments, feature similarity correlates with UAS improvements and can therefore provide a new way to interpret oracle influence.
Finally, regarding efficiency, we observe (Figure 1) that IMP-EARLY converges in a number of epochs similar to that of standard MAXV. Despite an increased number of updates, it is however slightly faster (in CPU time) because it avoids the extra reference pre-computation.

Conclusion
In this paper, we have extended the dynamic oracle framework to global training, for transition-based dependency parsers. This innovation lets us propose an alternative training strategy, that reduces the discrepancy between the feature distributions seen at train and test time that exists in state-ofthe-art methods. Experiments on the 9 SPMRL treebanks show that our restart strategy improves both parsing accuracy and model convergence. We intend for future work to investigate other ways to reduce the train-test distribution discrepancy in structured prediction, using the new possibilities offered by this extended framework.