Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.


Introduction
Multiple annotation conventions have been proposed over the years for representing dependency structures (Hajič et al., 2001;De Marneffe et al., 2014). The divergence between annotation guidelines can result from the theoretical linguistic principles governing the choices of head status and dependency inventories, the tree-to-dependency conversion scheme or arbitrary decisions regarding closed class words, such as interjections or discursive markers, the syntactic role of which is debatable. Several works have shown that the choice of a dependency structure can have a large impact on parsing performance (Silveira and Manning, 2015;de Lhoneux and Nivre, 2016;Kohita et al., 2017) and on the performance of downstream applications (Elming et al., 2013).
A natural way to decide which syntactic representation is the best is to choose the one for which a standard parser will achieve the highest parsing performance (Schwartz et al., 2012;Husain and Agrawal, 2012;Noro et al., 2005). Implementing this general principle faces two challenges: i) defining a learning criterion that can predict which dependency structure will be the easiest to learn ii) finding a way to explore a potentially large number of annotation schemes that describe all combinations of several design decisions.
This work shows that the dynamic oracle of Goldberg and Nivre (2013) can straightforwardly uncover the most learnable dependency representation among a predefined set of possible references. 1 Rather than associating each sentence in the training set to a single reference, we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy.
This article is organized as follows: we first review standard structural transformations studied in the literature that will be used to build a treebank annotated with multiple references ( §2). We then show how the dynamic oracle of Goldberg and Nivre (2013) can be used to train a parser when each sentence is associated to a set of references and explain how it can be used to define a learnability criteria ( §3). An experimental evaluation of our approach is presented in §4.

Dependency Transformations
In this section, we explain how to automatically transform the reference UD treebanks (Nivre et al., 2016), to build corpora in which each sentence is annotated by a set of possible trees.
The UD project aims at developing crosslinguistically consistent treebank annotations for many languages by harmonizing annotation schemes between languages and converting existing treebanks to this new scheme. Several recent papers (Kohita et al., 2017;de Lhoneux and Nivre, 2016;Silveira and Manning, 2015;Popel et al., 2013) have investigated whether the choices made to increase the sharing of structures between languages hurt parsing performance and have identified a variety of choice points in which more than one design could be advocated. Most of these points are related to the issue of headness: contrary to most works in theoretical linguistic, UD assumes that function words should be categorically subordinated to content words to maximize the similarity of dependency trees across languages (Osborne and Maxwell, 2015).
The alternative representations we consider are summarized in Table 1. They mostly consist in demoting the lexical head and making it dependent on a functional head. We designed a set of handcrafted rules 2 to convert dependencies between these two schemes. Each application of a rule creates a new tree in the set of references that is being built. As shown in Figure 1, the resulting set of references encodes all possible combinations of the considered transformations.

Training a Dependency Parser with Multiple References
Dynamic Oracle In a transition-based parser (Nivre, 2008), a parse is computed by performing a sequence of transitions building the parse tree in an incremental fashion. A partially built dependency tree is represented by a configuration c; when in c, applying a transition t results in the parser moving to a new configuration denoted c•t. At each step of the parsing process, every possible transition is scored by a classifier (e.g. a linear model), given a feature representation of c and 2 A more detailed description of the transformations can be found in (Wisniewski and Lacroix, 2017). The source code is freely available on the first author web site.
Algorithm 1: Training on one sentence with multiple references (see text for notations).
Input: W the input sentence, T the set of gold trees model parameters w; the score of a derivation (a sequence of transitions) generating a given parse tree is the sum of its transition scores. Parsing thus amounts to finding, starting from the initial configuration INITIAL(W ), the derivation having the highest score, typically using greedy or beam search. 3 Algorithm 1 formalizes the training procedure when the dynamic 4 oracle of Goldberg and Nivre (2013) is used: for each sentence, a parse tree is built incrementally and at each step, if the predicted transition prevents the creation of a gold dependency, the parameters are updated, according, for instance, to the perceptron rule (l.7). Erroneous transitions can efficiently be found using the ORACLE(t, c, T ) function formally defined in (Goldberg and Nivre, 2013) as computing the number of dependencies of a gold parse tree T that can no longer be predicted when a transition t is applied in configuration c.
During training, it often happens that several transitions are equally good: in such situations, the training algorithm breaks ties among oracle transitions according to the model current prediction (l.5). As suggested in the imitation learning literature (Daumé III and Marcu, 2005;Ross and Bagnell, 2010), this strategy enables to sample those configurations that will be the most similar to the ones seen when predicting a new dependency tree:  it is a way to let the parser explore more specifically the part of the search space it prefers and is more likely to see at test time (Aufrant et al., 2017). Using a dynamic oracle usually results in substantial improvements in accuracy compared to static oracles.
Considering Multiple References Implementing the training algorithm described above only requires the ability to detect whether a transition will cause an erroneous dependency. It can naturally be extended to the case of multiple references: a transition is considered correct as long as it can predict at least one of the gold trees; when moving to a new configuration, trees that can no longer be generated are removed from the set of references, in order to make sure the parser will not mix the dependencies of two gold trees (l.11).
Upon full completion of parsing, there will remain only one surviving reference that has been selected according to the model current predictions. This reference corresponds to the dependency structure that is the most similar to the hypothesis the parser would have predicted at test time and can therefore be described as the reference the parser prefers: intuitively, Algorithm 1 will thus identify the reference that will be predicted with the highest accuracy.

Experiments
Data We separately apply to the 7 dependencies considered the transformations described in Section 2 on the 38 languages of the UD project (v1.3), resulting in 266 transformed corpora. 5 To evaluate the ability of the proposed method to identify the 'best' dependency structure, we consider fully as well as partially transformed sentences: a sentence with n dependencies of interest will generate 2 n references.
For each condition (i.e. a language and a transformation), a dependency parser is trained using (a) the original data annotated with UD convention, (b) 'transformed' data in which each sentence is associated to a reference in which all dependencies of interest have been transformed and (c) the data associated with a set of reference containing all the partially transformed references (including the original and transformed references).
Parser We use our own implementation of an arc-eager unlabeled dependency parser with a dynamic oracle and an averaged perceptron, using the features described in (Zhang and Nivre, 2011) which have been designed for English and have not been adapted to the specificities of the other languages. 6 Training stops when the UAS estimated on the validation set has converged. Figure 2 shows the distribution of differences in UAS between a parser trained on the original data (setting (a)) and a parser trained on the transformed data (setting (b)). To evaluate the proposed transformations, we follow the approach introduced in (Schwartz et al., 2012) consisting in comparing the original and the transformed data on their respective references.

Impact of Transformations
As expected, the annotation scheme has a large impact on the quality of the prediction, with an average difference in scores of 0.66 UAS points and variations as large as 8.1 UAS points. These results show that, contrary to general belief (Schwartz et al., 2012;Kohita et al., 2017), the UD scheme is not sub-optimal for monolingual parsing: the difference in UAS is negative in 93 conditions and positive in 129. Table 2 details for each dependency the when the UD scheme results in better predictions. case 44.7% mark 58.3% det 80.5% cc 89.4% mwe 50.0% name 45.8% cop 25.0% Table 2: Percentage of times a parser trained and evaluated on UD data (setting (a)) outperforms a parser trained and evaluated on transformed data (setting (b)).
Training with Multiple References To assess the impact of training with multiple references (setting (c)), we first evaluate the capacity of Algorithm 1 to consistently select a single annotation scheme during training. We count, in each conditions, the number of times the reference that has survived training was following the original scheme and the number of times it was following the transformed scheme. For 74.7% of the conditions, the reference that has survived training was following the same annotation scheme for more than 70% of the training examples. This observation proves the ability of the parser to commit itself to a single annotation scheme.
Learnability Criterion The training procedure proposed in this article was designed to uncover the dependency structure that will optimize parsing accuracy. In this section we evaluate whether this goal is achieved, by counting the number of conditions in which the annotation scheme that has survived training the most often (in setting (c)) is indeed the one that achieves the best performance on the test set, as evaluated by testing a parser in settings (a) and (b).
We will consider, as baselines, two measures of the 'learnability' of a treebank, the predictability of an annotation scheme (Schwartz et al., 2012) and the derivation perplexity (Søgaard and Haulrich, 2010). Contrary to our approach, these two measures aims at deciding which of two annotations schemes will achieve the best parsing accuracy without actually training and testing a parser. The predictability is defined as the entropy of the conditional distribution of the dependent PoS knowing the head PoS. The derivation perplexity is the perplexity of 3-gram language model estimated on a corpus in which the words of a sentence appear in the order in which they are attached to their head. 7 Table 3 reports the number of times, averaged over languages and transformations, that each measure of learnability is able to predict which of two competing annotation schemes will yield the best parsing performance. These results clearly show that the approach we propose to evaluate the 'learnability' of an annotation scheme outperforms existing criteria and is able to select the annotation convention that achieves the highest parsing performance. metric learnability predictability 64.8% derivation complexity 62.6% multiple references 76.3% Table 3: Number of times a given learnability measure is able to predict which annotation scheme will result in the best parsing performance. 'multiple references' corresponds to the approach proposed in this work.

Conclusion
This work introduces a new strategy to compare the numerous representations that have been proposed over the years for expressing dependency structures and discover the one that is easiest to learn. Experiments with the popular transitionbased parser on the UD corpora show the validity of the proposed approach.
In future work, we would like to evaluate the impact of annotation conventions on other kind of parsers and to find the properties of a dependency tree that facilitate its prediction. We also plan to find ways to easily annotate sentences with multiple references (e.g. by indicating that the head of word can be chosen arbitrarily) and eliminate the constraint that references should be trees.