Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing

Dynamic oracles provide strong supervision for training constituency parsers with exploration, but must be custom defined for a given parser’s transition system. We explore using a policy gradient method as a parser-agnostic alternative. In addition to directly optimizing for a tree-level metric such as F1, policy gradient has the potential to reduce exposure bias by allowing exploration during training; moreover, it does not require a dynamic oracle for supervision. On four constituency parsers in three languages, the method substantially outperforms static oracle likelihood training in almost all settings. For parsers where a dynamic oracle is available (including a novel oracle which we define for the transition system of Dyer et al., 2016), policy gradient typically recaptures a substantial fraction of the performance gain afforded by the dynamic oracle.


Introduction
Many recent state-of-the-art models for constituency parsing are transition based, decomposing production of each parse tree into a sequence of action decisions Cross and Huang, 2016;Liu and Zhang, 2017;, building on a long line of work in transition-based parsing (Nivre, 2003;Yamada and Matsumoto, 2003;Henderson, 2004;Zhang and Clark, 2011;Chen and Manning, 2014;Andor et al., 2016;Kiperwasser and Goldberg, 2016).
However, models of this type, which decompose structure prediction into sequential decisions, can be prone to two issues (Ranzato et al., 2016;Wiseman and Rush, 2016). The first is exposure bias: if, at training time, the model only observes states resulting from correct past decisions, it will not be prepared to recover from its own mistakes during prediction. Second is the loss mismatch between the action-level loss used at training and any structure-level evaluation metric, for example F1.
A large family of techniques address the exposure bias problem by allowing the model to make mistakes and explore incorrect states during training, supervising actions at the resulting states using an expert policy (Daumé III et al., 2009;Ross et al., 2011;Choi and Palmer, 2011;Chang et al., 2015); these expert policies are typically referred to as dynamic oracles in parsing (Goldberg and Nivre, 2012;. While dynamic oracles have produced substantial improvements in constituency parsing performance (Coavoux and Crabbé, 2016;Cross and Huang, 2016;González and Gómez-Rodríguez, 2018), they must be custom designed for each transition system.
To address the loss mismatch problem, another line of work has directly optimized for structurelevel cost functions (Goodman, 1996;Och, 2003). Recent methods applied to models that produce output sequentially commonly use policy gradient (Auli and Gao, 2014;Ranzato et al., 2016;Shen et al., 2016) or beam search (Xu et al., 2016;Wiseman and Rush, 2016;Edunov et al., 2017) at training time to minimize a structured cost. These methods also reduce exposure bias through exploration but do not require an expert policy for supervision.
In this work, we apply a simple policy gradient method to train four different state-of-theart transition-based constituency parsers to maximize expected F1. We compare against training with a dynamic oracle (both to supervise exploration and provide loss-augmentation) where one is available, including a novel dynamic oracle that we define for the top-down transition system of .
We find that while policy gradient usually outperforms standard likelihood training, it typically underperforms the dynamic oracle-based methods -which provide direct, model-aware supervision about which actions are best to take from arbitrary parser states. However, a substantial fraction of each dynamic oracle's performance gain is often recovered using the model-agnostic policy gradient method. In the process, we obtain new state-of-the-art results for single-model discriminative transition-based parsers trained on the English PTB (92.6 F1), French Treebank (83.5 F1), and Penn Chinese Treebank Version 5.1 (87.0 F1).

Models
The transition-based parsers we use all decompose production of a parse tree y for a sentence x into a sequence of actions (a 1 , . . . a T ) and resulting states (s 1 , . . . s T +1 ). Actions a t are predicted sequentially, conditioned on a representation of the parser's current state s t and parameters θ: We investigate four parsers with varying transition systems and methods of encoding the current state and sentence: (1) the discriminative Recurrent Neural Network Grammars (RNNG) parser of , (2) the In-Order parser of Liu and Zhang (2017), (3) the Span-Based parser of Cross and Huang (2016), and (4) the Top-Down parser of . 1 We refer to the original papers for descriptions of the transition systems and model parameterizations.

Training Procedures
Likelihood training without exploration maximizes Eq. 1 for trees in the training corpus, but may be prone to exposure bias and loss mismatch (Section 1). Dynamic oracle methods are known to improve on this training procedure for a variety of parsers (Coavoux and Crabbé, 2016;Cross and Huang, 2016;González and Gómez-Rodríguez, 2018), supervising exploration during training by providing the parser with the best action to take at each explored state. We describe how policy gradient can be applied as an oracle-free alternative. We then compare to several variants of dynamic oracle training which focus on addressing exposure bias, loss mismatch, or both.

Policy Gradient
Given an arbitrary cost function ∆ comparing structured outputs (e.g. negative labeled F1, for trees), we use the risk objective: which measures the model's expected cost over possible outputs y for each of the training exam- Minimizing a risk objective has a long history in structured prediction (Povey and Woodland, 2002;Smith and Eisner, 2006;Li and Eisner, 2009;Gimpel and Smith, 2010) but often relies on the cost function decomposing according to the output structure. However, we can avoid any restrictions on the cost using reinforcement learning-style approaches (Xu et al., 2016;Shen et al., 2016;Edunov et al., 2017) where cost is ascribed to the entire output structure -albeit at the expense of introducing a potentially difficult credit assignment problem.
The policy gradient method we apply is a simple variant of REINFORCE (Williams, 1992). We perform mini-batch gradient descent on the gradient of the risk objective: where Y(x (i) ) is a set of k candidate trees obtained by sampling from the model's distribution for sentence x (i) . We use negative labeled F1 for ∆.
To reduce the variance of the gradient estimates, we standardize ∆ using its running mean and standard deviation across all candidates used so far throughout training. Following Shen et al. (2016), we also found better performance when including the gold tree y (i) in the set of k candidates Y(x (i) ), and do so for all experiments reported here. 2

Dynamic Oracle Supervision
For a given parser state s t , a dynamic oracle defines an action a * (s t ) which should be taken to incrementally produce the best tree still reachable from that state. 3 Dynamic oracles provide strong supervision for training with exploration, but require custom design for a given transition system. Cross and Huang (2016) and  defined optimal (with respect to F1) dynamic oracles for their respective transition systems, and below we define a novel dynamic oracle for the top-down system of RNNG.
In RNNG, tree production occurs in a stackbased, top-down traversal which produces a leftto-right linearized representation of the tree using three actions: OPEN a labeled constituent (which fixes the constituent's span to begin at the next word in the sentence which has not been shifted), SHIFT the next word in the sentence to add it to the current constituent, or CLOSE the current constituent (which fixes its span to end after the last word that has been shifted). The parser stores opened constituents on the stack, and must therefore close them in the reverse of the order that they were opened.
At a given parser state, our oracle does the following: 1. If there are any open constituents on the stack which can be closed (i.e. have had a word shifted since being opened), check the topmost of these (the one that has been opened most recently). If closing it would produce a constituent from the the gold tree that has not yet been produced (which is determined by the constituent's label, span beginning position, and the number of words currently shifted), or if the constituent could not be closed at a later position in the sentence to produce a constituent in the gold tree, return CLOSE.
the estimate of the risk objective's gradient; however since in the parsing tasks we consider, the gold tree has constant and minimal cost, augmenting with the gold is equivalent to jointly optimizing the standard likelihood and risk objectives, using an adaptive scaling factor for each objective that is dependent on the cost for the trees that have been sampled from the model. We found that including the gold candidate in this manner outperformed initial experiments that first trained a model using likelihood training and then fine-tuned using unbiased policy gradient. 3 More generally, an oracle can return a set of such actions that could be taken from the current state, but the oracles we use select a single canonical action.
2. Otherwise, if there are constituents in the gold tree which have not yet been opened in the parser state, with span beginning at the next unshifted word, OPEN the outermost of these.
3. Otherwise, SHIFT the next word.
While we do not claim that this dynamic oracle is optimal with respect to F1, we find that it still helps substantially in supervising exploration (Section 5).
Likelihood Training with Exploration Past work has differed on how to use dynamic oracles to guide exploration during oracle training Cross and Huang, 2016;. We use the same sample-based method of generating candidate sets Y as for policy gradient, which allows us to control the dynamic oracle and policy gradient methods to perform an equal amount of exploration. Likelihood training with exploration then maximizes the sum of the log probabilities for the oracle actions for all states composing the candidate trees: where a * (s) is the dynamic oracle's action for state s.
Softmax Margin Softmax margin loss (Gimpel and Smith, 2010;Auli and Lopez, 2011) addresses loss mismatch by incorporating task cost into the training loss. Since trees are decomposed into a sequence of local action predictions, we cannot use a global cost, such as F1, directly. As a proxy, we rely on the dynamic oracles' action-level supervision.
In all models we consider, action probabilities (Eq. 1) are parameterized by a softmax function for some state-action scoring function z. The softmax-margin objective replaces this by p SM M (a | s t ; θ) ∝ exp(z(a, s t , θ) + ∆(a, a * t )) (2) We use ∆(a, a * t ) = 0 if a = a * t and 1 otherwise. This can be viewed as a "soft" version of the maxmargin objective used by  for training without exploration, but retains a locallynormalized model that we can use for samplingbased exploration. Softmax Margin with Exploration Finally, we train using a combination of softmax margin loss augmentation and exploration. We perform the same sample-based candidate generation as for policy gradient and likelihood training with exploration, but use Eq. 2 to compute the training loss for candidate states. For those parsers that have a dynamic oracle, this provides a means of training that more directly provides both exploration and cost-aware losses.

Experiments
We compare the constituency parsers listed in Section 2 using the above training methods. Our experiments use the English PTB (Marcus et al., 1993), French Treebank (Abeillé et al., 2003), and Penn Chinese Treebank (CTB) Version 5.1 (Xue et al., 2005).
Training To compare the training procedures as closely as possible, we train all models for a given parser in a given language from the same randomly-initialized parameter values.
We train two different versions of the RNNG model: one model using size 128 for the LSTMs and hidden states (following the original work), and a larger model with size 256. We perform evaluation using greedy search in the Span-Based and Top-Down parsers, and beam search with beam size 10 for the RNNG and In-Order parsers. We found that beam search improved performance for these two parsers by around 0.1-0.3 F1 on the development sets, and use it at inference time in every setting for these two parsers.
In our experiments, policy gradient typically requires more epochs of training to reach performance comparable to either of the dynamic oraclebased exploration methods. Figure 1 gives a typical learning curve, for the Top-Down parser on English. We found that policy gradient is also more sensitive to the number of candidates sampled per sentence than either of the other exploration methods, with best performance on the development set usually obtained with k = 10 for k ∈ {2, 5, 10} (where k also counts the sentence's gold tree, included in the candidate set). See Appendix A in the supplemental material for the values of k used.
Tags, Embeddings, and Morphology We largely follow previous work for each parser in our use of predicted part-of-speech tags, pretrained word embeddings, and morphological features.
All parsers use predicted part-of-speech tags as part of their sentence representations. For English and Chinese, we follow the setup of Cross and Huang (2016): training the Stanford tagger (Toutanova et al., 2003) on the training set of each parsing corpus to predict development and test set tags, and using 10-way jackknifing to predict tags for the training set.
For French, we use the predicted tags and morphological features provided with the SPMRL dataset (Seddah et al., 2014). We modified the publicly released code for all parsers to use predicted morphological features for French. We follow the approach outlined by Cross and Huang (2016) and  for representing morphological features as learned embeddings, and use the same dimensions for these embeddings as in their papers. For RNNG and In-Order, we similarly use 10-dimensional learned embeddings for each morphological feature, feeding them as LSTM inputs for each word alongside the word and part-of-speech tag embeddings.
For RNNG and the In-Order parser, we use the same word embeddings as the original papers for English and Chinese, and train 100-dimensional word embeddings for French using the structured skip-gram method of Ling et al. (2015) on French Wikipedia. Table 1 compares parser F1 by training procedure for each language. Policy gradient improves upon likelihood training in 14 out of 15 cases, with improvements of up to 1.5 F1. One of the three dynamic oracle-based training methods -either likelihood with exploration, softmax margin (SMM), or softmax margin with exploration -obtains better performance than policy gradient in 10 out of 12 cases. This is perhaps unsurprising given the strong supervision provided by the dynamic oracles and the credit assignment problem faced by  policy gradient. However, a substantial fraction of this performance gain is recaptured by policy gradient in most cases.

Results and Discussion
While likelihood training with exploration using a dynamic oracle more directly addresses exploration bias, and softmax margin training more directly addresses loss mismatch, these two phenomena are still entangled, and the best dynamic oracle-based method to use varies. The effectiveness of the oracle method is also likely to be influenced by the nature of the dynamic oracle available for the parser. For example, the oracle for RNNG lacks F1 optimality guarantees, and softmax margin without exploration often underperforms likelihood for this parser. However, exploration improves softmax margin training across all parsers and conditions. Although results from likelihood training are mostly comparable between RNNG-128 and the larger model RNNG-256 across languages, policy gradient and likelihood training with exploration both typically yield larger improvements in the larger models, obtaining 92.6 F1 for English and 86.0 for Chinese (using likelihood training with exploration), although results are slightly higher for the policy gradient and dynamic oracle-based methods for the smaller model on French (including 83.5 with softmax margin with exploration). Finally, we observe that policy gradient also provides large improvements for the In-Order parser, where a dynamic oracle has not been defined.
We note that although some of these results (92.6 for English, 83.5 for French, 87.0 for Chinese) are state-of-the-art for single model, discriminative transition-based parsers, other work on constituency parsing achieves better performance through other methods. Techniques that combine multiple models or add semi-supervised data (Vinyals et al., 2015;Choe and Charniak, 2016;Kuncoro et al., 2017;Liu and Zhang, 2017;Fried et al., 2017) are orthogonal to, and could be combined with, the singlemodel, fixed training data methods we explore. Other recent work (Gaddy et al., 2018;Kitaev and Klein, 2018) obtains comparable or stronger performance with global chart decoders, where training uses loss augmentation provided by an oracle. By performing model-optimal global inference, these parsers likely avoid the exposure bias problem of the sequential transition-based parsers we investigate, at the cost of requiring a chart decoding procedure for inference.
Overall, we find that although optimizing for F1 in a model-agnostic fashion with policy gradient typically underperforms the model-aware expert supervision given by the dynamic oracle training methods, it provides a simple method for consistently improving upon static oracle likelihood training, at the expense of increased training costs. In-Order policy gradient 10 10 10