Approximate Dynamic Oracle for Dependency Parsing with Reinforcement Learning

We present a general approach with reinforcement learning (RL) to approximate dynamic oracles for transition systems where exact dynamic oracles are difficult to derive. We treat oracle parsing as a reinforcement learning problem, design the reward function inspired by the classical dynamic oracle, and use Deep Q-Learning (DQN) techniques to train the oracle with gold trees as features. The combination of a priori knowledge and data-driven methods enables an efficient dynamic oracle, which improves the parser performance over static oracles in several transition systems.


Introduction
Greedy transition-based dependency parsers trained with static oracles are very efficient but suffer from the error propagation problem. Goldberg andNivre (2012, 2013) laid the foundation of dynamic oracles to train the parser with imitation learning methods to alleviate the problem. However, efficient dynamic oracles have mostly been designed for arc-decomposable transition systems which are usually projective. Gómez-Rodríguez et al. (2014) designed a non-projective dynamic oracle but runs in O(n 8 ). Gómez-Rodríguez and Fernández-González (2015) proposed an efficient dynamic oracle for the non-projective Covington system (Covington, 2001;Nivre, 2008), but the system itself has quadratic worst-case complexity.
Instead of designing the oracles, Straka et al. (2015) applied the imitation learning approach (Daumé et al., 2009) by rolling out with the parser to estimate the cost of each action. Le and Fokkens (2017) took the reinforcement learning approach (Maes et al., 2009) by directly optimizing the parser towards the reward (i.e., the correct arcs) instead of the the correct action, thus no oracle is required. Both approaches circumvent the difficulty in designing the oracle cost function by us-ing the parser to (1) explore the cost of each action, and (2) explore erroneous states to alleviate error propagation.
However, letting the parser explore for both purposes is inefficient and difficult to converge. For this reason, we propose to separate the two types of exploration: (1) the oracle explores the action space to learn the action cost with reinforcement learning, and (2) the parser explores the state space to learn from the oracle with imitation learning.
The objective of the oracle is to prevent further structure errors given a potentially erroneous state. We design the reward function to approximately reflect the number of unreachable gold arcs caused by the action, and let the model learn the actual cost from data. We use DQN (Mnih et al., 2013) with several extensions to train an Approximate Dynamic Oracle (ADO), which uses the gold tree as features and estimates the cost of each action in terms of potential attachment errors. We then use the oracle to train a parser with imitation learning methods following Goldberg and Nivre (2013).
A major difference between our ADO and the search-based or RL-based parser is that our oracle uses the gold tree as features in contrast to the lexical features of the parser, which results in a much simpler model solving a much simpler task. Furthermore, we only need to train one oracle for all treebanks, which is much more efficient.
We experiment with several transition systems, and show that training the parser with ADO performs better than training with static oracles in most cases, and on a par with the exact dynamic oracle if available. We also conduct an analysis of the oracle's robustness against error propagation for further investigation and improvement.
Our work provides an initial attempt to combine the advantages of reinforcement learning and imitation learning for structured prediction in the case of dependency parsing.

Approximate Dynamic Oracle
We treat oracle parsing as a deterministic Markov Decision Process (Maes et al., 2009), where a state corresponds to a parsing configuration with the gold tree known. The tokens are represented only by their positions in the stack or buffer, i.e., without lexical information. Unlike normal parsing, the initial state for the oracle can be any possible state in the entire state space, and the objective of the oracle is to minimize further structure errors, which we incorporate into the reward function.

Transition Systems and Reward Function
We define a unified reward function for the four transition systems that we experiment with: Arc-Standard (Yamada and Matsumoto, 2003;Nivre, 2004), Attardi's system with gap-degree of 1 (Attardi, 2006;Kuhlmann and Nivre, 2006), Arc-Standard with the swap transition (Nivre, 2009), andArc-Hybrid (Kuhlmann et al., 2011). We denote them as STANDARD, ATTARDI, SWAP, and HYBRID, respectively. We formalize the actions in these systems in Appendix A.
The reward function approximates the arc reachability property as in Goldberg and Nivre (2013).
Concretely, when an arc of headdependent pair h, d is introduced, there are two cases of unreachable arcs: (1) if a pending token h (h = h) is the gold head of d, then h , d is unreachable; (2) if a pending token d whose gold head is d, then d, d is unreachable. If an attachment action does not immediately introduce unreachable arcs, we consider it correct.
The main reward for an attachment action is the negative count of immediate unreachable arcs it introduces, which sums up to the total attachment errors in the global view. We also incorporate some heuristics in the reward function, so that the swap action and non-projective (Attardi) attachments are slightly discouraged. Finally, we give a positive reward to a correct attachment to prevent the oracle from unnecessarily postponing attachment decisions. The exact reward values are modestly tuned in the preliminary experiments, and the reward function is defined as follows: if n unreachable arcs are introduced 0.5, if attachment is correct but non-projective 1, if attachment is correct and projective Although we only define the reward function for the four transition systems here, it can be easily extended for other systems by following the general principle: (1) reflect the number of unreachable arcs; (2) identify the unreachable arcs as early as possible; (3) reward correct attachment; (4) add system-specific heuristics.
Also note that the present reward function is not necessarily optimal. E.g., in the HYBRID system, a shift could also cause an unreachable arc, which is considered in the exact dynamic oracle by Goldberg and Nivre (2013), while the ADO can only observe the loss in later steps. We intentionally do not incorporate this knowledge into the reward function in order to demonstrate that the ADO is able to learn from delayed feedback information, which is necessary in most systems other than HY-BRID. We elaborate on the comparison to the exact dynamic oracle in Section 4.

Feature Extraction and DQN Model
In contrast to the rich lexicalized features for the parser, we use a very simple feature set for the oracle. We use binary features to indicate the position of the gold head of the first 10 tokens in the stack and in the buffer. We also encode whether the gold head is already lost and whether the token has collected all its pending gold dependents. Additionally, we encode the 5 previous actions leading to this state, as well as all valid actions in this state.
We use the Deep Q-Network (DQN) to model the oracle, where the input is the aforementioned binary features from a state, and the output is the estimated values for each action in this state. The training objective of the basic DQN is to minimize the expected Temporal Difference (TD) loss: where π is the policy given the value function Q, which assigns a score for each action, s is the current state, a is the performed action, r is the reward, γ is the discount factor, s is the next state, and a is the optimal action in state s .
We apply several techniques to improve the stability of the DQN, including the averaged DQN (Anschel et al., 2016) to reduce variance, the dueling network (Wang et al., 2016) to decouple the estimation of state value and action value, and prioritized experience replay (Schaul et al., 2015) to increase the efficiency of samples.

Sampling Oracle Training Instances
Our goal is learning to handle erroneous states, so we need to sample such instances during training. Concretely, for every state in the sampling process, apart from following the ε-greedy policy (i.e., select random action with ε probability), we fork the path with certain probability by taking a valid random action to simulate the mistake by the parser. We treat each forked path as a new episode starting from the state after the forking action. Also, to increase sample efficiency, we only take the first N states in each episode, as illustrated in Figure 1.

Cost-Sensitive Parser Training
Since the DQN estimates the cost for each action, we can thus apply cost-sensitive training with multi-margin loss (Edunov et al., 2017) for the parser instead of negative log-likelihood or hinge loss. Concretely, we enforce the margin of the parser scores between the correct action and every other action to be larger than the difference between the oracle scores of these two actions: where A is the set of valid actions, P(·|s) is the parser score, Q(·|s) is the oracle score, and a * is the optimal action according to the oracle. Cost-sensitive training is similar to the nondeterministic oracle (Goldberg and Nivre, 2013), since actions with similar oracle scores need to maintain a smaller margin, thus allowing spurious ambiguity. On the other hand, it also penalizes the actions with larger cost more heavily, thus focusing the training on the more important actions.

Data and Settings
We conduct our experiments on the 55 big treebanks from the CoNLL 2017 shared task (Zeman et al., 2017), referred to by their treebank codes, e.g., grc for Ancient Greek. For easier replicability, we use the predicted segmentation, part-ofspeech tags and morphological features by UD-Pipe (Straka et al., 2016), provided in the shared task, and evaluate on Labeled Attachment Score (LAS) with the official evaluation script. We also provide the parsing results by UDPipe as a baseline, which incorporates the search-based oracle for non-projective parsing (Straka et al., 2015).
We implement the parser architecture with bidirectional LSTM following Kiperwasser and Goldberg (2016) with the minimal feature set, namely three tokens in the stack and one token in the buffer. For each token, we compose characterbased representations with convolutional neural networks following Yu and Vu (2017), and concatenate them with randomly initialized embeddings of the word form, universal POS tag, and morphological features. All hyperparameters of the parser and the oracle are listed in Appendix B. 1 We compare training the parser with ADOs to static oracles in the four transition systems STAN-DARD, ATTARDI, SWAP, and HYBRID. Additionally, we implement the exact dynamic oracle (EDO) for HYBRID as the upper bound. For each system, we only use the portion of training data where all oracles can parse, e.g., for STANDARD and HYBRID, we only train on projective trees.
We did preliminary experiments on the ADOs in three settings: (1) O all is trained only on the non-projective trees from all training treebanks (ca. 133,000 sentences); (2) O ind is trained on the individual treebank as used for training the parser; and (3) O tune is based on O all , but fine-tuned interactively during the parser training by letting the parser initiate the forked episodes. Results show that three versions have very close performance, we thus choose the simplest one O all to report and analyze, since in this setting only one oracle is needed for training on all treebanks.

Oracle Recovery Test
Before using the oracle to train the parser, we first test the oracle's ability to control the mistakes. In this test, we use a parser trained with the static oracle to parse the development set, and starting from the parsing step 0, 10, 20, 30, 40, and 50, we let the ADO fork the path and parse until the end. We use the error rate of the oracle averaged over the aforementioned starting steps as the measurement for the oracle's robustness against error propagation: the smaller the rate, the more robust the oracle. Note that we identify the errors only when the incorrect arcs are produced, but they could be already inevitable due to previous actions, which means some of the parser's mistakes are attributed to the oracle, resulting in a more conservative estimation of the oracle's recovery ability. Figure 2a and 2b show the average error rate for each treebank and its relation to the percentage of non-projective arcs in the projective STAN-DARD and the non-projective SWAP systems. Generally, the error rate correlates with the percentage of the non-projective arcs. However, even in the most difficult case (i.e., grc with over 17% nonprojective arcs), the oracle only introduces 5% errors in the non-projective system, which is much lower than the parser's error rate of over 40%. The higher error rates in the projective system is due to the fact that the number of errors is at least the number of non-projective arcs. Figure 2c and 2d show the oracles' error recovery performance in the most difficult case grc. The error curves of the oracles in the non-projective systems are very flat, while in the STANDARD system, the errors of the oracle starting from step 0 is only slightly higher than the number of non-projective arcs (the dotted line), which is the lower bound of errors. These results all confirm that the ADO is able to find actions to minimize further errors given any potentially erroneous state.  In the final experiment, we compare the performance of the parser trained by the ADOs against the static oracle or the EDO if available. Table 1 shows the LAS of 12 representative treebanks, while the full results are shown in Appendix C. In the selection, we include treebanks with the highest percentage of non-projective arcs (grc, nl lassysmall, grc proiel), almost only projective trees (ja, gl, zh), the most training data (cs, ru syntagrus, cs cac), and the least training data (cs cltt, hu, en partut).
Out of the 55 treebanks, the ADO is beneficial in 41, 40, 41, and 35 treebanks for the four systems, and on average outperforms the static baseline by 0.33%, 0.33%, 0.51%, 0.21%, respectively. While considering the treebank characteristics, training with ADOs is beneficial in most cases irrespective of the projectiveness of the treebank. It works especially well for small treebanks, but not as well for very large treebanks. The reason could be that the error propagation problem is not as severe when the parsing accuracy is high, which correlates with the training data size.
In HYBRID, the benefit of the ADO and EDO is very close, they outperform the static baseline by 0.21% and 0.27%, which means that the ADO approximates the upper bound EDO quite well.
Note that we train the parsers only on projective trees in projective systems to ensure a fair comparison. However, the ADO is able to guide the parser even on non-projective trees, and the resulting parsers in STANDARD outperform the baseline by 1.24% on average (see Appendix C), almost bridging the performance gap between projective and non-projective systems.

Comparing to Exact Dynamic Oracle
The purpose of the ADO is to approximate the dynamic oracle in the transition systems where an exact dynamic oracle is unavailable or inefficient. However, it could demonstrate how well the approximation is when compared to the EDO, which serves as an upper bound. Therefore, we compare our ADO to the EDO (Goldberg and Nivre, 2013) in the HYBRID system.
First, we compare the reward function of the ADO (see Section 2.1) to the cost function of the EDO, which is: (1) for an attachment action that introduces an arc h, d , the cost is the number of reachable dependents of d plus whether d is still reachable to its gold head h (h = h); and (2) for shift, the cost is the number of reachable dependents of d in the stack plus whether the gold head of d is in the stack except for the top item.
The general ideas of both oracles are very similar, namely to punish an action by the number of unreachable arcs it introduces. However, the definitions of reachability are slightly different.
Reachable arcs in the ADO are defined more loosely: as long as the head and dependent of an arc are pending in the stack or buffer, it is considered reachable, thus the reward (cost) of shift is always zero. However, in the HYBRID system, an arc of two tokens in the stack could be unreachable (e.g. s 0 , s 1 ), thus the cost of shift in the EDO could be non-zero.
Note that both oracles punish each incorrect attachment exactly once, and the different definitions of reachability only affect the time when an incorrect attachment is punished, namely when the correct attachment is deemed unreachable. Generally, the ADO's reward function delays the punishment for many actions, and dealing with delayed reward signal is exactly the motivation of RL algorithms (Sutton and Barto, 1998).
The DQN model in the ADO bridges the lack of prior knowledge in the definitions of reachability by estimating not only the immediate reward of an action, but also the discounted future rewards. Take the HYBRID system for example. Although the immediate reward of a shift is zero, the ADO could learn a more accurate cost in its value estimation if the action eventually causes an unreachable arc. Moreover, in a system where the exact reachability is difficult to determine, the ADO estimates the expected reachability based on the training data.
We then empirically compare the behavior of the ADO with the EDO, in which we use a parser trained with the static oracle to parse the development set of a treebank, and for each state along the transition sequence produced by the parser we consult the ADO and the EDO. Since the EDO gives a set of optimal actions, we check whether the ADO's action is in the set.
On average, the ADO differs from the EDO (i.e., making suboptimal actions) only in 0.53% of all cases. Among the states where the ADO makes suboptimal actions, more than 90% has the pattern shown in Figure 3, where the gold head of s 1 is s 0 but it is already impossible to make the correct attachment for it, therefore the correct action is to make a left-arc to ensure that s 0 is attached correctly. However, the ADO does not realize that s 1 is already lost and estimates that a left-arc attachment would incur a negative reward, and is thus inclined to make a "harmless" shift, which would actually cause another lost token s 0 in the future. This type of mistakes happens about 30% of the time when this pattern occurs, and further investigation is needed to eliminate them. .

Conclusion
In this paper, we propose to train efficient approximate dynamic oracles with reinforcement learning methods. We tackle the problem of nondecomposable structure loss by letting the oracle learn the action loss from incremental immediate rewards, and act as a proxy for the structure loss to train the parser. We demonstrate that training with a single treebank-universal ADO generally improves the parsing performance over training with static oracle in several transition systems, we also show the ADO's comparable performance to an exact dynamic oracle. Furthermore, the general idea in this work could be extended to other structured prediction tasks such as graph parsing, by training a betterinformed oracle to transform structure costs into action costs, which gives the learning agent more accurate objective while staying in the realm of imitation learning to ensure training efficiency.  Table 2 provides a unified view of the the actions in the four transition systems: shift and right are shared by all four systems; left is shared by all but the HYBRID system, which uses left-hybrid instead; left-2 and right-2 are defined only in the ATTARDI system; and swap is defined only in the SWAP system. For all systems, the initial states are identical: the stack contains only the root, the buffer contains all other tokens, and the set of arcs is empty. The terminal states are also identical: the stack contains only the root, the buffer is empty, and the set of arcs is the created dependency tree.  The column ADO * indicate the parsers trained on both projective and non-projective trees. Average is calculated over all 55 test set.