Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set

We first present a minimal feature set for transition-based dependency parsing, continuing a recent trend started by Kiperwasser and Goldberg (2016a) and Cross and Huang (2016a) of using bi-directional LSTM features. We plug our minimal feature set into the dynamic-programming framework of Huang and Sagae (2010) and Kuhlmann et al. (2011) to produce the first implementation of worst-case O(n^3) exact decoders for arc-hybrid and arc-eager transition systems. With our minimal features, we also present O(n^3) global training methods. Finally, using ensembles including our new parsers, we achieve the best unlabeled attachment score reported (to our knowledge) on the Chinese Treebank and the “second-best-in-class” result on the English Penn Treebank.


Introduction
It used to be the case that the most accurate dependency parsers made global decisions and employed exact decoding. But transition-based dependency parsers (TBDPs) have recently achieved state-of-the-art performance, despite the fact that for efficiency reasons, they are usually trained to make local, rather than global, decisions and the decoding process is done approximately, rather than exactly (Weiss et al., 2015;Dyer et al., 2015;Andor et al., 2016). The key efficiency issue for decoding is as follows. In order to make accurate (local) attachment decisions, historically, TBDPs have required a large set of features in order to access rich information about particular positions in the stack and buffer of the current parser configuration. But consulting many positions means that although polynomial-time exact-decoding algo-rithms do exist, having been introduced by Huang and Sagae (2010) and Kuhlmann et al. (2011), unfortunately, they are prohibitively costly in practice, since the number of positions considered can factor into the exponent of the running time. For instance, Huang and Sagae employ a fairly reduced set of nine positions, but the worst-case running time for the exact-decoding version of their algorithm is Opn 6 q (originally reported as Opn 7 q) for a length-n sentence. As an extreme case, Dyer et al. (2015) use an LSTM to summarize arbitrary information on the stack, which completely rules out dynamic programming.
Recently, Kiperwasser and Goldberg (2016a) and Cross and Huang (2016a) applied bidirectional long short-term memory networks (Graves and Schmidhuber, 2005, bi-LSTMs) to derive feature representations for parsing, because these networks capture wide-window contextual information well. Collectively, these two sets of authors demonstrated that with bi-LSTMs, four positional features suffice for the arc-hybrid parsing system (K&G), and three suffice for arcstandard (C&H). 1 Inspired by their work, we arrive at a minimal feature set for arc-hybrid and arc-eager: it contains only two positional bi-LSTM vectors, suffers almost no loss in performance in comparison to larger sets, and out-performs a single position. (Details regarding the situation with arc-standard can be found in §2.) Our minimal feature set plugs into Huang and Sagae's and Kuhlmann et al.'s dynamic program-ming framework to produce the first implementation of Opn 3 q exact decoders for arc-hybrid and arc-eager parsers. We also enable and implement Opn 3 q global training methods. Empirically, ensembles containing our minimal-feature, globallytrained and exactly-decoded models produce the best unlabeled attachment score (UAS) reported (to our knowledge) on the Chinese Treebank and the "second-best-in-class" result on the English Penn Treebank. 2 Additionally, we provide a slight update to the theoretical connections previously drawn by Weir (2008, 2011) between TBDPs and the graph-based dependency parsing algorithms of Eisner (1996) and Eisner and Satta (1999), including results regarding the arc-eager parsing system.

A Minimal Feature Set
TBDPs incrementally process a sentence by making transitions through search states representing parser configurations. Three of the main transition systems in use today (formal introduction in §3.1) all maintain the following two data structures in their configurations: (1) a stack of partially parsed subtrees and (2) a buffer (mostly) of unprocessed sentence tokens.
To featurize configurations for use in a scoring function, it is common to have features that extract information about the first several elements on the stack and the buffer, such as their word forms and part-of-speech (POS) tags. We refer to these as positional features, as each feature relates to a particular position in the stack or buffer. Typically, millions of sparse indicator features (often developed via manual engineering) are used.
In contrast, Chen and Manning (2014) introduce a feature set consisting of dense word-, POS-, and dependency-label embeddings. While dense, these features are for the same 18 positions that have been typically used in prior work. Recently, Kiperwasser and Goldberg (2016a) and Cross and Huang (2016a) adopt bi-directional LSTMs, which have nice expressiveness and context-sensitivity properties, to reduce the number of positions considered down to four and three, 2 Our ideas were subsequently adapted to the labeled setting by Shi, Wu, Chen, and Cheng (2017) in their submission to the CoNLL 2017 shared task on Universal Dependencies parsing. Their team achieved the second-highest labeled attachment score in general and had the top average performance on the surprise languages.   Kiperwasser and Goldberg (2016a), Cross and Huang (2016a), and our work.
for different transition systems, respectively. This naturally begs the question, what is the lower limit on the number of positional features necessary for a parser to perform well? Kiperwasser and Goldberg (2016a) reason that for the arc-hybrid system, the first and second items on the stack and the first buffer item -denoted by s 0 , s 1 , and b 0 , respectively -are required; they additionally include the third stack item, s 2 , because it may not be adjacent to the others in the original sentence. For arc-standard, Cross and Huang (2016a) argue for the necessity of s 0 , s 1 , and b 0 . We address the lower-limit question empirically, and find that, surprisingly, two positions suffice for the greedy arc-eager and arc-hybrid parsers. We also provide empirical support for Cross and Huang's argument for the necessity of three features for arc-standard. In the rest of this section, we explain our experiments, run only on an English development set, that support this conclusion; the results are depicted in Table 1. We later explore the implementation implications in §3-4 and then test-set parsing-accuracy in §6.
We employ the same model architecture as Kiperwasser and Goldberg (2016a). Specifically, we first use a bi-LSTM to encode an n-token sentence, treated as a sequence of per-token concatenations of word-and POS-tag embeddings, into a sequence of vectors r ÑÐ w 1 , . . . , ÑÐ w n s, where each ÑÐ w i is the output of the bi-LSTM at time step i. (The double-arrow notation for these vectors emphasizes the bi-directionality of their origin). Then, for a given parser configuration, stack positions are represented by ÑÐ s j , defined as ÑÐ w ips j q where ips j q gives the position in the sentence of the token that is the head of the tree in s j . Similarly, buffer positions are represented by ÑÐ b j , defined as ÑÐ w ipb j q for the token at buffer position j. Finally, as in Chen and Manning (2014), we use a multilayer perceptron to score possible transitions from the given configuration, where the input is the concatenation of some selection of the ÑÐ s j s and ÑÐ b k s. We use greedy decoders, and train the models with dynamic oracles (Goldberg and Nivre, 2013). Table 1 reports the parsing accuracy that results for feature sets of size four, three, two, and one for three commonly-used transition systems. The data is the development section of the English Penn Treebank (PTB), and experimental settings are as described in our other experimental section, §6. We see that we can go down to three or, in the arc-hybrid and arc-eager transition systems, even two positions with very little loss in performance, but not further. We therefore call t ÑÐ s 0 , ÑÐ b 0 u our minimal feature set with respect to arc-hybrid and arc-eager, and empirically confirm that Cross and Huang's t ÑÐ s 0 , ÑÐ s 1 , ÑÐ b 0 u is minimal for arc-standard; see Table 1 for a summary. 3

Dynamic Programming for TBDPs
As stated in the introduction, our minimal feature set from §2 plugs into Huang and Sagae and Kuhlmann et al.'s dynamic programming (DP) framework. To help explain the connection, this section provides an overview of the DP framework. We draw heavily from the presentation of Kuhlmann et al. (2011).

Three Transition Systems
Transition-based parsing (Nivre, 2008;Kübler et al., 2009) is an incremental parsing framework based on transitions between parser configura- 3 We tentatively conjecture that the following might explain the observed phenomena, but stress that we don't currently see a concrete way to test the following hypothesis. With t ÑÐ s 0, ÑÐ b 0u, in the arc-standard case, situations can arise where there are multiple possible transitions with missing information. In contrast, in the arc-hybrid case, there is only one possible transition with missing information (namely, reñ, introduced in §3.1); perhaps ÑÐ s 1 is therefore not so crucial for arc-hybrid in practice? tions. For a sentence to be parsed, the system starts from a corresponding initial configuration, and attempts to sequentially apply transitions until a configuration corresponding to a full parse is produced. Formally, a transition system is defined as S " pC, T, c s , C τ q, where C is a nonempty set of configurations, each t P T : C á C is a transition function between configurations, c s is an initialization function that maps an input sentence to an initial configuration, and C τ Ď C is a set of terminal configurations.
All systems we consider share a common tripartite representation for configurations: when we write c " pσ, β, Aq for some c P C, we are referring to a stack σ of partially parsed subtrees; a buffer β of unprocessed tokens and, optionally, at its beginning, a subtree with only left descendants; and a set A of elements ph, mq, each of which is an attachment (dependency arc) with head h and modifier m. 4 We write m ð h to indicate that m left-modifies h, and h ñ m to indicate that m rightmodifies h. For a sentence w " w 1 , ..., w n , the initial configuration is pσ 0 , β 0 , A 0 q, where σ 0 and A 0 are empty and β 0 " rROOT|w 1 , ..., w n s; ROOT is a special node denoting the root of the parse tree 5 (vertical bars are a notational convenience for indicating different parts of the buffer or stack; our convention is to depict the buffer first element leftmost, and to depict the stack first element rightmost). All terminal configurations have an empty buffer and a stack containing only ROOT.

Arc-Standard
The arc-standard system (Nivre, 2004) is motivated by bottom-up parsing: each dependent has to be complete before being attached. The three transitions, shift (sh, move a token from the buffer to the stack), right-reduce (re ñ , reduce and attach a right modifier), and left-reduce (re ð , reduce and attach a left modifier), are defined as: shrpσ, b 0 |β, Aqs " pσ|b 0 , β, Aq re ñ rpσ|s 1 |s 0 , β, Aqs " pσ|s 1 , β, A Y tps 1 , s 0 quq re ð rpσ|s 1 |s 0 , β, Aqs " pσ|s 0 , β, A Y tps 0 , s 1 quq Arc-Hybrid The arc-hybrid system (Yamada and Matsumoto, 2003;Gómez-Rodríguez et al., 2008;Kuhlmann et al., 2011) has the same definitions of sh and re ñ as arc-standard, but forces the collection of left modifiers before right modifiers via its b 0 -modifier re ð transition. This contrasts with arc-standard, where the attachment of left and right modifiers can be interleaved on the stack.
shrpσ, b 0 |β, Aqs " pσ|b 0 , β, Aq re ñ rpσ|s 1 |s 0 , β, Aqs " pσ|s 1 , β, A Y tps 1 , s 0 quq Arc-Eager In contrast to the former two systems, the arc-eager system (Nivre, 2003) makes attachments as early as possible -even if a modifier has not yet received all of its own modifiers. This behavior is accomplished by decomposing the right-reduce transition into two independent transitions, one making the attachment (ra) and one reducing the right-attached child (re).

Deduction and Dynamic Programming
Kuhlmann et al. (2011) reformulate the three transition systems just discussed as deduction systems (Pereira and Warren, 1983;Shieber et al., 1995), wherein transitions serve as inference rules; these are given as the lefthand sides of the first three subfigures in Figure 1. For a given w " w 1 , ..., w n , assertions take the form ri, j, ks (or, when applicable, a two-index shorthand to be discussed soon), meaning that there exists a sequence of transitions that, starting from a configuration wherein head ps 0 q " w i , results in an ending configuration wherein head ps 0 q " w j and head pb 0 q " w k . If we define w 0 as ROOT and w n`1 as an endof-sentence marker, then the goal theorem can be stated as r0, 0, n`1s.
For arc-standard, we depict an assertion ri, h, ks as a subtree whose root (head) is the token at h. Assertions of the form ri, i, ks play an important role for arc-hybrid and arc-eager, and we employ the special shorthand ri, ks for them in Figure 1. In that figure, we also graphically depict such situations as two consecutive half-trees with roots w i and w k , where all tokens between i and k are already attached. The superscript b in an arc-eager assertion ri b , js is an indicator variable for whether w i has been attached to its head (b " 1) or not (b " 0) after the transition sequence is applied. Kuhlmann et al. (2011) show that all three deduction systems can be directly "tabularized" and dynamic programming (DP) can be applied, such that, ignoring for the moment the issue of incorporating complex features (we return to this later), time and space needs are low-order polynomial. Specifically, as the two-index shorthand ri, js suggests, arc-eager and arc-hybrid systems can be implemented to take Opn 2 q space and Opn 3 q time; the arc-standard system requires Opn 3 q space and Opn 4 q time (if one applies the so-called hook trick (Eisner and Satta, 1999)).
Since an Opn 4 q running time is not sufficiently practical even in the simple-feature case, in the remainder of this paper we consider only the archybrid and arc-eager systems, not arc-standard.

Practical Optimal Algorithms Enabled By Our Minimal Feature Set
Until now, no one had suggested a set of positional features that was both information-rich enough for accurate parsing and small enough to obtain the Opn 3 q running-time promised above. Fortunately, our bi-LSTM-based t ÑÐ s 0 , ÑÐ b 0 u feature set qualifies, and enables the fast optimal procedures described in this section.

Exact Decoding
Given an input sentence, a TBDP must choose among a potentially exponential number of corresponding transition sequences. We assume access to functions f t that score individual configurations, where these functions are indexed by the transition functions t P T . For a fixed transition sequence t " t 1 , t 2 , . . ., we use c i to denote the configuration that results after applying t i .
Typically, for efficiency reasons, greedy left-toright decoding is employed: the next transition ti out of c i´1 is arg max t f t pc i´1 q, so that past and future decisions are not taken into account. The score F ptq for the transition sequence is induced by summing the relevant f t i pc i´1 q values.
However, our use of minimal feature sets enables direct computation of an argmax over the entire space of transition sequences, arg max t F ptq, via dynamic programming, because our positions don't rely on any information "outside" the deduction rule indices, thus eliminating the need for ad-   ditional state-keeping. We show how to integrate the scoring functions for the arc-eager system; the arc-hybrid system is handled similarly. The score-annotated rules are as follows: w j q -abusing notation by referring to configurations by their features. The left-reduce rule says that we can first take the sequence of transitions asserted by rk b , is, which has a score of v 1 , and then a shift transition moving w i from b 0 to s 0 . This means that the initial condition for ri 0 , js is met, so we can take the sequence of transitions asserted by ri 0 , js -say it has score v 2 -and finally a left-reduce transition to finish composing the larger transition sequence. Notice that the scores for sh and ra are 0, as the scoring of these transitions is accounted for by reduce rules elsewhere in the sequence.

Global Training
We employ large-margin training that considers each transition sequence globally. Formally, for a training sentence w " w 1 , . . . , w n with gold transition sequence t gold , our loss function is max t´F ptq`costpt gold , tq´F pt gold qw here costpt gold , tq is a custom margin for taking t instead of t gold -specifically, the number of mis-attached nodes. Computing this max can again be done efficiently with a slight modification to the scoring of reduce transitions: where ∆ 1 " ∆`1 phead pw i q ‰ w j q. This lossaugmented inference or cost-augmented decoding (Taskar et al., 2005;Smith, 2011) technique has previously been applied to graph-based parsing by Kiperwasser and Goldberg (2016a).
Efficiency Note The computation decomposes into two parts: scoring all feature combinations, and using DP to find a proof for the goal theorem in the deduction system. Time-complexity analysis is usually given in terms of the latter, but the former might have a large constant factor, such as 10 4 or worse for neural-network-based scoring functions. As a result, in practice, with a small n, scoring with the feature set t ÑÐ s 0 , ÑÐ b 0 u (Opn 2 q) can be as time-consuming as the decoding steps (Opn 3 q) for the arc-hybrid and arc-eager systems.

Theoretical Connections
Our minimal feature set brings implementation of practical optimal algorithms to TBDPs, whereas previously only graph-based dependency parsers (GBDPs) -a radically different, non-incremental paradigm -enjoyed the ability to deploy them. Interestingly, for both the transition-and graphbased paradigms, the optimal algorithms build dependency trees bottom-up from local structures. It is thus natural to wonder if there are deeper, more formal connections between the two.
In previous work, Kuhlmann et al. (2011) related the arc-standard system to the classic CKY algorithm (Cocke, 1969;Kasami, 1965;Younger, 1967) in a manner clearly suggested by Figure 1a; CKY can be viewed as a very simple graph-based approach. Gómez-Rodríguez et al. (2008, 2011 formally prove that sequences of steps in the edgefactored GBDP (Eisner, 1996) can be used to emulate any individual step in the arc-hybrid system (Yamada and Matsumoto, 2003) and the Eisner and Satta (1999, Figure 1d) version. However, they did not draw an explicitly direct connection between Eisner and Satta (1999) and TBDPs.
Here, we provide an update to these previous findings, stated in terms of the expressiveness of scoring functions, considered as parameterization.
For the edge-factored GBDP, we write the score for an edge as f G p ÑÐ h, ÑÐ mq, where h is the head and m the modifier. A tree's score is the sum of its edge scores. We say that a parameterized dependency parsing model A contains model B if for every instance of parameterization in model B, there exists an instance of model A such that the two models assign the same score to every parse tree. We claim: Lemma 1. The arc-eager model presented in §4.1 contains the edge-factored model.
Proof Sketch. Consider a given edge-factored GBDP parameterized by f G . For any parse tree, every edge i ð j involves two deduction rules, and their contribution to the score of the final proof is The parameterization we arrive at emulates exactly the scoring model of f G .
We further claim that the arc-eager model is more expressive than not only the edge-factored GBDP, but also the arc-hybrid model in our paper.
Lemma 2. The arc-eager model contains the archybrid model.

Experiments
Data and Evaluation We experimented with English and Chinese. For English, we used the Stanford Dependencies (de Marneffe and Manning, 2008) conversion (via the Stanford parser 3.3.0) of the Penn Treebank (Marcus et al., 1993, PTB). As is standard, we used §2-21 of the Wall Street Journal for training, §22 for development, and §23 for testing; POS tags were predicted using 10-way jackknifing with the Stanford max entropy tagger (Toutanova et al., 2003). For Chinese, we used the Penn Chinese Treebank 5.1 (Xue et al., 2002, CTB), with the same splits and head-finding rules for conversion to dependencies as Zhang and Clark (2008). We adopted the CTB's goldstandard tokenization and POS tags. We report unlabeled attachment score (UAS) and sentencelevel unlabeled exact match (UEM). Following prior work, all punctuation is excluded from evaluation. For each model, we initialized the network parameters with 5 different random seeds and report performance average and standard deviation. Implementation Details Our model structures reproduce those of Kiperwasser and Goldberg (2016a). We use 2-layer bi-directional LSTMs with 256 hidden cell units. Inputs are concatenations of 28-dimensional randomly-initialized partof-speech embeddings and 100-dimensional word vectors initialized from GloVe vectors (Pennington et al., 2014) (English) and pre-trained skipgram-model vectors (Mikolov et al., 2013) (Chinese). The concatenation of the bi-LSTM feature vectors is passed through a multi-layer perceptron (MLP) with 1 hidden layer which has 256 hidden units and activation function tanh. We set the dropout rate for the bi-LSTM (Gal and Ghahramani, 2016) and MLP (Srivastava et al., 2014) for each model according to development-set performance. 6 All parameters except the word embed- dings are initialized uniformly (Glorot and Bengio, 2010). Approximately 1,000 tokens form a mini-batch for sub-gradient computation. We train each model for 20 epochs and perform model selection based on development UAS. The proposed structured loss function is optimized via Adam (Kingma and Ba, 2015). The neural network computation is based on the python interface to DyNet (Neubig et al., 2017), and the exact decoding algorithms are implemented in Cython. 7

Main Results
We implement exact decoders for the arc-hybrid and arc-eager systems, and present the test performance of different model configurations in Table 2, comparing global models with local models. All models use the same decoder for testing as during the training process. Though no global decoder for the arc-standard system has been explored in this paper, its local models are listed for comparison. We also include an edgefactored graph-based model, which is conventionally trained globally. The edge-factored model scores bi-LSTM features for each head-modifier pair; a maximum spanning tree algorithm is used to find the tree with the highest sum of edge scores. For this model, we use Dozat and Man-7 See https://github.com/tzshi/dp-parser-emnlp17 .
ning's (2017) biaffine scoring model, although in our case the model size is smaller. 8 Analogously to the dev-set results given in §2, on the test data, the minimal feature sets perform as well as larger ones in locally-trained models. And there exists a clear trend of global models outperforming local models for the two different transition systems on both datasets. This illustrates the effectiveness of exact decoding and global training. Of the three types of global models, the arceager arguably has the edge, an empirical finding resonating with our theoretical comparison of their model expressiveness.
Comparison with State-of-the-Art Models Figure 2 compares our algorithms' results with those of the state-of-the-art. 9 Our models are competitive and an ensemble of 15 globallytrained models (5 models each for arc-eager DP, arc-hybrid DP and edge-factored) achieves 95.33 and 90.22 on PTB and CTB, respectively, reach-ing the highest reported UAS on the CTB dataset, and the second highest reported on the PTB dataset among dependency-based approaches.

Related Work Not Yet Mentioned
Approximate Optimal Decoding/Training Besides dynamic programming (Huang and Sagae, 2010;Kuhlmann et al., 2011), various other approaches have been proposed for approaching global training and exact decoding. Best-first and A* search Sagae and Lavie, 2006;Sagae and Tsujii, 2007;Zhao et al., 2013;Thang et al., 2015;Lee et al., 2016) give optimality certificates when solutions are found, but have the same worst-case time complexity as the original search framework. Other common approaches to search a larger space at training or test time include beam search (Zhang and Clark, 2011), dynamic oracles Nivre, 2012, 2013;Cross and Huang, 2016b) and error states (Vaswani and Sagae, 2016). Beam search records the k best-scoring transition prefixes to delay local hard decisions, while the latter two leverage configurations deviating from the gold transition path during training to better simulate the test-time environment.
Recurrent and recursive neural networks can be used to build representations that encode complete configuration information or the entire parse tree (Le and Zuidema, 2014;Dyer et al., 2015;Kiperwasser and Goldberg, 2016b), but these models cannot be readily combined with DP approaches, because their state spaces cannot be merged into smaller sets and thus remain exponentially large.

Concluding Remarks
In this paper, we have shown the following.
• The bi-LSTM-powered feature set t ÑÐ s 0 , ÑÐ b 0 u is minimal yet highly effective for arc-hybrid and arc-eager transition-based parsing.
• Since DP algorithms for exact decoding (Huang and Sagae, 2010;Kuhlmann et al., 2011) have a run-time dependence on the number of positional features, using our mere two effective positional features results in a running time of Opn 3 q, feasible for practice.
• Combining exact decoding with global training -which is also enabled by our minimal feature set -with an ensemble of parsers achieves 90.22 UAS on the Chinese Treebank and 95.33 UAS on the Penn Treebank: these are, to our knowledge, the best and secondbest results to date on these data sets among "purely" dependency-based approaches.
There are many directions for further exploration. Two possibilities are to create even better training methods, and to find some way to extend our run-time improvements to other transition systems. It would also be interesting to further investigate relationships between graph-based and dependency-based parsing. In §5 we have mentioned important earlier work in this regard, and provided an update to those formal findings.
In our work, we have brought exact decoding, which was formerly the province solely of graphbased parsing, to the transition-based paradigm. We hope that the future will bring more inspiration from an integration of the two perspectives.