Linear-time Constituency Parsing with RNNs and Dynamic Programming

Recently, span-based constituency parsing has achieved competitive accuracies with extremely simple models by using bidirectional RNNs to model “spans”. However, the minimal span parser of Stern et al. (2017a) which holds the current state of the art accuracy is a chart parser running in cubic time, O(n^3), which is too slow for longer sentences and for applications beyond sentence boundaries such as end-to-end discourse parsing and joint sentence boundary detection and parsing. We propose a linear-time constituency parser with RNNs and dynamic programming using graph-structured stack and beam search, which runs in time O(n b^2) where b is the beam size. We further speed this up to O(n b log b) by integrating cube pruning. Compared with chart parsing baselines, this linear-time parser is substantially faster for long sentences on the Penn Treebank and orders of magnitude faster for discourse parsing, and achieves the highest F1 accuracy on the Penn Treebank among single model end-to-end systems.


Introduction
Span-based neural constituency parsing (Cross and Huang, 2016;Stern et al., 2017a) has attracted attention due to its high accuracy and extreme simplicity. Compared with other recent neural constituency parsers (Dyer et al., 2016;Liu and Zhang, 2016;Durrett and Klein, 2015) which use neural networks to model tree structures, the spanbased framework is considerably simpler, only using bidirectional RNNs to model the input sequence and not the output tree. Because of this factorization, the output space is decomposable which enables efficient dynamic programming algorithm such as CKY. But existing span-based parsers suffer from a crucial limitation in terms of search: on the one hand, a greedy span parser (Cross and Huang, 2016) is fast (linear-time) but only explores one single path in the exponentially large search space, and on the other hand, a chartbased span parser (Stern et al., 2017a) performs exact search and achieves state-of-the-art accuracy, but in cubic time, which is too slow for longer sentences and for applications that go beyond sentence boundaries such as end-to-end discourse parsing (Hernault et al., 2010;Zhao and Huang, 2017) and integrated sentence boundary detection and parsing (Björkelund et al., 2016).
We propose to combine the merits of both greedy and chart-based approaches and design a linear-time span-based neural parser that searches over exponentially large space. Following Huang and Sagae (2010), we perform left-to-right dynamic programming in an action-synchronous style, with (2n − 1) actions (i.e., steps) for a sentence of n words. While previous non-neural work in this area requires sophisticated features (Huang and Sagae, 2010;Mi and Huang, 2015) and thus high time complexity such as O(n 11 ), our states are as simple as : (i, j) where is the step index and (i, j) is the span, modeled using bidirectional RNNs without any syntactic features. This gives a running time of O(n 4 ), with the extra O(n) for step index. We further employ beam search to have a practical runtime of O(nb 2 ) at the cost of exact search where b is the beam size. However, on the Penn Treebank, most sentences are less than 40 words (n < 40), and even with a small beam size of b = 10, the observed complexity of an O(nb 2 ) parser is not exactly linear in n (see Experiments). To solve this problem, we apply cube pruning (Chiang, 2007;Huang and Chiang, 2007) to improve the runtime to O(nb log b) which renders an observed complexity that is linear in n (with minor extra inexactness).
We make the following contributions: • We design the first neural parser that is both linear time and capable of searching over exponentially large space. 1 • We are the first to apply cube pruning to incremental parsing, and achieves, for the first time, the complexity of O(nb log b), i.e., linear in sentence length and (almost) linear in beam size. This leads to an observed complexity strictly linear in sentence length n.
• We devise a novel loss function which penalizes wrong spans that cross gold-tree spans, and employ max-violation update (Huang et al., 2012) to train this parser with structured SVM and beam search.
• Compared with chart parsing baselines, our parser is substantially faster for long sentences on the Penn Treebank, and orders of magnitude faster for end-to-end discourse parsing. It also achieves the highest F1 score on the Penn Treebank among single model end-to-end systems.
• We devise a new formulation of graphstructured stack (Tomita, 1991) which requires no extra bookkeeping, proving a new theorem that gives deep insight into GSS.

Span-Based Shift-Reduce Parsing
A span-based shift-reduce constituency parser (Cross and Huang, 2016) maintains a stack of spans (i, j), and progressively adds a new span each time it takes a shift or reduce action. With (i, j) on top of the stack, the parser can either shift to push the next singleton span (j, j + 1) on the stack, or it can reduce to combine the top two spans, (k, i) and (i, j), forming the larger span (k, j). After each shift/reduce action, the top-most span is labeled as either a constituent or with a null label ∅, which means that the subsequence is not a subtree in the final decoded parse. Parsing initializes with an empty stack and continues until (0, n) is formed, representing the entire sentence.
Figure 1: Our shift-reduce deductive system. Here is the step index, c and v are prefix and inside scores. Unlike Huang and Sagae (2010) and Cross and Huang (2016), ξ and σ are not shift/reduce scores; instead, they are the (best) label scores of the resulting span: ξ = max X s(j, j + 1, X) and σ = max X s(k, j, X) where X is a nonterminal symbol (could be ∅). Here = − 2(j − i) + 1.

Bi-LSTM features
To get the feature representation of a span (i, j), we use the output sequence of a bi-directional LSTM (Cross and Huang, 2016;Stern et al., 2017a). The LSTM produces f 0 , ..., f n forwards and b n , ..., b 0 backwards outputs, which we concatenate the differences of (f j −f i ) and (b i −b j ) as the representation for span (i, j). This eliminates the need for complex feature engineering, and can be stored for efficient querying during decoding.

Score Decomposition
Like Stern et al. (2017a), we also decompose the score of a tree t to be the sum of the span scores: Note that X is a nonterminal label, a unary chain (e.g., S-VP), or null label ∅. 2 In a shift-reduce setting, there are 2n − 1 steps (n shifts and n − 1 reduces) and after each step we take the best label for the resulting span; therefore there are exactly 2n−1 such (labeled) spans (i, j, X) in tree t. Also note that the choice of the label for any span (i, j) is only dependent on (i, j) itself (and not depending on any subtree information), thus the max over label X is independent of other spans, which is a nice property of span-based parsing (Cross and Huang, 2016;Stern et al., 2017a).

Graph-Struct. Stack w/o Bookkeeping
We now reformulate this DP parser in the above section as a shift-reduce parser. We maintain a step index in order to perform action-synchronous beam search (see below). Figure 1 shows how to represent a parsing stack using only the top span (i, j). If the top span (i, j) shifts, it produces (j, j + 1), but if it reduces, it needs to know the second last span on the stack, (k, i), which is not represented in the current state. This problem can be solved by graph-structure stack (Tomita, 1991;Huang and Sagae, 2010), which maintains, for each state p, a set of predecessor states π(p) that p can combine with on the left. This is the way our actual code works (π(p) is implemented as a list of pointers, or "left pointers"), but here for simplicity of presentation we devise a novel but easier-to-understand formulation in Fig. 1, where we explicitly represent the set of predecessor states that state : (i, j) can combine with as : (k, i) where = − 2(j − i) + 1, i.e., (i, j) at step can combine with any (k, i) for any k at step . The rationale behind this new formulation is the following theorem: Theorem 1 The predecessor states π( : (i, j)) are all in the same step = − 2(j − i) + 1.

Proof. By induction.
This Theorem bring new and deep insights and suggests an alternative implementation that does not require any extra bookkeeping. The time complexity of this algorithm is O(n 4 ) with the extra O(n) due to step index. 3

Action-Synchronous Beam Search
The incremental nature of our parser allows us to further lower the runtime complexity at the cost of inexact search. At each time step, we maintain the top b parsing states, pruning off the rest. Thus, a candidate parse that made it to the end of decoding had to survive within the top b at every step.
With O(n) parsing actions our time complexity becomes linear in the length of the sentence.

Cube Pruning
However, Theorem 1 suggests that a parsing state p can have up to b predecessor states ("left pointers"), i.e., |π(p)| ≤ b because π(p) are all in the same step, a reduce action can produce up to b subsequent new reduced states. With b items on a beam and O(n) actions to take, this gives us an overall complexity of O(nb 2 ). Even though b 2 is a constant, even modest values of b can make b 2 dominate the length of the sentence. 4 To improve this at the cost of additional inexactness, we introduce cube pruning to our beam search, where we put candidate actions into a heap and retrieve the top b states to be considered in the next time-step. We heapify the top b shiftmerged states and the top b reduced states. To avoid inserting all b 2 reduced states from the previous beam, we only consider each state's highest scoring left pointer, 5 and whenever we pop a reduced state from the heap, we iterate down its left pointers to insert the next non-duplicate reduced state back into the heap. This process finishes when we pop b items from the heap. The initialization of the heap takes O(b) and popping b items takes O(b log b), giving us an overall improved runtime of O(nb log b).

Training
We use a Structured SVM approach for training (Stern et al., 2017a;Shi et al., 2017). We want the model to score the gold tree t * higher than any other tree t by at least a margin ∆(t, t * ): Note that ∆(t, t) = 0 for any t and ∆(t, t * ) > 0 for any t = t * . At training time we perform lossaugmented decoding: The average length of a sentence in the Penn Treebank training set is about 24. Even with a beam size of 10, we already have b 2 = 100, which would be a significant factor in our runtime. In practice, each parsing state will rarely have the maximum b left pointers so this ends up being a loose upper-bound. Nevertheless, the beam search should be performed with the input length in mind, or else as b increases we risk losing a linear runtime. 5 If each previous beam is sorted, and if the beam search is conducted by going top-to-bottom, then each state's left pointers will implicitly be kept in sorted order. where s ∆ (·) is the loss-augmented score. Ift = t * , then all constraints are satisfied (which implies arg max t s(t) = t * ), otherwise we perform an update by backpropagating from s ∆ (t) − s(t * ).

Cross-Span Loss
The baseline loss function from Stern et al. (2017a) counts the incorrect labels (i, j, X) in the predicted tree: Note that X can be null ∅, and t * (i,j) denotes the gold label for span (i, j), which could also be ∅. 6 However, there are two cases where t * (i,j) = ∅: a subspan (i, j) due to binarization (e.g., a span combining the first two subtrees in a ternary branching node), or an invalid span in t that crosses a gold span in t * . In the baseline function above, these two cases are treated equivalently; for example, a span (3, 5, ∅) ∈ t is not penalized even if there is a gold span (4, 6, VP) ∈ t * . So we revise our loss function as: ∨ cross(i, j, t * ) 6 Note that the predicted tree t has exactly 2n − 1 spans but t * has much fewer spans (only labeled spans without ∅). where cross(i, j, t * ) = ∃ (k, l) ∈ t * , and i < k < j < l or k < i < l < j.

Max Violation Updates
Given that we maintain loss-augmented scores even for partial trees, we can perform a training update on a given example sentence by choosing to take the loss where it is the greatest along the parse trajectory. At each parsing time-step , the violation is the difference between the highest augmented-scoring parse trajectory up to that point and the gold trajectory (Huang et al., 2012;Yu et al., 2013). Note that computing the violation gives us the max-margin loss described above. Taking the largest violation from all time-steps gives us the max-violation loss.

Experiments
We present experiments on the Penn Treebank (Marcus et al., 1993) and the PTB-RST discourse treebank (Zhao and Huang, 2017). In both cases, the training set is shuffled before each epoch, and dropout (Hinton et al., 2012) is employed with probability 0.4 to the recurrent outputs for regularization. Updates with minibatches of size 10 and 1 are used for PTB and the PTB-RST respectively. We use Adam (Kingma and Ba, 2014) with default settings to schedule learning rates for all the weights. To address unknown words during training, we adopt the strategy described by Kiperwasser and Goldberg (Kiperwasser and Goldberg, 2016); words in the training set are replaced with the unknown word symbol UNK with probability p unk = 1 1+f (w) , with f (w) being the number of  occurrences of word w in the training corpus. Our system is implemented in Python using the DyNet neural network library (Neubig et al., 2017).

Penn Treebank
We use the Wall Street Journal portion of the Penn Treebank, with the standard split of sections 2-21 for training, 22 for development, and 23 for testing. Tags are provided using the Stanford tagger with 10-way jackknifing. Table 1 shows our development results and overall speeds, while Table 2 compares our test results. We show that a beam size of 20 can be fast while still achieving state-of-the-art performances.

Discourse Parsing
To measure the tractability of parsing on longer sequences, we also consider experiments on the    Table 3, broken down to focus on the discourse labels.
PTB-RST discourse Treebank, a joint discourse and constituency dataset with a combined representation, allowing for parsing at either level (Zhao and Huang, 2017). We compare our runtimes out-of-the-box in Figure 3. Without any pre-processing, and by treating discourse examples as constituency trees with thousands of words, our trained models represent end-to-end discourse parsing systems. For our overall constituency results in Table 3, and for discourse results in Table 4, we adapt the split-point feature described in (Zhao and Huang, 2017) in addition to the base parser. We find that larger beamsizes are required to achieve good discourse scores.

Conclusions
We have developed a new neural parser that maintains linear time, while still searching over an exponentially large space. We also use cube pruning to further improve the runtime to O(nb log b). For training, we introduce a new loss function, and achieve state-of-the-art results among singlemodel end-to-end systems.