A Search-Based Dynamic Reranking Model for Dependency Parsing

We propose a novel reranking method to extend a deterministic neural dependency parser. Different to conventional k-best reranking, the proposed model integrates search and learning by utilizing a dynamic action revising process, using the reranking model to guide modiﬁcation for the base outputs and to rerank the candidates. The dynamic reranking model achieves an absolute 1.78% accuracy improvement over the deterministic baseline parser on PTB, which is the highest improvement by neural rerankers in the literature.


Introduction
Neural network models have recently been exploited for dependency parsing. Chen and Manning (2014) built a seminal model by replacing the SVM classifier at the transition-based Malt-Parser (Nivre et al., 2007) with a feed-forward neural network, achieving significantly higher accuracies and faster speed. As a local and greedy neural baseline, it does not outperform the best discrete-feature parsers, but nevertheless demonstrates strong potentials for neural network models in transition-based dependency parsing.
Subsequent work aimed to improve the model of Chen and Manning (2014) in two main directions. First, global optimization learning and beam search inference have been exploited to reduce error propagation (Weiss et al., 2015;. Second, recurrent neural network models have been used to extend the range of neural features beyond a local window . These methods give accuracies that are competitive to the best results in the literature.  Another direction to extend a baseline parser is reranking (Collins and Koo, 2000;Charniak and Johnson, 2005;Huang, 2008). Recently, neural network models have been used to constituent (Socher et al., 2013;Le et al., 2013) and dependency (Le and Zuidema, 2014; parsing reranking. Compared with rerankers that rely on discrete manual features, neural network rerankers can potentially capture more global information over whole parse trees.

John loves Mary
Traditional rerankers are based on chart parsers, which can yield exact k-best lists and forests. For reranking, this is infeasible for the transitionbased neural parser and neural reranker, which have rather weak feature locality. In addition, kbest lists from the baseline parser are not necessarily the best candidates for a reranker. Our preliminary results show that reranking candidates can be constructed by modifying unconfident actions in the baseline parser output, and letting the baseline parser re-decode the sentence from the modified action. In particular, revising two incorrect actions of the baseline parser yields oracle with 97.79% UAS, which increases to 99.74% by revising five actions. Accordingly, we design a novel search-based dynamic reranking algorithm by revising baseline parser outputs.
For example, the sentence: "John loves Mary", the baseline parser generates a base tree ( Figure  1a) using 5 shift-reduce actions (Figure 1d) of Section 2. The gold parse tree can be obtained by a 2-step action revising process: As shown in Figure 1d, we first revise the least confident action S of the base tree, running the baseline parser again from the revised action to obtain tree 1. This corrects the John loves dependency arc. Then we obtain the gold parsing tree (tree 2) by further revising the least confident action in tree 1 on the second action sequence.
Rather than relying on the baseline model scores alone for deciding the action to revise (static search), we build a neural network model to guide which actions to revise, as well as to rerank the output trees (dynamic search). The resulting model integrates search and learning, yielding the minimum amount of candidates for the best accuracies. Given the extensively fast speed of the baseline parser, the reranker can be executed with high efficiency.
Our dynamic search reranker has two main advantages over the static one: the first is training diversity, the dynamic reranker searches over more different structurally diverse candidate trees, which allows the reranker to distinguish candidates more easily; the second is reranking oracle, with the guidance of the reranking model, the dynamic reranker has a better reranking oracle compared to the static reranker.
On WSJ, our dynamic reranker achieved 94.08% and 93.61% UAS on the development and test sets, respectively, at a speed of 16.1 sentences per second. It yields a 0.44% accuracy improvement (+1.78%) from the same number of candi- Figure 2: Hierarchical neural parsing model.
dates, compared to a static reranker (+1.34%), obtaining the largest accuracy improvement among related neural rerankers.

Baseline Dependency Parser
Transition-based dependency parsers scan an input sentence from left to right, performing a sequence of transition actions to predict its parse tree (Nivre, 2008). We employ the arc-standard system (Nivre et al., 2007), which maintains partially-constructed outputs using a stack, and orders the incoming words in the sentence in a queue. Parsing starts with an empty stack and a queue consisting of the whole input sentence. At each step, a transition action is taken to consume the input and construct the output. Formally, a parsing state is denoted as j, S, L , where S is a stack of subtrees [. . . s 2 , s 1 , s 0 ], j is the head of the queue (i.e. [ q 0 = w j , q 1 = w j+1 · · · ]), and L is a set of dependency arcs that has been built. At each step, the parser chooses one of the following actions: • SHIFT (S): move the front word w j from the queue onto the stacks. • LEFT-l (L): add an arc with label l between the top two trees on the stack (s 1 ← s 0 ), and remove s 1 from the stack. • RIGHT-l (R): add an arc with label l between the top two trees on the stack (s 1 → s 0 ), and remove s 0 from the stack. Given the sentence "John loves Mary", the gold standard action sequence is S, S, L, S, R.

Model
Chen and Manning (2014) proposed a deterministic neural dependency parser, which rely on dense embeddings to predict the optimal actions at each step. We propose a variation of Chen and Manning (2014), which splits the output layer into two hierarchical layers: the action layer and dependency label layer. The hierarchical parser determines a action in two steps, first deciding the action type, and then the dependency label (Figure 2). At each step of deterministic parsing, the neural model extracts n atomic features from the parsing state. We adopt the feature templates of Chen and Manning (2014). Every atomic feature is represented by a feature embedding e i ∈ R d , An input layer is used to concatenate the n feature embeddings into a vector x = [e 1 ; e 2 . . . e n ], where x ∈ R d·n . Then x is mapped to a d h -dimensional hidden layer h by a mapping matrix W 1 ∈ R d h ×d·n and a cube activation function for feature combination: Our method is different from Chen and Manning (2014) in the output layer. Given the hidden layer h, the action type output layer o act and the label output layer o label (a i ) of the action type a i are computed as Where W 2 ∈ R da×d h is the mapping matrix from the hidden layer to the action layer, and d a is the number of action types. W i 3 ∈ R d label ×d h is the mapping matrix from the hidden layer to the corresponding label layer, d label is the number of dependency labels.
The probability of a labeled action y i,j given its history Acts and input x is computed as: Here a i is the i th action in the action layer, and l j is the j th label in the label layer for a i .
In training, we use the cross-entropy loss to maximum the probability of training data A: Experiments show that our hierarchical neural parser is both faster and slightly accurate than the original neural parser.

Reranking Scorer
We adopt the recursive convolutional neural network (RCNN) of  for scoring full trees. Given a dependency subtree rooted at h, where Here p i ∈ R n is the concatenation of head word embedding w h , child phrase representation x c i and the distance embeddings d (h,c i ) . W (h,c i ) ∈ R m×n is a linear composition matrix, which depends on the POS tags of h and c i . The subtree phrase representation x h are computed using a max-pooling function on rows, over the matrix of arc representations Z h .
The subtree with the head h is scored by: Here, v h,c i is the score vector, which is a vector of parameters that need to be trained. The score of the whole dependency tree y is computed as: where w is the node in tree y and Θ denotes the set of parameters in the network.

Search-based Dynamic Reranking for Dependency Parsing
Using the hierarchical parser of Section 2 as the baseline parser, we propose a search-based dynamic reranking model, which integrates search and learning by searching the reranking candidates dynamically, instead of limiting the scope to a fixed k-best list. The efficiency of the reranking model is guaranteed by 3 properties of the baseline parser, namely revising efficiency, probability diversity and search efficiency.  Table 2: Average action probabilities.

Properties of the Baseline Parser
To demonstrate the above three properties, we give some preliminary results for the baseline. To parse the 1,695 sentences in Section 22 of WSJ, our baseline parser needs to perform 78,227 shiftreduce actions. During the process, if we correct every encountered incorrectly determined action and let the baseline parser re-decode the sentence from the point, we need to revise 2,052 actions, averaging 1.2 actions per sentence. In other words, the baseline parser can parse the 1,695 sentences correctly with 2,052 action being revised. Note that the revise operation is required to change the action type (i.e. S, L). After revising the action type, the optimal dependency label will be chosen for parsing by the hierarchical baseline parser. We only modify the action type in the revising process. Thus the modified trees are always structurally different instead of only with different dependency labels compared to the original one, which guarantees structured diversity.
Revising Efficiency It can be seen from Table 1 that revising one incorrect action results in 3.5% accuracy improvement. We obtain a 99.74% UAS after a maximum 5 depth revising. Although we only revise the action type, the LAS goes up with the UAS. The property of revising efficiency suggests that high quality tree candidates can be found with a small number of changes.
Probability Diversity Actions with lower probabilities are more likely to be incorrect. We compute the average probabilities of gold and incor-rect actions in parsing the section 22 of WSJ (Table 2), finding that most gold actions have very high probabilities. The average probabilities of the gold actions is much higher than that of the incorrectly predicted ones, indicating that revising actions with lower probabilities can lead to better trees.

Search Efficiency
The fast speed of the baseline parser allows the reranker to search a large number of tree candidates efficiently. With the graph stack trick (Goldberg et al., 2013), the reranker only needs to perform partial parsing to obtain new trees. This enables a fast reranker in theory.

Search Strategy
Given an output sequence of actions by the baseline parser, we revise the action with the lowest probability margin, and start a new branch by taking a new action at this point. The probability margin of an action a is computed as: p(a max ) − p(a), where a max is the action taken by the baseline, which has the highest model probability. a is taken instead of a max for this branch, and the baseline parser is executed deterministically until parsing finishes, thus yielding a new dependency tree. We require that the action type must change in the revision and the most probable dependency label among all for the revised action type will be used.
Multiple strategies can be used to search for the revised reranking process. For example, one intuitive strategy is best-first, which modifies the action with the lowest probability margin among all sequences of actions constructed so far. Starting from the original output of the baseline parser, modifying the action with the lowest probability margin results in a new tree. According to the best-first strategy, the action with the lowest probability margin in the two outputs will be revised next to yield the third output. The search repeats until k candidates are obtained, which are used as candidates for reranking.
The best-first strategy, however, does not consider the quality of the output, which is like a greedy process. A better candidate ( with higher F1 score) is more likely to take us to the gold tree. With the best-first strategy, we revise one tree at each time. If the selected tree is not the optimal one, the revised tree will be less likely the gold one. Revising a worse output is less likely to generate the gold parse tree compared with revising a relatively better output. Our preliminary experi-ments confirms this intuition. As a result, we take a beam search strategy, which uses a beam to hold b outputs to modify.
For each tree in beam search, most f actions with the lowest probability margin are modified, leading to b × f new trees. Here, b is the beam size, f is the revising factor. From these trees, the b best are put to the beam for the next step. Search starts with the beam containing only the original base parse, and repeats for l steps, where l is called the revising depth. The best tree will be selected from all the trees constructed. The search process for example in Figure 1 is illustrated in Figure 3, in which b = 1, f = 3 and l = 2.
At each iteration, the b best candidates can be decided by the baseline parser score alone, which is the product of the probability of each action. We call this the static search reranking. As mentioned in the introduction, the baseline model score might not be the optimal criteria to select candidates for reranking, since they may not reflect the best oracle or diversity. We introduce a dynamic search strategy instead, using the reranking model to calculate heuristic scores for guiding the search.

Search-Based Dynamic Reranking
Doppa et al. (2013) propose that structuredprediction by learning guide search should maintain two different scoring functions, a heuristic function for guiding search and a cost function for obtaining the best output. Following Doppa et al. (2013), we use the RCNN in Section 3 to yield two different scores, namely a heuristic score s t (x, y, Θ h ) to guide the search of revising, and a cost score s t (x, y, Θ c ) to select the best tree output.
Denote b(i) as the beam at i-th step of search, k-best candidates in the beam of i + 1 step is: where c(i) denotes the set of newly constructed trees by revising trees in b(i), s b (x, c) is the baseline model score and arg K leaves the k best candidate trees to the next beam. Finally, the output tree y i of reranking is selected from all searched trees C in the revising process Interpolated Reranker In testing, we also adopt the popular mixture reranking strategy (Hayashi et al., 2013;Le and Mikolov, 2014), which obtains better reranking performance by a linear combination of the reranking score and the baseline model score.
Here y i is the final output tree for a sentence x i ; τ (x i ) returns all the trees candidates of the dynamic reranking; β ∈[0, 1] is a hyper-parameter.

Training
As k-best neural rerankers (Socher et al., 2013;, we use the max-margin criterion to train our model in a stage-wise manner (Doppa et al., 2013). Given training data D c = (x i , y i ,ŷ i ) N i=1 , where x i is the sentence,ŷ i is the output tree with highest cost score and y i is the corresponding gold tree, the final training objective is to minimize the loss function J(Θ c ), plus a Here, Θ c is the model, s t (x i , y i , Θ c ) is the cost reranking score for y i .
∆(y i ,ŷ i ) is the structured margin loss between y i andŷ i , measured by counting the number of incorrect dependency arcs in the tree (Goodman, 1998;.
for the heuristic score model, the training objective is to minimize the loss between the tree with the best UAS y i and the tree with the best heuristic reranking scoreŷ i .
The detailed training algorithm is given by Algorithm 1. AdaGrad (Duchi et al., 2011) updating with subgradient (Ratliff et al., 2007 and minibatch is adopted for optimization.

Set-up
Our experiments are performed using the English Penn Treebank (PTB; Marcus et al., (1993)). We follow the standard splits of PTB3, using sections 2-21 for training, section 22 for development and section 23 for final testing. Following prior work on reranking, we use Penn2Malt 1 to convert constituent trees to dependency trees. Ten-fold POS jackknifing is used in the training of the baseline parser. We use the POS-tagger of Collins (2002) to assign POS automatically. Because our reranking model is a dynamic reranking model, which generates training instances during search, we train 10 baseline parsing models on the 10-fold jackknifing data, and load the baseline parser model dynamically for reranking training .
We follow Chen and Manning (2014), using the set of pre-trained word embeddings with a dictionary size of 13,000 2 from Collobert et al. (2011). The word embeddings were trained on the entire English Wikipedia, which contains about 631 million words.

Hyper-parameters
There are two different networks in our system, namely a hierarchical feed-forward neural network for the baseline parsing and a recursive convolution network for dynamic reranking. The hyper-parameters of the hierarchical parser are set as described by Chen and Manning (2014), with the embedding size d = 50, the hidden layer size d h = 300, the regularization parameter λ = 10 −8 , the initial learning rate of Adagrad α = 0.01 and the batch size b = 100,000. We set the hyperparameters of the RCNN as follows: word embedding size d w rnn = 25, distance embedding size d d rnn = 25, initial learning rate of Adagrad α rnn = 0.1, regularization parameter λ rnn = 10 −4 , margin loss discount κ = 0.1 and revising factor f = 8.

The Hierarchical Neural Parser
Shown in Table 3, the proposed hierarchical base parser is 1.3 times faster, and obtains a slight accuracy improvement (   speed gain is that smaller output layer leads to less computation of mapping from the hidden layer to the output layer in neural networks (Morin and Bengio, 2005;Mnih and Hinton, 2009).

Development Tests
For the beam search dynamic reranking model, the selection of beam size b and revising depth l affect the accuracy and efficiency of the reranker. We tune the values on the development set.
Beam Size A proper beam size balances efficiency and accuracy in the search process. The reranking accuracies with different beam sizes are listed in Table 4. Here, the oracle is the best UAS among searched trees during reranking. K is the number of searched candidate trees in testing. The UAS and parsing oracle both go up with increasing the beam size. Reranking with beam size = 4 gives the best development performance. We set the final beam size as 4 in the next experiments. Table 5, with revising depth increasing from 1 to 3, the reranker obtains better parsing oracle. The depth of 3 gives the best UAS 93.81% on the development set. The parsing oracle stops improving with deeper revised search. This may because in the fourth search step, the high quality trees begin to fall out the beam, resulting in worse output candidates, which make the revising step yield less oracle gains. We set the search depth as 3 in the next experiments.

Revising Depth As shown in
Integrating Search and Learning Shown in Table 6, the dynamic and static rerankers both achieve significant accuracy improvements over the baseline parser. The dynamic reranker gives   much better improvement, although the oracle of dynamic reranker is only 0.2% higher than the static one. This demostrates the benefit of diversity. The candidates are always the same for static search, but the dynamic reranker searches more diverse tree candidates in different iterations of training.
To further explore the impact of training diversity to dynamic reranking, we also compare the dynamic search reranker of training and testing with different revising depth. In Table 7, origin is the results by training and testing with the same depth d. Results of ts is obtained by training with d = 3, and testing with a smaller d. For example, a reranker with training d = 3 and testing d = 2 achieves better performance than with training d = 2 and testing d = 2. The testing oracle of the former reranker is lower than the later, yet the former learns more from the training instance, obtaining better parsing accuracies. This again indicates that training diversity is very important besides the oracle accuracy.
Interpolated Reranker Finally, we mix the baseline model score and the reranking score by following Hayashi et al. (2013) and , and the mixture parameter β is optimized by searching with the step size of 0.005. With the mixture reranking trick, the dynamic reranker obtains an accuracy of 94.08% (Table 8), with an improvement of 0.28% on the development set.

Final Results
Comparison with Dependency Rerankers In Table 9, we compare the search-based dynamic rerankers with a list of dependency rerankers. The reranking models of Hayashi et al. (2013) and Hayashi et al. (2011) are forest reranking models. Le and Zuidema (2014) and  are neural k-best reranking models. Our dynamic   reranking model achieves the highest accuracy improvement over the baseline parser on both the development and test sets. We obtain the best performance on the development set.  achieved higher accuracy on the test set, but they adopted a better baseline parser than ours, which could not be used in our dynamic reranker because it is not fast enough and will make our reranker slow in practice.

Comparing with Neural Dependency Parsers
We also compare parsing accuracies and speeds with a number of neural network dependency parsers.  proposed a dependency parser with stack LSTM;  applied the beam search for structured dependency parsing. Both achieved significant accuracy improvements over the deterministic neural parser of Chen and Manning (2014). Our dynamic search reranker obtains a 93.61% UAS on the test set, which is higher than most of the neural parsers except Weiss et al. (2015), who employ a structured prediction model upon the neural greedy baseline, achieving very high parsing accuracy.

Results on Stanford dependencies
We also evaluate the proposed static and dynamic rerankers on Staford dependency treebank. The main results are consistent with CoNLL dependency treebank with the dynamic reranker achieving a 0.41% accuracy improvement upon the static reranker on test data. But the parsing accuracy on Stanford dependency is not the state-of-the-art.
We speculate that there may be two reasons. First, the baseline parsing accuracy on Stanford dependencies is lower than CoNLL. Second, all the hyper-parameters are tuned on the CoNLL data.

Related Work
Neural Networks Reranking A line of work has been proposed to explore reranking using neural networks. Socher et al. (2013) first proposed a neural reranker using a recursive neural network for constituent parsing. Le and Zuidema (2014) extended the neural reranker to dependency parsing using a inside-outside recursive neural network (IORNN), which can process trees both bottom-up and top-down.  proposed a RCNN method, which solved the problem of modeling k-ary parsing tree in dependency parsing. The neural rerankers are capable of capturing global syntax features across the tree. In contrast, the most non-local neural parser with LSTM  cannot exploit global features. Different to previous neural rerankers, our work in this paper contributes on integrating search and learning for reranking, instead of proposing a new neural model.
Forest Reranking Forest reranking (Huang, 2008;Hayashi et al., 2013) offers a different way to extend the coverage of reranking candidates, with computing the reranking score in the trees forests by decomposing non-local features with cube-pruning (Huang and Chiang, 2005). In contrast, the neural reranking score encodes the whole dependency tree, which cannot be decomposed for forest reranking efficiently and accurately. Doppa et al. (2013) proposed a structured prediction model with HC-Search strategy and imitation learning, which is closely related to our work in spirit. They used the complete space search (Doppa et al., 2012) for sequence labeling tasks, and the whole search process halts after a specific time bound. Different from them, we propose a dynamic parsing reranking model based on the action revising process, which is a multi-step process by revising the least confident   actions from the base output and the search stops in a given revising depth. The dynamic reranking model concentrates on extending the training diversity and testing oracle for parsing reranking, which is built on the transition-based parsing framework.

Conclusion
In this paper, we proposed a search-based dynamic reranking model using a hierarchical neural base parser and a recursive convolutional neural score model. The dynamic model is the first reranker integrating search and learning for dependency parsing. It achieves significant accuracy improvement (+1.78%) upon the baseline deterministic parser. With the dynamic search process, our reranker obtains a 0.44% accuracy improvement upon the static reranker. The code of this paper can be downloaded from http://github. com/zhouh/dynamic-reranker.