Neural Reranking for Dependency Parsing: An Evaluation

Recent work has shown that neural rerankers can improve results for dependency parsing over the top k trees produced by a base parser. However, all neural rerankers so far have been evaluated on English and Chinese only, both languages with a configurational word order and poor morphology. In the paper, we re-assess the potential of successful neural reranking models from the literature on English and on two morphologically rich(er) languages, German and Czech. In addition, we introduce a new variation of a discriminative reranker based on graph convolutional networks (GCNs). We show that the GCN not only outperforms previous models on English but is the only model that is able to improve results over the baselines on German and Czech. We explain the differences in reranking performance based on an analysis of a) the gold tree ratio and b) the variety in the k-best lists.


Introduction
Neural models for dependency parsing have been a tremendous success, pushing state-of-the-art results for English on the WSJ benchmarking dataset to over 94% LAS (Dozat and Manning, 2017). Most state-of-the-art parsers, however, are local and greedy and are thus expected to have problems finding the best global parse tree. This suggests that combining greedy, local parsing models with some mechanism that adds a global view on the data might increase parsing accuracies even further.
In this work, we look into incorporating global information for dependency parsing via reranking. Different model architectures have been proposed for neural reranking of dependency parse trees (Le and Zuidema, 2014;Zhu et al., 2015;Zhou et al., 2016). Despite achieving modest or even substantial improvements over the baseline parser, however, all the systems above only report performance on English and Chinese data, both morphologically poor languages with a configurational word order and mostly projective tree structures.
In the paper, we thus try to reproduce results for different reranking models from the literature on English data and compare them to results for German and Czech, two morphologically rich(er) languages (MRLs) with a high percentage of nonprojective structures. In addition, we present a new discriminative reranking model based on graph convolutional networks (GCNs). Our GCN reranker outperforms the other rerankers on English and is also the only model able to obtain small improvements over the baseline parser on German and Czech while the other rerankers fail to beat the baselines. The improvements, however, are not significant and raise the question what makes neural reranking of MRLs more difficult than reranking English or Chinese.
We analyze the differences in performance on the three languages and show that the reasons for this failure are due to the composition and quality of the k-best lists. In particular, we show that the gold tree ratio in the English k-best list is much higher than for German and Czech, and that the trees in the English k-best list show a higher variety, thus making it easier for the reranker to distinguish between high-and low-quality trees.
The paper is structured as follows. In §2, we review related work on reranking for neural dependency parsing. The different reranking models are described in detail in §3. In §4, we first reproduce reranking results for English and evaluate our new reranker on the English data. Then we test the different models on the two morphologically rich(er) languages and present the results of our evaluation and our analysis, before we conclude in §5.

Related Work
Reranking is a popular technique to improve parsing performance of the output of a base parser. First, the top k candidate trees are generated by the base parser, then these trees are reranked using additional features not accessible to the base parser. This adds a more global and complete view of the trees, in contrast to the local and incomplete features used by the parser.
Discriminative rerankers have been a success story in constituency parsing (Collins and Koo, 2005;Charniak and Johnson, 2005). A disadvantage of the traditional feature-rich rerankers is that the large number of potentially sparse features makes them prone to overfitting, and also reduces the efficiency of the systems. Neural rerankers offer a solution to that problem by learning dense, low-dimensional feature representations that are better at generalization, and so reduce the risk of overfitting.
Neural reranking The first neural reranker has been presented by Socher et al. (2013) for constituency parsing, based on a recursive neural network which processes the nodes in the parse tree bottom-up and learns dense feature presentations for the whole tree. This approach was adapted for dependency parsing by Le and Zuidema (2014). Zhu et al. (2015) improve on previous work by proposing a recursive convolutional neural network (RCNN) architecture for reranking which can capture syntactic and semantic properties of words and phrases in the parse trees (see §3 for a more detailed description of the two models).
k-best vs. forest reranking There exist two different approaches to reranking for parsing: k-best reranking and forest reranking. In k-best reranking, the complete parse tree is encoded and presented to the reranker. A disadvantage of k-best reranking is the limited scope of the k-best list which provides an upper bound for reranking performance. In contrast, a packed parse forest is a compact representation of exponentially many trees of which each node represents a deductive step. Forest reranking (Huang, 2008;Hayashi et al., 2013) approximately decodes the highest scored tree with both local and non-local features in a parse forest with cube pruning (Huang and Chiang, 2005).
In our work, we focus on neural reranking of a k-best list of parses generated by a base parsing system as we could not find any available parsers that are both non-projective and produce packed parse forests as output.

Neural Reranking Models for Dependency Parsing
In this section, we look into reranking for dependency parsing and compare two different types of models: the generative inside-outside recursive neural network (IORNN) reranker (Le and Zuidema, 2014) and the discriminative reranker based on recurrent convolutional neural networks (RCNNs) (Zhu et al., 2015). In addition, we propose a new reranking model for dependency parsing that employs graph convolutional networks (GCNs) to encode the trees.

Generative models
A generative reranking model scores a dependency structure by estimating its generation probability. The probability of generating a fragment of a dependency tree (e.g., a node) D depends on its dependency context C D . The amount of information used in C D is called the order of the generative model. Ideally, we want to generate a dependency subtree D based on ∞-order context C ∞ D which includes all ancestors of D, their siblings, and all siblings of D. As the ∞-order counting model is impracticable due to data sparsity, Le and Zuidema (2014) propose the IORNN model to encode the context to generate each node in a dense vector.
IORNN The IORNN extends the idea of recursive neural networks (Socher et al., 2010) for constituent parsing where the inner representation of a node is computed bottom up. It also adds a second vector to each node, an outer representation, which is computed top down. The inner representation represents the content of the subtree at the current node, while the outer representation represents the context used to generate that node. The model is further adapted to ∞-order dependency trees with partial outer representations that represent the partial context while generating dependents from left to right. For details on how to compute these representations, please refer to Le and Zuidema (2014).
Training The IORNN is trained to maximize the probability of generating each word w given its partial outer representationō w : where D is the set of dependency trees, and m is the total number of words.

Discriminative models
In contrast to generative models, a discriminative reranker learns to distinguish the correct parse tree of a sentence from the incorrect ones. Since the tree space is huge, one cannot generate all possible trees to train the model, but can only use a subset of the trees generated by the base parser. Therefore, a discriminative reranker is only optimized for one specific parser and can easily overfit the error types of the k-best list. The common idea of all models in this section is to encode the structure of a dependency tree via its node and/or edge representations. Node representations are computed either recursively bottom-up (RCNN) or in a step-by-step recurrent manner (GCN).
RCNN A RCNN recursively encodes each subtree with regards to its children using a convolutional layer. At each dependency node h, a RCNN module computes its hidden representation h and a plausibility score s(h) based on the representation of its children. For details, see Zhu et al. (2015). Given a sentence x and its dependency tree y, the score of y is computed by summing up the scores of all inner nodes h: The network then outputs the predicted treeŷ from the input list gen(x) with the highest score: The bottom-up fashion used in the RCNN can cause disproportion between the tree structure and its representation due to the order in the recursive computation. Consider two trees that only differ in one edge. Their node representations will be more similar if the edge appears higher up in the tree and less so if the edge is closer to the lower level, since the difference spreads to the upper level. Thus, we believe that a discriminative reranker can benefit from a model that considers nodes in a tree more equally, as done in our GCN model below.
GCN GCNs have been used to encode nodes in a graph with (syntactic) information from their neighbors. By stacking several layers of GCNs, the learned representation can capture information about directly connected nodes (with only one layer), or nodes that are K hops away (with K layers). We adapt the syntactic gated GCNs for semantic role labeling from Marcheggiani and Titov (2017) to encode the parse trees in our experiments. To our best knowledge, this is the first time GCNs are used for reranking in dependency parsing.
Let the hidden representation of node v after K GCN layers be h (K) v . The plausibility score of each tree is the sum of the scores of all nodes in the tree: Training Given an input sentence x, the input to the reranker is the corresponding correct parse tree y and a list of trees generated by a base parser gen(x). As in conventional ranking systems, all discriminative rerankers can be trained with a margin-based hinge loss so that the score of the correct tree is higher than the score of the incorrect one with a margin of at least m: Zhu et al. (2015) use a structured margin m = κ∆(y, t), which is computed by counting the number of incorrect edges of t with respect to y. κ is a discount hyperparameter indicating the importance of ∆ to the loss. In addition, the tree predicted by the modelŷ (i.e., the highest scored tree) (3) is used to calculate the final loss. Alternatively, the loss of the predicted tree can be replaced by the average loss over all trees in the list.

Mixture reranking models
None of the models above does consider the scores from the base parser when ranking trees. Therefore, it seems plausible to try combining the advantages from both models, base parser and reranker, to produce a better final model. The most common way to do so is to consider the base parser and the reranker as a mixture model. The score of any reranking model s r can be combined with the score of the base parser s b using a linear combination: where α ∈ [0, 1] is a parameter.

Evaluating Neural Rerankers for Dependency Parsing
We are now providing a systematic evaluation of different neural reranking models used to rank the k-best lists generated by different parsers. In our first experiments, we try to reproduce the results for the available rerankers (IORNN, RCNN) on English. After that, we compare the performance of the rerankers on German and Czech data. Unless stated otherwise, results are compared based on UAS and LAS including punctuation.

Data
English Following Zhu et al. (2015), we use the Penn Treebank (PTB) with standard splits: sections 2-21 for training, section 22 for development and section 23 for testing. Their reranking models are applied to unlabeled trees. The authors used the linear incremental parser from Huang and Sagae (2010) to produce k-best lists and achieved slight improvements due to differences in optimization. In contrast, we obtained the data and pretrained model from the public repository. 1 Although not emphasized in their paper, Zhu et al. (2015) obtained the top k parses from the forests (a by-product of dynamic programming) rather than by using beam search. This is very important for reranking because the forest encodes exponentially many trees and so the k-best list extracted from the parse forest has a higher upper bound (Huang and Sagae, 2010). Following previous work, we refer to the greedy, one-best results from the base parser as the baseline. Oracle worst and best are the lower and upper bound accuracies of the trees in the k-best list, respectively. Top tree results are calculated on the highest scored trees by the base parser in the list. Table 1 shows that both our baseline and upper bound results are lower than those from Zhu et al. (2015). Extracting the top trees from the parse forest results in a much higher upper bound (+3.97%, development set) compared to using beam search (+1.46%, although not shown here). The maximum gain of our k-best list at k = 64 using the forest is about 1% lower than in Zhu et al. (2015).
German We use the German dataset from the SPMRL 2014 Shared Task (Seddah et al., 2014) which contains 50,000 sentences of newspaper text. We follow the original train/dev/test splits and use the predicted POS and morphological tags provided by the shared task organizers. The top k parses are produced using the graph-based parser in the MATE tools (Bohnet, 2010), 2 a non-neural model that employs second order, approximate nonprojective parsing (McDonald and Pereira, 2006). The algorithm first finds the highest scored projective tree with exact inference, then rearranges the edges one at a time as long as the overall score improves and the parse tree does not violate the tree constraint. This algorithm also creates a list of kbest trees through its search process. We also tried to generate the k-best lists with a transition-based parser by adding a beam search decoder, but the beam failed to improve the parsing upper bound.
Czech We use the Czech Universal Dependencies (UD) Treebank, 3 based on the Prague Dependency Treebank 3.0 (Bejček et al., 2013). We use the original train/dev/test split and use MarMoT (Mueller et al., 2013) to predict UD POS tags by 5way jackknifing. The k-best lists are created using the same parser as for German. The properties of the k-best lists extracted from the German and Czech data are shown in table 2. Extracting the top k parses results in scores lower than the baseline when using the top trees as output, as the reranking scores do not always correlate with the quality of the trees.  Pre-trained word embeddings In all experiments on English, we use the 50-dimensional GloVe word embeddings (Pennington et al., 2014) trained on Wikipedia 2014 and Gigaword 5. For German, we train 100-dimensional dependencybased word embeddings (Levy and Goldberg, 2014) on the SdeWaC corpus (Faaß and Eckart, 2013) with a cutoff frequency of 20 for both words and contexts and set the number of negative samples to 15. In experiments on Czech, we reduce the number of dimensions of the word vectors from fastText (Bojanowski et al., 2017) to 100 using PCA (Raunak et al., 2019).

Reproducing reranking results for PTB
This section is dedicated to the reproduction of the published results for the IORNN and RCNN rerankers on the English PTB. All results are from one run since we observe little variation between different runs 4 (and even between different settings the results hardly vary). (2014) can be reproduced with 93.01% UAS using the data and instructions from the public repository 5 . We are able to replicate this trend on our unlabeled English data described in §4.1, i.e., the reranking results are better than the baseline. The IORNN 4 For instance, the standard deviations of 5 runs on the development and test sets are σ dev = 0.05, σtest = 0.07 (%) when running the best GCN model setting on the English data. 5 https://github.com/lephong/ iornn-depparse  mixture model achieves 92.06% UAS on the test set, which is lower than the reproduced results on the paper's original data. Our baseline, however, is also lower due to the use of different data conversion rules for the conversion from constituency trees to dependencies, and the use of different base parsers. Note that Le and Zuidema (2014) also optimize the results on k while we keep k fixed in our experiments to make the results comparable between the different models. In addition, the authors do a logarithmic scaling for the score of the reranker in the mixture model combination (equation 6) and we use this function as it is. 6,7 Table 3 summarizes the results from our reproduction study.

IORNN The results from Le and Zuidema
RCNN Since the code is not publicly available, we re-implemented the RCNN model following the description in the paper (Zhu et al., 2015). However, we were not able to reproduce the results on the 10-best list extracted from the parse forest. The authors report 93.83% (+1.48) UAS without punctuation using the mixture reranker with k = 64, and the same trend sets for all k. All our attempts to get better results than the base parser fail. Even when combining the reranking score with the score from the base parser, results do not improve over the baseline.
We run an ablation study to investigate the effect of different hyperparameters on the model's performance. We achieve best scores (UAS 90.65% and 90.29%) on both development and test set when removing L2 and structured margin and replacing  Table 4: Accuracies of the RCNN-shared (+BiLSTMs) model on the PTB development set with regard to the size of the k-best list the largest margin with the average margin. However, one thing we noted during training is that the learning curves indicate severe overfitting. In conclusion, despite our efforts we were not able to reproduce the RCNN results from Zhu et al. (2015).

RCNN-shared
As the learning curves for the RCNN models show severe overfitting, we propose to simplify the original model. The original RCNN has a large number of parameters, due to its use of different weight matrices and vectors for the POS tags of the current head-child pair. In the simplified model, we replace those matrices W (h,c) and vectors v (h,c) with a shared matrix W and vector v. Word embeddings and POS embeddings (randomly initialized) are concatenated as the input to the RCNN. Following common practice, we also test a model where we place several BiLSTM layers before the RCNNs to learn better representations from the input embeddings (+BiLSTMs). By switching from RCNN to the RCNN-shared model, we are now able to beat the baseline, even though by only a small margin (UAS 90.65% and 90.29% on the dev and test sets respectively). We also study the effect of k to the model's performance (table 4). Training the reranker on a larger k-best list 8 improves the UAS by 0.36% on the development set, which shows that the model learns better with more negative examples. Increasing k at test time, on the other hand, hurts performance because the longer list now contains more low quality trees. The drop caused by using a longer list at test time is also smaller (0.20% vs 0.68%) when the model is trained with more trees.

Reranking with GCNs
We now present results for our new GCN reranking model on the English data. The best GCN model   = 10, 64). Reranker is the ranked list produced by the reranking model only. Mixture is the result for combining the output score given by the rerankers and the score of the base parser as described in §3.3. Following Zhu et al. (2015), we do not use the exact linear equation (6), but do logarithmic scaling of the base parser's score. The parameter α is optimized based on the results on the development set, which has the same k as the test set. Since the correct tree is not always in the k-best list, we also show an upper bound performance for our  rerankers where we manually add the gold trees to the input list (with oracle). Note that with oracle is the result from the reranker, not from the mixture reranking model because the correct tree does not have a score from the base parser if it is not included in the k-best list. Combining the score from both the reranker and the base parser consistently improves over the reranking score alone (except for the GCN reranker k test = 10), which confirms our hypothesis that the parser and the reranker complement each other by looking at different scoring criteria. Although the accuracy drops when reranking longer lists, the mixture scores are always higher. Compared to the RCNN-shared models, the GCN models benefit less from the mixture models, maybe because the  GCNs rank trees more similar to the base parser. The upper bound performance (with oracle) shows that we can still improve results with a better k-best list. Interestingly, although we achieve modest improvements compared to Zhu et al. (2015), our upper bound is higher than theirs. A comparison of results with the original RCNN paper on their data is given in table 6.

Neural Reranking for MRLs
We now evaluate the reranking models that have proved to be effective for English (IORNN, RCNNshared (+BiLSTMs) and GCNs) on German and Czech data. Note that the RCNN model only ranks unlabeled trees while the other two models also consider the dependency labels, which is particularly important for non-configurational languages. All models are trained with the same hyperparameter settings as for English. The mixture scores are combined using equation 6 except that we optimize the IORNN mixture model using the original tool provided by the authors.
The results for the different reranking models are presented in   ence between the reranking results and the baseline (table 9). Although the overall accuracy is similar, our reranking results show a better performance for core arguments (nsubj: subject, obj: direct object, iobj: indirect object) and conjunctions (conj).

Analysis
Through our experiments, we have shown that neural reranking models, which have demonstrated their effectiveness on English data, fail to improve baseline parsing results when applied to German and Czech. This brings us to the question whether this failure is due to the differences between the languages or simply due to the lower quality in the German and Czech k-best lists that are input to the rerankers. It is conceivable that language-specific properties such as the freer word order and richer morphology in German and Czech might make it harder for our models to learn a good representation capturing the quality of a specific parse tree. However, when we add the correct parse tree to the k-best list (with oracle results in table 5, 7 and 8), the accuracy goes up to 94% for English, German and Czech, which effectively eliminates the first reason. This points to the method used to obtain the k- best list as the main factor responsible for the low results for German and Czech. Beam search, although being straightforward to implement, fails to create high quality k-best lists for the base parsers used for both languages ( §4.1). While several projective parsers support k-best parsing (Huang and Sagae, 2010;McDonald and Pereira, 2006), there is, to the best of our knowledge, no out-of-thebox parsing system that implements an effective non-projective k-best parsing algorithm (as, for example, Hall (2007)'s algorithm).
Gold tree ratio Clearly, the (upper bound) tree accuracy in the k-best list determines the reranking performance. In all datasets, we observe that the accuracy decreases when sentence length increases. Overall, the (unlabeled) tree accuracy in the English k-best list is ∼5% higher than in the German data, but is behind that in the Czech data. This, however, is not caused by a larger amount of long sentences in the German data. For sentences of same length, the top k trees from the PTB contain more gold trees than those from the German SPMRL and Czech UD datasets. We further study the effect of the gold tree ratio for reranking by removing the gold trees from the kbest list to reduce the ratio to a certain level. Figure  1 shows that the gold tree ratio strongly correlates with the reranking results.
k-best list variation We measure the variation between the trees in the k-best lists by calculating the standard deviation of their UAS. Figure 2 illustrates the UAS standard deviation distribution in the data for the three languages for k = 10. In each dataset, the tree UAS variation in the English data is the highest, followed by German and then Czech, which shows that the re-arranging method used to generate German and Czech k-best trees tends to return more similar trees. We hypothesize that reranking benefits from diversity, especially if the data contains hard negative examples (incorrect trees that are very similar to the correct one). The gap between reranker performance and with oracle results shows that the reranker is able to detect the correct tree among the incorrect ones because they are very different from each other.
Reranking models Among the neural rerankers, the RCNNs are prone to error propagation from the lower levels, and the IORNNs are sensitive to the order of the child nodes. Both models did not work very well when moving to German and Czech compared to the GCNs, which disregard the top-down or left-to-right order.
In practice, parser output reranking is not a very cost effective way to improve parsing performance, unless we have a fast way to generate high quality output trees. However, the small improvement in core arguments might be useful for downstream applications that require high quality prediction of core arguments.

Conclusion
We have evaluated recent neural techniques for reranking dependency parser output for English, German and Czech and presented a novel reranking model, based on graph convolutional networks (GCNs). We were able to reproduce results for English, using existing rerankers, and showed that our novel GCN-based reranker even outperformed them. However, none of the rerankers works well on the two morphologically rich(er) languages.
Our analysis gave some insights into this issue. We showed that the failure of the rerankers to improve results for German and Czech over the baseline is due to the lower quality of the k-best lists. Here the gold tree ratio in the k-best list plays an important role, as the discriminative rerankers are very well able to distinguish the gold trees from other trees in the list, but their performance drops notably when we remove the gold trees from the list. In addition, we observe a higher diversity in the English k-best list, as compared to German and Czech, which helps the rerankers to learn the differences between high-and low-quality trees.
We conclude that the prerequisite for improving dependency parsing with neural reranking is a diverse k-best list with a high gold-tree ratio. The latter is much harder to achieve for MRLs where the freer word order and high amount of non-projectivity result in a larger number of tree candidates, reflected by a lower gold tree ratio.

A.2 IORNN reranker
We use the code provided by Le and Zuidema (2014) 10 to train all IORNN rerankers with default hyperparameters for English, German and Czech. The default number of training epochs is set to 50. Due to time limit, we could only train the model for Czech which is the largest of our datasets up to 27 epochs, which took 15 days on a CPU. The program processes a single sentence at a time rather than batching or multithreading. For computing the mixture score, we use the tool provided in the repository instead of ours. The authors do logarithmic scaling for the score of the reranker in the mixture model combination: s(x, y) = α log s r (x, y, Θ) + (1 − α)s b (x, y)

A.3 RCNN, RCNN-shared and GCN rerankers
For all discriminative rerankers, in the experiment with the English data, we do logarithmic scaling for the score of the base parser in the mixture model combination: s(x, y) = αs r (x, y, Θ) + (1 − α) log s b (x, y) In the experiments with German and Czech data, we do not scale the score of the base parser and use equation 6.