Syntactic Parse Fusion

Model combination techniques have consistently shown state-of-the-art performance across multiple tasks, including syntactic parsing. However, they dramatically increase runtime and can be difﬁ-cult to employ in practice. We demonstrate that applying constituency model combination techniques to n -best lists instead of n different parsers results in signiﬁcant parsing accuracy improvements. Parses are weighted by their probabilities and combined using an adapted version of Sagae and Lavie (2006). These accuracy gains come with marginal computational costs and are obtained on top of existing parsing techniques such as discriminative reranking and self-training, resulting in state-of-the-art accuracy: 92.6% on WSJ section 23. On out-of-domain corpora, accuracy is improved by 0.4% on average. We empirically conﬁrm that six well-known n -best parsers beneﬁt from the proposed methods across six domains.


Introduction
Researchers have proposed many algorithms to combine parses from multiple parsers into one final parse (Henderson and Brill, 1999;Zeman anď Zabokrtskỳ, 2005;Sagae and Lavie, 2006;Nowson and Dale, 2007;Fossum and Knight, 2009;Petrov, 2010;Johnson and Ural, 2010;Huang et al., 2010;McDonald and Nivre, 2011;Shindo et al., 2012;Narayan and Cohen, 2015). These new parses are substantially better than the originals: Zhang et al. (2009) combine outputs from multiple n-best parsers and achieve an F 1 of 92.6% on the WSJ test set, a 0.5% improvement over their best n-best parser. Model combination approaches tend to fall into the following categories: hybridization, where multiple parses are combined into a single parse; switching, which picks a single parse according to some criteria (usually a form of voting); grammar merging where grammars are combined before or during parsing; and stacking, where one parser sends its prediction to another at runtime. All of these have at least one of the caveats that (1) overall computation is increased and runtime is determined by the slowest parser and (2) using multiple parsers increases the system complexity, making it more difficult to deploy in practice. In this paper, we describe a simple hybridization extension ("fusion") which obtains much of hybridization's benefits while using only a single n-best parser and minimal extra computation. Our method treats each parse in a single parser's n-best list as a parse from n separate parsers. We then adapt parse combination methods by Henderson and Brill (1999), Sagae andLavie (2006), andFossum andKnight (2009) to fuse the constituents from the n parses into a single tree. We empirically show that six n-best parsers benefit from parse fusion across six domains, obtaining stateof-the-art results. These improvements are complementary to other techniques such as reranking and self-training. Our best system obtains an F 1 of 92.6% on WSJ section 23, a score previously obtained only by combining the outputs from multiple parsers. A reference implementation is available as part of BLLIP Parser at http: //github.com/BLLIP/bllip-parser/ 2 Fusion Henderson and Brill (1999) propose a method to combine trees from m parsers in three steps: populate a chart with constituents along with the number of times they appear in the trees; remove any constituent with count less than m/2 from the chart; and finally create a final tree with all the remaining constituents. Intuitively their method constructs a tree with constituents from the majority of the trees, which boosts precision significantly. Henderson and Brill (1999) show that this process is guaranteed to produce a valid tree. Sagae and Lavie (2006) generalize this work by reparsing the chart populated with constituents whose counts are above a certain threshold. By adjusting the threshold on development data, their generalized method balances precision and recall. Fossum and Knight (2009) further extend this line of work by using n-best lists from multiple parsers and combining productions in addition to constituents. Their model assigns sums of joint probabilities of constituents and parsers to constituents. Surprisingly, exploiting n-best trees does not lead to large improvement over combining 1-best trees in their experiments.
Our extension takes the n-best trees from a parser as if they are 1-best parses from n parsers, then follows Sagae and Lavie (2006). Parses are weighted by the estimated probabilities from the parser. Given n trees and their weights, the model computes a constituent's weight by summing weights of all trees containing that constituent. Concretely, the weight of a constituent spanning from ith word to jth word with label is where W (k) is the weight of kth tree and C k (i → j) is one if a constituent with label spanning from i to j is in kth tree, zero otherwise. After populating the chart with constituents and their weights, it throws out constituents with weights below a set threshold t. Using the threshold t = 0.5 emulates the method of Henderson and Brill (1999) in that it constructs the tree with the constituents in the majority of the trees. The CYK parsing algorithm is applied to the chart to produce the final tree. Note that populating the chart is linear in the number of words and the chart contains substantially fewer constituents than charts in well-known parsers, making this a fast procedure.

Score distribution over trees
We assume that n-best parsers provide trees along with some kind of scores (often probabilities or log probabilities). Given these scores, a natural way to obtain weights is to normalize the probabilities. However, parsers do not always provide accurate estimates of parse quality. We may obtain better performance from parse fusion by altering this distribution and passing scores through a nonlinear function, f (·). The kth parse is weighted: where SCORE(i) is the score of ith tree. 1 We explore the family of functions f (x) = x β which can smooth or sharpen the score distributions. This includes a tunable parameter, β ∈ R + 0 : Employing β < 1 flattens the score distribution over n-best trees and helps over-confident parsers.
On the other hand, having β > 1 skews the distribution toward parses with higher scores and helps under-confident parsers. Note that setting β = 0 weights all parses equally and results in majority voting at the constituent level. We leave developing other nonlinear functions for fusion as future work.

Experiments
Corpora: Parse fusion is evaluated on British National Corpus (BNC), Brown, GENIA, Question Bank (QB), Switchboard (SB) and Wall Street Journal (WSJ) (Foster and van Genabith, 2008;Francis and Kučera, 1989;Kim et al., 2003;Judge et al., 2006;Godfrey et al., 1992;Marcus et al., 1993). WSJ is used to evaluate indomain parsing, the remaining five are used for out-of-domain. For divisions, we use tune and test splits from Bacchiani et al. (2006) Table 1: Six parsers along with their 1-best F 1 scores, unlabeled attachment scores (UAS) and labeled attachment scores (LAS) on WSJ section 23.
Supervised parsers are trained on the WSJ training set (sections 2-21) and use section 22 or 24 for development. Self-trained BLLIP is selftrained using two million sentences from Gigaword and Stanford RNN uses word embeddings trained from larger corpora. Parameter tuning: There are three parameters for our fusion process: the size of the n-best list (2 < n ≤ 50), the smoothing exponent from Section 2.1 (β ∈ [0.5, 1.5] with 0.1 increments), and the minimum threshold for constituents (t ∈ [0.2, 0.7] with 0.01 increments). We use grid search to tune these parameters for two separate scenarios. When parsing WSJ (in-domain), we tune parameters on WSJ section 24. For the remaining corpora (out-ofdomain), we use the tuning section from Brown. Each parser is tuned separately, resulting in 12 different tuning scenarios. In practice, though, in-domain and out-of-domain tuning regimes tend to pick similar settings within a parser. Across parsers, settings are also fairly similar (n is usually 30 or 40, t is usually between 0.45 and 0.5). While the smoothing exponent varies from 0.5 to 1.3, setting β = 1 does not significantly hurt accuracy for most parsers.
To study the effects of these parameters, Figure 1 shows three slices of the tuning surface for BLLIP parser on WSJ section 24 around the optimal settings (n = 30, β = 1.1, t = 0.47). In each graph, one of the parameters is varied while the other is held constant. Increasing n-best size improves accuracy until about n = 30 where there seems to be sufficient diversity. For BLLIP, the  smoothing exponent (β) is best set around 1.0, with accuracy falling off if the value deviates too much. Finally, the threshold parameter is empirically optimized a little below t = 0.5 (the value suggested by Henderson and Brill (1999)). Since score values are normalized, this means that constituents need roughly half the "score mass" in order to be included in the chart. Varying the threshold changes the precision/recall balance since a high threshold adds only the most confident constituents to the chart (Sagae and Lavie, 2006). Baselines: Table 2 gives the accuracy of fusion and baselines for BLLIP on the development corpora. Majority voting sets n = 50, β = 0, t = 0.5 giving all parses equal weight and results in constituent-level majority voting. We explore a rank-based weighting which ignores parse probabilities and weight parses only using the rank: W rank (k) = 1/(2 k ). These show that accurate parse-level scores are critical for good performance.
Final evaluation: Table 3 gives our final results for all parsers across all domains. Results in blue are significant at p < 0.01 using a randomized permutation test. Fusion generally improves F 1 for in-domain and out-of-domain parsing by a significant margin. For the self-trained BLLIP parser, in-domain F 1 increases by 0.4% and out-of-domain F 1 increases by 0.4% on average. Berkeley parser obtains the smallest gains from fusion since Berkeley's n-best lists are ordered by factors other than probabilities. As a result, the probabilities from Berkeley can mislead the fusion process.
We also compare against model combination using our reimplementation of Sagae and Lavie (2006). For these results, all six parsers were given equal weight. The threshold was set to 0.42 to optimize model combination F 1 on development data (similar to Setting 2 for constituency parsing in Sagae and Lavie (2006)). Model combination Figure 1: Tuning parameters independently for BLLIP and their impact on F 1 for WSJ section 24 (solid purple line). For each graph, non-tuned parameters were set at the optimal configuration for BLLIP (n = 30, β = 1.1, t = 0.47).  Table 3: Evaluation of the constituency fusion method on six parsers across six domains. x/y indicates the F 1 from the baseline parser (x) and the baseline parser with fusion (y) respectively. Blue indicates a statistically significant difference between fusion and its baseline parser (p < 0.01).
performs better than fusion on BNC and GENIA, but surprisingly fusion outperforms model combination on three of the six domains (not usually not by a significant margin). With further tuning (e.g., specific weights for each constituent-parser pair), the benefits from model combination should increase.
Multilingual evaluation: We evaluate fusion with the Berkeley parser on Arabic (Maamouri et al., 2004;Green and Manning, 2010), French (Abeillé et al., 2003), and German (Brants et al., 2002) from the SPMRL 2014 shared task (Seddah et al., 2014) but did not observe any improvement. We suspect this has to do with the same ranking issues seen in the Berkeley Parser's English results. On the other hand, fusion helps the parser of Narayan and Cohen (2015) on the German NEGRA treebank (Skut et al., 1997) to improve from 80.9% to 82.4%.
Runtime: As discussed in Section 2, fusion's runtime overhead is minimal. Reranking parsers (e.g., BLLIP and Stanford RNN) already need to perform n-best decoding as input for the reranker. Using a somewhat optimized implementation fusion in C++, the overhead over BLLIP parser is less than 1%.
Discussion: Why does fusion help? It is possible that a parser's n-list and its scores act as a weak approximation to the full parse forest. As a result, fusion seems to provide part of the benefits seen in forest reranking (Huang, 2008).
Results from Fossum and Knight (2009) imply that fusion and model combination might not be complementary. Both n-best lists and additional parsers provide syntactic diversity. While additional parsers provide greater diversity, n-best lists from common parsers are varied enough to provide improvements for parse hybridization.
We analyzed how often fusion produces completely novel trees. For BLLIP on WSJ section 24, this only happens about 11% of the time. Fusion picks the 1-best tree 72% of the time. This means that for the remaining 17%, fusion picks an existing parse from the rest of the nlist, acting similar to a reranker. When fusion creates unique trees, they are significantly better than the original 1-best trees (for the 11% subset of WSJ 24, F 1 scores are 85.5% with fusion and 84.1% without, p < 0.003). This contrasts with McClosky et al. (2012) where novel predic-tions from model combination (stacking) were worse than baseline performance. The difference is that novel predictions with fusion better incorporate model confidence whereas when stacking, a novel prediction is less trusted than those produced by one or both of the base parsers. Preliminary extensions: Here, we summarize two extensions to fusion which have yet to show benefits. The first extension explores applying fusion to dependency parsing. We explored two ways to apply fusion when starting from constituency parses: (1) fuse constituents and then convert them to dependencies and (2) convert to dependencies then fuse the dependencies as in Sagae and Lavie (2006). Approach (1) does not provide any benefit (LAS drops between 0.5% and 2.4%). This may result from fusion's artifacts including unusual unary chains or nodes with a large number of children -it is possible that adjusting unary handling and the precision/recall tradeoff may reduce these issues. Approach (2) provided only modest benefits compared to those from constituency parsing fusion. The largest LAS increase for (2) is 0.6% for the Stanford Parser, though for Berkeley and Self-trained BLLIP, dependency fusion results in small losses (-0.1% LAS). Two possible reasons are that the dependency baseline is higher than its constituency counterpart and some dependency graphs from the n-best list are duplicates which lowers diversity and may need special handling, but this remains an open question.
While fusion helps on top of a self-trained parser, we also explored whether a fused parser can self-train (McClosky et al., 2006). To test this, we (1) parsed two million sentences with BLLIP (trained on WSJ), (2) fused those parses, (3) added the fused parses to the gold training set, and (4) retrained the parser on the expanded training. The resulting model did not perform better than a selftrained parsing model that didn't use fusion.

Conclusions
We presented a simple extension to parse hybridization which adapts model combination techniques to operate over a single parser's n-best list instead of across multiple parsers. By weighting each parse by its probability from the n-best parser, we are able to better capture the confidence at the constituent level. Our best configuration obtains state-of-the-art accuracy on WSJ with an F 1 of 92.6%. This is similar to the accuracy obtained from actual model combination techniques but at a fraction of the computational cost. Additionally, improvements are not limited to a single parser or domain. Fusion improves parser accuracy for six n-best parsers both in-domain and out-of-domain.
Future work includes applying fusion to n-best dependency parsers and additional (parser, language) pairs. We also intend to explore how to better apply fusion to converted dependencies from constituency parsers. Lastly, it would be interesting to adapt fusion to other structured prediction tasks where n-best lists are available.