Cross-Domain Generalization of Neural Constituency Parsers

Neural parsers obtain state-of-the-art results on benchmark treebanks for constituency parsing—but to what degree do they generalize to other domains? We present three results about the generalization of neural parsers in a zero-shot setting: training on trees from one corpus and evaluating on out-of-domain corpora. First, neural and non-neural parsers generalize comparably to new domains. Second, incorporating pre-trained encoder representations into neural parsers substantially improves their performance across all domains, but does not give a larger relative improvement for out-of-domain treebanks. Finally, despite the rich input representations they learn, neural parsers still benefit from structured output prediction of output trees, yielding higher exact match accuracy and stronger generalization both to larger text spans and to out-of-domain corpora. We analyze generalization on English and Chinese corpora, and in the process obtain state-of-the-art parsing results for the Brown, Genia, and English Web treebanks.


Introduction
Neural constituency parsers have obtained increasingly high performance when measured by F1 scores on in-domain benchmarks, such as the Wall Street Journal (WSJ) (Marcus et al., 1993) and Penn Chinese Treebank (CTB) (Xue et al., 2005). However, in order to construct systems useful for cross-domain NLP, we seek parsers that generalize well to domains other than the ones they were trained on. While classical, non-neural parsers are known to perform better in their training domains than on out-of-domain corpora, their out-ofdomain performance degrades in well-understood ways (Gildea, 2001;Petrov and Klein, 2007), and improvements in performance on in-domain * Equal contribution.
Is the success of neural constituency parsers (Henderson 2004;Vinyals et al. 2015;Dyer et al. 2016;Cross and Huang 2016;Choe and Charniak 2016;Stern et al. 2017;Liu and Zhang 2017;Kitaev and Klein 2018, inter alia) similarly transferable to out-of-domain treebanks? In this work, we focus on zero-shot generalization: training parsers on a single treebank (e.g. WSJ) and evaluating on a range of broad-coverage, out-of-domain treebanks (e.g. Brown (Francis and Kučera, 1979), Genia (Tateisi et al., 2005), the English Web Treebank (Petrov and McDonald, 2012)). We ask three questions about zero-shot generalization properties of state-of-the-art neural constituency parsers: First, do non-neural parsers have better out-ofdomain generalization than neural parsers? We might expect neural systems to generalize poorly because they are highly-parameterized, and may overfit to their training domain. We find that neural and non-neural parsers generalize similarly, and, encouragingly, improvements on indomain treebanks still transfer to out-of-domain.
Second, does pre-training particularly improve out-of-domain performance, or does it just generally improve test accuracies? Neural parsers incorporate rich representations of language that can easily be pre-trained on large unlabeled corpora (Ling et al., 2015;Peters et al., 2018;Devlin et al., 2019) and improve accuracies in new domains (Joshi et al., 2018). Past work has shown that lexical supervision on an out-of-domain treebank can substantially improve parser performance (Rimell and Clark, 2009). Similarly, we might expect pre-trained language representations to give the largest improvements on out-of-domain treebanks, by providing representations of language disparate from the training domains. Surprisingly, however, we find that pre-trained representations give similar error reductions across domains.  Table 1: Performance and relative increase in error (both given by F1) on English corpora as parsers are evaluated out-of-domain, relative to performance on the in-domain WSJ Test set. Improved performance on WSJ Test translates to improved performance out-of-domain. The two parsers with similar absolute performance on WSJ (BLLIP and In-Order) have comparable generalization out-of-domain, despite one being neural and one non-neural.
Finally, how much does structured prediction help neural parsers? While neural models with rich modeling of syntactic structure have obtained strong performance on parsing (Dyer et al., 2016;Liu and Zhang, 2017) and a range of related tasks (Kuncoro et al., 2018;Hale et al., 2018), recent neural parsers obtain state-of-the-art F1 on benchmark datasets using rich input encoders without any explicit modeling of correlations in output structure (Shen et al., 2018;Kitaev and Klein, 2018). Does structural modeling still improve parsing performance even with these strong encoder representations? We find that, yes, while structured and unstructured neural models (using the same encoder representations) obtain similar F1 on in-domain datasets, the structured model typically generalizes better to longer spans and out-of-domain treebanks, and has higher exact match accuracies in all domains.

Experimental setup
We compare the generalization of strong nonneural parsers against recent state-of-the-art neural parsers on English and Chinese corpora.
Non-neural models We use publicly released code and models for the Berkeley Parser (Petrov and Klein, 2007) and BLLIP Parser (Charniak, 2000;Charniak and Johnson, 2005) for English; and ZPar (Zhang and Clark, 2011) for Chinese.
Neural models We use two state-of-the-art neural models: the Chart model of Kitaev and Klein (2018), and In-Order shift-reduce model of Liu and Zhang (2017). These parsers differ in their modeling both of input sentences and output structures. The Chart model uses a self-attentive encoder over the input sentence, and does not explicitly model output structure correlations, predicting tree span labels independently conditioned on the encoded input. 1 The In-Order shift-reduce model of Liu and Zhang (2017) uses a simpler LSTM-based encoding of the input sentence but a decoder that explicitly conditions on previously constructed structure of the output tree, obtaining the best performance among similarly structured models (Dyer et al., 2016;Kuncoro et al., 2017).
The In-Order model conditions on predicted part-of-speech tags; we use tags predicted by the Stanford tagger (following the setup of Cross and Huang (2016)). At test time, we use Viterbi decoding for the Chart model and beam search with beam size 10 for the In-Order model.
To control for randomness in the training procedure of the neural parsers, all scores reported in the remainder of the paper for the Chart and In-Order parsers are averaged across five copies of each model trained from separate random initializations.
Corpora The English parsers are trained on the WSJ training section of the Penn Treebank. We perform in-domain evaluation of these parsers on the WSJ test section, and out-of-domain evaluation using the Brown, Genia, and English Web Treebank (EWT) corpora. For analysis and comparisons within parsers, we evaluate on the entirety of each out-of-domain treebank; for final results and comparison to past work we use the same testing splits as the past work.
The Chinese parsers are trained on the training section of the Penn Chinese Treebank (CTB) v5.1 (Xue et al., 2005), consisting primarily of newswire. For out-of-domain evaluation on Chinese, we use treebank domains introduced in CTB versions 7 and 8: broadcast conversations (B. Conv), broadcast news (B. News), web discussion forums (Forums) and weblogs (Blogs).  Table 2: Performance on Chinese corpora and increase in error (relative to the CTB test set) as parsers are evaluated out-of-domain. The non-neural (ZPar) and neural (In-Order) parser generalize similarly.
3 How well do neural parsers generalize? Table 1 compares the generalization performance of the English parsers, both non-neural (Berkeley, BLLIP) and neural (Chart, In-Order). None of these parsers use additional data beyond the WSJ training section of the PTB: we use the version of the BLLIP parser without self-training on unlabeled data, and use the In-Order parser without external pre-trained word embeddings. Across all parsers, higher performance on the WSJ Test set corresponds to higher performance on each outof-domain corpus, showing that the findings of McClosky et al. (2006) extend to recent neural parsers. In particular, the Chart parser has highest performance in all four domains.
The ∆ Err. column shows the generalization gap for each parser on each corpus: the parser's relative increase in error (with error defined by 100 − F1) from the WSJ Test set (lower values are better). Improved performance on the WSJ Test set corresponds to increased generalization gaps, indicating that to some extent parser improvements on WSJ have come at the expense of out-ofdomain generalization. However, the two parsers with similar absolute performance on WSJ-the BLLIP parser and In-Order parser-have comparable generalization gaps, despite one being neural and one non-neural. Table 2 shows results for ZPar and the In-Order parser on the Chinese treebanks, with ∆ Err. computed relative to the in-domain CTB Test set. As with the English parsers and treebanks, increased performance on the in-domain test set corresponds to improvements on the out-of-domain treebanks (although these differences are small enough that this result is less conclusive than for English). In addition, as with English, we observe similar generalization performance of the non-neural and neural parsers across the out-of-domain treebanks.  We evaluate non-contextual word embeddings produced by structured skip-gram (Ling et al., 2015), as well as the current state-of-the-art contextual representations from BERT (Devlin et al., 2019).

Word embeddings
We use the same pre-trained word embeddings as the original In-Order English and Chinese parsers, 2 trained on English and Chinese Gigaword (Parker et al., 2011) respectively. Table 3 compares models without (In-Order column) to models with embeddings (+Embeddings), showing that embeddings give comparable error reductions both in-domain (the WSJ Test and CTB Test rows) and out-of-domain (the other rows).

BERT
For the Chart parser, we compare the base neural model (Sec. 2 and 3) to a model that uses a pre-  For the In-Order parser, we introduce a novel integration of a BERT encoder with the parser's structured tree decoder.
These architectures represent the best-performing types of encoder and decoder, respectively, from past work on constituency parsing, but have not been previously combined. We replace the word embeddings and predicted part-of-speech tags in the In-Order parser's stack and buffer representations with BERT's contextual embeddings. See Appendix A.1 for details on the architecture. Code and trained models for this system are publicly available. 4 Both the Chart and In-Order parsers are trained in the same way: the parameters of the BERT encoder (BERT LARGE, Uncased English or BERT BASE Chinese) are fine-tuned during training on the treebank data, along with the parameters of the parser's decoder. See Appendix A.2 for details.
Results for the In-Order parser are shown in the +BERT section of Table 3, and results for the chart parser are shown in Table 4. BERT is effective across domains, providing between 25% and 55% error reduction over the base neural parsers. However, as for word embeddings, the pre-trained BERT representations do not generally provide a larger error reduction in out-of-domain settings than in in-domain (although a possible confound is that the BERT model is fine-tuned on the relatively small amount of in-domain treebank data, along with the other parser parameters).
For English, error reduction from BERT is comparable between WSJ and EWT, largest on Brown, and smallest on Genia, which may indicate a dependence on the similarity between the out-of-3 https://github.com/nikitakit/self-attentive-parser 4 https://github.com/dpfried/rnng-bert  domain treebank and the pre-training corpus. 5 For Chinese, the relative error reduction from BERT is largest on the in-domain CTB Test corpus.

Can structure improve performance?
When using BERT encoder representations, the Chart parser (with its unstructured decoder) and In-Order parser (with its conditioning on a representation of previously-constructed structure) obtain roughly comparable F1 (shown in the first two columns of Table 5), with In-Order better on seven out of nine corpora but often by slight margins. However, these aggregate F1 scores decompose along the structure of the tree, and are dominated by the short spans which make up the bulk of any treebank. Structured-conditional prediction may plausibly be most useful for predicting larger portions of the tree, measurable in exact match accuracies and in F1 on longer-length spans (containing more substructure). First, we compare the tree-level exact match accuracies of the two parsers. In the last two columns of Table 5, we see that the In-Order parser consistently achieves higher exact match than the Chart parser across domains (including the indomain WSJ and CTB Test sets), with improvements ranging from 0.5 to 2.8 percentage absolute. In fact, for several corpora (Blogs and B. Conv) the In-Order parser outperforms the Chart parser on exact match despite having the same or lower F1. This suggests that conditioning on structure in the model induces a correlation between spanlevel decisions that becomes most apparent when using a metric defined on the entire structure.   Second, we compare the performance of the two parsers on longer spans of text. Figure 1 plots F1 by minimum span length for the In-Order and Chart parsers with BERT encoders on the English treebanks. Across datasets, the improvement of the In-Order parser is slight when computing F1 across all spans in the dataset (x = 0), but becomes pronounced when considering longer spans. This effect is not observed in the WSJ test set, which may be attributable to its lack of sufficiently many long spans for us to observe a similar effect there. The curves start to diverge at span lengths of around 30-40 words, longer than the median length of a sentence in the WSJ (23 words).

Discussion
Neural parsers generalize surprisingly well, and are able to draw benefits both from pre-trained language representations and structured output prediction. These properties allow single-model parsers to surpass previous state-of-the-art systems on out-of-domain generalization (Table 6). 6 Although the F1 scores obtained here are higher than the zero-shot transfer results of Joshi et al. (2018) on the Brown and Genia corpora due to the use of improved encoder (BERT) and decoder (self-attentive Chart and In-Order) models, we note the results are not directly comparable due to the use of different sections of the corpora for evaluation.
We note that these systems from prior work (Choe et al., 2015;Petrov and McDonald, 2012;Le Roux et al., 2012) use additional ensembling or selftraining techniques, which have also been shown to be compatible with neural constituency parsers (Dyer et al., 2016;Choe and Charniak, 2016;Fried et al., 2017;Kitaev et al., 2019) and may provide benefits orthogonal to the pre-trained representations and structured models we analyze here. Encouragingly, parser improvements on the WSJ and CTB treebanks still transfer out-of-domain, indicating that improving results on these benchmarks may still continue to yield benefits in broader domains.