An Empirical Comparison of Unsupervised Constituency Parsing Methods

Unsupervised constituency parsing aims to learn a constituency parser from a training corpus without parse tree annotations. While many methods have been proposed to tackle the problem, including statistical and neural methods, their experimental results are often not directly comparable due to discrepancies in datasets, data preprocessing, lexicalization, and evaluation metrics. In this paper, we first examine experimental settings used in previous work and propose to standardize the settings for better comparability between methods. We then empirically compare several existing methods, including decade-old and newly proposed ones, under the standardized settings on English and Japanese, two languages with different branching tendencies. We find that recent models do not show a clear advantage over decade-old models in our experiments. We hope our work can provide new insights into existing methods and facilitate future empirical evaluation of unsupervised constituency parsing.


Introduction
Unsupervised constituency parsing, a task in the area of grammar induction, aims to learn a constituency parser from a training corpus without parse tree annotations. While research on unsupervised constituency parsing has a long history (Carroll and Charniak, 1992;Pereira and Schabes, 1992;Stolcke and Omohundro, 1994), recently there is a resurgence of interest in this task and several approaches based on neural networks have been proposed that achieve impressive performance (Shen et al., 2018;Drozdov et al., 2019;Shen et al., 2019;Kim et al., 2019b,a;Jin et al., 2019).
With the recent development in research of unsupervised constituency parsing, however, the problem of lacking a unified experimental setting begins to emerge, which makes empirical comparison between different approaches difficult. First of all, although almost all previous approaches are evaluated on the Penn Treebank (Marcus and Marcinkiewicz, 1993), they differ in how they preprocess the training data, with respect to the sentence length limit, punctuation removal, vocabulary pruning, and so on. For example, non-neural methods such as Constituent Context Model (CCM) (Klein and Manning, 2002) are trained on short sentences, while modern neural based methods such as Parsing-Reading-Predict Network (PRPN) (Shen et al., 2018;Htut et al., 2018) do not impose any limit on sentence length.
Furthermore, existing approaches also differ in their evaluation metrics, with respect to the methods of computing averages, counting trivial spans, and so on. The evaluation results of the same approach using different metrics can differ significantly in some cases. Unfortunately, we have seen more than one paper that directly compares approaches evaluated with different metrics.
In this paper, we propose three standardized experimental settings with respect to data preprocessing, post-processing, evaluation metrics, and tuning. We then empirically compare five existing methods under the standardized settings, including two decade-old methods and three recently proposed neural methods. We run our experiments on English and Japanese, two languages with different branching tendencies. Interestingly, the overall experimental results show that the recent methods do not show a clear advantage over the decade-old methods.
We hope our empirical comparison could provide new insights into the relative strength and weakness of existing methods and our standard-ized experimental settings could facilitate future evaluation of unsupervised constituency parsing. Our pre/post-processing and evaluation source code can be found at https://github.com/i-lijun/ UnsupConstParseEval.
PRPN is a neural-based model designed for language modeling by leveraging latent syntactic structures. It calculates syntactic distances between words of a sentence which can be used to obtain an unlabeled parse tree. Note that as a constituency parser, PRPN is incomplete (Dyer et al., 2019).
URNNG is an unsupervised version of the supervised neural parser RNNG (Dyer et al., 2016). It uses a chart parser to approximate the posterior of the original RNNG.
DIORA is a recursive autoencoder using the inside-outside algorithm to compute scores and representations of spans in the input sentence. It is the only model in our comparison that uses external word embedding (in our experiments, we use ELMo (Peters et al., 2018) for English and fastText (Grave et al., 2018) for Japanese).
CCM is a generative distributive model, the parameters of which are updated with the EM algorithm. It is the only model in our comparison that uses the gold Part-of-Speech tags as input.
CCL is an incremental parser, which uses a representation for syntactic structures similar to dependency links.
In addition to these models, we note that there are several other models that achieve good results on unsupervised constituency parsing, such as UML-DOP (Bod, 2006), UPParse (Ponvert et al., 2011), feature CCM (Golland et al., 2012), Depth-Bounded PCFG (Jin et al., 2018), and Compound PCFG (Kim et al., 2019a). However, because of limited time and computational resource, as well as a lack of open source implementations for some of the models, we do not evaluate them in our experiments.

Datasets and Preprocessing
We use two corpora in our evaluation: the English Penn Treebank (PTB) (Marcus and Marcinkiewicz, 1993) and the Japanese Keyaki Treebank (KTB) (Butler et al., 2012). We pick KTB in addition to PTB for the purpose of checking the generalizability of existing models on left-branching languages. For PTB, we follow the standard split, using section 02-21 for training, 22 for validation and 23 for testing. For KTB, we shuffle the corpus and use 80% of the sentences for training, 10% for validation and 10% for testing.
Many previous approaches learn from training sentences of length ≤ 10, but recent models based on language modeling often use a length limit of 40 or set no length limit at all. We experiment with both length ≤ 10 and length ≤ 40. We do not impose any length limit on test sentences.
Previous models also have different ways to deal with punctuation. Although Jones (1994) and Spitkovsky et al. (2011) point out that careful treatment of punctuation may be helpful in unsupervised parsing, many previous models choose to remove punctuation and some recent models treat punctuation as normal words. Only a few models such as CCL (Seginer, 2007) make special treatment of punctuation. We experiment with two settings for length 40, one with punctuation and one without.
To reduce the vocabulary size, we replace all the numerals with a <num>token and words that appear only once with <unk>.

Post-processing
The parses output by CCL do not contain punctuation even when it is trained with punctuation, so it cannot be evaluated properly using a test set with punctuation. In addition, although the right branching baseline is a very strong baseline when punctuation is removed, its evaluation score becomes very low if punctuation is included because of its treatment of trailing punctuation. So we extend the post-processing method used in (Drozdov et al., 2019) to either add back punctuation marks or modify their connections in a parse tree: for a trailing punctuation mark, we manually attach it to the root of the constituency parse tree; for a punctuation mark inside the sentence, we attach it to the lowest common ancestor of its two adjacent words in the parse tree. Note that the above procedure will produce non-binary parse trees.

Evaluation Metrics
The performance of a constituency parser is often evaluated with F1 scores. However, two ways of averaging F1 scores over multiple test sentences are available, i.e., micro average and macro average. In micro average, all the span predictions are aggregated together and then compared with the gold spans to get the precision and recall. In contrast, macro average is obtained by calculating the F1 score for each individual sentence and then take an average over all the sentences.
We use both metrics in our experiments. Note that when computing F1 scores, we remove trivial spans, i.e., single-word spans and whole-sentence spans, and we calculate duplicate constituents only once.
We additionally use the standard PARSEVAL metric computed by the Evalb program 6 . Although Evalb calculates the micro average F1 score, it differs from our micro average metric in that it will count the whole sentence spans and duplicated spans are calculated and not removed.

Tuning and Model Selection
To maintain the unsupervised nature of our experiments, we avoid the common practice of using gold parses of the validation set for hyperparameter tuning. CCM and CCL do not expose any hyperparameter for tuning. We tune PRPN and URNNG based on their perplexity on the validation set. DIORA does not provide a metric that can be used for tuning, so we do not tune it.
We tune PRPN and URNNG with the same time budget of 5 days on a GPU cluster with TITAN V GPUs. We use Bayesian optimization 7 to automatically tune these models. We set the ranges of hyperparameter values around the default values provided in the original papers.

Experimental Results
We list the experimental results of all the models and the left/right-branching baselines for PTB and KTB in Table 1 and Table 2 respectively. Since all the models except CCL produce binary parse trees, we also show the score upper bound that a binary tree parser can achieve, which is computed by binarizing the gold trees and calculating their scores against the original gold trees. Note that our results can be very different from those reported in the original papers of these models because of different experimental setups. For example, the original CCM paper reports an F1 score of 71.9 on PTB, but we report 62.97. This is because the original CCM experiment uses the whole WSJ corpus (with length ≤ 10) for both training and test, which is very different from our setup.
Also note that for the left and right branching baselines and the binary upper bound, the scores for "length 10 no punct" and "length 40 no punct" are the same, because these baselines do not require training and are evaluated on the same test sets.
Overall Comparison There is no universal winner for all the settings but there is clear winners for specific settings. On PTB, it is surprising to see that each model is the winner of at least one setting. Right-branching is a very strong base-line and with post-processing it outperforms all the models in some settings of "ptb len40 punct". On KTB, DIORA is the winner in most of the settings, while CCM has a strong performance on "ktb len10 nopunct". Left-branching is a strong baseline especially when evaluated on sentences with length ≤ 10.
Although CCM and DIORA achieve the best overall performance, we note that they both utilize additional resources. CCM uses gold POS tags and DIORA uses pretrained word embedding. Our preliminary experiments on PTB show a significant drop in performance when we run CCM using words without gold POS tags, with the Evalb F1 score dropping from 70.14 to 57.29 when evaluated on length ≤ 10 under the "ptb len10 nopunct" setting. DIORA also performs worse when pretrained word embedding is replaced by randomly initialized embedding, with the average Evalb F1 score dropping from 49.39 to 42.63 when evaluated on all sentences under the "ptb len40 nopunct" setting.
Overall, we do not see a clear advantage of more recent neural models over traditional models. There are two factors that should be taken into account though. First, neural models are significantly slower and therefore may not have been sufficiently tuned because of the fixed tuning time budget. Second, the training data may still be too small from the perspective of neural models.
Finally, we also note that our post-processing method for adding back punctuation almost always improves the score in PTB, sometimes by a large margin (e.g., for CCM and RBranch). On KTB, however, it sometimes decreases the score. This may be caused by different annotation standards for punctuation in the two treebanks.
Impact of Experimental Settings Different experimental settings lead to remarkable difference in the evaluation scores of the same model. Different evaluation metrics also produce very different scores. With the same output parses, they can sometimes differ more than 20 F1 points.
Running Time Traditional models such as CCM and CCL are fast, taking only several minutes. On the other hand, neural models take hours or even days to train. Apart from training, the inference stage is also very fast for traditional models but slow for neural models. Considering their close F1 scores, we believe at least in the scenario of limited data and computational resources, traditional models are preferred to neural models.

Comments on Individual Models
We find that CCM when trained with length ≤ 10 sentences is very competitive. On PTB, it even outperforms all the other models that are trained on length 40 data with no punctuation. However, CCM cannot handle punctuation very well without post-processing.
URNNG seems to degrade to mostly rightbranching in many settings (thus having very low standard deviations). This is possibly due to two reasons: 1) URNNG takes a lot of time to train and is therefore only lightly tuned because of the tuning time budget; 2) in the original paper, URNNG is trained with punctuation but evaluated without punctuation, which is quite different from our settings.
PRPN has a strong performance on PTB when trained with long sentences. However, we note that PRPN has a right-branching bias during inference (Dyer et al., 2019). If we switch its inference bias to left-branching, the performance drops significantly (for more than 10 points). Because of its rightbranching bias, PRPN does not perform well on KTB.

Discussion
We make the following recommendations for future experiments on unsupervised constituency parsing.
For the sentence length limit, we think one can set any limit on the training data, but should report evaluation results on both length ≤ 10 and alllength test data. For the evaluation metrics, since small details in implementing micro and macro average will lead to nontrivial differences, we suggest using PARSEVAL which has publicly available implementation. For models sensitive to random seeds, we recommend reporting means and standard deviations from multiple runs. We also recommend evaluation on treebanks of both leftbranching and right-branching languages, such as PTB and KTB.