On the Role of Supervision in Unsupervised Constituency Parsing

We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing $F_1$ score on the Wall Street Journal (WSJ) development set (1,700 sentences). We introduce strong baselines for them, by training an existing supervised parsing model (Kitaev and Klein, 2018) on the same labeled examples they access. When training on the 1,700 examples, or even when using only 50 examples for training and 5 for development, such a few-shot parsing approach can outperform all the unsupervised parsing methods by a significant margin. Few-shot parsing can be further improved by a simple data augmentation method and self-training. This suggests that, in order to arrive at fair conclusions, we should carefully consider the amount of labeled data used for model development. We propose two protocols for future work on unsupervised parsing: (i) use fully unsupervised criteria for hyperparameter tuning and model selection; (ii) use as few labeled examples as possible for model development, and compare to few-shot parsing trained on the same labeled examples.


Introduction
Recent work has considered neural unsupervised constituency parsing (Shen et al., 2018a;Drozdov et al., 2019;Kim et al., 2019b, inter alia), showing that it can achieve much better performance than trivial baselines. However, many of these approaches use the gold parse trees of all sentences in a development set for either early stopping (Shen et al., 2018a(Shen et al., , 2019Drozdov et al., 2019, inter alia) or hyperparameter tuning (Kim et al., 2019a). In contrast, models trained and tuned without any labeled data (Kim et al., 2019b;Peng et al., 2019) are much less competitive.
Are the labeled examples important in order to obtain decent unsupervised parsing performance? How well can we do if we train on these labeled examples rather than merely using them for tuning? In this work, we consider training a supervised constituency parsing model (Kitaev and Klein, 2018) with very few examples as a strong baseline for unsupervised parsing tuned on labeled examples.
We empirically characterize unsupervised and few-shot parsing across the spectrum of labeled data availability, finding that (i) tuning based on a few (as few as 15) labeled examples is sufficient to improve unsupervised parsers over fully unsupervised criteria by a significant margin; (ii) unsupervised parsing with supervised tuning does outperform few-shot parsing with fewer than 15 labeled examples, but few-shot parsing quickly dominates once there are more than 55 examples; and (iii) when few-shot parsing is combined with a simple data augmentation method and self-training (Steedman et al., 2003;Reichart and Rappoport, 2007;McClosky et al., 2006, inter alia), only 15 examples are needed for few-shot parsing to begin to dominate.
Based on these results, we propose the following two protocols for future work on unsupervised parsing: 1. Derive and use fully unsupervised criteria for hyperparameter tuning and model selection.
2. Use as few labeled examples as possible for model development and tuning, and compare to few-shot parsing models trained on the used examples as a strong baseline.
We suggest future work to tune and compare models under each protocol separately. In addition, we present two side findings on unsupervised parsing: (i) the vocabulary size in unsupervised parsing, which has not been widely considered as a hyperparameter and varies across prior work, greatly affects the performance of all unsupervised parsing models tested; and (ii) selftraining can help improve all investigated unsupervised parsing (Shen et al., 2018a(Shen et al., , 2019Drozdov et al., 2019;Kim et al., 2019a) and few-shot parsing models, and thus can be considered as a post-processing step in future work.

Related Work
Unsupervised parsing. During the past two decades, there has been a lot of work on both unsupervised constituency parsing (Klein andManning, 2002, 2004;Bod, 2006a,b;Seginer, 2007;Snyder et al., 2009, inter alia) and unsupervised dependency parsing (Klein and Manning, 2004;Smith and Eisner, 2006;Spitkovsky et al., 2011Spitkovsky et al., , 2013. Recent work has proposed several effective models for unsupervised or distantly supervised constituency parsing, optimizing either a language modeling objective (Shen et al., 2018a(Shen et al., , 2019Kim et al., 2019b,a, inter alia) or other downstream semantic objectives (Li et al., 2019;Shi et al., 2019). Some of them are tuned with labeled examples in the WSJ development set (Shen et al., 2018a(Shen et al., , 2019Htut et al., 2018;Drozdov et al., 2019;Kim et al., 2019a;Wang et al., 2019) or other labeled examples (Jin et al., 2018(Jin et al., , 2019. Data augmentation. Data augmentation is a strategy for automatically increasing the amount and variety of data for training models, without actually collecting any new data. Such methods have been found helpful on many NLP tasks, including text classification (Kobayashi, 2018;Samanta et al., 2019), relation classification (Xu et al., 2016), and part-of-speech tagging (Ş ahin and Steedman, 2018). Part of our approach also falls into the category of data augmentation, applied specifically to constituency parsing from very few examples.

Few-Shot Constituency Parsing
We apply Benepar ( §3.1; Kitaev and Klein, 2018) as the base model for few-shot parsing. We present a simple data augmentation method ( §3.2) and an iterative self-training strategy ( §3.3) to further improve the performance. We suggest that such an approach should serve as a strong baseline for unsupervised parsing with supervised tuning.

Parsing Model
The Benepar parsing model consists of (i) word embeddings, (ii) transformer-based (Vaswani et al., 2017) word-span embeddings, and (iii) a multilayer perceptron to compute a score for each labeled span. 2 The score of an arbitrary tree is defined as the sum of all of its internal span scores. Given a sentence and its ground-truth parse tree T * , the model is trained to satisfy score(T * ) ≥ score(T ) + ∆(T * , T ) for any tree T (T = T * ), where ∆ denotes the Hamming loss on labeled spans. The label-aware CKY algorithm is used to obtain the tree with the highest score. More details can be found in Kitaev and Klein (2018).

Data Augmentation
We introduce a data augmentation method, subtree substitution (SUB; Figure 1), to automatically improve the diversity of data in the few-shot setting.
We start with a set of sentences with N un- j=1 denotes the unlabeled parse tree of s i with C i nonterminal nodes; b ij and e ij denotes the beginning and ending index of a constituent.
The augmented dataset S is initialized to S. At each step, we draw a sentence s i and its parse tree T i uniformly from S , and draw a constituent b ij , e ij ∈ T i uniformly from T i . After that, we replace b ij , e ij with a random b kh , e kh ∈ t k ; that is, we replace a constituent with another one from the training set. We let s i and T i denote modified sentence and its parse tree, assign S ← S ∪ {(s i , T i )}, and repeat the above procedure until S reaches the desired size. self-training (ST) on unseen sentences can improve a parsing model. Inspired by this, we apply an iterative self-training strategy after obtaining each supervised or unsupervised parsing model. Concretely, we start with an arbitrary parsing model M 0 . At the i th step of self-training, we (i) use the trained model from the previous step (i.e., M i−1 ) to predict parse trees for sentences in the WSJ training set and those in the WSJ development set, and (ii) train a supervised parsing model M i (Kitaev and Klein, 2018) to fit the prediction of M i−1 . No gold labels are used in self-training.

Dataset and Training Details
We use the WSJ portion of the Penn Treebank corpus (Marcus et al., 1993) to train and evaluate the models, replace all number tokens with a special token, and split standard train/dev/test sets following Kim et al. (2019b). 3 For each criterion, we tune the hyperparameters of each model with respect to its performance on the development set. To solve the problem of vocabulary sparsity in the few-shot parsing setting ( §3), we initialize the word embeddings of Benepar (Kitaev and Klein, 2018) with the word embeddings from an LSTM-based (Hochreiter and Schmidhuber, 1997) language model trained on the WSJ training set. During training, models are able to access all sentences (without parse trees) in the WSJ training set; for few-shot parsing or unsupervised parsing with supervised tuning, some unlabeled parse trees in the WSJ development set are available as well. We augment the training set to 10,000 examples for few-shot parsing with SUB, and apply 5-step self-training when applicable.
We evaluate the unlabeled F 1 score of all models using evalb, 4 discarding punctuation. More details can be found in the supplementary material. PRPN and ON-LSTM are left-to-right neural language models, where syntactic distance (Shen et al., 2018b) between consecutive words is computed from the model output and used to infer the constituency parse tree. DIORA learns text-span representations and span-level scores by optimizing a masked language modeling objective. The compound PCFG uses a neural parameterization of a PCFG, as well as a per-sentence latent vector which introduces context sensitivity. Both DIORA and the Compound PCFG use the CKY algorithm to infer the parse tree of a given sentence.

Models and Tuning Criteria
As fully unsupervised tuning criteria, we use perplexity on the development set for PRPN and ON-LSTM, and the upper bound of perplexity for the Compound PCFG, following Shen et al. (2018aShen et al. ( , 2019 and Kim et al. (2019a) respectively. For DIORA, we use its reconstruction loss on the development set. 5

Comparison between Unsupervised
Parsing and Few-Shot Parsing We compare unsupervised parsing against few-shot parsing ( On the other hand, we find that a few labeled examples are consistently helpful for most models to achieve better results than fully unsupervised parsing. In addition, models tuned on a very small number (e.g., 15) of labeled examples can achieve similar performance to those tuned on 1,700 labeled examples; that is, we need far fewer labeled examples than existing unsupervised parsing approaches have used to obtain very similar results.
To test if SUB can also help improve unsupervised parsing models, we generate 10K sentences from the 1,700 sentences in the WSJ development set with SUB (Figure 1), and add them to the 40Ksentence WSJ training set. We compare unsupervised parsing models trained on the original WSJ training set and the augmented one (Table 2). We find that SUB can sometimes help, but not by a large margin, and all numbers in Table 2 are far below the performance of few-shot parsing with the same data availability (82.6; Table 1). Few-shot parsing with data augmentation is a strong baseline for unsupervised parsing with data augmentation.

The Importance of Vocabulary Size
We notice that the result of the Compound PCFG in Table 1 is much worse than that reported by Kim  Figure 2: Performance of models with vocabulary size 35K (left) and 10K (right) on WSJ Section 24. C-PCFG denotes the Compound PCFG. The F 1 scores are averaged over 5 runs with the same hyperparameters, different random seeds, and different sets of labeled examples when applicable.  The only major difference between their approach and ours is the vocabulary size: instead of keeping all words, they keep the most frequent 10K words in the WSJ corpus and replace others with a special token. To analyze the importance of this choice, we compare the performance of the models with vocabulary size 35K vs. 10K (Figure 2), tuning models separately in the two settings. We find that the vocabulary size, which has not been widely considered a hyperparameter and varies across prior work, greatly affects the performance of all models tested. One possible reason is that a large portion (79.9%) of the low-frequency (i.e., outside the 10K vocabulary) word tokens are nouns or adjectives -some models (e.g., PRPN and Compound PCFG) may benefit from collapsing these tokens to a single form, as it may be a beneficial kind of word clustering. This suggests that we should consider tuning the vocabulary size  Table 3: F 1 score on WSJ Section 24 of different models, where the base models are those used to report results in Table 1 with |D label | = 15. as a hyperparameter, or fix the vocabulary size for fair comparison in future work.

Self-Training Improves all Models
Inspired by the fact that self-training boosts the performance of few-shot parsing (Table 1), we apply iterative self-training to the unsupervised parsing models as well, and find that it improves all models (Table 3). 7 It is worth noting that 5-step self-training is better than 1-step self-training for all base models we experimented with. Our results suggest that iterative (e.g., 5-step) self-training may be considered as a standard post-hoc processing step for unsupervised parsing.

Discussion
While many state-of-the-art unsupervised parsing models are tuned on all labeled examples in a development set (Drozdov et al., 2019;Kim et al., 2019b;Wang et al., 2019, inter alia), we have demonstrated that, given the same data, few-shot parsing with simple data augmentation and self-training can consistently outperform all of these models by a large margin. We suggest that one possibility for future work is to focus on fully unsupervised criteria, such as language model perplexity (Shen et al., 2018a(Shen et al., , 2019Kim et al., 2019b;Peng et al., 2019;Li et al., 2020) and model stability across different random seeds (Shi et al., 2019), for model selection, as discussed in unsupervised learning work Eisner, 2005, 2006;Spitkovsky et al., 2010a,b, inter alia). An alternative is to use as few labeled examples in the development set as possible, and compare to few-shot parsing trained on the used examples as a strong baseline. In addition, we find that self-training is a useful post-processing step for unsupervised parsing. Our work does not necessarily imply that unsupervised parsers produce poor parses; they may be producing good parses that clash with the conventions of treebanks (Klein, 2005). If this is the case, then extrinsic evaluation of parsers in downstream tasks (Shi et al., 2018), e.g., machine translation (DeNero and Uszkoreit, 2011;Neubig et al., 2012;Gimpel and Smith, 2014), may better show the potential of unsupervised methods.