Self-Training for Unsupervised Parsing with PRPN

Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models: we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model’s predictions during training, we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a), such that it can be trained in a semi-supervised fashion. We then add examples with parses predicted by our model to our unlabeled UP training data. Our self-trained model outperforms the PRPN by 8.1% F1 and the previous state of the art by 1.6% F1. In addition, we show that our architecture can also be helpful for semi-supervised parsing in ultra-low-resource settings.


Introduction
Unsupervised parsing (UP) models learn to parse sentences into unlabeled constituency trees without the need for annotated treebanks.Self-training (Yarowsky, 1995;Riloff et al., 2003) consists of training a model, using it to label new examples and, based on a confidence metric, adding a subset to the training set, before repeating training.For supervised parsing, results with self-training have been mixed (Charniak, 1997;Steedman et al., 2003;McClosky D, 2006).For unsupervised dependency parsing, Le and Zuidema (2015) obtain strong results by training a supervised parser on outputs of unsupervised parsing.UP models show low self-agreement between training runs (Kim et al., 2019a), while obtaining parsing performances far above chance.Supervising one run with confident parses from the last could combine their individual strengths.Thus, we ask the question: Can UP benefit from self-training?
In order to answer this question, we propose SS-PRPN, a semi-supervised extension of the UP architecture PRPN (Shen et al., 2018a), which can be trained jointly on language modeling and supervised parsing.This enables our model to leverage silver-standard annotations obtained via selftraining for supervision.Our approach draws on the idea of syntactic distances, which can be learned both as latent variables (Shen et al., 2018a) and as explicit supervision targets (Shen et al., 2018b).We use both of these, leveraging annotations obtained via UP to supervise the two different outputs of the parser, in addition to standard UP training.
SS-PRPN, in combination with self-training, improves over its original version by 8.1% F1 and over the previous state of the art (Kim et al., 2019a) by 1.6% F1, when trained and evaluated on the English PTB (Marcus et al., 1999): UP can indeed benefit from self-training.We further perform an analysis of our self-training procedure, finding that longer sentences benefit most from self-training.
Although our primary motivation for the de- Related Work Following the line of research on non-neural UP models (Clark, 2001;Klein and Manning, 2002;Bod, 2006), early approaches to neural UP (Yogatama et al., 2017;Choi et al., 2018) obtain improved performance on downstream tasks, yet show highly inconsistent behavior in parsing (Williams et al., 2018).Recently, Shen et al. (2018a) introduce the first high performing neural UP model (Htut et al., 2018).Dyer et al. (2019) raise concerns that PRPN's parsing methodology is biased towards English trees.Though these concerns are serious, they are largely orthogonal to our research question regarding the helpfulness of self-training for UP.
Several models have been introduced since: Shen et al. (2019) propose an architecture consisting of an LSTM (Hochreiter and Schmidhuber, 1997)

Model
Syntactic Distances In order to parse a sentence, a computational model needs to output some kind of variables representing a unique tree structure.The variables we use are syntactic distances as introduced by Shen et al. (2018a).They represent the syntactic relationships between all successive pairs of words in a sentence.If the distance between two neighboring words is large, they belong to different subtrees, and, thus, their traversal distance in the tree is large.A parse tree can be created by finding the maximum syntactic distance, splitting We design our parser in such a way that it can predict both.We treat the decision whether D g or D l are used at test time as a hyperparameter.The reasons why we employ both types of distances are two-fold: D g , unlike D l , cannot be learned in an unsupervised fashion, which is critical for a semisupervised architecture.Empirically, supervising purely on D l performs poorly.
The Parser Our parser, cf. Figure 1, consists of an embedding layer and a convolutional layer which are followed by two different components: a linear output layer that predicts D g and a second convolutional layer that predicts D l .
Formally, given an input sentence s = t 0 , t 1 , . . ., t n−1 , our parser predicts D g as: Algorithm 2: Self-training for UP where W c are the weights of the first convolutional layer, W d are the weights of the output layer corresponding to D g , and b c and b d are bias vectors.L 1 is the filter size.D l involves similar computations, but is the output of the second layer.
Distance Loss When we have silver-standard annotations from self-training available, we compute the loss for both syntactic distances directly.Since the relative ranking between distances-rather than absolute values-defines the tree structure, we train our parser with a hinge ranking loss following Shen et al. (2018b).Our distance loss L r is the weighted sum of the distance losses corresponding to D l (L sl ) and D g (L sg ): Language Modeling Loss In order to optimize the parameters of our parser without direct supervision, we further feed its output-the predictions for D l -into a language model, following Shen et al. (2018a).

Multi-Task Training
Our parser is trained in a semi-supervised fashion with losses corresponding to (i) learning the distances in a latent manner through language modeling, and (ii) supervising directly on distances.We sample batches from both objectives at random.

Self-Training
For self-training, cf.Algorithm 2, we first train n c models on the unlabeled PTB training set X U .We then have them predict parse trees for all sentences in X U .If more than µ * n c models (with µ as a hyperparameters) agree on the same parse, we add it as a silver-standard labeled example to the parsing training set X T .We use Algorithm 1 and the respective algorithm by Shen et al. (2018b)

Experimental Design
Data and Metrics We experiment on the English Penn Treebank Marcus et al. (PTB;1999).For evaluation, we compute the F1 score of the output parses against binarized gold parses following Williams et al. (2018).The code for our model is published online1 .
Baselines We compare against an unsupervised recurrent neural network grammar (URNNG; Kim et al., 2019b), a compound probabilistic context free grammar (C-PCFG; Kim et al., 2019a), and Shen et al. (2018a)'s PRPN.We re-implement and tune PRPN in our code base.
Hyperparameters We tune our hyperparameters on the development set.Hidden states and word embeddings have 300 and 100 dimensions, respectively.We set the weight α = 0.5.For self-training, we obtain best results with µ = 60% and n c = 15.We further experiment with converting either D l or D g into final parse trees, and find that D l works best.

Results and Analysis
Unsupervised Parsing Performance Table 1 shows our results.SS-PRPN outperforms all baselines: our model obtains a 1.6% higher F1 score than the strongest baseline.It further improves substantially over comparable non-self-trained baselines: by 14.6% over PRPN and by 8.1% over our reimplementation of it.SS-PRPN also shows a much lower variance.This demonstrates that selftraining is indeed a viable approach for UP.

Analysis of Self-Training
We interpret agreement rate as our confidence value for self-training, with the hypothesis that, as agreement among models increases, there is a higher likelihood that the parse is correct.In Figure 2, we show that, as expected, the F1 score increases as more models agree, for the best self-training run (15 individual models, or the second last row in Table 1).
Additionally, Figure 2 and Table 2 show that selftraining annotations consist of shorter sentences and shallower trees than our dataset's average, i.e., mostly of easier sentences.
Our final hypothesis is that self-training helps mostly for longer sentences, since models often agree on shorter ones anyways and, trivially, longer sentences leave more room for error.Table 3 shows the development set performance and the number of examples for varying sentence lengths.As expected, self-training yields the greatest gains for longer sentences.Low-Resource Parsing Performance We further investigate how SS-PRPN performs when limited gold parses are available in addition to unlabeled data.To predict constituency labels, we add and train an additional linear output layer after the first convolutional layer.We find that, on the development set, converting D g into parse trees works better for low-resource parsing than D l .As supervised baselines, we employ Dyer et al. (2016)'s recurrent neural network grammar (RNNG) and a supervised parser (SP) based on syntactic distances (Shen et al., 2018b).Figure 3 shows results for 50 to 250 annotated examples.The upper part shows the unlabeled parsing performance in comparison to the UP baselines.We outperform all baselines for 50 to 150 examples, while SP performs slightly better with more annotations.When looking at labeled F1 in the lower part of Figure 3, SS-PRPN clearly outperforms SP, which indicates that unlabeled data can be leveraged in the low-resource setting.

Conclusion
We introduce a semi-supervised neural architecture, SS-PRPN, which is capable of UP via self-training.Our self-trained models strongly outperform comparable baselines, and advance the state of the art on PTB by 1.6% F1.Analyses show that our approach yields most gains for longer sentences.Our architecture can also leverage limited amounts of parsing supervision when available.We conclude that it is beneficial to develop better UP models for semi-supervised settings.

Figure 1 :
Figure 1: Our parser, represented by the dotted box, outputs syntactic distances D g and D l .Both D g and D l can be supervised, but D l can also be learned in a latent manner.
with a modified update function for the LSTM cell state, Kim et al. (2019a)-the current state-of-the-art-introduce a model based on a mixture of probabilistic context-free grammars, Kim et al. (2019b) present unsupervised learning of recurrent neural networks grammars, Li et al. (2019) combine PRPN with imitation learning, and Drozdov et al. (2019) employ a recursive autoencoder.Kim et al. (2020) examine tree induction from pretrained models.

Algorithm 1 :
Tree to latent distances D l 1 D l ← [1] * leavestree leavestree :leaf count of tree 2 b ← 0 3 max ← 100 max: max possible depth of tree 4 Function DISTANCE(tree, b, max) 5 DISTANCE(tree l , b, max-1) 6 x ← treer treer: right child of tree 7 while True do 8 if x l is empty then 9 D l [b + leavestree l ] ← max l x l : left child of x 13 endthe sentence into sub-trees there, and repeating this process recursively for each sub-tree until a single token is left.Two different formulations of syntactic distance have been proposed to realize this basic intuition:The first, which we refer to as D l , is introduced byShen et al. (2018a) as a latent variable in their UP model.Since, for self-training, we supervise D l with values predicted by our model, we introduce Algorithm 1, which is used to convert a tree to distances D l .The second kind of distance, denoted here as D g , is introduced byShen et al. (2018b) as labels for direct supervision.We use their algorithms to map trees to distances D g and vice versa, and ask readers to refer toShen et al. (2018b) for details.

Figure 2 :
Figure 2: Statistics for self-training (n c = 15): As agreement among UP models goes up, parsing F1 improves, and average depth and length go down.

Figure 3 :
Figure 3: Low-resource parsing on the PTB.The first and second plots show unlabeled and labeled F1 respectively, plotted against the training data size.

1
Unlabeled data XU 2 Training set XT ← ∅ 3 Train nc UP models on XU 4 for si ∈ XU do 5 na ← number of models agreeing on parse p(si)

Table 1 :
Results on the English PTB test set, with the model tuned on the dev set.LB, RB and Random baselines are taken as-is from Htut et al. (2018).Since evaluation of C-PCFG, PRPN and URNNG is done against binary gold trees, results might differ from the original papers.
to convert consensus trees into distances D l and D g .We then train a new model on both X U and X T .

Table 2 :
Statistics of our best self-training annotations compared to PTB.

Table 3 :
Percentage of development examples improved by SS-PRPN in comparison to PRPN, listed by sentence length.