Unsupervised Parsing with S-DIORA: Single Tree Encoding for Deep Inside-Outside Recursive Autoencoders

The deep inside-outside recursive autoencoder (DIORA; Drozdov et al. 2019a) is a self-supervised neural model that learns to induce syntactic tree structures for input sentences without access to labeled training data . In this paper, we discover that while DIORA exhaustively encodes all possible binary trees of a sentence with a soft dynamic program, its vector averaging approach is locally greedy and cannot recover from errors when computing the highest scoring parse tree in bottom-up chart parsing. To ﬁx this issue, we introduce S-DIORA, an improved variant of DIORA that encodes a single tree rather than a softly-weighted mixture of trees by employing a hard argmax operation and a beam at each cell in the chart. Our experiments show that through ﬁne-tuning a pre-trained DIORA with our new algorithm, we improve the state of the art in unsupervised constituency parsing on the English WSJ Penn Treebank by 2 . 2 � 6 % F1, depending on the data used for ﬁne-tuning.


Introduction
Syntactic parse trees are valuable intermediate features for many NLP pipelines (He et al., 2018;Strubell et al., 2018), as a soft constraint (Rush and Collins, 2012), a hard constraint (Lee et al., 2019b), or in multi-task learning with syntactic scaffolds (Swayamdipta et al., 2018). Syntactic inductive bias can also improve generalization of deep learning models (Kuncoro et al., 2020). These results have motivated researchers to pursue unsupervised parsing, with the hope of training syntax-dependent models on large amounts of data without annotation (Klein and Manning, 2002;Bod, 2006;Ponvert et al., 2011;Shen et al., 2019;Kim et al., 2019, inter alia).
Of these models, we focus on the deep insideoutside recursive autoencoder (DIORA; Drozdov et al. 2019a). DIORA encodes sentences in a is sensitive to locally nonoptimal decisions. By assigning a low weight to a potentially important subtree when recursively computing the vector for a target tree, it is difficult or impossible to recover and the important subtree is washed out (represented in light gray). Our method, S-DIORA (bottom row) can recover from errors, and the desired tree ends up at the top of the beam in the right-most column.
procedure resembling the inside-outside algorithm (Baker, 1979), which allows it to induce syntactic tree structures for input sentences without access to labeled training data, and achieves near stateof-the-art results on unsupervised constituency parsing. DIORA resembles pre-trained language models, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), in that it is trained with a self-supervised blank-filling objective on large amounts of unlabeled data.
DIORA is a strong unsupervised parser in spite of its locally greedy nature. DIORA works by encoding all subtrees covering a particular span as separate vectors, and then computing a weighted average of these vectors -DIORA uses this averaged vector later in the dynamic program to represent the entire forest of trees covering a span. DIORA computes a score for each subtree; intuitively, a subtree's score affects how strongly it is represented in the averaged vector. The representations are computed recursively, and when a tree that looks locally not important is given a weak score, as shown in Figure 1, it will be washed out. This weakness in local decision making is similar to the label bias problem (Lafferty et al., 2001) in sequence prediction.
In this paper, we extend DIORA so that it can easily recover from local errors ( §3). We replace the weight assignment used for vector averaging with a sparse operator equivalent to a one-hot argmax function, ensuring that each representation accurately encodes a single tree (hence, we call our method S-DIORA). In S-DIORA, it is not possible for a subtree to be washed out, although it is still possible to make an error by ignoring a potentially important subtree. Fortunately, this can be alleviated by adding a beam to each cell of the chart, allowing multiple subtrees over any span to be considered. The key benefit of our modification is that error recovery is easily possible, where previously the vector serves as a bottleneck that makes error recovery difficult or impossible.
We initialize an instance of S-DIORA using the previously released DIORA model, then finetune before evaluating on the target domain, constituency parse trees from the English WSJ Treebank (PTB, Marcus et al. 1993). In one experimental setting, we assume no access to the evaluation domain and use a subset of DIORA's training data, a concatenation of the SNLI (Bowman et al., 2015) and Multi-NLI (Williams et al., 2018b) corpora (hereinafter NLI). In the other setting, we assume access to raw text in the target domain, parse tree labels excluded. In both cases, we see S-DIORA improves on the original DIORA performance by at least 4 F1, and training on the PTB raw text leads to more than 3 F1 over the previous state of the art in constituency parsing.
In summary, the main contributions in this paper are: (a) An extension to DIORA called S-DIORA that allows for easy recovery from local errors; (b) New results in unsupervised constituency parsing, improving over the previous state of the art by 2.2 6% F1 depending on the data used for fine-tuning; and (c) Thorough error analysis of the parse tree output revealing useful insights of why S-DIORA improves over baselines, for example, capturing marginally less prepositional phrases in the parse tree output yet making half the PPattachment errors.
2 DIORA (Drozdov et al., 2019a) Drozdov et al. (2019a introduced DIORA, an unsupervised model that learns to 'reconstruct the input by discovering and exploiting syntactic regularities of the text.' It operates much like a masked language model or denoising autoencoder -first it encodes all-but-one of the words from the input sentence as a vector representation, then it decodes from this vector the missing word. DIORA encodes the sentence in the shape of a constituency tree, yet the model is trained using raw text only and without access to tree annotations. The 'ground truth' tree is unknown, so all valid trees are considered simultaneously using an efficient dynamic program with soft vector weighting.
Here is a sketch for how this approach works. Consider the hypothetical sentence with tokens: x 0 x 1 x 2 x 3 . Although the 'ground truth' tree is unknown, one valid tree is ⌧ = ((x 0 (x 1 x 2 ))x 3 ). For each span of token x i:j DIORA computes an inside vector h in i,j , summarizing the information in that span. Additionally, DIORA computes an outside vector h out i,j representing the tokens not in x i:j . Assume that x 2 is the target token to predict, then for the parse tree ⌧ the token x 1 is in the inside context for x 2 because x 1 is the immediate sibling of x 2 in the subtree capturing both tokens. The tokens x 0 and x 3 are not captured in this subtree and are considered to be in the outside context of x 2 . DIORA represents the inside context as h in 1,1 and the outside context as h out 1,2 to compute h out 2,2,k . The k in the subscript indicates that this is only one of many possible valid trees for the hypothetical sentence. DIORA assigns a weight to each valid tree s 2,2,k where higher weight values indicate the tree is more helpful for predicting the target token. The vector used to predict the target token is a weighted summation of all the tree representations h out 2,2 = P k q 2,2,k h out 2,2,k where q i,j,k is a weight DIORA assigns to each subtree.
The rest of this section covers in more technical detail how to recursively compute the inside and outside vectors and weights for DIORA. The recursive computation is done efficiently using a chart data structure and dynamic program similar to the inside-outside algorithm (Baker, 1979). Part of this computation involves a softly weighted summation, which is an efficient way to encode all valid trees, yet has some downsides ( §2.3). Figure 2: In the inside pass (left) DIORA composes two neighboring vectors. In the outside pass (right) DIORA computes the values for a target span (i, j) recursively from its sibling inside span (j +1, k) and outside spans (0, i 1) and (k + 1, n 1). The sibling span on the outside pass can appear to the left of the target span, in which case the indexing is adjusted.

Scoring and Composition
To fill the chart, DIORA learns to compose vectors using a multi-layer neural network (referred to as MLP), and to score vectors using a bi-linear function. In this section, we describe the chartfilling procedure from Drozdov et al. (2019a) In the inside chart, when i = j the scalars s in equal 0, the matrix W is learned, and the vectors h in are equal to the embedding of the token for the i-th position in the sentence x.
The equations for filling the outside chart are: In the outside chart, when i = 0 and j = |S| 1 the scalars s out equal 0, the matrix U is learned, and the vector h out is learned independent of the sentence (analogous to the initial hidden state in a recurrent neural network).
For a given span (i, j) there may be multiple valid split points or parent-sibling contexts. If each was considered separately, this would lead to a combinatorial explosion of paths to explore. Instead, DIORA averages the scalars and vectors that share the same (i, j) values. This is identical for the outside or inside pass, taking the following form: Learning DIORA is trained end to end via word prediction. The bottom-most vectors in the outside chart represent the entire sentence x except for a single token. By predicting this missing token x i from the outside vector h out i,i , we may update the model's parameters without any parse tree labels. 1 The training objective for a single sentence is:

Parse Tree Inference
Although DIORA is not trained with any parse tree annotations, its chart filling procedure can be used to extract binary unlabeled parse trees. First, fill the inside chart following §2.1. Afterwards, use the CKY algorithm to findŷ the maximal scoring tree where the score for a tree y is S(y) = P (i,j,k)2y s in i,j,k . This approach demonstrated impressive results for unsupervised constituency parsing (Drozdov et al., 2019a).
To understand better the effectiveness decoding parse trees with DIORA, we train DIORA for supervised parsing using a binarized version of the 'ground truth' parse trees from the English Penn Treebank (Marcus et al., 1993). The training procedure is done by optimizing the structured SVM loss: where S(ŷ) is the score of the maximal tree and S(y) is the score of the 'ground truth'.
We use the off-the-shelf parser from Kitaev and Klein (2018) as a baseline and the results are shown in Table 1. Although DIORA is strong in unsupervised parsing, the supervised parsing results are not as competitive with the baseline as we had expected, and lead us to consider deeply why this might be the case.
We posit the low performance in supervised parsing is due to DIORA's inability to effectively recover from local errors. Predicting trees in DIORA is exact -you are guaranteed to find the highest scoring tree given the scalar values associated with each span, but there is a weakness when assigning the scalar values. Specifically, the scalar values are assigned using local information, and may assign a low weight to a subtree which, when given more information, deserves to be given higher weight. Said plainly, this might occur when the sentence has structural ambiguity that requires context to resolve. For instance, the clause 'We saw the dog with my phone,' has a more likely parse tree depending on the context. 2,3 In the next section we present our extension to DIORA that addresses this downside.  Table 1: Supervised parsing results on the validation set of PTB using parsing F1 with binarized trees. DIORA does not do well because of its inherent weakness, and the best setting from S-DIORA (Table 2) is superior.

S-DIORA: Single Tree Encoding
We improve DIORA by making it more robust to local errors. DIORA is sensitive to errors because its vector averaging approach makes it difficult or impossible to recover when important subtrees have been washed out. The first modification we present prevents trees from being washed out by replacing the weights q with a sparse operator q 0 equivalent to a one-hot argmax. This effectively replaces vector averaging with selection of the highest scoring subtree for each span.
This change alone is not sufficient. If using only a single highest-scoring tree, S-DIORA would remain as vulnerable, or more so, to local errors that are inevitable when using the context-free approach of the inside-outside algorithm. Instead, at each cell in the chart we record up to values corresponding to the highest scoring subtrees. We refer to as beam-size, and our experiments demonstrate that using a beam-size of 2 already gives a great improvement in results, although any size of can be used at test time regardless what was used during training.
In the popular K-means algorithm, each point minimizes its distance to only one centroid. Using this as motivation, we train S-DIORA s.t. each sentence is only drawn towards one tree. We implement this change using a variant of the structured SVM loss: J unsup tree (x) = min(0, S(y 1 ) S(y 0 ) + 1), where S(y i ) is the score for the i-th tree represented on the beam and S(y) = P (i,j,k)2y s in i,j,k . S-DIORA trains with this loss in addition to the reconstruction loss (the original DIORA objective): A natural question to ask is whether S-DIORA is difficult to train given arg max is relatively non-smooth. To help train S-DIORA, we employed different tricks. During our unsupervised parsing experiments, we used gumbel-top-k (Kool et al., 2019) for q 0 to ensure the model would sufficiently explore multiple parse trees. We also added regularization via mixout (Lee et al., 2019a) or L2 regularization for the initial parameters so that the model would not diverge drastically from its initialization and suffer catastrophic forgetting. Empirically, we found that none of these additions were necessary, and that fine-tuning with the J S-DIORA objective was sufficient. One possible explanation for why this is so is that training with > 1 already lets S-DIORA explore multiple subtrees for each span during training.

Experiments and Results
The model and approach in this paper are motivated mainly by wash out in DIORA with respect to its vector averaging. In this section we experimentally test the following hypotheses: • There are often multiple valid parse trees for a clause (in other words, phrases can have structural ambiguity), therefore we expect that S-DIORA with a beam should be more effective at supervised parsing than DIORA.
• Word prediction benefits from parsing sentences with their most likely constituency tree, therefore S-DIORA, which is trained via word prediction, should be more effective at unsupervised parsing than DIORA because it can recover from local errors.
• Parsers are sensitive to their training domain.
Although we expect training with S-DIORA to be helpful for unsupervised parsing, we expect an even bigger benefit when training on the same domain as used for evaluation.

Preliminaries: Constituency Parsing
We measure the performance of our changes via unsupervised and supervised parsing on the test set of the WSJ Penn Treebank (Marcus et al., 1993). 4 All models (S-DIORA and baselines) output unlabeled binary trees 5 and are evaluated via sentence level F1 (S-F1).
• True Positives (TP) are the spans in both parse trees (inferred and ground truth).
• False Positives (FP) are spans predicted but not in the ground truth.
• False Negatives (FN) are spans in the ground truth but not the predicted tree.
Following previous work, we consider only non-trivial spans (covering 2 or more words, and ignoring spans covering the entire sentence). For pre-processing we remove punctuation. training split of PTB and evaluate using the validation data. We binarize the ground truth using the Stanford parser (Manning et al., 2014) and train for 10 epochs. Results against the binary trees and original n-ary (ingoring labels in both cases) is shown in Table 1. Both DIORA and S-DIORA are trained from random initialization using the structured SVM loss from Kitaev and Klein (2018). We see that DIORA is not competitive with the Kitaev and Klein (2018) parser, and attribute this to wash out and its inability to recover from errors.
For S-DIORA we train and evaluate with 2 {1, 2, 3, 4} and results are shown in Table 2. We see, unsurprisingly, that regardless of the beamsize at training, when = 1 at test time the performance is worse than DIORA. This is because even though S-DIORA does not suffer from wash out, when the beam is too small it can not recover from errors. As the beam-size increases, so does performance, surpassing DIORA by 3 F1 in the best case ( = 3 for training; = 4 at test time

Unsupervised Parsing
We explored two settings in unsupervised parsing. In the first, the zero-shot case, we assume no access to the evaluation domain. Instead, we sample a subset from the NLI data used to train DIORA and use this to fine-tune S-DIORA. The subset includes the same number of sentences as the train-ing data from PTB and the same sentence length distribution. The results are shown in Table 3 with the model name S-DIORA NLI . This model does substantially better than the original DIORA (and improvement of more than 5 F1) and is even competitive with the state-of-the-art model C-PCFG (Kim et al., 2019).
In the other experimental setting we assume access to raw text in the target domain is available but annotations are not. We fine-tune using the training data from PTB (about 40k sentences) and results are shown in Table 3 with the model name S-DIORA P T B . This improves upon S-DIORA P T B by a full 2 F1 points and is also substantially better than the previous state of the art by 3.5 F1.
S-DIORA sees a large improvement in WSJ-10. These sentences are length 10 or less and previously DIORA was on-par with ON-LSTM (Shen et al., 2019). When we bucket F1 by sentence length, we see that S-DIORA improves not only short sentences but on all sentence lengths.
To determine whether fine-tuning is necessary, we initialize S-DIORA from DIORA and evaluate it immediately. In this setting DIORA None performs 5 F1 less than DIORA, confirming the importance of fine-tuning.
To further determine if the extra training data was the main factor in the improved performance, we train DIORA with the an equivalent amount of data and see no improvement. This is not surprising given that the pre-trained DIORA was trained initially until convergence on relatively more data.   Kim et al. (2019). The average across random seeds is F1 µ 6 and the best model's F1 is reported as F1 max . We take the best model and also evaluate it on sentences of length of 10 or less and report the value in F1 n10 . Values with a † are copied from Kim et al. (2019). We only had access to a single DIORA model so no F1 µ is reported.

Training and Implementation Details
When applicable, we use the MLP with 'softmax loss' model checkpoint provided by Drozdov et al. (2019a). S-DIORA makes an impactful change to DIORA, but its parameters are exactly the same, making it easy to load a pretrained DIORA model for S-DIORA. Our implementation of S-DIORA, checkpoints of best models, training scripts, and all parsing output are available online. 7 Additional training details are covered in the Appendix A.1.

Discussion and Analysis
In this section we examine the parse tree output of the models in our experimental setup with more fine-grained detail than parsing F1. Given the prevalence of pre-trained language models in NLP tasks, we also include in our analysis recent results using transformers for unsupervised parsing. In addition, we present a new baseline demonstrating that pretrained language models are better at unsupervised parsing than previously known.

Linguistic Error Analysis
Parsing F1 is useful to quickly compare performance between parsers, and previous work in unsupervised parsing often also report segment recall to give a sense of which phrases are most often captured in the output. To provide an even more thorough treatment of linguistic errors we add labels to the parse trees using the parser from Kitaev and Klein (2018) and then run the Berkeley parser analyzer (Kummerfeld et al., 2012). This latter tool classifies mistakes for each predicted tree by the type of phrases (or patterns like coordination) involved in the error, allowing analysis of the types of errors being made by a model. In Table 4 we show the parsing F1, segment recall, and error counts as determined by the analyzer.
By segment recall, we see that C-PCFG outperforms DIORA in segment recall for NP and PP, explaining its high S-F1. The linguistic analysis tells a slightly different story -C-PCFG makes less errors associated with NP internal structure and clause attachment, but substantially more errors associated with PP attachment.  Table 4: To better understand the difference between models, displayed above are the segment recall on the WSJ validation set separated by phrase type (the left columns). For a more informative look at linguistic phenomenon, we use the Berkeley parser analyzer (Kummerfeld et al., 2012) and display error counts (the right columns). Since the unsupervised parsing models do not provide labels, we use high performing supervised constituency parser (Kitaev and Klein, 2018)

Unsupervised Parsing with Large
Pre-trained LMs We introduce a new unsupervised parsing baseline using ELMo (Peters et al., 2018), so that we may compare S-DIORA with large pre-trained LMs, a class of models that have recently proven very effective across NLP tasks. To extract a parse tree from ELMo, we first compute vector similarity between phrase embeddings in the output, then use these scalar values as input to the CKY algorithm. 8 Compared to ELMo we see that S-DIORA captures less ADVP phrases yet also makes less NP-I errors. Although S-DIORA has a strong affinity for VP phrases ELMo makes less VP-A errors.
For further comparison we include the best models from Kim et al. (2020). We see that XLNet =0 is the worst of all models in S-F1 and VP segment recall, but also has the fewest VP-A errors. This suggests that errors related to segment recall are likely folded into a different category such as PP attachment. The right-skewed model XLNet =1.5 substantially improves over XLNet =0 in SBAR recall and is comparable in this category with S-DIORA.
Interestingly, although increasing the size of in S-DIORA results in a near monotonic improvement in all categories (with some minor exceptions), S-DIORA shows a very different error profile when compared to pre-trained LMs, despite having a better S-F1. For instance, the pre-trained LMs make fewer coordinations errors, and perform better with adverbial phrases (ADVP), than any version of S-DIORA. In future work, it may be useful to understand why parser performance does not increase monotonically. Perhaps this is an artifact of the current state of unsupervised parsing research and will change once parsers improve beyond some threshold.

DIORA versus S-DIORA
It is not sufficient to initialize S-DIORA from DIORA without fine-tuning. DIORA None does worse than DIORA in nearly every category. Furthermore, the biggest benefit is gained when using S-DIORA with > 1, otherwise error recovery is not possible (see Figure 3). DIORA is trained on NLI and it is not surprising it incurs so many errors in coordination and clause attachment, which are frequently observed in domain mis-match (Kummerfeld et al., 2012). We used the same checkpoint for finetuning with the original formulation of DIORA -any improvements would be from exposure to more training data. When using NLI for finetuning, across 5 The vote was a test of the government 's resolve to proceed with a restructuring program The vote was a test of the government 's resolve to proceed with a restructuring program The vote was a test of the government 's resolve to proceed with a restructuring program The vote was a test of the government 's resolve to proceed with a restructuring program Figure 3: In this example, a beam-size of 1 is not sufficient for S-DIORA to improve upon DIORAerror recovery is only achieved with larger . The trees from top to bottom are from PTB, DIORA, then S-DIORA P T B with = 1, 3. Although larger can lead to more errors in certain situations (specifically clausal attachment), here they decrease. random seeds there was no improvement over the pre-trained model. This is not surprising given the original models were trained until convergence with relatively large amounts of training data.
Training on NLI provides S-DIORA with a substantial advantage in segment recall for VP and PP. S-DIORA does much worse in capturing the low frequency ADVP category. This does not incur much penalty in S-F1 but is reflected in NP-I. 9

Effects of Beam Size
Performance improves across the board as we increase beam size , and S-DIORA P T B improves over DIORA suggesting that single tree encoding already provided some benefit (recall that we finetuned DIORA on both NLI and PTB with no improvements in unsupervised parsing). Most benefit is achieved using = 3, although in some cases it helps to increase it further (see Figure 5). Increasing the beam also helps with different classes of errors. In Figure 4 we see the benefit in sentences with tricky coordination.

Labeled Parsing
We evaluate the labeled trees from §5.1, and the best performing S-DIORA model achieves 80.7 9 The NP-I category covers missed gold phrases within large noun phrases. In general, much of NP structure in PTB is not annotated, and in future work it is worth using the data provided by Vadas and Curran (2011) to investigate NP structure, as determined by unsupervised parsers, more thoroughly.

The exchanges and the Securities and Exchange Commission agree on conditions for halting or staying
The exchanges and the Securities and Exchange Commission agree on conditions for halting or staying The exchanges and the Securities and Exchange Commission agree on conditions for halting or staying He was punched and kicked by one player and the other broke his jaw He was punched and kicked by one player and the other broke his jaw He was punched and kicked by one player and the other broke his jaw Figure 4: Two sentences where beam-search helps with ambiguous coordination structures, correctly nesting noun phrases (top) and getting better coordination of verb phrases (bottom). The displayed parse tree output, top to bottom, are from PTB, then S-DIORA P T B with = 1, 3 respectively. labeled parsing F1 on the validation data (72.3 recall, 91.2 precision, and 11.7 complete match) when evaluated this way. This suggests that unsupervised parsers are closer to supervised parsers than previously realized, and although deciding which phrases are in the tree is the harder task (Klein and Manning, 2002), it may be worth pursuing unsupervised labeling 10 for more informative error analysis (Bisk and Hockenmaier, 2015).

Related Work
Avoiding errors by using rich feature models. The nature of unsupervised parsing is that good performance is a result of strong inductive bias, explaining why DIORA and S-DIORA are so effective, yet their context-free approach to chart parsing is also the cause of local errors. S-DIORA employees a beam at each cell to recover from local errors, but this would be less helpful if errors were less frequent. Top performing super-10 Typically unsupervised constituency parsing is purely evaluated by its structure, although recent work from Drozdov et al. (2019b) shows that a simple approach to induce labels with DIORA can be done by clustering the inside and outside phrase vectors.

From the outset the tobacco industry has been uncertain as to what strategy to follow
From the outset the tobacco industry has been uncertain as to what strategy to follow From the outset the tobacco industry has been uncertain as to what strategy to follow From the outset the tobacco industry has been uncertain as to what strategy to follow Figure 5: As the beam-size increases, S-DIORA's output tends to match the ground truth more closely. The displayed output, top to bottom, are from PTB, then S-DIORA P T B with = 1, 3, 5 respectively.
vised parsers do not need error recovery because they use models with rich features and model each span score independently (Cross and Huang, 2016;Stern et al., 2017;Kitaev and Klein, 2018;Mrini et al., 2019). Previous research has attempted to achieve the "best of both worlds" by distilling a strong model for supervised parsing via an unsupervised model's output (Le and Zuidema, 2015).
These approaches are closely related to fast and accurate parsing. More accurate models tend to use richer features that are more expensive to compute, influencing researchers to find efficient techniques to offset the loss in speed (Vieira and Eisner, 2017). In this paper, we use the most simple approach to learn to parse with the capability to recover from local errors by maintaining a beam of size at each cell in the chart. S-DIORA is often faster and discovers better trees than DIORA, but there are other methods for extracting lists of best or plausible parses (Resnik, 1992;Roark and Johnson, 1999;Charniak and Johnson, 2005;Huang and Chiang, 2005;Bouchard-côté et al., 2009) that might further improve performance.
Sparse structured inference. Various work has explored sparse alternatives to soft-weighting. Sparsemax (Martins and Astudillo, 2016) is a deterministic sparse alternative to the softmax, and Gumbel-Softmax (Jang et al., 2017) uses the categorical reparameterization trick to sample a discrete value during training. Both have attractive properties but alone would not be sufficient for overcoming local errors in S-DIORA. Nonetheless, these options would be worth exploring for unsupervised parsing when training with more data or when the ground truth parse trees are very different than the ones in S-DIORA's output frontier after initialization. Other work has explored methods for differentiable structured inference (Niculae et al., 2018;Mensch and Blondel, 2018;Corro and Titov, 2019a,b), which may also be suitable. It's worth noting that PCFGs are not graphical models (Liang et al., 2009), and marginal inference is often not tractable, 11 which is why these approximate methods may be helpful.
Grammar induction. There is a rich research history in grammar induction and unsupervised parsing (Fu and Booth, 1975;Angluin, 1980;Carroll and Charniak, 1992). We cover notable work not already mentioned in Appendix A.2.

Conclusion
We introduce S-DIORA, an extension to DIORA that enables for easy recovery from local errors and is not subject to wash out from vector averaging. Our experiments in supervised parsing verify S-DIORA improves upon the representational power of DIORA. Unsupervised fine-tuning with S-DIORA leads to new impressive results in unsupervised constituency parsing, improving upon the previous state of the art by 2.2 6% F1, depending on the data used.

A.1 Training Details
All key details for training and evaluating our method, S-DIORA, are described in the main text.
In this Appendix section we repeat those details and provide an organized reference to aid reproducibility.

A.1.1 Supervised Parsing Loss and Training
In supervised parsing, we assume access to binary non-projective constituency trees y for each sentence x. Predicting a treeŷ with DIORA can be done using the CKY method described in Drozdov et al. (2019a). Similarly, backtracking the various max operations from the inside-pass in S-DIORA can be used to decodeŷ. 12 The conditional probability of a tree given a sentence is proportional to the sum of scalar values for each span and split (i, j, k) in the tree, depicted in Eq. 2.
To train DIORA or S-DIORA to predict the most likely tree for an input sentence, we use the structured SVM loss employed by multiple other work in supervised parsing (Stern et al., 2017;Kitaev and Klein, 2018) with a margin of 1 and do not use loss augment inference, depicted in Eq. 3. J sup tree = max(0, S(ŷ) S(y) + 1) In our experiments, we train DIORA and S-DIORA on the training from PTB (roughly 40k sentences). Both models are trained from random initialization and using the same hyperparameters. Early stopping is done by evaluating against the validation data each epoch. S-DIORA is trained with different beam-size = {1, 2, 3, 4}.
This paper is primarily concerned with unsupervised parsing, and we only explored one hyperparameter setting as supervised parsing is used primarily to verify the benefit of beam search in S-DIORA and its improvement over DIORA. Those hyperparameters are listed here: 12 Each value save on the beam in S-DIORA represents a unique tree -duplicate trees can not appear on the beam. For both supervised and unsupervised training, each batch is restricted to sentences of uniform length.
A.1.2 Unsupervised Loss and Training DIORA and S-DIORA are models especially effective for unsupervised parsing. In this setting we assume no access to parse tree labels, only raw text. The models are trained end-to-end by reconstructing the input sentence from the outside vectors. Reconstruction is defined as predicting a word x i given its context {x} i which are the words in the rest of the sentence. Unlike Drozdov et al. (2019a), we use a fixed vocabulary instead of sampling, which includes the 10k most frequent words from the training data. 13 The objective for a single sentence is depicted in Eq. 4.
As mentioned in §3, we also train S-DIORA to increase the confidence gap between its highestscoring tree on the beam and other trees. To accomplish this we use the same structured SVM from supervised parsing, but instead of the ground truth y, we include the highest-scoring tree on the beam y 0 and the second highest y 1 . This loss is depicted in Eq. 5, and the total loss for S-DIORA is simply the sum of the reconstruction and 'tree' losses (Eq. 6). For S-DIORA NLI we train using a subset of NLI. 14 The subset is sampled once from NLI and 13 The vocabulary is different between NLI and PTB. 14 A concatenation of the training data from SNLI (Bowman et al., 2015) and Multi-NLI (Williams et al., 2018b). used across all experiments, and consists of the same number of sentences as the training data from PTB and also the same distribution of sentence lengths. For S-DIORA P T B we use the training data from PTB. Early stopping is done by evaluating against the validation data each epoch. We explore various hyperparameter settings, and for S-DIORA we also train with different beam-sizes . S-DIORA is initialized from the MLP with 'softmax loss' DIORA checkpoint 15 that was released by Drozdov et al. (2019a) We ran each setting for 5 random seeds. The best performing hyperparameter setting was chosen using validation performance, and the best performing setting (⌘, , n) for S-DIORA NLI and S-DIORA P T B were (1 3 , 2, 30) and (2 3 , 2, 30) respectively.

A.2 Other Work in Grammar Induction and Unsupervised Parsing
There is a rich research history in grammar induction and unsupervised parsing. In the main text, we cover the work most relevant to frame our scientific questions and experimental results. Instead, here, we mention loosely related work that would be useful for further analysis and future research. Furthermore, some of the mentioned work might be in dependency parsing rather than constituency parsing, or about measuring syntactic information without parse trees.
15 DIORA and S-DIORA have exactly the same parameters, so one can be initialized easily from the other. The number of parameters is the same, but the runtime of S-DIORA is slower by an order of . Even so, a correctly implemented S-DIORA should be as fast as DIORA or faster since the sparse operator q 0 can be leveraged to avoid computation when there are many possible subtrees for a span.

A.2.1 Partial Semantic Information
We assume access to no text annotation, but often some might be available (Pereira and Schabes, 1992) and this can be leveraged to constrain induced syntax in a useful way. Naseem and Barzilay (2011) explore syntactic structure of semantic relations, presenting an approach that encourages structural consistency for each occurrence of a specific semantic relation, but also allowing for variation. DIORA and S-DIORA represent spans as vectors, and a simple extension would be to encourage span vectors associated with the same semantic relation to be similar through contrastive estimation (Smith and Eisner, 2005a,b;Gimpel and Bansal, 2014). Rather than encouraging similarity within a relation, Shi et al. (2019) have success encouraging similarity between an image and constituents in its caption.

A.2.2 Multilingual Alignment
Syntactic phrase types do not necessarily translate to the same type across languages (Koehn and Knight, 2003), but can still leverage parallel text to improve unsupervised constituency parsing as a phrase in one language may have less uncertain structure in another (Snyder et al., 2009).

A.2.3 Label Refinement
Similarities across languages can be used to create fine-grained grammar rules that are helpful when applied as soft constraints for grammar induction since they serve as a prior to contradict patterns seen in the data (Naseem et al., 2010). These linguistic priors need not be derived from crosslingual data (Druck et al., 2009) -using a small set of simple rules (e.g. a determiner followed by a noun is a noun phrase) can be helpful for grammar induction and can be derived from a few positive examples of phrases (Haghighi and Klein, 2006). Williams et al. (2018a) measure self F1 in addition to parsing F1 and find the models that consistently converge to the same grammar were also the ones most different from ground truth, although this was an extreme case as the pertinent model made trivial predictions (nearly always leftbranching). Follow up work from Mohananey et al. (2020) shows that self-training is helpful for training PRPN (Shen et al., 2018) and parsing F1 improves with self-agreement, with the biggest benefit for longer sentences.