Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

There have been several recent attempts to improve the accuracy of grammar induction systems by bounding the recursive complexity of the induction model. Modern depth-bounded grammar inducers have been shown to be more accurate than early unbounded PCFG inducers, but this technique has never been compared against unbounded induction within the same system, in part because most previous depth-bounding models are built around sequence models, the complexity of which grows exponentially with the maximum allowed depth. The present work instead applies depth bounds within a chart-based Bayesian PCFG inducer, where bounding can be switched on and off, and then samples trees with or without bounding. Results show that depth-bounding is indeed significantly effective in limiting the search space of the inducer and thereby increasing accuracy of resulting parsing model, independent of the contribution of modern Bayesian induction techniques. Moreover, parsing results on English, Chinese and German show that this bounded model is able to produce parse trees more accurately than or competitively with state-of-the-art constituency grammar induction models.


Introduction
Unsupervised grammar inducers hypothesize hierarchical structures for strings of words.Using context-free grammars (CFGs) to define these structures, previous attempts at either CFG parameter estimation (Carroll and Charniak, 1992;Schabes and Pereira, 1992;Johnson et al., 2007b) or directly inducing a CFG as well as its probabilities (Liang et al., 2009;Tu, 2012) have not achieved as much success as experiments with other kinds of formalisms (Klein and Manning, 2004;Seginer, 2007;Ponvert et al., 2011).The assumption has been made that the space of grammars is so big that constraints must be applied to the learning process to reduce the burden of the learner (Gold, 1967;Cramer, 2007;Liang et al., 2009).
One constraint that has been applied is recursion depth (Schuler et al., 2010;Ponvert et al., 2011;Shain et al., 2016;Noji and Johnson, 2016;Jin et al., 2018), motivated by human cognitive constraints on memory capacity (Chomsky and Miller, 1963).Recursion depth can be defined in a left-corner parsing paradigm (Rosenkrantz and Lewis, 1970;Johnson-Laird, 1983).Left-corner parsers require only minimal stack memory to process left-branching and right-branching structures, but require an extra stack element to process each center embedding in a structure.For example, a left-corner parser must add a stack element for each of the first three words in the sentence, For parts the plant built to fail was awful, shown in Figure 1.These kinds of depth bounds in sentence processing have been used to explain the relative difficulty of center-embedded sentences compared to more right-branching paraphrases like It was awful for the plant's parts to fail.
However, depth-bounded grammar induction has never been compared against unbounded induction in the same system, in part because most previous depth-bounding models are built around sequence models, the complexity of which grows exponentially with the maximum allowed depth.In order to compare the effects of depth-bounding more directly, this work extends a chart-based Bayesian PCFG induction model (Johnson et al., 2007b) to include depth bounding, which allows both bounded and unbounded PCFGs to be induced from unannotated text.
Experiments reported in this paper confirm that depth-bounding does empirically have the effect of significantly limiting the search space of the inducer.Analyses of this model also show that the posterior samples are indicative of implicit depth limits in the data.This work also shows for the first time that it is possible to induce an accurate unbounded PCFG from raw text with no strong linguistic constraints.With a novel grammarlevel marginalization in posterior inference, comparisons of the accuracy of bounded grammar induction using this model against other recent constituency grammar inducers show that this model is able to achieve state-of-the-art or competitive results on datasets in multiple languages.

Related work
Induction of PCFGs has long been considered a difficult problem (Carroll and Charniak, 1992;Johnson et al., 2007b;Liang et al., 2009;Tu, 2012).Lack of success for direct estimation was attributed either to a lack of correlation between the linguistic accuracy and the optimization objective (Johnson et al., 2007b), or the likelihood function or the posterior being filled with weak local optima (Smith, 2006;Liang et al., 2009).Much of this grammar induction work used strong linguistically motivated constraints or direct linguistic annotation to help the inducer eliminate some local optima.Schabes and Pereira (1992) use bracketed corpora to provide extra structural information to the inducer.Use of part-of-speech (POS) sequences in place of word strings is popular in the dependency grammar induction literature (Klein andManning, 2002, 2004;Berg-Kirkpatrick et al., 2010;Jiang et al., 2016;Noji and Johnson, 2016).Combinatory Categorial Grammar (CCG) induction also relies on POS tags to assign basic categories to words (Bisk andHockenmaier, 2012, 2013), among other constraints such as CCG combinators.Other linguistic constraints such as constraints of root nodes (Noji and Johnson, 2016), attachment rules (Naseem et al., 2010) or acoustic cues (Pate, 2013) have also been used in induction.
Depth-like constraints have been applied in work by Seginer (2007) and Ponvert et al. (2011) to help with the search.Both of these systems are successful in inducing phrase structure trees from only words, but only generate unlabeled constituents.
Depth-bounds are directly used by induction models in work by Noji and Johnson (2016), Shain et al. (2016) andJin et al. (2018), and are shown to be beneficial to induction.Noji and Johnson (2016) apply depth-bounding to dependency grammar induction with POS tags.However the constituency parsing evaluation scores they report are low compared to other induction systems.The model in Shain et al. (2016) is a hierarchical sequence model instead of a PCFG.Although depthbounding limits the search space, the sequence model has more parameters than a PCFG, therefore benefits brought by depth-bounding may be offset by this larger parameter space.Jin et al. (2018) also apply depth-bounding to a grammar inducer and induce depth-bounded PCFGs and show that the depth-bounded grammar inducer can learn labeled PCFGs competitive with state-of-the-art grammar inducers that only produce unlabeled trees.However, because of the cognitively motivated left-corner HMM sampler used in the model, its state space grows exponentially with the maximum depth and polynomially with the number of categories.This renders the transition matrix and the trellis of the inducer too big to be practical in exploring models with higher depth limits, let alone unbounded models.By using Gibbs sampling for PCFGs (Goodman, 1998;Johnson et al., 2007b), here described as the inside-sampling algorithm, the state space of the model proposed in this work grows only polynomially with both the maximum depth and the number of categories.This allows experiments with more complex models and also achieves a faster processing speed due to an overall smaller state space.

Proposed model
The model described in this paper follows Jin et al. (2018) to induce a depth-bounded PCFG by first inducing an unbounded PCFG and then deterministically deriving the parameters of a depthbounded PCFG from it.The main difference between this model and the model in Jin et al. (2018) is that they use the bounded PCFG to derive parameters for a factored HMM sequence model, where a forward-filtering backward-sampling algorithm (Carter and Kohn, 1996) can be used in inference.In contrast, the model described in this paper transforms the unbounded PCFG into a bounded PCFG, and then uses the inside-sampling algorithm (Goodman, 1998) to sample from the posterior of the parse trees given the bounded PCFG in inference.This section first gives an overview of the model, then briefly reviews the depth-bounding algorithm for PCFGs (van Schijndel et al., 2013;Jin et al., 2018), and finally describes the inference.
As defined in Jin et al. ( 2018), a Chomsky normal form (CNF) unbounded PCFG is a matrix G of binary rule probabilities with one row for each of C parent symbols c and one column for each of C 2 +W combinations of left and right child symbols a and b, which can be pairs of nonterminals or observed words from vocabulary W followed by null symbols ⊥: where δ c is a Kronecker delta (a vector with value one at index c and zeros elsewhere) and ⊗ is a Kronecker product (multiplying two matrices 2 of dimension m × n and o × p into a matrix of dimension mo × np composed of products of all pairs of elements in the operands).A deterministic depthbounding transform φ is then applied to G to create a depth-bounded version G D .A depth-bounded grammar is composed of a set of side-and depthspecific distributions G s,d : 2 or vectors in case n and p equal one where side s ∈ {1, 2} indicates left (1) or right (2) child.Categories in G D are made to be side-and depth-specific using transforms D s,d and E s,d :3 The generative story of this model is as follows.The model first generates an unbounded grammar G from the Dirichlet prior.Distributions over expansions P(c → a b | c) of each category c in this model are drawn from a Dirichlet with symmetric parameter β: Trees for sentences 1..N are each drawn from a PCFG given parameters Each tree τ is a set {τ , τ 1 , τ 2 , τ 11 , τ 12 , τ 21 , ...} of category labels τ η where η ∈ {1, 2} * is a Gorn address specifying a path of left or right branches from the root.Categories of every pair of left and right children τ η1 , τ η2 are drawn from a multinomial defined by the grammar G D and the category of the parent τ η : where In inference, a Gibbs sampler can be used to iteratively draw samples from the conditional posteriors of the unbounded grammar and the parse trees.For example, at iteration t: where σ τ denotes the terminals in τ.These distributions will be defined in Section 3.2.

Depth-bounding a PCFG
This section summarizes the depth-bounding function φ for PCFGs described in van Schijndel et al. (2013) andJin et al. (2018).Depth-bounding essentially creates a set of PCFGs with depth-and side-specific categories where no tree that exceeds its depth bound can be generated by the bounded grammar.Because depth increases when a left child of a right child of some parent category performs non-terminal expansion, the probability of such expansions at the maximum depth limit as well as non-depth-increasing expansions beyond the maximum depth limit must be removed from the unbounded grammar.Following van Schijndel et al. ( 2013) and Jin et al. ( 2018), this can be done by iteratively defining a side-and depth-specific containment likelihood h (i) s,d for left-or right-side siblings s ∈ {1, 2} at depth d ∈ {1..D} at each iteration i ∈ {1..I}, as a vector with one row for each nonterminal or terminal symbol (or null symbol ⊥) in G, containing the probability of each symbol generating a complete yield within depth d as an s-side sibling: where 'T' is a top-level category label at depth zero.Following previous work, experiments described in this paper use I = 20.A depth-bounded grammar G s,d can then be defined to be the original grammar G reweighted and renormalized by this containment likelihood: To sample from the conditional posterior of G, it is necessary to first sum over all rule applications in all sampled trees: then remove side-and depth-specificity from category labels: A side-and depth-independent grammar is then sampled from these counts, plus the pseudocount β: Inside-sampling (Goodman, 1998;Johnson et al., 2007b) is then used to sample from the posterior of trees P(τ 0..N | G D , σ τ 0...N ).Given a depthbounded grammar and a sentence, this algorithm first constructs the inside chart V ∈ R L×L×C , where L is the length of the sentence.A chart vector V [i, j,1..C] for the span i, j where i < j ≤ L in some sentence w 1..L is the likelihood P G D (w i.. j | c) of the span for all side-and depth-specific categories c: Trees are sampled iteratively from the top down by first choosing a split point k i, j for the current span i, j such that i < k i, j < j: The algorithm then samples pairs of category labels c i,k i, j and c k i, j , j adjacent at this split point k i, j : (16) Empirically the sampler spends most of its time constructing the inside chart.The model described in this paper therefore efficiently computes the inside chart using matrix multiplication, which is able to exploit GPU optimization.

Efficient inside score calculation
The complexity of the inside algorithm is cubic on the length of the sentence because it has to iterate over all start points i, all end points j and all split points k of a span.For a dense PCFG with a large number of states, the explicit looping is undesirable, especially when it can be formulated as matrix multiplication.The split point loop is therefore replaced with a matrix multiplication in order to take advantage of highly optimized GPU linear algebra packages like cuBLAS and cuSPARSE, whereas previous work explores how to parse efficiently on GPUs (Johnson, 2011;Canny et al., 2013;Hall et al., 2014).
Inside likelihoods are propagated using a copy V of the inside likelihood tensor V with the first and second indices reversed: This reversal allows the sum over split points k ∈ {i+1, ..., j−1} to be calculated as a product of contiguous matrices, which can be efficiently implemented on a GPU: ) where vec(M) flattens a matrix M into a vector.

Posterior inference on constituents
Prior work (Johnson et al., 2007a) shows that using EM-like algorithms, which seek to maximize the likelihood of data marginalizing out the latent trees, does not yield good performance.Because trees are the main target for evaluation, it may be preferable to find the most probable tree structures given the marginal posterior of tree structures compared to finding the most probable gram-mar.Some recent work (McClosky and Charniak, 2015;Keith et al., 2018) explores how to use marginal distributions of tree structures from supervised parsers to create more accurate parse trees.Based on these arguments, this model performs maximum a posteriori (MAP) inference on constituents (PIoC) using approximate conditional posteriors of spans to create final parses for evaluation.
Formally, let σ i, j be an MAP unlabeled span of words in a sentence from a corpus σ, with start point i and end point j, and σ i,k , σ k, j its possible children.This algorithm iteratively looks for the best pair of children σ i,k , σ k, j according to the posterior of the children, using all posterior samples.The spans are sentence-specific, but the below equations omit the sentence index for brevity: where σ is the training corpus.Starting from the whole sentence σ 0,N , this algorithm finds the best children for a span from the Monte Carlo estimation of the marginal posterior distribution of children for the span, and then continues to split the found children spans.Because samples from different runs at different iterations can be used to approximate the span posteriors, the process marginalizes out sampled grammars, wholesentence parse trees and constituent labels to only consider split points for spans.In terms of input and output, the PIoC algorithm takes in posterior samples of trees for a sentence, and outputs an unlabeled binary-branching tree.
There are a few benefits of doing posterior inference on constituents.First, the distribution P(σ i,k , σ k, j | σ i, j , σ) quantifies how much uncertainty there is in splitting a span σ i, j at all possible k's.One way of using this uncertainty information is to merge spans where uncertainty is high, effectively weakening or removing the constraint of binary-branching from the grammar inducer.Second, this algorithm produces trees that may not be seen in the samples, potentially helping aggregate evidence across different iterations within a run and across runs.Third, the multimodal na-ture of the joint posterior of grammars and trees often makes the sampler get stuck at local modes, but doing MAP on constituents may allow information about trees from different modes to come together.If different grammars all consider certain children for a span to be highly likely, then these children should be in the final parse output.Finally, it is a nonparametric way of doing model selection.As will be shown, model selection relies on the log likelihood of the data, but the log likelihood of the data is only weakly correlated with parsing accuracy.Performing PIoC with multiple runs can increase accuracy without depending too heavily on log likelihood for model selection.

Model analysis and evaluation
The model described above has hyperparameters for maximum depth D, number of categories C and the symmetric Dirichlet prior β.Following Jin et al. (2018), this evaluation uses the first half of the WSJ20 corpus as the development set (WSJ20dev) for all experiments.However instead of using the development set only to set the hyperparameters of the model, this evaluation also uses it to explore interactions among parsing accuracy, model fit, depth limit and category domain.The first set of experiments explores various settings of D in the hope of acquiring a better picture of how depth-bounding affects the inducer.The second set of experiments uses the value of D tuned in the first experiments, and does PIoC on different sets of samples to examine the effect it has on parse quality.Optimal parameter values from these first two experiments are then applied in experiments on English (The Penn Treebank;Marcus et al., 1993), Chinese (The Chinese Treebank 5.0; Xia et al., 2000) and German (NEGRA 2.0; Skut et al., 1998) data to show how the model performs compared with competing systems.
Each run in evaluation uses one sample of parse trees from the posterior samples after convergence.Preliminary experiments show that the samples after convergence are very similar within a run and their parsing accuracies differ very little.This evaluation follows Seginer (2007) by running unlabeled PARSEVAL on parse trees collected from each run.Punctuation is retained in the raw text in induction, and removed in evaluation, also following Seginer (2007).

Analysis of model behavior
The first experiment explores the effects of depthbounding on linguistic grammar quality.The hypothesis is that depth-bounding limits the search space of possible grammars, so the inducer will be less likely to find low-quality local optima where cognitively implausible parse trees are assigned non-zero probabilities, because such local optima would be removed from the posterior by limiting the maximum depth of parse trees to a small number d.

The effect of depth-bounding
Figure 3 shows the effect of depth bounding using 60 data points of unlabeled PARSEVAL scores from 20 different runs for each of three different depth bounds: 2, 3, and ∞ (unbounded).The range of possible parsing accuracy scores is very wide, as mapped out by the runs.Although the unbounded model is able to reach the performance upper bound seen from the figure, most of the time its results are in the middle of the range.By bounding the maximum depth to 2, the sampler is able to stay in the region of high parsing accuracy.This may be because the majority of the modes in the region of low parsing accuracy require higher depth limits, and humans who produce the sentences do not have access to those higher depth limits.The difference between depth ∞ and depth 2 is significant (p=0.017,Student's t test), showing that depth-bounding does have a positive effect on the linguistic grammar quality of the induced grammars.Data from depth 3 also shows a positive trend of inducing better grammars than 575000 570000 565000 560000 555000 Data loglikelihood 30 40 50 60

F1
Figure 4: The correlation between data likelihood and parsing accuracy of all 60 runs.Calculations show that there is a significant (p = 0.007) positive correlation (Pearson's r=0.39) between data likelihood and parsing accuracy at convergence for our model.

unbounded.
A purely right-branching baseline achieves an F1 score of 48 on the WSJ20 development dataset.A majority of induction runs perform better than this baseline, which indicates that the PCFG induction model with the inside-sampling algorithm is able to find good solutions, most of the time much better than the right-branching baseline.This is especially interesting when the grammar is unbounded with almost no other constraint, which had previously been shown to converge to weak local optima.

Correlation of model fit and parsing accuracy
Model fit, or data likelihood, has been reported not to be correlated or to be correlated only weakly with parsing accuracy for some unsupervised grammar induction models (Smith, 2006;Johnson et al., 2007b;Liang et al., 2009) when the model has converged to a local maximum.Figure 4 shows the correlation between data likelihood and parsing accuracy at convergence for all the runs.There is a significant (p = 0.007) positive correlation (Pearson's r=0.39) between data likelihood and parsing accuracy at convergence for our model.This indicates that although noisy and unreliable, the data likelihood can be used as a metric to do preliminary model selection.

The bounded unbounded PCFG
We also examine the distribution of tree depths in unbounded runs.For a run, we compute the percentage of parse trees with a certain depth, and then examine how these percentages vary across different runs.Theoretically the possible maximum depth of a parse for a sentence is the sentence length divided by 2. For example, a 20-word sentence can have a parse of depth 10 because at least two words are needed to create a new depth with a center embedded phrase, but under most PCFGs this maximally center embedded configuration is not very likely.Figure 5 shows the percentage of tree depths from samples in the beginning of each unbounded run and at convergence.It shows that at the beginning of the sampling process with a random model sampled from the prior, the distribution of parse tree depths seems to be centered around depth 2 and 3, with non-negligible probability mass at other depth levels.At convergence, the distribution of parse tree depths is very peaked with a large portion of the probability mass concentrated at depth 2. Given that an unbounded PCFG has no constraint on depth, this convergence of the marginal posterior distribution of parse tree depth shows that the depth limit seems to be a natural tendency in the data, rather than an arbitrary preference of corpus annotators.

Posterior uncertainty of constituents
Experiments were also conducted to determine whether posterior inference on constituents (PIoC) has any effect on parsing accuracy.These experiments use 10 runs on WSJ20dev with depth 2 that have the highest log-likelihoods for exploration.In this data, some spans have a strikingly higher degree of uncertainty than other spans.For example, the posterior probability of splitting the phrase the old story, into the old and story is 0. probability of splitting it into the and old story is 0.45.Some other spans like use old tools have virtually no uncertainty in how the inducer evaluates the splits.Many such spans with high uncertainty are noun phrases, which are not annotated with subconstituents in the Penn Treebank annotation.The parser can therefore avoid precision losses by not splitting constituents with 3 or 4 words if there is large uncertainty in this posterior. 4This experiment only merges spans that would cover 3 or 4 words and leave merging spans with larger coverage to future work.
Table 1 shows parsing results on the WSJ20dev dataset.The Best result is from an arbitrary sample at convergence of the oracle best run.The Best with PIoC is the same run, but with PIoC to aggregate 100 posterior samples at convergence.All with PIoC uses 100 posterior samples from all of the 10 chosen runs, and finally All with PIoC without best excludes the best run in PIoC calculation.
There is almost a point of gain in precision going from Best to Best with PIoC with virtually no recall loss, showing that the posterior uncertainty is helpful in flattening binary trees.As more samples from the posterior are collected, as shown in All with PIoC without best, the precision gain is even more substantial.This shows that with PIoC there is no need to know which sample from which run is the best.Model selection in this case is only needed to weed out the runs with very low likelihood.

Multilingual PARSEVAL
A final set of experiments compare the proposed model with several state-of-the-art constituency grammar induction systems on three different languages.The competing systems are CCL (Seginer, 2007) 5 and UPPARSE (Ponvert et al., 2011). 6We also include the published results of DB-PCFG (Jin et al., 2018) on English for comparison. 7The corpora used are the WSJ20test dataset used in Jin et al. (2018), the CTB20 (sentences with 20 words or fewer from the Chinese Treebank) and NE-GRA20 (sentences with 20 words or fewer from the German NEGRA Treebank) datasets used in Seginer (2007).All systems are trained and evaluated on the same datasets to ensure fair and direct comparison.Five different induction runs were run on each dataset with the same hyperparameters D=2, C=15, β=0.2 as tuned on the development set, and three runs with the highest likelihood at convergence were chosen for comparison with other models.Parse trees were then calculated using PIoC as previously described, removing punctuation to calculate the unlabeled PARSE-VAL scores with EVALB.Multiple runs of CCL and UPPARSE on the same data yield the same results.
Table 2 shows the unlabeled PARSEVAL scores for the competing systems.The model described in this paper shows strong performance in all languages.On English and Chinese, this model achieves the new state-of-the-art recall and F1 numbers.On German, this model also achieves the best recall scores among all models, showing that more constituents found in the gold annotation are discovered.It is worth noting that the CCL and UPPARSE models do take advantage of additional linguistic constraints, e.g. using punctuation as delimiters of constituents.Experiments described in this paper show that this system can perform better than or competitive with these existing models without similar heuristics and constraints.
The model described in this paper performs relatively poorly on precision due to the fact that trees produced by this system are mostly binarybranching with some constituents flattened by PIoC.This issue is most evident on Negra, where fully binary-branching trees have nearly twice as many constituents as are annotated in gold.This puts any system that produces binary-branching trees under a precision celling of 0.51, and F1 celling of 0.675.

Conclusion
Experiments in this work confirm that depthbounding does empirically have the effect of limiting the search space of an unsupervised PCFG in-

Figure 1 :
Figure 1: Stack elements after the word the in a leftcorner parse of the sentence For parts the plant built to fail was awful.
10b) 3.2 Gibbs sampling of unbounded grammars and bounded trees As defined above, this model samples iteratively from the conditional posteriors of P(G | τ 0..N , σ τ 0..N , β) and P(τ 0..N | G D , σ τ 0..N ) in inference, extending the Gibbs sampling algorithm for PCFG induction introduced in Johnson et al. (2007b) to depth-bounded grammars.The below equations will omit the superscript t for the iteration number of inference for clarity.

Figure 2 :
Figure 2: Example of matrix multiplication in place of looping over break points for the span(0,5).Each chart cell represents a likelihood vector for the span between i and j where i is the leftmost delimiting index of the span and j the rightmost.The arrows represent the order in which the cells are stored in the chart matrices V and V .

Figure 3 :
Figure3: PARSEVAL scores for runs with different depth limits.The difference of all PARSEVAL scores between depth ∞ and depth 2 is significant (p=0.017,Student's t test).

Figure 5 :
Figure 5: The usage of different depths for parse trees in the samples from 20 runs with the unbounded grammar.

Table 1 :
Development results for different systems using posterior inference on constituents (PIoC).