Compound Probabilistic Context-Free Grammars for Grammar Induction

We study a formalization of the grammar induction problem that models sentences as being generated by a compound probabilistic context free grammar. In contrast to traditional formulations which learn a single stochastic grammar, our context-free rule probabilities are modulated by a per-sentence continuous latent variable, which induces marginal dependencies beyond the traditional context-free assumptions. Inference in this context-dependent grammar is performed by collapsed variational inference, in which an amortized variational posterior is placed on the continuous variable, and the latent trees are marginalized with dynamic programming. Experiments on English and Chinese show the effectiveness of our approach compared to recent state-of-the-art methods for grammar induction from words with neural language models.


Introduction
Grammar induction is the task of inducing hierarchical syntactic structure from data.Statistical approaches to grammar induction require specifying a probabilistic grammar (e.g.formalism, number and shape of rules), and fitting its parameters through optimization.Early work found that it was difficult to induce probabilistic context-free grammars (PCFG) from natural language data through direct methods, such as optimizing the log likelihood with the EM algorithm (Lari and Young, 1990;Carroll and Charniak, 1992).While the reasons for the failure are manifold and not completely understood, two major potential causes are the ill-behaved optimization landscape and the overly strict independence assumptions of PCFGs.More successful approaches to grammar induction have thus resorted to carefully-crafted auxiliary objectives (Klein and Manning, 2002), priors or Code: https://github.com/harvardnlp/compound-pcfgnon-parametric models (Kurihara and Sato, 2006;Johnson et al., 2007;Liang et al., 2007;Wang and Blunsom, 2013), and manually-engineered features (Huang et al., 2012;Golland et al., 2012) to encourage the desired structures to emerge.
We revisit these aforementioned issues in light of advances in model parameterization and inference.First, contrary to common wisdom, we find that parameterizing a PCFG's rule probabilities with neural networks over distributed representations makes it possible to induce linguistically meaningful grammars by simply optimizing log likelihood.While the optimization problem remains non-convex, recent work suggests that there are optimization benefits afforded by over-parameterized models (Arora et al., 2018;Xu et al., 2018;Du et al., 2019), and we indeed find that this neural PCFG is significantly easier to optimize than the traditional PCFG.Second, this factored parameterization makes it straightforward to incorporate side information into rule probabilities through a sentence-level continuous latent vector, which effectively allows different contexts in a derivation to coordinate.In this compound PCFG-continuous mixture of PCFGs-the context-free assumptions hold conditioned on the latent vector but not unconditionally, thereby obtaining longer-range dependencies within a tree-based generative process.
To utilize this approach, we need to efficiently optimize the log marginal likelihood of observed sentences.While compound PCFGs break efficient inference, if the latent vector is known the distribution over trees reduces to a standard PCFG.This property allows us to perform grammar induction using a collapsed approach where the latent trees are marginalized out exactly with dynamic programming.To handle the latent vector, we employ standard amortized inference using reparameterized samples from a variational arXiv:1906.10225v1[cs.CL] 24 Jun 2019 posterior approximated from an inference network (Kingma and Welling, 2014;Rezende et al., 2014).
On standard benchmarks for English and Chinese, the proposed approach is found to perform favorably against recent neural network-based approaches to grammar induction (Shen et al., 2018(Shen et al., , 2019;;Drozdov et al., 2019;Kim et al., 2019).

Probabilistic Context-Free Grammars
We consider context-free grammars (CFG) consisting of a 5-tuple G = (S, N , P, Σ, R) where S is the distinguished start symbol, N is a finite set of nonterminals, P is a finite set of preterminals,1 Σ is a finite set of terminal symbols, and R is a finite set of rules of the form, A probabilistic context-free grammar (PCFG) consists of a grammar G and rule probabilities π = {π r } r∈R such that π r is the probability of the rule r.Letting T G be the set of all parse trees of G, a PCFG defines a probability distribution over t ∈ T G via p π (t) = r∈t R π r where t R is the set of rules used in the derivation of t.It also defines a distribution over string of terminals x ∈ Σ * via where T G (x) = {t | yield(t) = x}, i.e. the set of trees t such that t's leaves are x.We will use p π (t | x) p π (t | yield(t) = x) to denote the posterior distribution over latent trees given the observed sentence x.
Parameterization The standard way to parameterize a PCFG is to simply associate a scalar to each rule π r with the constraint that they form valid probability distributions, i.e. each nonterminal is associated with a fully-parameterized categorical distribution over its rules.This direct parameterization is algorithmically convenient since the M-step in the EM algorithm (Dempster et al., 1977) has a closed form.However, there is a long history of work showing that it is difficult to learn meaningful grammars from natural language data with this parameterization (Carroll and Charniak, 1992).2Successful approaches to unsupervised parsing have therefore modified the model/learning objective by guiding potentially unrelated rules to behave similarly.
Recognizing that sharing among rule types is beneficial, we propose a neural parameterization where rule probabilities are based on distributed representations.We associate embeddings with each symbol, introducing input embeddings w N for each symbol N on the left side of a rule (i.e.N ∈ {S} ∪ N ∪ P).For each rule type r, π r is parameterized as follows, , where M is the product space (N ∪P)×(N ∪P), and f 1 , f 2 are MLPs with two residual layers (see appendix A.1 for the full parameterization).We will use denote the set of input symbol embeddings for a grammar G, and λ to refer to the parameters of the neural network used to obtain the rule probabilities.A graphical model-like illustration of the neural PCFG is shown in Figure 1 (left).
It is clear that the neural parameterization does not change the underlying probabilistic assumptions.The difference between the two is analogous to the difference between count-based vs. feed-forward neural language models, where feedforward neural language models make the same Markov assumptions as the count-based models but are able to take advantage of shared, distributed representations.

Compound PCFGs
A compound probability distribution (Robbins, 1951) is a distribution whose parameters are themselves random variables.These distributions generalize mixture models to the continuous case, for example in factor analysis which assumes the following generative process, z ∼ N (0, I) x ∼ N (Wz, Σ).
Compound distributions provide the ability to model rich generative processes, but marginalizing over the latent parameter can be computationally expensive unless conjugacy can be exploited.
for the neural PCFG (left) and the compound PCFG (right) for an example tree structure.In the above, A 1 , A 2 ∈ N are nonterminals, T 1 , T 2 , T 3 ∈ P are preterminals, w 1 , w 2 , w 3 ∈ Σ are terminals.In the neural PCFG, the global rule probabilities π = πS ∪ πN ∪ πP are the output from a neural net run over the symbol embeddings EG, where πN are the set of rules with a nonterminal on the left hand side (πS and πP are similarly defined).In the compound PCFG, we have per-sentence rule probabilities πz = πz,S ∪ πz,N ∪ πz,P obtained from running a neural net over a random vector z (which varies across sentences) and global symbol embeddings EG.In this case, the context-free assumptions hold conditioned on z, but they do not hold unconditionally: e.g. when conditioned on z and A2, the variables A1 and T1 are independent; however when conditioned on just A2, they are not independent due to the dependence path through z.Note that the rule probabilities are random variables in the compound PCFG but deterministic variables in the neural PCFG.
In this work, we study compound probabilistic context-free grammars whose distribution over trees arises from the following generative process: we first obtain rule probabilities via where p γ (z) is a prior with parameters γ (spherical Gaussian in this paper), and f λ is a neural network that concatenates the input symbol embeddings with z and outputs the sentence-level rule probabilities π z , where [w; z] denotes vector concatenation.Then a tree/sentence is sampled from a PCFG with rule probabilities given by π z , This can be viewed as a continuous mixture of PCFGs, or alternatively, a Bayesian PCFG with a prior on sentence-level rule probabilities parameterized by z, λ, E G . 3 Importantly, under this generative model the context-free assumptions hold conditioned on z, but they do not hold unconditionally.This is shown in Figure 1 (right) where there is a dependence path through z if it is not conditioned upon.Compound PCFGs give rise to a marginal distribution over parse trees t via 3 Under the Bayesian PCFG view, pγ(z) is a distribution over z (a subset of the prior), and is thus a hyperprior.
where p θ (t | z) = r∈t R π z,r .The subscript in π z,r denotes the fact that the rule probabilities depend on z.Compound PCFGs are clearly more expressive than PCFGs as each sentence has its own set of rule probabilities.However, it still assumes a tree-based generative process, making it possible to learn latent tree structures.
Our motivation for the compound PCFG is based on the observation that for grammar induction, first-order context-free assumptions are generally made not because they represent an adequate model of natural language, but because they allow for tractable training. 4Higher-order PCFGs can introduce dependencies between children and ancestors/siblings through, for example, vertical/horizontal Markovization (Johnson, 1998;Klein and Manning, 2003).However such dependencies complicate training due to the rapid increase in the number of rules.Under this view, we can interpret the compound PCFG as a restricted version of some higher-order PCFG where a child can depend on its ancestors and siblings through a shared latent vector. 5We hypothesize that this dependence among siblings is especially useful in grammar induction from words, where (for example) if we know that watched is used as a verb then the noun phrase is likely to be a movie.
In contrast to the usual Bayesian treatment of PCFGs which places priors on global rule probabilities (Kurihara and Sato, 2006;Johnson et al., 2007;Wang and Blunsom, 2013), the compound PCFG assumes a prior on local, sentence-level rule probabilities.It is therefore closely related to the Bayesian grammars studied by Cohen et al. (2009) and Cohen and Smith (2009), who also sample local rule probabilities from a logistic normal prior for training dependency models with valence (DMV) (Klein and Manning, 2004).Inference in Compound PCFGs The expressivity of compound PCFGs comes at a significant challenge in learning and inference.Letting θ = {E G , λ} be the parameters of the generative model, we would like to maximize the log marginal likelihood of the observed sentence log p θ (x).In the neural PCFG the log marginal likelihood log p θ (x) = log t∈T G (x) p θ (t) can be obtained by summing out the latent tree structure using the inside algorithm (Baker, 1979), which is differentiable and thus amenable to gradientbased optimization.In the compound PCFG, the log marginal likelihood is given by Notice that while the integral over z makes this quantity intractable, when we condition on z, we can tractably perform the inner summation as before using the inside algorithm.We therefore resort to collapsed amortized variational inference.We first obtain a sample z from a variational posterior distribution (given by an amortized inference network), then perform the inner marginalization conditioned on this sample.The evidence lower bound ELBO(θ, φ; x) is then given by, and we can calculate p θ (x | z) = t∈T G (x) p(t | z) with the inside algorithm given a sample z from a variational posterior q φ (z | x).For the variational family we use a diagonal Gaussian where the mean/log-variance vectors are given by an affine layer over maxpooled hidden states from an LSTM over x.We can obtain low-variance estimators for the gradient ∇ θ,φ ELBO(θ, φ; x) by using the reparameterization trick for the expected reconstruction likelihood and the analytical expression for the KL term (Kingma and Welling, 2014).
We remark that under the Bayesian PCFG view, since the parameters of the prior (i.e.θ) are estimated from the data, our approach can be seen as an instance of empirical Bayes (Robbins, 1956). 6AP Inference After training, we are interested in comparing the learned trees against an annotated treebank.
This requires inferring the most likely tree given a sentence, i.e. argmax t p θ (t | x).For the neural PCFG we can obtain the most likely tree by using the Viterbi version of the inside algorithm (CKY algorithm).For the compound PCFG, the argmax is intractable to obtain exactly, and hence we estimate it with the following approximation, where µ φ (x) is the mean vector from the inference network.The above approximates the true posterior p θ (z | x) with δ(z − µ φ (x)), the Dirac delta function at the mode of the variational posterior. 7This quantity is tractable to estimate as in the PCFG case.Other approximations are possible: for example we could use q φ (z | x) as an importance sampling distribution to estimate the first integral.However we found the above approximation to be efficient and effective in practice.

Experimental Setup
Data We test our approach on the Penn Treebank (PTB) (Marcus et al., 1993) with the standard splits (2-21 for training, 22 for validation, 23 for test) and the same preprocessing as in recent works (Shen et al., 2018(Shen et al., , 2019)), where we discard punctuation, lowercase all tokens, and take the top 10K most frequent words as the vocabulary.This setup is more challenging than traditional setups, which usually experiment on shorter sentences and use gold part-of-speech tags.
We further experiment on Chinese with version 5.1 of the Chinese Penn Treebank (CTB) (Xue et al., 2005), with the same splits as in Chen and Manning (2014).On CTB we also remove punctuation and keep the top 10K word types.
Hyperparameters Our PCFG uses 30 nonterminals and 60 preterminals, with 256-dimensional symbol embeddings.The compound PCFG uses 64-dimensional latent vectors.The bidirectional LSTM inference network has a single layer with 512 dimensions, and the mean and the log variance vector for q φ (z | x) are given by max-pooling the hidden states of the LSTM and passing it through an affine layer.Model parameters are initialized with Xavier uniform initialization.For training we use Adam (Kingma and Ba, 2015) with β 1 = 0.75, β 2 = 0.999 and learning rate of 0.001, with a maximum gradient norm limit of 3. We train for 10 epochs with batch size equal to 4. We employ a curriculum learning strategy (Bengio et al., 2009) where we train only on sentences of length up to 30 in the first epoch, and increase this length limit by 1 each epoch.This slightly improved performance and similar strategies have used in the past for grammar induction (Spitkovsky et al., 2012).During training we perform early stopping based on validation perplexity. 8To mitigate against overfitting to PTB, experiments on CTB utilize the same hyperparameters from PTB.

Baselines and Evaluation
We observe that even on PTB, there is enough variation in setups across prior work on grammar induction to render a meaningful comparison difficult.Some important dimensions along which prior works vary include, (1) lexicalization: earlier work on grammar induction generally assumed gold (or induced) partof-speech tags (Klein and Manning, 2004;Smith and Eisner, 2004;Bod, 2006;Snyder et al., 2009), while more recent works induce grammar directly from words (Spitkovsky et al., 2013;Shen et al., 2018); (2) use of punctuation: even within papers that induce a grammar directly from words, some papers employ heuristics based on punctuation as punctuation is usually a strong signal for start/end of constituents (Seginer, 2007;Ponvert et al., 2011;Spitkovsky et al., 2013), some train with punctuation (Jin et al., 2018;Drozdov et al., 2019;Kim et al., 2019), while others discard punctuation altogether for training (Shen et al., 2018(Shen et al., , 2019)); (3) train/test data: some works do not explicitly separate out train/test sets (Reichart and Rappoport, 2010;Golland et al., 2012) while some do (Huang et al., 2012;Parikh et al., 2014;Htut et al., 2018).Maintaining train/test splits is less of an issue for unsupervised structure learning, however in this work we follow the latter and separate train/test data.(4) evaluation: for unlabeled F 1 , almost all works ignore punctuation (even approaches that use punctuation during training typically ignore them during evaluation), but there is some variance in discarding trivial spans (widthone and sentence-level spans) and using corpuslevel versus sentence-level F 1 .9In this paper we discard trivial spans and evaluate on sentencelevel F 1 per recent work (Shen et al., 2018(Shen et al., , 2019)).
Given the above, we mainly compare our approach against two recent, strong baselines with open source code: Parsing Predict Reading Network (PRPN) 10 (Shen et al., 2018) and Ordered Neurons (ON)11 (Shen et al., 2019).These approaches train a neural language model with gated attention-like mechanisms to induce binary trees, and achieve strong unsupervised parsing performance even when trained on corpora where punctuation is removed.Since the original results were on both language modeling and grammar induction, their hyperparameters were presumably tuned to do well on both and thus may not be optimal for just unsupervised parsing.We therefore tune the hyperparameters of these baselines for unsupervised parsing only (i.e. on validation F 1 ).

Results and Discussion
Table 1 shows the unlabeled F 1 scores for our models and various baselines.All models soundly outperform right branching baselines, and we find that the neural PCFG/compound PCFG are strong models for grammar induction.In particular the compound PCFG outperforms other models by an appreciable margin on both English and Chinese.We again note that we were unable to induce meaningful grammars through a traditional PCFG with the scalar parameterization.12See appendix A.2 for the full results (including corpuslevel F 1 ) broken down by sentence length.LSTM LM while maintaining good unsupervised parsing performance.
We thus experiment to see if it is possible to use the induced trees to supervise a more flexible generative model that can make use of tree structures-namely, recurrent neural network grammars (RNNG) (Dyer et al., 2016).RNNGs are generative models of language that jointly model syntax and surface structure by incrementally generating a syntax tree and sentence.As with NLMs, RNNGs make no independence assumptions, and have been shown to outperform NLMs in terms of perplexity and grammaticality judgment when trained on gold trees (Kuncoro et al., 2018;Wilcox et al., 2019).We take the best run from each model and parse the training set,14 and use the induced trees to supervise an RNNG for each model using the parameterization from Kim et al. (2019). 15We are also interested in syntactic evaluation of our models, and for this we utilize the framework and dataset from Marvin and Linzen (2018), where a model is presented two minimally different sentences such as: the senators near the assistant are old *the senators near the assistant is old and must assign higher probability to grammatical sentence.Additionally, Kim et al. (2019) report perplexity improvements by fine-tuning an RNNG trained on gold trees with the unsupervised RNNG (URNNG)-whereas the RNNG is is trained to maximize the joint likelihood log p(x, t), the URNNG maximizes a lower bound on the log marginal likelihood log t p(x, t) with a structured inference network that approximates the true  posterior.We experiment with a similar approach where we fine-tune RNNGs trained on induced trees with URNNGs.We perform early stopping for both RNNG and URNNG based on validation perplexity.See appendix A.3 for further details regarding the experimental setup.
The results are shown in Table 3.For perplexity, RNNGs trained on induced trees (Induced RNNG in Table 3) are unable to improve upon an LSTM LM, 16 in contrast to the supervised RNNG which does outperform the LSTM language model (Table 3, bottom).For grammaticality judgment however, the RNNG trained with compound PCFG trees outperforms the LSTM LM despite obtaining worse perplexity, 17 and performs on par with the RNNG trained on gold trees.Fine-tuning with the URNNG results in improvements in perplexity and grammaticality judgment across the board (Induced URNNG in Table 3).We also obtain large improvements on unsupervised parsing as measured by F 1 , with the fine-tuned URNNGs outperforming the respective original models. 18This is potentially due to an ensembling effect be- 16 Under our RNNG parameterization, the LSTM LM is equivalent to an RNNG trained with right branching trees.
17 Kuncoro et al. (2018Kuncoro et al. ( , 2019) also observe that models that achieve lower perplexity do not necessarily perform better on syntactic evaluation tasks.
18 Li et al. (2019) similarly obtain improvements by refining a model trained on induced trees on classification tasks.For each nonterminal we visualize the proportion of correctly-predicted constituents that correspond to particular gold labels.For reference we also show the precision (i.e.probability of correctly predicting unlabeled constituents) in the rightmost column.
tween the original model and the URNNG's structured inference network, which is parameterized as a neural CRF constituency parser (Durrett and Klein, 2015;Liu et al., 2018). 19 Model Analysis We analyze our best compound PCFG model in more detail.Since we induce a full set of nonterminals in our grammar, we can analyze the learned nonterminals to see if they can be aligned with linguistic constituent labels.Figure 2 visualizes the alignment between induced and gold labels, where for each nonterminal we show the empirical probability that a predicted constituent of this type will correspond to a particular linguistic constituent in the test set, conditioned on its being a correct constituent (for reference we also show the precision).We observe that some of the induced nonterminals clearly align to linguistic nonterminals.Further results, including preterminal alignments to part-of-speech tags, 20 are shown in appendix A.4. 19 While left as future work, it is possible to use the compound PCFG itself as an inference network.Also note that the F1 scores for the URNNGs in Table 3 are optimistic since we selected the best-performing runs of the original models based on validation F1 to parse the training set. 20As a POS induction system, the many-to-one performance of the compound PCFG using the preterminals is 68.0.A similarly-parameterized compound HMM with 60 hidden states (an HMM is a particularly type of PCFG) with 60 states obtains 63.2.This is still quite a bit lower than the state-of-the-art (Tran et al., 2016;He et al., 2018;Stratos, 2019), though comparison is confounded by various factors such as preprocessing (e.g.we drop punctuation).A neural PCFG/HMM obtains 68.2 and 63.4 respectively.he retired as senior vice president finance and administration and chief financial officer of the company oct.N kenneth j. unk who was named president of this thrift holding company in august resigned citing personal reasons the former president and chief executive eric w. unk resigned in june unk 's president and chief executive officer john unk said the loss stems from several factors mr.unk is executive vice president and chief financial officer of unk and will continue in those roles charles j. lawson jr.N who had been acting chief executive since june N will continue as chairman unk corp.received an N million army contract for helicopter engines boeing co. received a N million air force contract for developing cable systems for the unk missile general dynamics corp.received a N million air force contract for unk training sets grumman corp.received an N million navy contract to upgrade aircraft electronics thomson missile products with about half british aerospace 's annual revenue include the unk unk missile family already british aerospace and french unk unk unk on a british missile contract and on an air-traffic control radar system meanwhile during the the s&p trading halt s&p futures sell orders began unk up while stocks in new york kept falling sharply but the unk of s&p futures sell orders weighed on the market and the link with stocks began to fray again on friday some market makers were selling again traders said futures traders say the s&p was unk that the dow could fall as much as N points meanwhile two initial public offerings unk the unk market in their unk day of national over-the-counter trading friday traders said most of their major institutional investors on the other hand sat tight Table 4: For each query sentence (bold), we show the 5 nearest neighbors based on cosine similarity, where we take the representation for each sentence to be the mean of the variational posterior.
We next analyze the continuous latent space.Table 4 shows nearest neighbors of some sentences using the mean of the variational posterior as the continuous representation of each sentence.We qualitatively observe that the latent space seems to capture topical information.We are also interested in the variation in the leaves due to z when the variation due to the tree structure is held constant.To investigate this, we use the parsed dataset to obtain pairs of the form is the j-th subtree of the (approximate) MAP tree t (n) for the n-th sentence.Therefore each mean vector µ φ (x (n) ) is associated with |x (n) | − 1 subtrees, where |x (n) | is the sentence length.Our definition of subtree here ignores terminals, and thus each subtree is associated with many mean vectors.For a frequently occurring subtree, we perform PCA on the set of mean vectors that are associated with the subtree to obtain the top principal component.We then show the constituents that had the 5 most positive/negative values for this top principal component in Table 5.For example, a particularly common subtree-associated with 180 unique constituents-is given by The top 5 constituents with the most negative/positive values are shown in the top left part of Table 5.We find that the leaves [w 1 , . . ., w 6 ], which form a 6-word constituent, vary in a regular manner as z is varied.We also observe that root of this subtree (NT-04) aligns to prepositional phrases (PP) in Figure 2, and the leaves in Table 5 (top left) are indeed mostly PP.However, the model fails to identify ((T-40 w5) (T-22 w6)) as a constituent in this case (as well as well in the bottom right example).See appendix A.5 for more examples.It is possible that the model is utilizing the subtrees to capture broad template-like structures and then using z to fill them in, similar to recent works that also train models to separate "what to say" from "how to say it" (Wiseman et al., 2018;Peng et al., 2019;Chen et al., 2019a,b).
Limitations We report on some negative results as well as important limitations of our work.While distributed representations promote parameter sharing, we were unable to obtain improvements through more factorized parameterizations that promote even greater parameter sharing.In particular, for rules of the type A → BC, we tried having the output embeddings be a function of the input embeddings (e.g.u BC = g([w B ; w C ]) where g is an MLP), but obtained worse results.For rules of the type T → w, we tried using a character-level CNN (dos Santos and Zadrozny, 2014; Kim et al., 2016) to obtain the output word embeddings u w (Jozefowicz et al., 2016;Tran et al., 2016), but found the performance to be similar to the word-level case. 21We were also unable to obtain improvements through normalizing flows (Rezende and Mohamed, 2015;Kingma et al., 2016).However, given that we did not exhaustively explore the full space of possible parameterizations, the above modifications could eventually lead to improvements with the right setup.
Relatedly, the models were quite sensitive to parameterization (e.g. it was important to use residual layers for f 1 , f 2 ), grammar size, and optimization method.Finally, despite vectorized GPU implementations, training was significantly more expensive (both in terms of time and memory) than NLM-based grammar induction systems due to the O(|R||x| 3 ) dynamic program, which makes our approach potentially difficult to scale.PCpurchased through the exercise of stock options circulated by a handful of major brokers higher as a percentage of total loans common with a lot of large companies surprised by the storm of sell orders PC + brought to the u.s.against her will laid for the arrest of opposition activists uncertain about the magnitude of structural damage held after the assassination of his mother hurt as a result of the violations PCraise the minimum grant for smaller states veto a defense bill with inadequate funding avoid an imminent public or private injury field a competitive slate of congressional candidates alter a longstanding ban on such involvement PC + generate an offsetting profit by selling waves change an export loss to domestic plus expect any immediate problems with margin calls make a positive contribution to our earnings find a trading focus discouraging much participation Table 5: For each subtree, we perform PCA on the variational posterior mean vectors that are associated with that particular subtree and take the top principal component.We then list the top 5 constituents that had the lowest (PC -) and highest (PC +) principal component values.

Related Work
Grammar induction has a long and rich history in natural language processing.Early work on grammar induction with pure unsupervised learning was mostly negative (Lari and Young, 1990;Carroll and Charniak, 1992;Charniak, 1993), though Pereira and Schabes (1992) reported some success on partially bracketed data.Clark (2001) and Klein and Manning (2002) were some of the first successful statistical approaches to grammar induction.In particular, the constituent-context model (CCM) of Klein and Manning (2002), which explicitly models both constituents and distituents, was the basis for much subsequent work (Klein and Manning, 2004;Huang et al., 2012;Golland et al., 2012).Other works have explored imposing inductive biases through Bayesian priors (Johnson et al., 2007;Liang et al., 2007;Wang and Blunsom, 2013), modified objectives (Smith and Eisner, 2004), and additional constraints on recursion depth (Noji et al., 2016;Jin et al., 2018).
While the framework of specifying the structure of a grammar and learning the parameters is common, other methods exist.Bod (2006) consider a nonparametric-style approach to unsupervised parsing by using random subsets of training subtrees to parse new sentences.Seginer (2007) utilize an incremental algorithm to unsupervised parsing which makes local decisions to create constituents based on a complex set of heuristics.Ponvert et al. (2011) induce parse trees through cascaded applications of finite state models.
More recently, neural network-based approaches to grammar induction have shown promising results on inducing parse trees directly from words.In particular, Shen et al. (2018Shen et al. ( , 2019) ) learn tree structures through gated mechanisms within hidden layers of neural language models, while Drozdov et al. (2019) combine recursive autoencoders with the inside-outside algorithm.Kim et al. (2019) train unsupervised recurrent neural network grammars with a structured inference network to induce latent trees.
Our work is also related to latent variable PCFGs (Matsuzaki et al., 2005;Petrov et al., 2006;Cohen et al., 2012), which extend PCFGs to the latent variable setting by splitting nonterminal symbols into latent subsymbols.In particular, latent vector grammars (Zhao et al., 2018) and compositional vector grammars (Socher et al., 2013) also employ continuous vectors within their grammars.However, these extensions have been employed for learning supervised parsers on annotated treebanks, in contrast to the unsupervised setting of the current work.

Conclusion
This work explores grammar induction with compound PCFGs, which modulate rule probabilities with per-sentence continuous latent vectors.The latent vector induces marginal dependencies beyond the traditional first-order context-free assumptions within a tree-based generative process, leading to improved performance.The collapsed amortized variational inference approach is general and can be used for generative models which admit tractable inference through partial conditioning.Learning deep generative models which exhibit such conditional Markov properties is an interesting direction for future work.

A.1 Model Parameterization
Neural PCFG We associate an input embedding w N for each symbol N on the left side of a rule (i.e.N ∈ {S} ∪ N ∪ P) and run a neural network over w N to obtain the rule probabilities.Concretely, each rule type π r is parameterized as follows, , where M is the product space (N ∪P)×(N ∪P), and f 1 , f 2 are MLPs with two residual layers, The bias terms for the above expressions (including for the rule probabilities) are omitted for notational brevity.In Figure 1 we use the following to refer to rule probabilities of different rule types, where L(A) denotes the set of rules with A on the left hand side.
Compound PCFG The compound PCFG rule probabilities π z given a latent vector z, .
Again the bias terms are omitted for brevity, and f 1 , f 2 are as before where the first layer's input dimensions are appropriately changed to account for concatenation with z.
A.2 Corpus/Sentence F 1 by Sentence Length For completeness we show the corpus-level and sentence-level F 1 broken down by sentence length in Table 6, averaged across 4 different runs of each model.

A.3 Experiments with RNNGs
For experiments on supervising RNNGs with induced trees, we use the parameterization and hyperparameters from Kim et al. (2019), which uses a 2-layer 650-dimensional stack LSTM (with dropout of 0.5) and a 650-dimensional tree LSTM (Tai et al., 2015;Zhu et al., 2015) as the composition function.
Concretely, the generative story is as follows: first, the stack representation is used to predict the next action (SHIFT or REDUCE) via an affine transformation followed by a sigmoid.If SHIFT is chosen, we obtain a distribution over the vocabulary via another affine transformation over the stack representation followed by a softmax.Then we sample the next word from this distribution and shift the generated word onto the stack using the stack LSTM.If REDUCE is chosen, we pop the last two elements off the stack and use the tree LSTM to obtain a new representation.This new representation is shifted onto the stack via the stack LSTM.Note that this RNNG parameterization is slightly different than the original from Dyer et al. (2016), which does not ignore constituent labels and utilizes a bidirectional LSTM as the composition function instead of a tree LSTM.As our RNNG parameterization only works with binary trees, we binarize the gold trees with right binarization for the RNNG trained on gold trees (trees from the unsupervised methods explored in this paper are already binary).The RNNG also trains a discriminative parser alongside the generative model for evaluation with importance sampling.We use a CRF parser whose span score parameterization is similar similar to recent works (Wang and Chang, 2016;Stern et al., 2017;Kitaev and Klein, 2018): position embeddings are added to word embeddings, and a bidirectional LSTM with 256 hidden dimensions is run over the input representations to obtain the forward and backward hidden states.The score s ij ∈ R for a constituent spanning the i-th and j-th word is given by, where the MLP has a single hidden layer with ReLU nonlinearity followed by layer normalization (Ba et al., 2016).
For experiments on fine-tuning the RNNG with the unsupervised RNNG, we take the discriminative parser (which is also pretrained alongside the RNNG on induced trees) to be the structured inference network for optimizing the evidence lower bound.We refer the reader to Kim et al. (2019) and their open source implementation22 for additional details.We also observe that as noted by Kim et al. (2019), a URNNG trained from scratch on this version of PTB without punctuation failed to outperform a right-branching baseline.
The LSTM language model baseline is the same size as the stack LSTM (i.e. 2 layers, 650 hidden units, dropout of 0.5), and is therefore equivalent to an RNNG with completely right branching trees.For all models we share input/output word embeddings (Press and Wolf, 2016).Perplexity estimation for the RNNGs and the compound PCFG uses 1000 importance-weighted samples.
For grammaticality judgment, we modify the publicly available dataset from Marvin and Linzen (2018) 23 to only keep sentence pairs that did not have any unknown words with respect to our PTB vocabulary of 10K words.This results in 33K sentence pairs for evaluation.

A.4 Nonterminal/Preterminal Alignments
Figure 3 shows the part-of-speech alignments and Table 7 shows the nonterminal label alignments for the compound PCFG/neural PCFG.

A.5 Subtree Analysis
Table 8 lists more examples of constituents within each subtree as the top principical component is varied.Due to data sparsity, the subtree analysis is performed on the full dataset.See section 5 for more details.would be irresponsible has been growing could be delayed 've been neglected can be held had been made can be proven had been canceled could be used have been wary the frankfurt market was mixed the gramm-rudman targets are met the u.s. unit edged lower a private meeting is scheduled a news release was prepared the key assumption is valid the stock market closed wednesday the budget scorekeeping is completed the stock market remains fragile the tax bill is enacted has been operating in paris will be used for expansion has been taken in colombia might be room for flexibility has been vacant since july may be built in britain have been dismal for years will be supported by advertising has been improving since then could be used as weapons to integrate the products into their operations to defend the company in such proceedings to offset the problems at radio shack to dismiss an indictment against her claiming to purchase one share of common stock to death some N of his troops to tighten their hold on their business to drop their inquiry into his activities to use the microprocessor in future products to block the maneuver on procedural grounds has been mentioned as a takeover candidate would be run by the joint chiefs has been stuck in a trading range would be made into a separate bill had left announced to the trading mob would be included in the final bill only become active during the closing minutes would be costly given the financial arrangement will get settled in the short term would be restricted by a new bill to supply that country with other defense systems to enjoy a loyalty among junk bond investors to transfer its skill at designing military equipment to transfer their business to other clearing firms to improve the availability of quality legal service to soften the blow of declining stock prices to unveil a family of high-end personal computers to keep a lid on short-term interest rates to arrange an acceleration of planned tariff cuts to urge the fed toward lower interest rates unconsolidated pretax profit increased N % to N billion amex short interest climbed N % to N shares its total revenue rose N % to N billion its pretax profit rose N % to N million total operating revenue grew N % to N billion its pretax profit rose N % to N billion its group sales rose N % to N billion fiscal first-half sales slipped N % to N million total operating expenses increased N % to N billion total operating expenses increased N % to N billion Table 8: For each subtree (shown at the top of each set of examples), we perform PCA on the variational posterior mean vectors that are associated with that particular subtree and take the top principal component.We then list the top 5 constituents that had the lowest (left) and highest (right) principal component values.

Figure 2 :
Figure2: Alignment of induced nonterminals ordered from top based on predicted frequency (therefore NT-04 is the most frequently-predicted nonterminal).For each nonterminal we visualize the proportion of correctly-predicted constituents that correspond to particular gold labels.For reference we also show the precision (i.e.probability of correctly predicting unlabeled constituents) in the rightmost column.
the company 's capital structure in the company 's divestiture program by the company 's new board in the company 's core businesses on the company 's strategic plan PC + above the treasury 's N-year note above the treasury 's seven-year note above the treasury 's comparable note above the treasury 's five-year note measured the earth 's ozone layer contract with warner to support a coup in panama to suit the bureaucrats in brussels to thwart his bid for amr to prevent the pound from rising PC + to change our strategy of investing to offset the growth of minimills to be a lot of art to change our way of life to increase the impact of advertising

Figure 3 :
Figure 3: Preterminal alignment to part-of-speech tags for the compound PCFG (top) and the neural PCFG (bottom).

(
NT-04 (T-13 w 1 ) (NT-12 (T-60 w 2 ) (NT-18 (T-60 w 3 ) (T-21 w 4 )))) of federally subsidized loans in fairly thin trading of criminal racketeering charges in quiet expiration trading for individual retirement accounts in big technology stocks without prior congressional approval from small price discrepancies between the two concerns by futures-related program buying (NT-04 (T-13 w 1 ) (NT-12 (T-05 w 2 ) (NT-01 (T-18 w 3 ) (T-25 w 4 )))) by the supreme court in a stock-index arbitrage of the bankruptcy code as a hedging tool to the bankruptcy court of the bond market in a foreign court leaving the stock market for the supreme court after the new york (NT-12 (NT-20 (NT-20 (T-05 w 1 ) (T-40 w 2 )) (T-40 w 3 )) (T-22 w 4 )) a syrian troop pullout the frankfurt stock exchange a conventional soviet attack the late sell programs the house-passed capital-gains provision a great buying opportunity the official creditors committee the most active stocks a syrian troop withdrawal a major brokerage firm (NT-21 (NT-22 (NT-20 (T-05 w 1 ) (T-40 w 2 )) (T-22 w 3 )) (NT-13 (T-30 w 4 ) (T-58 w 5 )))

Table 1 :
Unlabeled sentence-level F1 scores on PTB and CTB test sets.Top shows results from previous work while the rest of the results are from this paper.Mean/Max scores are obtained from 4 runs of each model with different random seeds.Oracle is the maximum score obtainable with binarized trees, since we compare against the non-binarized gold trees per convention.Results with † are trained on a version of PTB with punctuation, and hence not strictly comparable to the present work.For URNNG/DIORA, we take the parsed test set provided by the authors from their best runs and evaluate F1 with our evaluation setup, which ignores punctuation.
gold, left, right, and "self" trees (top), where self F 1 score is calculated by averaging over all 6 pairs obtained from 4 different runs.We find that PRPN is particularly consistent across multiple runs.We also observe that different models are better at identifying different constituent labels, as measured by label recall (Table2, bottom).While left as future work, this naturally suggests an ensemble approach wherein the empirical probabilities of constituents (obtained by averaging the predicted binary constituent labels from the different models) are used either to supervise another model or directly as potentials in a CRF constituency parser.Finally, all models seemed to have some difficulty in identifying SBAR/VP constituents which typically span more words than NP constituents.

Table 2 :
(Top)Mean F1 similarity against Gold, Left, Right, and Self trees.Self F1 score is calculated by averaging over all 6 pairs obtained from 4 different runs.(Bottom) Fraction of ground truth constituents that were predicted as a constituent by the models broken down by label (i.e.label recall).

Table 3 :
Results from training RNNGs on induced trees from various models (Induced RNNG) on the PTB.Induced URNNG indicates fine-tuning with the URNNG objective.
We show perplexity (PPL), grammaticality judgment performance (Syntactic Eval.), and unlabeled F1.PPL/F1 are calculated on the PTB test set and Syntactic Eval. is from Marvin and Linzen (2018)'s dataset.Results on top do not make any use of annotated trees, while the bottom two results are trained on binarized gold trees.Note that the perplexity numbers here are not comparable to standard results on the PTB since our models are generative model of sentences and hence we do not carry information across sentence boundaries.

Table 6 :
Average unlabeled F1 for the various models broken down by sentence length on the PTB test set.For example WSJ-10 refers to F1 calculated on the subset of the test set where the maximum sentence length is at most 10.Scores are averaged across 4 runs of the model with different random seeds.Oracle is the performance of binarized gold trees (with right branching binarization).Top shows sentence-level F1 and bottom shows corpus-level F1. ).

Table 7 :
Analysis of label alignment for nonterminals in the compound PCFG (top) and the neural PCFG (bottom).Label alignment is the proportion of correctly-predicted constistuents that correspond to a particular gold label.We also show the predicited constituent frequency and accuracy (i.e.precision) on the right.Bottom line shows the frequency in the gold trees.