PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

Probabilistic context-free grammars (PCFGs) with neural parameterization have been shown to be effective in unsupervised phrase-structure grammar induction. However, due to the cubic computational complexity of PCFG representation and parsing, previous approaches cannot scale up to a relatively large number of (nonterminal and preterminal) symbols. In this work, we present a new parameterization form of PCFGs based on tensor decomposition, which has at most quadratic computational complexity in the symbol number and therefore allows us to use a much larger number of symbols. We further use neural parameterization for the new form to improve unsupervised parsing performance. We evaluate our model across ten languages and empirically demonstrate the effectiveness of using more symbols.


Introduction
Unsupervised constituency parsing is the task of inducing phrase-structure grammars from raw text without using parse tree annotations. Early work induces probabilistic context-free grammars (PCFGs) via the Expectation Maximation algorithm and finds the result unsatisfactory (Lari and Young, 1990;Carroll and Charniak, 1992). Recently, PCFGs with neural parameterization (i.e., using neural networks to generate rule probabilities) have been shown to achieve good results in unsupervised constituency parsing (Kim et al., 2019a;Jin et al., 2019;Zhu et al., 2020). However, due to the cubic computational complexity of PCFG representation and parsing, these approaches learn PCFGs with relatively small numbers of nonterminals and preterminals. For example, Jin et al. (2019) use 30 * Corresponding Author 1 Our code: https://github.com/sustcsonglin/TN-PCFG nonterminals (with no distinction between preterminals and other nonterminals) and Kim et al. (2019a) use 30 nonterminals and 60 preterminals.
In this paper, we study PCFG induction with a much larger number of nonterminal and preterminal symbols. We are partly motivated by the classic work of latent variable grammars in supervised constituency parsing (Matsuzaki et al., 2005;Petrov et al., 2006;Liang et al., 2007;Cohen et al., 2012;Zhao et al., 2018). While the Penn treebank grammar contains only tens of nonterminals and preterminals, it has been found that dividing them into subtypes could significantly improves the parsing accuracy of the grammar. For example, the best model from Petrov et al. (2006) contains over 1000 nonterminal and preterminal symbols. We are also motivated by the recent work of Buhai et al. (2019) who show that when learning latent variable models, increasing the number of hidden states is often helpful; and by Chiu and Rush (2020) who show that a neural hidden Markov model with up to 2 16 hidden states can achieve surprisingly good performance in language modeling. A major challenge in employing a large number of nonterminal and preterminal symbols is that representing and parsing with a PCFG requires a computational complexity that is cubic in its symbol number. To resolve the issue, we rely on a new parameterization form of PCFGs based on tensor decomposition, which reduces the computational complexity from cubic to at most quadratic. Furthermore, we apply neural parameterization to the new form, which is crucial for boosting unsupervised parsing performance of PCFGs as shown by Kim et al. (2019a).
We empirically evaluate our approach across ten languages. On English WSJ, our best model with 500 preterminals and 250 nonterminals improves over the model with 60 preterminals and 30 nonter-minals by 6.3% mean F1 score, and we also observe consistent decrease in perplexity and overall increase in F1 score with more symbols in our model, thus confirming the effectiveness of using more symbols. Our best model also surpasses the strong baseline Compound PCFGs (Kim et al., 2019a) by 1.4% mean F1. We further conduct multilingual evaluation on nine additional languages. The evaluation results suggest good generalizability of our approach on languages beyond English.
Our key contributions can be summarized as follows: (1) We propose a new parameterization form of PCFGs based on tensor decomposition, which enables us to use a large number of symbols in PCFGs.
(2) We further apply neural parameterization to improve unsupervised parsing performance.
(3) We evaluate our model across ten languages and empirically show the effectiveness of our approach.

Related work
Grammar induction using neural networks: There is a recent resurgence of interest in unsupervised constituency parsing, mostly driven by neural network based methods (Shen et al., 2018a(Shen et al., , 2019Drozdov et al., 2019Drozdov et al., , 2020Kim et al., 2019a,b;Jin et al., 2019;Zhu et al., 2020). These methods can be categorized into two major groups: those built on top of a generative grammar and those without a grammar component. The approaches most related to ours belong to the first category, which use neural networks to produce grammar rule probabilities. Jin et al. (2019) use an invertible neural projection network (a.k.a. normalizing flow (Rezende and Mohamed, 2015)) to parameterize the preterminal rules of a PCFG. Kim et al. (2019a) use neural networks to parameterize all the PCFG rules. Zhu et al. (2020) extend their work to lexicalized PCFGs, which are more expressive than PCFGs and can model both dependency and constituency parse trees simultaneously.
In other unsupervised syntactic induction tasks, there is also a trend to use neural networks to produce grammar rule probabilities. In unsupervised dependency parsing, the Dependency Model with Valence (DMV) (Klein and Manning, 2004) has been parameterized neurally to achieve higher induction accuracy (Jiang et al., 2016;Yang et al., 2020). In part-of-speech (POS) induction, neurally parameterized Hidden Markov Models (HMM) also achieve state-of-the-art results (Tran et al., 2016;He et al., 2018).
Tensor decomposition on PCFGs: Our work is closely related to Cohen et al. (2013) in that both use tensor decomposition to parameterize the probabilities of binary rules for the purpose of reducing the time complexity of the inside algorithm. However, Cohen et al. (2013) use this technique to speed up inference of an existing PCFG, and they need to actually perform tensor decomposition on the rule probability tensor of the PCFG. In contrast, we draw inspiration from this technique to design a new parameterization form of PCFG that can be directly learned from data. Since we do not have a probability tensor to start with, additional tricks have to be inserted in order to ensure validity of the parameterization, as will be discussed later.

Tensor form of PCFGs
PCFGs build upon context-free grammars (CFGs). We start by introducing CFGs and establishing notations. A CFG is defined as a 5-tuple G = (S, N , P, Σ, R) where S is the start symbol, N is a finite set of nonterminal symbols, P is a finite set of preterminal symbols, 2 Σ is a finite set of terminal symbols, and R is a set of rules in the following form: PCFGs extend CFGs by associating each rule r ∈ R with a probability π r . Denote n, p, and q as the number of symbols in N , P, and Σ, respectively. It is convenient to represent the probabilities of the binary rules in the tensor form: where T is an order-3 tensor, m = n + p, and h A ∈ [0, n) and h B , h C ∈ [0, m) are symbol indices. For the convenience of computation, we assign indices [0, n) to nonterminals in N and [n, m) to preterminals in P. Similarly, for a preterminal rule we define Again, h T and h w are the preterminal index and the terminal index, respectively. Finally, for a start rule we define Generative learning of PCFGs involves maximizing the log-likelihood of every observed sentence w = w 1 , . . . , w l : where T G (w) contains all the parse trees of the sentence w under a PCFG G. The probability of a parse tree t ∈ T G is defined as p(t) = ∏ r∈t R π r , where t R is the set of rules used in the derivation of t. log p θ (w) can be estimated efficiently through the inside algorithm, which is fully differentiable and amenable to gradient optimization methods.

Tensor form of the inside algorithm
We first pad T, Q, and r with zeros such that T ∈ R m×m×m , Q ∈ R m×q , r ∈ R m , and all of them can be indexed by both nonterminals and preterminals.
The inside algorithm computes the probability of a symbol A spanning a substring w i,j = w i , . . . , w j in a recursive manner (0 ≤ i < j < l): We use the tensor form of PCFGs to rewrite Equation 1 as: where s i,j , s i,k , and s k+1,j are all m-dimensional vectors; the dimension h A corresponds to the symbol A. Thus Equation 3 represents the core computation of the inside algorithm as tensor-vector dot product. It is amenable to be accelerated on a parallel computing device such as GPUs. However, the time and space complexity is cubic in m, which makes it impractical to use a large number of nonterminals and preterminals.

Parameterizing PCFGs based on tensor decomposition
The tensor form of the inside algorithm has a high computational complexity of O(m 3 l 3 ). It hinders the algorithm from scaling to a large m. To resolve the issue, we resort to a new parameterization form of PCFGs based on tensor decomposition (TD-PCFGs) (Cohen et al., 2013). As discussed in Section 2, while Cohen et al. (2013) use a TD-PCFG to approximate an existing PCFG for speedup in parsing, we regard a TD-PCFG as a stand-alone model and learn it directly from data. The basic idea behind TD-PCFGs is using Kruskal decomposition of the order-3 tensor T. Specifically, we require T to be in the Kruskal form, The Kruskal form of the tensor T is crucial for reducing the computation of Equation 3. To show this, we let x = s i,k , y = s k+1,j , and z be any summand in the right-hand side of Equation 3, so we have: Substitute T in Equation 4 into Equation 5 and consider the i-th dimension of z: where ⊙ indicates Hadamard (element-wise) product; e i ∈ R m is a one-hot vector that selects the i-th row of U. We have padded U with zeros such that U ∈ R m×d and the last m − n rows are all zeros. Thus and accordingly, Equation 8 (Socher et al., 2013) if we treat inside score vectors as span embeddings.
One problem with TD-PCFGs is that, since we use three matrices U, V and W to represent tensor T of binary rule probabilities, how we can ensure that T is non-negative and properly normalized, i.e., for a given left-hand side symbol A, ∑ j,k T h A ,j,k = 1. Simply reconstructing T with U, V and W and then performing normalization would take O(m 3 ) time, thus defeating the purpose of TD-PCFGs. Our solution is to require that the three matrices are non-negative and meanwhile U is row-normalized and V and W are columnnormalized (Shen et al., 2018b).
Theorem 1. Given non-negative matrices U ∈ R n×d and V, W ∈ R m×d , if U is row-normalized and V and W are column-normalized, then U, V, and W are a Kruskal decomposition of a tensor

Neural parameterization of TD-PCFGs
We use neural parameterization for TD-PCFGs as it has demonstrated its effectiveness in inducing PCFGs (Kim et al., 2019a). In a neurally parameterized TD-PCFGs, the original TD-PCFG parameters are generated by neural networks, rather than being learned directly; parameters of the neural network will thus be the parameters to be optimized. This modeling approach breaks the parameter number limit of the original TD-PCFG, so we can control the total number of parameters flexibly. When the total number of symbols is small, we can overparameterize the model as over-parameterization has been shown to ease optimization (Arora et al., 2018;Xu et al., 2018;Du et al., 2019). On the other hand, when the total number of symbols is huge, we can decrease the number of parameters to save GPU memories and speed up training. Specifically, we use neural networks to generate the following set of parameters of a TD-PCFG: The resulting model is referred to as neural PCFGs based on tensor decomposition (TN-PCFGs).
We start with the neural parameterization of U ∈ R n×d and V, W ∈ R m×d . We use shared symbol embeddings E s ∈ R m×k (k is the symbol embedding dimension) in which each row is the embedding of a nonterminal or preterminal. We first compute an unnormalizedŨ by applying a neural network f u (⋅) to symbol embeddings E s : u ∈ R k×d are learnable parameters of f u (⋅). For simplicity, we omit the learnable bias terms. We compute unnormal-izedṼ andW in a similar way. Note that only E s is shared in computing the three unnormalized matrices. Then we apply the Softmax activation function to each row ofŨ and to each column of V andW, and obtain normalized U, V, and W. For preterminal-rule probabilities Q ∈ R p×q and start-rule probabilities r ∈ R n , we follow (Kim et al., 2019a) and define them as: , , where w and u are symbol embeddings; f s (⋅) and f t (⋅) are neural networks that encode the input into a vector (see details in Kim et al. (2019a)). Note that the symbol embeddings are not shared between preterminal rules and start rules.

Parsing with TD-PCFGs
Parsing seeks the most probable parse t ⋆ from all the parses T G (w) of a sentence w: Typically, the CYK algorithm 3 can be directly used to solve this problem exactly: it first computes the score of the most likely parse; and then automatic differentiation is applied to recover the best tree structure t ⋆ (Eisner, 2016;Rush, 2020). This, however, relies on the original probability tensor T and is incompatible with our decomposed representation. 4 If we reconstruct T from U, V, W and then perform CYK, then the resulting time and space complexity would degrade to O(m 3 l 3 ) and become unaffordable when m is large. Therefore, we resort to Minimum Bayes-Risk (MBR) style decoding because we can compute the inside probabilities efficiently. Our decoding method consists of two stages. The first stage computes the conditional probability of a substring w i,j being a constituent in a given sentence w (a.k.a. posteriors of spans being a constituent): We can estimate the posteriors efficiently by using automatic differentiation after obtaining all the inside probabilities. This has the same time complexity as our improved inside algorithm, which is O(dl 3 + mdl 2 ). The second stage uses the CYK algorithm to find the parse tree that has the highest expected number of constituents (Smith and Eisner, 2006): We evaluate TN-PCFGs across ten languages. We use the Wall Street Journal (WSJ) corpus of the Penn Treebank (Marcus et al., 1994) for English, the Penn Chinese Treebank 5.1 (CTB) (Xue et al., 2005) for Chinese, and the SPRML dataset (Seddah et al., 2014) for the other eight morphology-rich languages. We use a unified data preprocessing pipeline 5 provided by Zhao and Titov (2021). The same pipeline has been used in several recent papers (Shen et al., 2018a(Shen et al., , 2019Kim et al., 2019a;Zhao and Titov, 2020). Specifically, for every treebank, punctuation is removed from all data splits and the top 10,000 frequent words in the training data are used as the vocabulary.

Settings and hyperparameters
For baseline models we use the best configurations reported by the authors. For example, we use 30 nonterminals and 60 preterminals for N-PCFGs and C-PCFGs. We implement TN-PCFGs and reimplement N-PCFGs and C-PCFGs using automatic differentiation (Eisner, 2016) and we borrow the idea of Zhang et al. (2020) to batchify the inside algorithm. Inspired by Kim et al. (2019a), for TN-PCFGs we set n/p, the ratio of the nonterminal number to the preterminal number, to 1/2. For U ∈ R n×d and V, W ∈ R m×d we set d = p when there are more than 200 preterminals and d = 200 otherwise. The symbol embedding dimension k is set to 256. We optimize TN-PCFGs using the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.75, β 2 = 0.999, and learning rate 0.001 with batch size 4. We use the unit Gaussian distribution to initialize embedding parameters. We do not use the curriculum learning strategy that is used by Kim et al. (2019a)  Systems with pretrained word embeddings DIORA (Drozdov et al., 2019) 56.8 S-DIORA (Drozdov et al., 2020) 57.6 64.0 CT (Cao et al., 2020) 62.8 65.9 Oracle Trees 84.3 time. Early stopping is performed based on the perplexity on the development data. The best model in each run is selected according to the perplexity on the development data. We tune model hyperparameters only on the development data of WSJ and use the same model configurations on the other treebanks. 6 We report average sentence-level F1 score 7 as well as their biased standard deviations.

Experimental results
We evaluate our models mainly on WSJ (Section 8.1-8.3). We first give an overview of model 6 Shi et al. (2020) suggest not using the gold parses of the development data for hyperparameter tuning and model selection in unsupervised parsing. Here we still use the gold parses of the WSJ development set for the English experiments in order to conduct fair comparison with previous work. No gold parse is used in the experiments of any other language. 7 Following Kim et al. (2019a), we remove all trivial spans (single-word spans and sentence-level spans). Sentence-level means that we compute F1 for each sentence and then average over all sentences.

Model
Time (  performance in Section 8.1 and then conduct ablation study of TN-PCFGs in Section 8.2. We quantitatively and qualitatively analyze constituent labels induced by TN-PCFGs in Section 8.3. In Section 8.4, we conduct a multilingual evaluation over nine additional languages.

Main results
Our best TN-PCFG model uses 500 preterminals (p = 500). We compare it with a wide range of recent unsupervised parsing models (see the top section of Table 1). Since we use MBR decoding for TN-PCFGs, which produces higher F1-measure than the CYK decoding (Goodman, 1996), for fair comparison we also use MBR decoding for our reimplemented N-PCFGs and C-PCFGs (see the middle section of Table 1). We draw three key observations from Table 1: (1) TN-PCFG (p = 500) achieves the best mean and max F1 score. Notebly, it outperforms the strong baseline model C-PCFG by 1.4% mean F1. Compared with TN-PCFG (p = 60), TN-PCFG (p = 500) brings a 6.3% mean F1 improvement, demonstrating the effectiveness of using more symbols.
In Table 1 we also show the results of Constituent test (CT) (Cao et al., 2020) and DIORA (Drozdov et al., 2019(Drozdov et al., , 2020, two recent state-of-the-art approaches. However, our work is not directly comparable to these approaches. CT relies on pretrained language models (RoBERTa) and DIORA relies on pretrained word embeddings (context insensitive ELMo). In contrast, our model and the other approaches do not use pretrained word embeddings and instead learn word embeddings from scratch. We are also aware of URNNG (Kim et al., 2019b), which has a max F1 score of 45.4%, but it uses punctuation and hence is not directly comparable to the models listed in the table. Recall that the nonterminal number n is set to half of p.
We report the average running time 8 per epoch and the parameter numbers of different models in Table 2. We can see that TN-PCFG (p = 500), which uses a much larger number of symbols, has even fewer parameters and is not significantly slower than N-PCFG. Figure 1 illustrates the change of F1 scores and perplexities as the number of nonterminals and preterminals increase. We can see that, as the symbol number increases, the perplexities decrease while F1 scores tend to increase.

Analysis on constituent labels
We analyze model performance by breaking down recall numbers by constituent labels (Table 3). We use the top six frequent constituent labels in the WSJ test data (NP, VP, PP, SBAR, ADJP, and ADVP). We first observe that the right-branching baseline remains competitive. It achieves the highest recall on VPs and SBARs. TN-PCFG (p = 500) displays a relatively even performance across the six labels. Specifically, it performs best on NPs and PPs among all the labels and it beats all the other models on ADJPs. Compared with TN-PCFG (p = 60), TN-PCFG (p = 500) results in the largest improvement on VPs (+19.5% recall), which are usually long (with an average length of 11) in comparison with the other types of constituents. As NPs and VPs cover about 54% of the total constituents in the WSJ test data, it is not surprising that models which are accurate on these labels have high F1 scores (e.g., C-PCFGs and TN-PCFGs (p = 500)).
We further analyze the correspondence between the nonterminals of trained models and gold constituent labels. For each model, we look at all the correctly-predicted constituents in the test set and estimate the empirical posterior distribution of nonterminals assigned to a constituent given the gold label of the constituent (see Figure 2). Compared with the other three models, in TN-PCFG (p = 500), the most frequent nonterminals are more likely to correspond to a single gold label. One possible explanation is that it contains much more nonterminals and therefore constituents of different labels are less likely to compete for the same nonterminal. Figure 2d (TN-PCFG (p = 500)) also illustrates that a gold label may correspond to multiple nonterminals. A natural question that follows is: do these nonterminals capture different subtypes of the gold label? We find it is indeed the case for some nonterminals. Take the gold label NPs (noun phrases), while not all the nonterminals have clear interpretation, we find that NT-3 corresponds to constituents which represent a company name; NT-99 corresponds to constituents which contain a possessive affix (e.g., "'s" in "the market 's decline"); NT-94 represents constituents preceded by an indefinite article. We further look into the gold label PPs (preposition phrases). Interestingly, NT-108, NT-175, and NT-218 roughly divided preposition phrases into three groups starting with 'with, by, from, to', 'in, on, for', and 'of', respectively. See Appendix for more examples.

Multilingual evaluation
In order to understand the generalizability of TD-PCFGs on languages beyond English, we conduct a multilingual evaluation of TD-PCFGs on CTB and SPMRL. We use the best model configurations obtained on the English development data and do not perform any further tuning on CTB and SPMRL. We compare TN-PCFGs with N-PCFGs and C-PCFGs and use MBR decoding by default. The results are shown in Table 4. In terms of the average F1 over the nine languages, all the three models beat trivial left-and right-branching baselines by a large margin, which suggests they have good generalizability on languages beyond English. Among the three models, TN-PCFG (p = 500) fares best. It achieves the highest F1 score on six out of nine treebanks. On Swedish, N-PCFG is worse than the right-branching baseline (-13.4% F1), while   TN-PCFG (p = 500) surpasses the right-branching baseline by 9.6% F1.

Discussions
In our experiments, we do not find it beneficial to use the compound trick (Kim et al., 2019a) in TN-PCFGs, which is commonly used in previous work of PCFG induction (Kim et al., 2019a;Zhao and Titov, 2020;Zhu et al., 2020). We speculate that the additional expressiveness brought by compound parameterization may not be necessary for a TN-PCFG with many symbols which is already sufficiently expressive; on the other hand, compound parameterization makes learning harder when we use more symbols. We also find neural parameterization and the choice of nonlinear activation functions greatly influence the performance. Without using neural parameterization, TD-PCFGs have only around 30% S-F1 scores on WSJ, which are even worse than the right-branching baseline. Activation functions other than ReLU (such as tanh and sigmoid) result in much worse performance. It is an interesting open question why ReLU and neural parameterization are crucial in PCFG induction. When evaluating our model with a large number of symbols, we find that only a small fraction of the symbols are predicted in the parse trees (for example, when our model uses 250 nonterminals, only tens of them are found in the predicted parse trees of the test corpus). We expect that our models can benefit from regularization techniques such as state dropout (Chiu and Rush, 2020).

Conclusion
We have presented TD-PCFGs, a new parameterization form of PCFGs based on tensor decomposition. TD-PCFGs rely on Kruskal decomposition of the binary-rule probability tensor to reduce the computational complexity of PCFG representation and parsing from cubic to at most quadratic in the symbol number, which allows us to scale up TD-PCFGs to a much larger number of (nonterminal and preterminal) symbols. We further propose neurally parameterized TD-PCFGs (TN-PCFGs) and learn neural networks to produce the parameters of TD-PCFGs. On WSJ test data, TN-PCFGs outperform strong baseline models; we empirically show that using more nonterminal and preterminal symbols contributes to the high unsupervised parsing performance of TN-PCFGs. Our multiligual evaluation on nine additional languages further reveals the capability of TN-PCFGs to generalize to languages beyond English.