Unsupervised Learning of PCFGs with Normalizing Flow

Unsupervised PCFG inducers hypothesize sets of compact context-free rules as explanations for sentences. PCFG induction not only provides tools for low-resource languages, but also plays an important role in modeling language acquisition (Bannard et al., 2009; Abend et al. 2017). However, current PCFG induction models, using word tokens as input, are unable to incorporate semantics and morphology into induction, and may encounter issues of sparse vocabulary when facing morphologically rich languages. This paper describes a neural PCFG inducer which employs context embeddings (Peters et al., 2018) in a normalizing flow model (Dinh et al., 2015) to extend PCFG induction to use semantic and morphological information. Linguistically motivated sparsity and categorical distance constraints are imposed on the inducer as regularization. Experiments show that the PCFG induction model with normalizing flow produces grammars with state-of-the-art accuracy on a variety of different languages. Ablation further shows a positive effect of normalizing flow, context embeddings and proposed regularizers.


Introduction
Unsupervised PCFG inducers (Jin et al., 2018b) automatically bracket sentences into nested spans, and label these spans with consistent, linguistically relevant syntactic categories, which may be useful in downstream applications or linguistic research on under-resourced languages. Their success also provides evidence for learnability of grammar in absence of strong linguistic universals (MacWhinney and Bates, 1993;Plunkett and Wood, 2004;Bannard et al., 2009). However, current PCFG induction models, using word tokens as input, are unable to incorporate semantics and morphology into induction, and may encounter issues of sparse vocabulary when facing morphologically rich languages.
This paper describes a PCFG induction model which exploits recent advances in deep generative models and context embeddings to generalize over rare, morphologically rich forms. We contextualize a PCFG's terminal emission rules with context embeddings (Peters et al., 2018) as observations, in order to bring context and subword information into the model. Probabilities for these contextualized terminal emission rules are modeled by transforming distributions with normalizing flow (Rezende and Mohamed, 2015;Dinh et al., 2015;He et al., 2018). Through invertible transformations, flow models transform simple distributions (e.g. Gaussian) into complex and potentially multi-modal distributions over observation vectors. These improvements help increase the expressivity of the induction model and give the model the ability to generalize over rare words, but still preserve the tractability of marginal likelihood computation so that inference is possible with marginal likelihood maximization.
Experiments described in this paper show that the model is able to achieve state-of-the-art or competitive results on multiple languages compared with existing PCFG induction and unlabeled tree induction models, especially on languages where complex morphology may cause induction models with discrete observations to succumb to data sparsity. Further analyses show (1) that the flow-based inducer is able to use morphological and semantic information in embeddings for grammar induction, (2) that the model produces consistent and meaningful labels at phrasal and lexical levels, and (3) that both the normalizing flow and the linguistically-motivated regularization terms make substantial improvements to parsing accuracy.

PCFGs with vector terminals
We first consider factoring the Chomsky normal form PCFG with C non-terminal categories into two separate parts: binary-branching nonterminal expansion rule 2 probabilities, and unarybranching terminal emission rule probabilities. Given a tree as a set τ of nodes η undergoing non-terminal expansions c η → c η1 c η2 (where η ∈ {1, 2} * is a Gorn address specifying a path of left or right branches from the root), and a set τ of nodes η undergoing terminal emissions c η → x η (where x η is an embedding for the word at node η), the marginal probability of a sentence σ i can be computed as: We first define a set of Bernoulli distributions that distribute probability mass between these two sets of rules: where c η is a non-terminal category, δ c η is a Kronecker delta function -a vector with value one at index c η and zeros everywhere else -and δ c η d is a parameter for the Bernoulli distribution of c η with d ∈ R C . Binary-branching non-terminal expansion rule probabilities for a non-terminal category c η are defined as: where ⊗ is a Kronecker product, c η1 is the category of the left child, c η2 is the category of the right child, and δ c η N is a parameter vector for the multinomial distribution of the category c η with N ∈ R C×C 2 . The contextualized unary-branching terminal emission rule probabilities for a preterminal category c η are defined as: 2 They include the expansion rules generating the top node in the tree.
where the terminal at node η is an observed word token, x η ∈ R D is the vectorial representation of that token, f c η is a probability density or mass function, and δ c η L is a parameter vector for the probability function of the category c η . We can recover the multinomial PCFG formulation by setting x η to be a one-hot word representation and the probability function f c η to be a multinomial distribution parameterized by δ c η L. We can also set x η to be a word embedding and f c η to be Gaussian distributions parameterized by δ c η L, giving us a PCFG with Gaussian emission.
In order to incorporate more information into the induction model, context embeddings (Peters et al., 2018) can be used here for x η . The ELMo model combines learned word embeddings with character embeddings through CNN encoders, and composes contextualized embeddings with bidirectional LSTMs over the combined representations. The output from the BiLSTM contains both subword information, word information and context information and is used as contextualized embeddings for words. While simple D-dimensional multivariate Gaussians can be used as the emission density f , it is unrealistic to assume that such embeddings follow simple Gaussian distributions. This work explores more complex transformed distributions using normalizing flows.

Normalizing flows
Flow models (Dinh et al., 2015(Dinh et al., , 2017Kingma and Dhariwal, 2018) are a class of deep generative models that model unknown yet complex distributions by transforming the observation through a series of invertible transformations to create latent representations to be used with known distributions like Gaussians. For PCFG induction with embeddings, we first consider the generative story for the observed embeddings. Let c η be a category label at the node η. M ∈ R C×D is the matrix of the means of the Gaussian distributions for the latent representations, and S ∈ R C×D the diagonal covariances with L = [M; S]. A probability model over trees may be defined as follows: 1. Sample an expansion decision Term ∼ Bernoulli 1 1+exp(−δ cη d) to expand node η with category c η to a lexical item, or to a binary branch.
4. Again, if Term=1, transform the latent representation deterministically to generate the observed embedding x η for the token at η: In order to compute the likelihood given the observation, we need to invert this process. If we integrate over x η = g(h η ), with the change-ofvariable formula, we have: where δ here is the Dirac delta function. This can be used to directly compute the likelihood of the observed embedding exactly given a category. In order to make this calculation tractable, the requirements on g −1 are usually (1) that it is invertible, and (2) that computing the log Jacobian determinant is possible without calculating the full Jacobian matrix or its full determinant. Note that g need not be explicitly constructed as it is usually only used in generation, not in inference.
There have been many proposed invertible functions that can be used as g −1 . The volume preserving invertible transformation is first proposed by Dinh et al. (2015) in the NICE model and later used in unsupervised learning (He et al., 2018). Because of the volume preserving property, the log Jacobian determinant is always 0. This property may allow the structural features of the original embedding space to be better preserved than other, less restrictive, invertible functions.
The invertible transformation g −1 consists of I stacked-up coupling layers. The input x to it is divided into two equal parts h (0) 1 , h (0) 2 : and the coupling layers in g −1 transform the two parts at alternating layers: The volume-preserving restriction is removed in the coupling layer in the Real NVP model (Dinh et al., 2017), in which the coupling layers transform the inputs as follows: where is a Hadamard product. All q : R D/2 → R D/2 in both models can be arbitrary nonlinear transformations. For Real NVP, the log Jacobian determinant is:

Regularization
In order to avoid undesirable yet possible grammars, we impose two linguistically-motivated regularization terms onto the model. In experiments described in this paper, for the emission parameters, we want to discourage the model from finding a solution in which all words are equally likely to be generated by any category, so we impose a regularization term on the model to encourage the rows of M to be far apart. The flow models can learn arbitrary transformations over the pretrained context embeddings. Because each token in the corpus has an embedding, the flow models may learn transformations that cue off arbitrary information in those embeddings, effectively making changes to observations. A Euclidean distance penalty is put between the output of the flow transformation g −1 (x η ) and the input embedding x η to penalize the output drifting too far from the input embedding. The final objective to maximize is: where σ is a minibatch of sentences, a, b, c, d, e are all category labels, λ 1 and λ 2 are the weights for the two regularization terms and . . . n is the n-norm.

Experiments
We report results of labeled parsing evaluation and unlabeled parsing evaluation against existing grammar induction and unsupervised parsing models. We evaluate our models on full English (The Penn Treebank; Marcus et al., 1993), Chinese (The Chinese Treebank 5.0; Xia et al., 2000) and German (NEGRA 2.0; Skut et al., 1998) constituency treebanks and the 20-or-fewer-word subsets for labeled parsing performance. 3 For unlabeled parsing evaluation, we first report results on a set of languages with complex morphology chosen prior to evaluation. This set includes Czech and Russian, which are fusional languages, Korean and Uyghur, which are agglutinative languages, and Finnish, which has elements of both types. Dependency trees from the Universal Dependency Treebank (Nivre et al., 2016) of these languages are converted into constituency trees (Collins et al., 1999) by keeping constituents that have a single incoming and no outgoing dependency arc. For example, constituents like noun phrases that are kept in conversion may only have one incoming arc from the main verb, and no outgoing arc to any modifier. Each dataset has 15,000 sentences randomly sampled from the dependency treebank (if the treebank has enough sentences), or is augmented with sentences randomly sampled from Wikipedia (if the treebank has fewer sentences). Finally, unlabeled parsing experiments on the three constituency treebanks are reported, one following Jin et al. (2018a) and one following Htut et al. (2018).
The hyperparameters of the model for all experiments are tuned on the Brown Corpus portion of the Penn Treebank. We set the number of categories C to 30, the categorical distance constraint strength λ 1 to be 0.0001, and the drifting penalty 3 WSJ20test is the second half of WSJ20. λ 2 to be 10. Function g −1 is set to have 4 coupling layers with q (i) being a feed-forward network with one hidden layer for both NICE and Real NVP, following He et al. (2018). We train the system until the marginal likelihood over the whole training set starts to oscillate, around 10,000 batches for smaller corpora and around 20,000 for larger corpora. Because the inside algorithm is quadratic on the length of the sentences, the batch size for training gets quadratically smaller from 400 to 1 as sentences get longer. We use the Adam optimizer (Kingma and Ba, 2015), initialized with learning rates 0.1 for d and N, and 0.001 for L and parameters in g −1 . Means and standard deviations of evaluation metrics are reported in tables with 10 runs of the proposed system.
We use ELMo embeddings (Peters et al., 2018) with 1024 dimensions from averaging representations from two BiLSTM layers and the word encoder in ELMo for all languages (Che et al., 2018). 4 These embeddings are each trained with 20 million words from Wikipedia and Common Crawl. We initialize d and N with multinomials drawn from a Dirichlet distribution with 0.2 as the concentration parameter, following PCFG induction work with Bayesian models (Jin et al., 2018b). We assign the same diagonal variance matrix to all latent Gaussian distributions, calculated empirically from embeddings from 5000 randomly sampled sentences. M is initialized with the empirical mean of the same sampled embeddings, but with random Gaussian noise added to each row. The parameters of the normalizing flow g −1 are initialized from a uniform distribution with 0 mean and a standard deviation of √ 1/D.
For labeled constituency evaluation, we compare against the state-of-the-art PCFG induction system DIMI (D2K15: depth bounded at 2 and 15 categories; Jin et al., 2018a) which takes word tokens as input and produces labeled trees. 5 For unlabeled constituency evaluation, results from other unsupervised systems are used for comparison, including CCL (Seginer, 2007), UPPARSE (Ponvert et al., 2011), PRPN (Shen et al., 2018), as well as systems which use gold part-of-speech tags: DMV+CCM (Klein and Manning, 2002) and UML-DOP (Bod, 2006).  Table 1: Recall-V-Measure scores for labeled grammar induction models trained on the listed treebanks with punctuation. For all tables, µ (σ) means the mean (standard deviation) of the reported scores.

Labeled parsing evaluation
Metric: Labeled trees induced by DIMI (Jin et al., 2018a) and the flow-based system are evaluated on six different datasets. In this evaluation, predicted labels of induced constituents that are in gold trees are compared against gold labels of these constituents 6 using V-Measure (Rosenberg and Hirschberg, 2007). Recall of the induced trees is used to weight these V-Measure scores. The final Recall-V-Measure (RVM) score is computed as the product of these two measures. RVM can be maximized when gold constituents are included in induced trees and their clustering is consistent with gold annotation. RVM is equal to unlabeled recall when the matching constituents have the same clustering of labels as the gold annotation.
Results: Left-and right-branching baselines are constructed by assigning 21 random labels 7 to constituents in purely left-and right-branching trees. However, both branching baselines perform poorly in this evaluation, due to the fact that there is no straightforward way to assign labels to constituent spans that may correspond to how gold labels are organized. VM scores for both baselines are close to 0, leading to RVM scores close to 0. Table 1 shows RVM scores for both the DIMI system and the flow-based system. For the labeled grammar induction systems, results show that the flow-based model outperforms DIMI on two of the three test datasets. Table 3 shows only the performance of the systems on bracketing. Although DIMI performs much better than the flow-based system in terms of bracketing F1 on WSJ20test, the flow-based system's performance on average RVM is much closer to DIMI, which indicates that the flow-based system assigns more consistent labels to constituents than DIMI. On CTB20 and NEGRA20, where the bracketing performance of the flow-based system is better, this system out- 6 The maximal projection category is used when a span is labeled with several categories in the gold annotation. All functional tags are removed. 7 There are 21 phrase level tags in the Penn Treebank II tag set. performs DIMI by a large margin on RVM. Also, runs with the highest performance on bracketing are not the highest on RVM in general, showing that for labeled induction models, bracketing accuracy may be traded for labeling accuracy.
Confusion matrix: Figure 1 shows the gold constituent recall on NEGRA20 for the two labeled grammar induction systems. We show 5 main phrasal categories in gold annotation and in a run of predicted trees. Grammars from DIMI are prone to category collapse in which only a few categories are active as non-terminals. Figure 1a shows that categories 8 and 3 are the main active categories containing the majority of all constituents, with category 8 covering 78% of all S categories, 23% of NPs, and many others. In Figure 1b, the clear diagonal pattern for the flowbased model shows that the gold categories do have separate corresponding predicted categories. For example, VP is almost exclusively in category 1 if appears in the predicted trees and PP is predominately in category 27. NP has a wider spread across predicted categories, but category 8 is mostly used to represent it.

Unlabeled parsing evaluation
We additionally perform three unlabeled parsing evaluations against baseline systems. The first experiment uses a set of dependency-derived treebanks in morphologically rich languages to examine how morphology is used by the proposed system. The second experiment induces on datasets used in Jin et al. (2018a) and the final experiment uses the WSJ, CTB and NEGRA datasets without any punctuation for evaluation against published results by Htut et al. (2018).
Morphologically rich languages: Table 2 shows unlabeled parsing performance on the morphologically rich languages described at the beginning of this section, compared against branching baselines and DIMI. There is a substantial performance improvement observed across all languages when context embeddings are used as ob-  servations. Korean and Uyghur both have very sparse vocabulary, leading to poor performance of the DIMI system. Constituency treebanks: We also compare the flow-based system to published unlabeled parsing results from previous work. Table 3 shows the unlabeled parsing F1 scores for several grammar induction systems on the WSJ20test, CTB20 and NEGRA20 datasets reported in Jin et al. (2018a). Posterior inference on constituents (PIoC) proposed in Jin et al. (2018a) is also used with parse trees from 10 runs of the flow-based system. The flow-based system is able to produce more accurate trees on the CTB20 and NEGRA20 datasets despite not being depth-bounded. However, its performance is subpar on the WSJ20test dataset.
Finally, the flow-based model is compared against other unsupervised parsing models on the   three full constituency treebanks and their 10-orfewer-word subsets, trained with sentences without punctuation in training, following Htut et al. (2018). The results are shown in Table 4. First, the flow-based system performs better than reported results from all systems, using raw text only, on both NEGRA and CTB, showing that the system is able to accurately generate structure. Second, there is a smaller performance gap between the flow-based system and the best-performing one on WSJ than on WSJ10. The fact that the flow-based model underperforms on English may be due to the fact that Uyghur NEGRA20 Figure 2: Correlation between recall difference of the flow-based system and DIMI and the average distance between ELMo embeddings.  the English vocabulary contains a relatively large number of high frequency words, which makes contexts for words similar, showing up as similarities between the context embeddings for different words. This confuses the model because it relies on the observed embeddings being distinct and representative for induction. Figure 2 shows average Euclidean distances for 50,000 pairs of ELMo embeddings of different words randomly sampled from each dataset. The averaged distance between the embeddings is positively correlated with the gain of the flow-based system over DIMI, indicating the importance of varied contexts for grammar induction.

Induced interpretable categories
PCFG induction systems usually create syntactic categories that correspond to coarse-grained linguistic classes like nouns and verbs using cooccurrence statistics. However the flow-based system also creates classes that are morphological or semantic in nature. The ability of the system to use morphological and semantic information to help grammar induction is shown in Table 5. Grammars induced on Korean from the flowbased system are greatly improved over baselines which use words only as input. Korean is an agglutinative language with many morphemes per token, so approaches that treat tokens as words must address severe sparsity issues. As ELMo embeddings include subword information from Korean characters, they may contain information useful for understanding morphology -the nominative clitics 이 or 가 and the accusative clitics 을 or 를,

Cat.
Interp. Most common words  for example, may encode strong biases towards a word token being a noun along with its case. Categories like 11 and 12 in Table 5 reliably capture nouns in the nominative and accusative cases, respectively, even though in both cases the marking clitic differs depending on whether the noun preceding it ends in a vowel or consonant. Similarly, category 3 shows noun-preceding adjectives, which in Korean are formed by verb stems plus ㄴ or 은, and the inducer is again able to cluster words with both endings together.  For German, the cased articles also have similar endings. The dative articles usually end with -en or -em, and the genitive articles usually end with -er or -es. Having access to the subword information, the flow-based system is able to come up with these distinctions with no supervision, because the cases may provide important clues to relative positions of the following nouns to verbs or prepositions. Contextual information also helps greatly, seen here when the system distinguishes the genitive der in category 8 and the nominative or accusative der in category 20 in the phrases like der(20) Pächter der(8) Junkerstube (the lessee of the junkerstube).
Finally, for languages like Chinese where there are few morphological markings, semantic information may help the system induce syntactic categories. Category 28 is a category of verbs related to cognition and expression, which also characteristically accepts sentential complements (Vendler, 1972;Fisher et al., 1991). Syntactic categories like these are not seen in systems inducing with words only. This indicates that the semantics of these verbs may play a role here, especially since Chinese has no complementizer to signal an upcoming sentential complement. Table 6 shows the ablation and comparison experiments on NEGRA20. ELMo embeddings provide a large performance boost with the Gaussian emission model over both the multinomial emission model, which has no access to contextual and subword information, and the Gaussian emission model with Fasttext embeddings based on character n-grams (Joulin et al., 2016), showing that both context and subword information helps grammar induction. The two linguistically-motivated regularization terms help the flow-based model perform even better. Most notably, the similarity performance helps the flow models greatly by restricting the freedom that the flow models have to change the context embeddings, indicating that the information in context embeddings is valuable for induction. The Real NVP model produces higher data likelihood but its performance is lower than other NICE-based models, indicating that the volume-preserving property of NICE is important for preventing overfitting.

Related work
Earlier work on PCFG induction (Carroll and Charniak, 1992;Johnson et al., 2007;Liang et al., 2009;Tu, 2012) shows that directly inducing PCFGs from raw text is difficult. Recent work (Shain et al., 2016;Jin et al., 2018b,a) shows that inducing PCFGs from raw text is possible, and cognitive constraints are useful for helping the induction model to find good grammars. Closely related to PCFG induction is the task of unsupervised constituency parsing from raw text where trees are unlabeled. Earlier work by Seginer (2007) and Ponvert et al. (2011) induces unlabeled trees and achieves good results. More recent work (Shen et al., 2018) utilizes complex neural architectures for unsupervised parsing and language modeling and also shows good results on English. Although unlabeled parsing evaluation is common, other work (Bisk and Hockenmaier, 2015) has argued for labeled parsing evaluation for grammar induction.
Early unsupervised dependency grammars and part-of-speech induction models (Klein and Manning, 2004;Christodoulopoulos and Steedman, 2010) have been similarly augmented with neural networks and word embeddings (Tran et al., 2016;Jiang et al., 2016). Neural networks provide flexible ways to parameterize distributions, and word embeddings (Mikolov et al., 2013;Pennington et al., 2014) allow these models to use semantic information in these distributed representations. Results show that these improvements produce more accurate dependencies and POS assignments, but these improvements have not been applied to PCFG induction.
Normalizing flows have been shown to be powerful models for complex densities (Dinh et al., 2015(Dinh et al., , 2017Rezende and Mohamed, 2015;Papa-makarios et al., 2017). He et al. (2018) showed improved performance on POS induction and dependency induction by incorporating normalizing flows into baseline models (Klein and Manning, 2004;Lin et al., 2015).

Conclusion
This work proposes a neural PCFG inducer which employs context embeddings (Peters et al., 2018) in a normalizing flow model (Dinh et al., 2015) to extend PCFG induction to use semantic and morphological information. Linguistically motivated similarity penalty and categorical distance constraints are also imposed on the inducer as regularization. Labeled and unlabeled evaluation shows that the PCFG induction model with normalizing flow and context embeddings produces grammars with state-of-the-art accuracy on a variety of different languages. Results show consistent and meaningful use of labels at phrasal and lexical levels by the flow-based model. Ablation further shows a positive effect of normalizing flow, context embeddings and proposed regularizers.