Variance of Average Surprisal: A Better Predictor for Quality of Grammar from Unsupervised PCFG Induction

In unsupervised grammar induction, data likelihood is known to be only weakly correlated with parsing accuracy, especially at convergence after multiple runs. In order to find a better indicator for quality of induced grammars, this paper correlates several linguistically- and psycholinguistically-motivated predictors to parsing accuracy on a large multilingual grammar induction evaluation data set. Results show that variance of average surprisal (VAS) better correlates with parsing accuracy than data likelihood and that using VAS instead of data likelihood for model selection provides a significant accuracy boost. Further evidence shows VAS to be a better candidate than data likelihood for predicting word order typology classification. Analyses show that VAS seems to separate content words from function words in natural language grammars, and to better arrange words with different frequencies into separate classes that are more consistent with linguistic theory.


Introduction
Unsupervised grammar induction models learn to produce hierarchical structures for strings of words. Previous work (Seginer, 2007;Ponvert et al., 2011;Shain et al., 2016;Jin et al., 2018b) show that using data likelihood as both the objective for optimization and the criterion for model selection, either implicitly (in the case of Bayesian models) or explicitly (in the case of EM), gives good results on grammar induction. However, it is also known that data likelihood is only weakly correlated with parsing accuracy, especially at convergence (Smith, 2006;Johnson et al., 2007;Jin et al., 2018a). This weak correlation points to the fact that the maximization of data likelihood at convergence may be non-optimal for model selection, and this non-optimality indicates other con-straints on learning may be at work in human acquisition. In this work, several linguistically-and psycholinguistically-motivated constraints related to syntax are explored as predictors of parsing accuracy for grammars learned by unsupervised induction (Jin et al., 2018a). Results show that variance of average surprisal (VAS) is better correlated with parsing accuracy of induced grammars than data likelihood. Using VAS for model selection at convergence also produces significantly higher parsing accuracy. Further evidence shows VAS to be a better candidate than data likelihood for predicting word order typology classification. Analyses show that VAS seems to separate content words from function words in natural language grammars, and seems to better arrange words with different frequencies into separate classes that are more consistent with linguistic theory.

Related work
Induction of PCFGs has previously been considered a difficult problem (Carroll and Charniak, 1992;Johnson et al., 2007;Liang et al., 2009;Tu, 2012). Earlier work attributed the lack of success for induction to a lack of correlation between parsing accuracy and data likelihood (Johnson et al., 2007), or to the likelihood function or the posterior being filled with weak local optima (Liang et al., 2009;Gimpel and Smith, 2012). Later work has shown that it is possible to induce PCFGs with useful labels from words alone (Shain et al., 2016;Jin et al., 2018b,a). Induction models of constituency grammars or trees usually use data likelihood as both the objective and the model selection criterion (Seginer, 2007;Johnson et al., 2007;Ponvert et al., 2011;Shen et al., 2018), but the weak correlation between data likelihood and parsing accuracy hints at the non-optimality of this practice (Smith, 2006;Headden et al., 2009;Jin et al., 2018a).
On the other hand, many linguistic and psycholinguistic theories propose constraints either as properties of natural language grammar or as constraints on human processing and acquisition. Chomsky (1965) proposes that grammars should favor fewer rules, which may be trimmed by the generalizability of the rules (Yang, 2017). Dryer (1992) argues that grammars with certain constituent ordering should produce trees with consistent branching tendencies, which is in contrast to theories that attribute constituent ordering to processing (Hawkins, 1994;Gibson, 1998). Rajkumar et al. (2016) and Jin et al. (2018b) show that grammars should generally control the maximal allowed stack depth. Yang (2013) observes that rules in a natural language grammar follow Zipf's law, just like words. Grammars may also contribute to the observation that the likelihood of each sentence tends to decrease as a monologue goes on (Keller, 2004;Levy and Jaeger, 2007).

Predictors
Motivated by these constraints, six accuracy predictors -data likelihood, right-branching score, rule complexity, average stack depth, Zipf likelihood ratio and variance of average surprisal -are evaluated as predictors of parsing accuracy over grammars from multiple runs of a PCFG inducer (Jin et al., 2018a). Variance of average surprisal, Zipf likelihood ratio and data likelihood are defined on the PCFG itself, and the other three are defined on Viterbi parses produced by the PCFG on the corpus.

Data likelihood
One of the most common induction and model selection criteria is data likelihood. Data likelihood (LL) refers to the marginal likelihood of a corpus given a PCFG, marginalizing out all trees: where σ is a corpus and T is all possible parse trees generated by a grammar G for σ. As it is usually the optimization objective, likelihood should be positively correlated with parsing accuracy at convergence.

Right-branching score
Branching Direction Theory (Dryer, 1992) explains different patterns of word order among languages. It distinguishes 'verb patterners,' which are non-phrasal lexical categories, from 'object patterners,' which are phrasal categories. It predicts that VO languages tend towards rightbranching structures and OV languages tend towards left-branching structures. Let |c right → a b| be the number of right children of a parent expanding into two non-terminal categories in all parse trees, and |c * → a b| be the total number of nodes that expand into two non-terminal categories, then is the right branching score of the parse trees. A purely right-branching set of binary-branching trees yields an RBS of 1.0, and a purely leftbranching set of binary-branching trees yields an RBS of 0.0. Previous work shows that rightbranching baselines are accurate for a few languages (Seginer, 2007). BDT predicts that different word orders favor different branching directions.

Rule complexity
One of the evaluation metrics used in the generative linguistics tradition is the complexity of a grammar (Chomsky, 1965). Often the number of rules is used as a proxy measurement of how complex a proposed grammatical analysis is against some other reference grammatical analysis. According to this theory, fewer unique rules present in the Viterbi parses would indicate higher grammar quality.

Average stack depth
Embedding depth is a known limiting factor to human sentence processing (Chomsky and Miller, 1963;Wu, 2010;Rajkumar et al., 2016), and is shown to benefit unsupervised grammar induction (Noji and Johnson, 2016;Jin et al., 2018b). It is also evaluated in this work as a predictor of parsing accuracy, defined as the expected number of stack elements per sentence in a left-corner parser for the Viterbi parses. Theories such as that of Chomsky and Miller (1963) predict it to correlate negatively with parsing accuracy.

Zipf likelihood ratio
The distribution of words in a corpus is known to follow Zipf's law (Zipf, 1935), in which the frequency of a word is inversely proportional to its frequency rank. Counts of syntactic rules in annotated corpora also follow this law (Yang, 2013).
Motivated by this observation, experiments in this work also evaluate expected counts of all possible rules, and compute the ratio (Zipf R) between the likelihood that the rules are generated by a power law model and the likelihood that they are generated by a lognormal model of which the mean µ must be positive (Clauset et al., 2009). The higher the ratio, the better fit the power law model provides to the rule counts. Zipfian observations predict this ratio should be positively correlated with parsing accuracy.

Variance of average surprisal
Finally, languages may have other interesting properties that are not identified by maximizing the likelihood of the corpus. For example, languages often distinguish function words from content words and assign them distinct categories. If grammars assign very small sets of high frequency words to a few function-word-like categories, this will increase the difference in likelihood between sentences consisting of mostly these function words and sentences with more modifiers and other content words. The magnitude of this difference can be measured using variance of average sentential surprisal (VAS): where N is the number of sentences in the corpus, and σ i is the i-th sentence. Because sentences in larger corpora contain different numbers of function words, VAS is predicted to be high when the distinction between predicted function words and predicted content words in the induced grammar aligns with human judgments, indicating that VAS should be positively correlated with parsing accuracy.

Dataset
The grammar accuracy predictors described above are evaluated on multiple languages using corpora annotated with constituents (Xia et al., 2000;Marcus et al., 1993;Alastair et al., 2018) and corpora annotated with dependencies (Nivre et al., 2016) which are converted to constituents (Collins et al., 1999). An example is shown in Figure 1. These evaluations use corpora with at least 2,000 annotated sentences, excluding all sentences with nonprojective dependency graphs.
Each induction run uses approximately 15,000 sentences randomly sampled from each language corpus. Languages with fewer than 15,000 annotated sentences are augmented with sentences sampled from Wikipedia (Zeman et al., 2017).
Evaluations initially screen predictors on a development partition consisting of 12 languages from 12 language subgroups covering language families including Indo-European, Uralic, Korean, Turkic, Sino-Tibetan and Afro-Asiatic. Significance tests use a separate test partition consisting of 25 languages 1 which are different from the development partition, covering additional Japanese, Austronesian and Austro-Asiatic language families.

Model
These evaluations use the Bayesian PCFG induction model from Jin et al. (2018a), 2 the objective function of which can be considered to be data likelihood. 3 However, the results for model selection reported in this paper are endemic neither to PCFG induction nor to the objective function used in induction. These experiments can be done with PCFGs randomly sampled from any distribution, but the fact that maximizing data likelihood as the objective can give better models than arbitrary random models ensures that evaluations are tractable and meaningful.
This model defines a Chomsky normal form (CNF) PCFG as a matrix G of binary rule probabilities which is first drawn from the Dirichlet prior with a concentration parameter β: Trees for sentences 1..N are then generated by drawing from a PCFG: Specifically, each tree τ is a set {τ , τ 1 , τ 2 , τ 11 , τ 12 , τ 21 , ...} of category node labels τ η where η ∈ {1, 2} * defines a path of left or right branches from the root to that node. Category labels for every pair of left and right children τ η1 , τ η2 are drawn from a multinomial  distribution defined by the grammar G and the category of the parent τ η : where δ x is a Kronecker delta function equal to 1 at value x and 0 elsewhere, and terminals have null In inference, the conditional posteriors are calculated with a chart sampler (Johnson et al., 2007), and Gibbs sampling is used to draw samples of grammars and parse trees from the true posteriors. For example, at iteration t of Gibbs sampling: where σ τ denotes the terminals in τ.
The inference procedure naturally produces sampled parses of a sentence, and the Viterbi parse of a sentence given an induced PCFG can be obtained by running the Viterbi algorithm with the grammar on the sentence.

Experiments
An exploratory evaluation on the 12-language development partition described in Section 4 measures the effectiveness of the proposed predictors 4 Here, · · · is an indicator function.
in order to narrow the number of possible candidates prior to significance testing. A confirmatory evaluation on the 25-language test partition with significance testing is performed with the predictors that are found to be effective in the exploratory evaluation. Following Jin et al. (2018a), the concentration parameter of the Dirichlet priors is set to 0.2 for all languages. The number of syntactic categories C is set to 30 to allow the model to explore more complex syntactic structures. 30 random seeds are used for initialization of the model parameters, creating 30 runs for each language. The embedding depth of the induced grammars is not bounded in any run. All runs are stopped at iteration 700 which has been observed to have stable likelihood for at least 200 iterations (Jin et al., 2018a). A sampled grammar and Viterbi parse from the end of each run are used for predictor value calculation. Recall is used as the parsing accuracy metric for recovery of attested constituents.

Correlation study
Columns two through seven in Table 1   ficients higher than 0.45 or lower than -0.45 are considered substantially predictive and reported in the table. Coefficients are averaged across reported languages.
Variance of average surprisal (VAS) has the highest correlation coefficients among all the predictors with the highest average correlation coefficient of 0.627. Data likelihood (LL), which is the most common metric for optimization and model selection in grammar induction, is the second best predictor. It also has a high average correlation coefficient of 0.588. 5 Right-branching score also is substantially predictive of recall, but two of the languages have a negative coefficient, making it difficult to use as a model selection criterion without prior knowledge about the branching tendency of a language. Rule complexity, average stack depth as well as Zipf likelihood ratio all show up as predictive, but the signs of the coefficients are similarly inconsistent. Also, the signs of rule complexity are mostly positive, indicating that grammars should maintain a certain minimum level of complexity.

Parsing accuracy and model selection
The rightmost columns in Table 1 show parsing results on the development partition. The oracle recall is the highest recall obtained with 30 runs and the baseline reports whichever one of the leftbranching baseline or the right-branching baseline 25 0 25 Top 1   has the highest recall, marked by L or R.
The VAS and LL columns in Table 1 show the parsing accuracy of the runs chosen by VAS and likelihood and Figure 2 shows the difference in recall. Positive difference shows that the run chosen with VAS is more accurate, and negative difference shows that LL is more accurate. Using VAS as the model selection criterion provides on average 3.19 points of recall gain. Recall gain from Nynorsk seems to be a fairly large outlier, but the positive gains from other languages are also larger than the negative gains. Figure 2 also shows the difference of average recall between the runs with the top 5 highest VAS and likelihood. There are still larger positive differences than negative differences, suggesting that VAS more strongly correlates with recall.

Parsing accuracy and model selection
In order to reduce the need for multiple trials correction, evaluations on the test partition only examine surprisal variance and data likelihood.
The VAS and LL columns in Table 2 show the parsing accuracy of the runs chosen by VAS and likelihood on the test partition, and Figure 3 shows  the difference in recall for top 1 and top 5 runs.
The patterns are similar to the ones on the development set. Using VAS as the model selection criterion with the top 1 runs provides on average 4.03 points of recall gain. Table 3 shows correlation coefficients for LL and VAS on languages in the test partition. Again the observed pattern is similar, if not more extreme, to what is seen on the development partition. The magnitude of the coefficients is consistent with findings in the development partition. Except for Basque, the sign for VAS-recall correlation is consistently positive, confirming that it is reliable to use VAS for model selection.
Confirmatory significance testing is performed on two sets of 25,000 randomly sampled parses from the runs with highest likelihood and highest VAS on all test languages. The parses are randomly permuted between the two sets, and the difference in recall between the two sets is measured. This permutation test shows that the average 4.03 recall gain in Table 2 is highly unlikely to be due to chance (p < 0.0001), showing that VAS produces significantly more accurate grammars in model selection than using likelihood.

Word-order typology prediction
If VAS is much more highly correlated to parsing accuracy than previous predictors, it is possible to use it as an unsupervised proxy to parsing accuracy. Branching Decision Theory (Dryer, 1992) predicts that VO languages favor right-branching structures and OV languages favor left-branching structures. This prediction can be evaluated by correlating VAS and RBS, and using the sign of the correlation coefficient as the word-order pre-  diction. This tests if grammars following the branching tendency predicted by the theory should have higher parsing accuracy. Table 4 shows results for the VAS-RBS correlation reported along with a few baselines, including a uniform baseline, a majority baseline (where there is oracle knowledge about the data set that the majority of languages is VO), the LL-RBS correlation baseline (where data likelihood is used as the proxy for recall), as well as the recall-RBS oracle performance.
There are 29 VO languages and 7 OV languages in the data set (Dryer, 2011). 6 Macro F1 is reported for all systems here as the population distribution of OV and VO languages in the world is almost uniform (Dryer, 1992). First, as predicted by BDT, using signs of the correlation between recall and right-branching score yields the best macro F1 score. Second, using VAS as a proxy of recall yields a much higher F score than all the other baselines, including likelihood. In fact, likelihood performs the worst of all the baselines. This result shows again that the correlation between VAS and parsing accuracy is stronger than likelihood at convergence, and this tighter correlation can be useful in other unsupervised tasks.

Discussion
Positive effects for predictors other than data likelihood suggest that natural language grammars are not optimally learned to explain sentence forms, but may additionally reflect biological constraints 6 Dutch has no dominant VO-OV order.  on grammar learning. In particular, the success of VAS may point to a bias toward a function/content distinction in natural language grammars, with common words more likely to form distinctive categories in human learners than co-occurrence statistics would suggest. This bias would produce the observed result that sentences containing more function words have higher per-word probabilities than sentences containing more content words and the existence of such a distinction may give rise to higher surprisal variance. In contrast, a lack of such bias would allow common words to mix with rare words, yielding more uniform probabilities and low surprisal variance, contrary to observations of conditions under which recall is maximized. The fact that simple maximization of data likelihood appears to favor the more uniform response suggests it is not a sufficient model of grammar learning. We first evaluate this hypothesis by examining the ratio between content and function words across sentences to determine whether this ratio is constant in a language. We use the Wall Street Journal portion of the Penn Treebank as the target corpus, 7 and calculate the ratio of function to content words in all sentences, and examine the density of the ratio in terms of sentence count and its relationship with sentence length. The left figure in Figure 4 shows the relation between the function-content word ratio and sentence count. The function-content word ratio has a mode at around 0.7, but the count pass is also widely distributed mostly within the range between 0 and 1. This shows that the ratio between content and function words in a language does not appear to be constant. The right figure in Figure 4 shows the relationship between the function-content word ra-  tio to sentence length. The ratio seems to converge to 0.7 as the sentence gets longer, but the majority of the sentences in the corpus are below 50 words, and the spread of function-content word ratio for sentences with shorter lengths is also very wide. In many languages, the words with highest frequencies are usually closed class words, such as prepositions and determiners, and these words typically split away from other major classes and form their own classes, raising their probabilities. Low frequency words, on the other hand, tend to move from smaller classes into larger classes, and thus lower their probabilities. It is known that low frequency words, especially hapax legomena, are usually open class words like nouns or adjectives. To reassign these words into larger classes may help them find a natural home where the majority is of the same class as the rare words. This strategy helps better assign words to syntactic classes, which in turn helps create syntactic rules which better align with human annotations.
The claim that VAS promotes a distinction between function and content words can be evaluated by comparing joint probabilities of the most frequent words in each language and their most common class in grammars from runs with high-est VAS, lowest VAS and highest likelihood. In each case, if the most frequent words have higher probabilities in the high VAS run, this may suggest VAS is correlated with function-content distinctions. Figure 5 shows the top 50 most frequent words in 6 different languages with substantial correlations between VAS and recall.
The left figure shows the fraction of words in the run with the highest VAS that have joint probabilities of words and their generating categories higher than in the run with the highest likelihood (i.e. words that have higher probabilities in VASselected grammars than likelihood-selected grammars). The right figure shows the fraction of words in the run with the highest VAS that have joint probabilities higher than in the run with the lowest VAS (i.e. words that have higher probabilities in VAS-selected grammars than in VAS-dispreferred grammars). For all six languages, the ratio of words with higher joint probability is larger than 1, meaning that frequent words in the run with the highest VAS are assigned to classes with higher joint probabilities than words in the run with the highest likelihood or the run with the lowest VAS, consistent with the hypothesis that VAS promotes a distinction between function and content words. Probabilities for some example words are shown in Figure 6.
A different explanation may be considered that information content in a sentence is higher when the sentence is longer (Keller, 2004), and when VAS is maximized, grammars that produce uniform information content across different sentence length are disfavored. For example, punctuation contributes more to the likelihood of short sentences than to long sentences. Assigning high probabilities to punctuation may create the result of sentence likelihood co-varying with sentence length. For a grammar to conform to this rule may help it produce structures more in line with hu- man annotations in the data set. Figure 7 shows the distribution of VAS plotted against sentence length. The regression lines for both the highest VAS and lowest VAS cases show a flat slope indicating the correlation between VAS and sentence length is not substantial, which is supported by correlation testing with Kendall's τ test between sentence length and VAS in the high VAS run (τ = −0.01, p = 0.41) and in the low VAS run (τ = −0.02, p = 0.28). This shows that the effectiveness of VAS cannot be explained by the hypothesis that it guides the grammar to generate syntactic structures by shaping the sentential information content to co-vary with sentence length.

Conclusion
This work explores the non-optimality of data likelihood for model selection in unsupervised grammar induction. Experiments with several linguistically-and psycholinguistically-motivated predictors on a large multilingual data set show that variance of average surprisal (VAS) is highly predictive of parsing performance. Using it as the criterion for model selection outperforms data likelihood significantly. Further evidence shows VAS to be a better candidate than data likelihood for predicting word-order typology. Analyses show that VAS seems to separate content words from function words in natural language grammars and better arrange words with different frequencies into different classes that are more consistent with these linguistic distinctions.