Grammar Induction with Neural Language Models: An Unusual Replication

Grammar induction is the task of learning syntactic structure without the expert-labeled treebanks (Charniak and Carroll, 1992; Klein and Manning, 2002). Recent work on latent tree learning offers a new family of approaches to this problem by inducing syntactic structure using the supervision from a downstream NLP task (Yogatama et al., 2017; Maillard et al., 2017; Choi et al., 2018). In a recent paper published at ICLR, Shen et al. (2018) introduce such a model and report near state-of-the-art results on the target task of language modeling, and the first strong latent tree learning result on constituency parsing. During the analysis of this model, we discover issues that make the original results hard to trust, including tuning and even training on what is effectively the test set. Here, we analyze the model under different configurations to understand what it learns and to identify the conditions under which it succeeds. We find that this model represents the first empirical success for neural network latent tree learning, and that neural language modeling warrants further study as a setting for grammar induction.


Introduction
Grammar induction is the task of learning syntactic structure without the expert-labeled treebanks (Charniak and Carroll, 1992;Klein and Manning, 2002). Recent work on latent tree learning offers a new family of approaches to this problem by inducing syntactic structure using the supervision from a downstream NLP task Maillard et al., 2017;Choi et al., 2018). In a recent paper published at ICLR, Shen et al. (2018) introduce such a model and report near state-ofthe-art results on the target task of language modeling, and the first strong latent tree learning result on constituency parsing. During the analysis of this model, we discover issues that make the original results hard to trust, including tuning and even training on what is effectively the test set. Here, we analyze the model under different configurations to understand what it learns and to identify the conditions under which it succeeds. We find that this model represents the first empirical success for neural network latent tree learning, and that neural language modeling warrants further study as a setting for grammar induction.

Background and Experiments
We analyze the Parsing-Reading-Predict-Network (PRPN; Shen et al., 2018), which uses convolutional networks with a form of structured attention (Kim et al., 2017) rather than recursive neural networks (Goller and Kuchler, 1996;Socher et al., 2011) to learn trees while performing straightforward backpropagation training on a language modeling objective. The structure of the model seems rather suboptimal: Since the parser is trained as part of a language model, it makes parsing greedily, with no access to any words to the right of the point where each parsing decision must be made.
The experiments on language modeling and A crusade of NO to the consumption of drugs is imperative . Parses from PRPN-LM trained on AllNLI. parsing are carried out using different configurations of the model-PRPN-LM tuned for language modeling, and PRPN-UP for (unsupervised) parsing. PRPN-LM is much larger than PRPN-UP, with embedding layer that is 4 times larger and the number of units per layer that is 3 times larger. In the PRPN-UP experiments, we observe that the WSJ data is not split, such that the test data is used without parse information for training. This implies that the parsing results of PRPN-UP may not be generalizable in the way usually expected of machine learning evaluation results.
We train PRPN on sentences from two datasets: The full WSJ and AllNLI, the concatenation of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018b). We then evaluate the constituency trees produced by these models on the full WSJ, WSJ10 1 , and the MultiNLI development set. Table 1 shows results with all the models under study, plus several baselines, on WSJ and WSJ10. Unexpectedly, the PRPN-LM models achieve higher parsing performance than PRPN-UP. This shows that any tuning done to separate PRPN-UP from PRPN-LM was not necessary, and that the results described in the paper can be largely reproduced by a unified model in a fair setting. Moreover, the PRPN models trained on the larger, out-of-domain AllNLI perform better than those trained on WSJ. Surprisingly, PRPN-LM tained on out-of-domain AllNLI achieves the best F1 score on full WSJ among all the models     (Table 1 and 2). This suggests that PRPN is strikingly effective at latent tree learning. Additionally, Table 2 shows that both PRPN-UP models achieve F1 scores of 46.3 and 48.6 respectively on the MultiNLI dev set, setting the state of the art in parsing on this dataset among latent tree models. We conclude that PRPN does acquire some substantial knowledge of syntax, and that this knowledge agrees with Penn Treebank (PTB) grammar significantly better than chance.

Results
Moreover, we replicate the language modeling perplexity of 61.6 reported in the paper using PRPN-LM trained on WSJ, which indicates that PRPN-LM is effective at both parsing and language modeling.

Conclusion
In our analysis of the PRPN model, we find several experimental problems that make the results difficult to interpret. However, in the analyses going well beyond the scope of the original paper, we find that PRPN is nonetheless robust. It represents a viable method for grammar induction and the first success for latent tree learning. We expect that it heralds further work on language modeling as a tool for grammar induction research.