Language Modeling with Shared Grammar

Sequential recurrent neural networks have achieved superior performance on language modeling, but overlook the structure information in natural language. Recent works on structure-aware models have shown promising results on language modeling. However, how to incorporate structure knowledge on corpus without syntactic annotations remains an open problem. In this work, we propose neural variational language model (NVLM), which enables the sharing of grammar knowledge among different corpora. Experimental results demonstrate the effectiveness of our framework on two popular benchmark datasets. With the help of shared grammar, our language model converges significantly faster to a lower perplexity on new training corpus.


Introduction
Language modeling has been a long-standing fundamental task in natural language processing. In recent years, sequential recurrent neural networks (RNNs) based language models have made astonishing progress, which achieve remarkable results on various benchmark datasets (Mikolov et al., 2010a;Jozefowicz et al., 2016;Melis et al., 2017;Elbayad et al., 2018;Gong et al., 2018;Dai et al., 2019). Despite the huge success, the structure information in natural language is largely overlooked due to the structural limit of sequential RNN-based language models.
Recently, researchers have explored to explicitly exploit the latent structures in natural language, such as recurrent neural network grammars (RNNGs; Dyer et al., 2016;Kuncoro et al., 2017) and parsing-reading-predict networks (PRPNs; Shen et al., 2017). These structureaware models have shown promising results on language modeling, demonstrating that the latent nested structure in language indeed helps improve sequential language models. Models like RNNG exploit treebank data with syntactic annotations to learn grammar, which is then used to improve language model performance by a significant margin. This is definitely intriguing, but we have to pay the cost: accurate syntactic annotation is very costly, and treebank data such as the Penn Treebank (Marcus et al., 1993) is typically small-scale and not open to the public for free.
On new corpus which has no syntactic annotations, how to improve language modeling with grammar knowledge? This is an important and challenging open problem. As a motivating example, we conduct a simple experiment by training a RNN language model on one corpus and testing it on another, and report the results in Table 1. The RNN language model performs terribly when training and testing on different datasets, which is reasonable since the data distribution may vary dramatically on different corpora. Training from scratch on every new corpus is obviously not good enough: 1) it is computationally expensive and not data-efficient; 2) the size of target corpus may be too small to train a decent RNN-based language model; 3) the common grammar is not leveraged. Some recent works on transfer learning have made attempts on language model adaptation (Yoon et al., 2017;Ma et al., 2017;Chen et al., 2015), however, none of them explicitly exploits the common grammar knowledge shared between corpora.
To bridge the gap of language modeling on different corpora, we believe that grammar is the key since all corpora are in the same language and should share the same grammar. Motivated by that, we propose neural variational language model (NVLM). Specifically, our framework consists of two probabilistic components: a constituency parser and a joint generative model of sentence and parse tree. When treebank data is available, we can separately train both components. On new corpus without tree annotations, we fix the pre-trained parser and train the generative model either from scratch or with warmup. The pre-trained parser is armed with grammar knowledge, thus it boosts up our language model to land on new corpus. Our proposed framework also supports end-to-end joint training of the two components, so that we can fine-tune the language model. Experimental results show that our proposed framework is effective in all leaning schemes, which achieves good performance on two popular benchmark datasets. With the help of shared grammar, our language model converges significantly faster to a lower perplexity on new corpus. Our contributions in this paper are summarized as follows: • Grammar-sharing framework: We propose a framework for grammar-sharing language modeling, which incorporates the common grammar knowledge into language modeling. With the shared grammar, our framework helps language model efficiently transfer to new corpus with better performance and using shorter time.
• End-to-end learning: Our framework can be end-to-end trained without syntactic annotations. To tackle the technical challenges in end-to-end learning, we use variational methods that exploit policy gradient algorithm for joint training.
• Efficient software package: We provide a highly efficient implementation of our work on GPUs. Our parser is capable of parsing one million sentences per hour on a single GPU. See Appendix D for details.

Model
In this section, we first provide an overview of the proposed framework, then briefly introduce how components work together, and finally present the probabilistic formulation of each component.

Framework
As shown in Figure 1, neural variational language model (NVLM) consists of two probabilistic components: 1) a constituency parser P ✓ 1 (y|x) which models the conditional probability of the parse tree y (a syntax tree without terminal tokens) given the input sentence x (a sequence of terminal tokens); 2) a joint generative model P ✓ 2 (x, y) which models the joint probability of the sentence and the parse tree. Constituency Parsing. Our parser can work independently, which takes as input a sentence x and parses x according to where Y(x) denotes the collection of all possible parses of x. Our parser can also cooperate with the joint generative model as where the parsing candidates y 0 sampled from P ✓ 1 (y|x) are fed into the generative model P ✓ 2 (x, y) to be reranked. Language Modeling. Statistical language models are typically formulated as where x t denotes the t-th token in the sentence x, the length of x is denoted as L x , and x <t indicates all tokens before x t . To evaluate NVLM as a language model, we need to marginalize the joint probability as P (x) = P y 0 2Y(x) P (x, y 0 ). This is extremely hard to compute due to the exponentially large space of Y(x). We use importance sampling technique to overcome this computational intractability, which is detailed in Section 4.
With treebank data such as Penn Treebank (Marcus et al., 1993), we have pairs of (x, y) to train the two components respectively, and get high-quality language model with the parser providing grammar knowledge. However, due to the expensive cost of accurate parsing annotation,  Figure 1: Overall framework of neural variational language model (NVLM). It consists of two probabilistic components: a constituency parser P ✓1 (y|x) and a joint generative model P ✓2 (x, y). The parser takes as input a sentence x and predicts the corresponding parse tree y. Specifically, we use an encoder-decoder structure to parameterize the parser. The joint generative model defines a joint distribution on parse trees (y) and sentences (x). When treebank data is available, we can learn the parameters ✓ 1 and ✓ 2 for each component respectively. To train language model on new corpus, we fix the pre-trained ✓ 1 and only update ✓ 2 . Our framework can also be end-to-end jointly trained to fine-tune the language model, where ✓ 1 and ✓ 2 are co-updated together.
treebank data is typically scarce. For new corpus without parsing annotations, our proposed framework can still leverage the parser to train highquality language model adapted to the new corpus. Also, we can co-train the two components together to fine-tune the language model on new corpus. In the rest of this section, we present our parameterization of the two probabilistic components P ✓ 1 (y|x) and P ✓ 2 (x, y). To avoid notational clutter, we use standard RNN as the basic building block in the rest of this section. 1

Constituency Parser
To parameterize the constituency parser P ✓ 1 (y|x), it is natural to first encode the input sentence x into an embedding vector, then pass the vector to a decoder to generate the parse tree y. There are quite a few choices for both encoder and decoder, among which recurrent neural network (RNN) and convolutional neural network (CNN) are the most popular ones, since they are powerful to capture the structural patterns in natural language (Sutskever et al., 2014;Zhang et al., 2015). Vinyals et al. (2015) have found that the RNNpowered sequence-to-sequence (Seq2Seq) architecture with attention mechanism achieves stateof-the-art parsing performance. This architecture 1 The proposed neural variational language model is independent of any specific implementation of the recurrent unit such as LSTM (Hochreiter and Schmidhuber, 1997), GRU  and SRU (Lei and Zhang, 2017), and can be directly applied to their deep and bi-directional variants.
is conceptually simple yet powerful and of large model capacity.
In this paper, we adapt the sequence-tosequence architecture for NVLM. We linearize a parse tree as a bracket representation (a sequence of nodes and brackets ordered by a preorder traversal of the tree), which is a one-toone mapping of the tree structure. For example, the parse tree shown in Figure 1 can be linearized as (S (NP NNP ) (VP VBZ (NP DT NN ) ) . ). Interestingly, the parser is now similar to a neural translation model , which translates a sentence into a linearized parse tree. Next we show how the parser computes P ✓ 1 (y|x) in detail.
Formally, the input sentence x is fed into the encoder, and is encoded as a sequence of hid- where x i is first embedded into a vector (word embedding) and then fed into the recurrent unit, and h 0 is a learnable vector for the special startof-sentence token <SOS>. The decoder uses a separate RNN to calculate the hidden states s j = RNN dec [y j 1 ; c j ], s j 1 , where y 0 is set as <SOS>, s 0 is set as h Lx (the last hidden state of the encoder), and y j 1 is the decoder's previous output token sampled from the categorical distribution of the decoder's softmax layer (or specified in teacher-forcing training mode), embedded and then concatenated with the context vector c j to serve as the RNN input. The context c j is calculated as ⌘ . This is a simplified version of the conventional attention mechanism  used in the Seq2Seq parser (Vinyals et al., 2015). Our parser is designed to be lighter and faster so that it can efficiently work together with the NVLM joint generative model. Finally, the likelihood P ✓ 1 (y|x) is computed as where f (·) refers to a fully-connected layer with tanh activation, and the subscript y t is to select its probability in the categorical distribution. L y denotes the length of y, which is determined by the decoder itself. Once the decoder emits the special end-of-sentence token <EOS>, the decoding phase is terminated. The trainable parameters ✓ 1 in the parser component P ✓ 1 (y|x) include the weights (and biases) in RNN enc and RNN dec , and all word embeddings. We use separate weights and word embeddings for the encoder and the decoder.

Joint Generative Model
Similar to the parser, the joint generative model P ✓ 2 (x, y) can also be parameterized in various ways. For example, Choe and Charniak (2016) uses a LSTM language model trained on the parser output (with terminal words), which is then used to rerank an existing parser and achieves state-ofthe-art parsing performance. Inspired by that, we parameterize the joint generative model as where z is the mixed parse tree of x and y, which is then mapped to a sequential representation following a pre-order traversal. Figure 1 illustrates how a sentence x and its parse tree y can be merged into a mixed tree. We use another RNN gen to compute the likelihood P (z t |z <t ), and finally get P ✓ 2 (x, y) using Eq. (5).
Algorithm 1: Sentence word attaching input : sentence x; parser output tokens y output: mixed tree tokens z 1 if y is not balanced then 2 y BalanceTree(y) The trainable parameters ✓ 2 in the joint generative model P ✓ 2 (x, y) include the weights (and biases) in RNN gen and all word embeddings.
In Choe and Charniak (2016), the parser is fixed and well-trained before training the generative model. Unlike that, our parser can be jointly trained with the generative model, where the parser may not be fully trained yet. Therefore, the generated parse tree can be malformed, which mismatches the sentence with incorrect number of leaves. An even worse case is when the parser's output is not balanced and not able to form a legitimate tree. To handle these cases, we propose a sentence word attaching algorithm, which guarantees to generate a well-formed mixed tree. We describe our algorithm for mixed tree generation in Algorithm 1. This algorithm takes as input a sentence and its parse tree, and generates a mixed tree by attaching the sentence words to the leaf nodes of the parse tree. To handle unbalanced parse tree, which is rare case but happens due to the nature of sequential parser, we simply add brackets to either the head or tail to make the parse tree balanced.

Learning
In this section, we describe our algorithms for learning the model parameters in the constituency parser P ✓ 1 (y|x) and the joint generative model P ✓ 2 (x, y).

Learning Schemes
NVLM can be trained in three different schemes: 1) fully supervised learning, where sentences (x) and their corresponding parse trees (y) are available; 2) distant-supervised learning, where we have a pre-trained parser and a new corpus without parsing annotations; 3) semi-supervised learning, where we have no parsing annotations available. Let , y (i) )} n i=1 denote the annotated training data, where each sentence is paired with a parse tree. Let D X = {x (i) } m i=1 denote the unannotated training data, where only sentences are available. Next, we show how to train NVLM under each setting respectively. Supervised: In the fully supervised setting, we use D XY to separately train the parser and the generative model, by maximizing their respective data log likelihood where P ✓ 1 (·) and P ✓ 2 (·) are defined in Eq. (4) and Eq. (5). We obtain the gradients r ✓ 1 J 1 and r ✓ 2 J 2 by chain rule, and iteratively update ✓ 1 and ✓ 2 with standard optimizers. Distant-supervised: In distant-supervised learning, we have pre-trained the parser on corpus D XY , and fix the parser to train the joint generative model on new corpus D X . The generative model can be either trained from scratch on D X or warmed-up on D XY . This setting is of practical importance, since we often need a language model on new corpus without annotations, and the parser pre-trained on treebank data can help since it encodes common grammar knowledge of the language. Under this setting, the pre-trained parser generates parse trees using Eq. (4) for unannotated sentences, and form (x, y) pairs to train the joint generative model through r ✓ 2 J 2 . The parser's parameters ✓ 1 remain fixed. Semi-supervised: NVLM can be end-to-end trained with only unannotated data D X . This is extremely hard if we train everything from scratch. However, it is very useful to fine-tune the language model on new corpus. Unlike distant-supervised learning, we now train the parser and the joint generative model together, and co-update the param-Algorithm 2: Semi-supervised learning input : annotated training data D XY ; unannotated training data D X ; optimizer G(·) 1 Initialize ✓ 1 with D XY using Eq. (6) 2 Initialize ✓ 2 with D XY using Eq. (7) Update the baseline function with b(x (i) ) (10) eters ✓ 1 and ✓ 2 . Here we also maximize the data log likelihood (8) Unfortunately, the derivative of J (D X ) is computationally intractable due to the large space of Y. To tackle this challenge, we use variational methods to maximize the lower bound of J (D X ), and exploit policy gradient algorithm to update the parser's parameters. Details are described in Section 3.2 and Section 3.3. Our algorithm for semisupervised learning is summarized in Algorithm 2, where we assume mini-batch size as 1 to avoid notational clutter.

Variational EM
As described above, to overcome the computational intractability of maximizing J (D X ) directly, we use variational expectationmaximization (EM) algorithm to maximize the evidence lower bound (ELBO): where we use our parser as the variational posterior P ✓ 1 (y|x). For readability, from now on we assume m = 1 and omit summing over training samples. With the Monte Carlo method, we ob-tain the unbiased gradient

Policy Gradient
To get the gradient ofJ (D X ) w.r.t. the parser parameters ✓ 1 , we need more work since y is sampled from a series of categorical distributions.
Here we use policy gradient algorithm (Williams, 1992) to get an unbiased estimator of the gradient where A(x, y) = log P ✓ 2 (x, y) log P ✓ 1 (y|x) is used as the learning signal. Due to the limit of space, we provide detailed derivation of Eq. (11) in Appendix E. In order to stabilize the learning process, we use standard variance reduction techniques to reduce the variance of gradient (Greensmith et al., 2004;Mnih and Gregor, 2014;. Specifically, we first standardize the signal (rescaling it to zero mean and unit variance) and then subtract a baseline function b(x). Then we use a separate GRU as the baseline function and fit the centered signal by minimizing the mean square loss. Finally, the gradient can be approximated as whereμ is the sample mean and˜ is the sample standard deviation, which estimate the mean and standard deviation of the learning signal A(x, y).

Inference
With the two components in NVLM, a parser P ✓ 1 (y|x) and a joint generative model P ✓ 2 (x, y), we can do three types of inference: • Parsing, where we sample the parser with greedy decoding to generate the parse tree for input sentence; • Evaluating P ✓ 2 (x, y), which is obtained from the joint generative model, and can be used to help rerank parsing candidates; • Estimating P (x) = P y 0 2Y(x) P (x, y 0 ) and evaluating the model perplexity, which is intractable due to the exponentially large space of Y(x). Similar to Dyer et al. (2016), we use importance sampling technique to estimate P (x).
Specifically, we use our parser P ✓ 1 (y|x) as the proposal distribution. The estimator of P (x) is derived as . The subsampling is for the efficiency of evaluating perplexity using importance sampling, which requires to sample multiple (we use 100) parse trees for each sentence. The training of our framework is actually much more scalable than the perplexity evaluation, and not restricted to the size of downsampled dataset. Note that our data preprocessing scheme follows what is standard in parsing instead of language modeling, since parsing typically requires more information (such as capital letters) and larger vocabulary. The vocabulary size for text is 26,620, and we have a separate vocabulary of size 74 for nonterminal nodes of parsing, such as (NP and NNP. Refer to Appendix A for more data preprocessing details. Tasks. We work on three different tasks: 1) supervised learning: we separately train both the parser and the joint generative model on PTB; 2) distantsupervised learning: we pre-train the parser on PTB, and then fix the parser to train the joint generative model on OBWB either from scratch or with PTB warmed-up model; 3) semi-supervised learning: we jointly train the parser and the generative model together on OBWB to fine-tune the language model. Evaluation. We mainly focus on language modeling, and use per-word perplexity to evaluate our framework and competitor models. As for the parser, we compare with state-of-the-art parsers in terms of training and testing speed.
Baselines. On the PTB dataset, we compare our language model with the following baselines: 1) Kneser-Ney 5-gram language model; 2) LSTM language model; 3) GRU language model implemented by ourselves; 4) recent state-of-the-art language models that also incorporate grammar to improve language modeling, including RNNG, SO-RNNG and GA-RNNG (Dyer et al., 2016;Kuncoro et al., 2017). On OBWB dataset, since there is no parsing annotations available, we compare with GRU language model as a strong baseline.
Optimization. All our models are trained on a single NVIDIA GTX 1080 GPU. For all NVLM models, we use Adam optimizer (Kingma and Ba, 2014) for the parser, and standard SGD for the joint generative model. Gradients are clipped at 0.25. See Appendix B for more details.

Supervised Learning
We first experiment on PTB dataset in the supervised learning setting. We separately train our parser and joint generative model on PTB training dataset, and then evaluate our language model on PTB test dataset. Table 2 lists the performance of our framework and competitor models. GRU-256 LM is our implemented language model using 2-layer GRU with hidden size 256, which is also used in other experiments. Parsing annotations are used by RNNG, SO-RNNG, GA-RNNG and NVLM. These grammar-aware models achieve significantly better performance compared to state-of-the-art sequential RNN-based language models, showing that grammar indeed helps language modeling. NVLM substantially improves over the current state of the art, by 10% reduction on test perplexity. With respect to our parser, instead of pursu-Model Perplexity KN-5-gram (Kneser and Ney, 1995) 169.3 LSTM-128 LM (Zaremba et al., 2014) 113.4 GRU-256 LM 112.3 RNNG (Dyer et al., 2016) 102.4 SO-RNNG (Kuncoro et al., 2017) 101.2 GA-RNNG (Kuncoro et al., 2017) 100.9 NVLM 91.6 Table 2: Test perplexity on PTB §23. KN-5-gram refers to Kneser-Ney 5-gram LM. Note that, since parsing typically requires more information (e.g., capital letters), we follow the standard data preprocessing of syntax-aware language modeling as in Dyer et al. (2016), thus the vocabulary size (⇠27K) is much larger than the capped vocabulary size (10K) in standard language modeling setting. Therefore, all the perplexity results reported in this paper are not directly comparable to that achieved by syntax-agnostic language models with a much smaller vocabulary, such as the perplexity 57.3 reported in Merity et al. (2017) and 54.5 reported in Dai et al. (2019). This also applies to the perplexity results on the OBWB dataset.
ing state-of-the-art parsing performance, it is designed to be light and fast to efficiently work together with the NVLM joint generative model. Our parser achieves 90.7 F 1 accuracy on PTB test dataset, which is comparable to state-of-the-art parsers. Due to the page limit, we report the detailed parsing performance in Appendix C.

Distant-supervised Learning
We then experiment on learning language model on new corpus without tree annotations. This is to verify whether the learned parser can help language model softly land on new corpus. We use the subsampled OBWB dataset for model training and evaluation. GRU-256 LM is used as a strong baseline. We have two different settings for both GRU LM and our framework: 1) fromscratch: For GRU LM, we randomly initialize it before training on OBWB. For NVLM, we train the parser on PTB and fix it, and randomly initialize the joint generative model; 2) warmed-up: For GRU LM, we pre-train GRU LM on PTB before training it on the OBWB dataset. For NVLM, we train the parser on PTB and fix it, and pre-train the joint generative model on PTB as warm-up. In both cases, NVLM uses its parser (trained on PTB) to generate parse trees for the OBWB dataset, and train the joint generative model with these silver-standard parse trees. Figure 2(a) shows the test perplexity curves along with number of training epochs. In both from-scratch and warmed-up settings, NVLM performs significantly better than GRU LM by 22.7 and 7.8 points in perplexity reduction. The warmed-up NVLM converges fast and achieves the lowest perplexity. Even when trained from scratch, NVLM achieves better performance than the warmed-up GRU LM, though it takes longer to converge. Note that the warm-up of GRU LM is directly training P (x) with more data, while the warm-up of NVLM is only for P (x, y). This explains why GRU LM seems to benefit more from warm-up at beginning, and why NVLM from scratch takes longer to converge.
Unlike the supervised learning setting, NVLM can now be trained on new corpus without parsing annotations, and still leverages the common grammar knowledge. To further study the adaptation speed of NVLM on new corpus, we train NVLM with variant proportion of training data in both from-scratch and warmed-up settings. We also train GRU LM as a strong baseline. Results are reported in Table 3.
As shown in Table 3, with smaller amount of data, NVLM outperforms GRU LM even more significantly. We find that with only 20% of training data, the warmed-up NVLM achieves test perplexity 140.6, which is comparable to GRU LM trained with full data from scratch (139.3). This demonstrates that our framework is data-efficient, and can quickly adapt to new corpus without parsing annotations. Moreover, we notice that even without looking at the new corpus (0% training data), the warmed-up NVLM achieves reasonable perplexity (151.8) which is substantially lower than the warmed-up GRU LM (242.2). This agrees with our conjecture that grammar knowledge is sharable among different corpora. Figure 2(b) plots the test perplexity curves using 20% OBWB training data. The warmed-up NVLM quickly converges, and achieves much lower perplexity compared to the warmed-up GRU LM.

Semi-supervised Learning
In distant-supervised setting, the parser is fixed when training NVLM on new corpus. We can actually continue training the parser together with the generative model, so that the language model can be fine-tuned in end-to-end fashion. This is es-(a) 100% training data (b) 20% training data Figure 2: Test perplexity curves on the subsampled OBWB dataset. Models are trained respectively on 100% and 20% training data. Models randomly initialized are marked as "scratch", while models pre-trained on the PTB dataset are marked as "warmed".  sentially semi-supervised learning: the parser has to be updated without parsing annotations, and the generative model will be updated together. Due to the exponentially large space of parse trees, such joint training is computationally intractable.
To tackle the challenge, we use variational EM (Section 3.2) and exploit policy gradient (Section 3.3) algorithm to co-update both components of NVLM. Technically, when the parser is fixed, the model is also maximizing the lower bound of data log likelihood. This empirically works well, as shown in the distant-supervised setting. With joint training, the model is essentially trying to find a better posterior on the new corpus and maximize a tighter lower bound. Therefore, the data log likelihood in Eq. (8) can be better optimized. In semisupervised setting, we use full OBWB training data (subsampled). As reported in Table 4, NVLM achieves the lowest perplexity (110.2) with joint training. We also evaluate the parser on PTB after joint training. It couldn't get improved since it has

Related Work and Discussion
Due to the remarkable success of RNN-based language models (Mikolov et al., 2010b,a;Jozefowicz et al., 2016;Merity et al., 2017;Elbayad et al., 2018;Gong et al., 2018;Dai et al., 2019), not too much attention has been paid to incorporate syntactic knowledge into language model. Although RNN-based models achieve impressive results on language modeling and other NLP tasks such as machine translation Xia et al., 2016) and parsing (Vinyals et al., 2015;Dyer et al., 2015;Choe and Charniak, 2016), it is far from perfect since it overlooks the language structure and simply generates sentence from left to right. The words in natural language are largely organized in latent nested structures rather than simple sequential order (Chomsky, 2002). Our work is related to syntactic language models, which has a long history. Traditional syntactic language models jointly generate syntactic structure with words using either bottom-up (Jelinek and Lafferty, 1991;Jelinek, 2004, 2005;Henderson, 2004), or top-down strategy (Charniak, 2000;Roark, 2001). Recently, some studies show the benefits of incorporating language structure into RNN-based language model, such as RNNG (Dyer et al., 2016;Kuncoro et al., 2017). Different from our work, these models mainly focus on parsing instead of language modeling, and cannot be trained without parsing annotations. Works on programming code generation (Rabinovich et al., 2017;Yin and Neubig, 2017) demonstrate that grammar is the key of effective code generation. Compared to natural lan-guage, programming code is more regulated and typically has well-defined grammar, thus it is more challenging to exploit the grammar knowledge in natural language.
Our work is also related to transfer learning of deep learning models (Bengio, 2012). There are some recent studies on neural language model adaptation (Yoon et al., 2017;Ma et al., 2017;Chen et al., 2015). However, none of them exploits grammar knowledge. There exists other lines of work on a broad field of general text generation (not for language modeling), such as GAN-based methods (Guo et al., 2017;Li et al., 2017) and VAE-based ones (Hu et al., 2017). It is a promising direction to incorporate syntactic knowledge into these generative models. Our work is also inspired by works on syntactical structured RNNs, such as tree LSTM (Tai et al., 2015), hierarchical RNNs (Chung et al., 2016) and doubly RNNs (Alvarez- Melis and Jaakkola, 2016).
Language models are widely used in a broad range of applications. We believe that a highquality language model can benefit many downstream tasks, such as machine translation, dialogue systems, and speech recognition. We consider to explore whether our framework can be seamlessly used in those applications, and leave it as future work.

Conclusion
In this work, we aim to improve language modeling with shared grammar. Our framework contains two probabilistic components: a constituency parser and a joint generative model. The parser encodes grammar knowledge in natural language, which helps language model quickly land on new corpus. We also propose algorithms for jointly training the two components to fine-tune the language model on new corpus without parsing annotations. Experiments demonstrate that our method improves language modeling on new corpus in terms of both convergence speed and perplexity.