Dependency Grammar Induction with Neural Lexicalization and Big Training Data

We study the impact of big models (in terms of the degree of lexicalization) and big data (in terms of the training corpus size) on dependency grammar induction. We experimented with L-DMV, a lexicalized version of Dependency Model with Valence (Klein and Manning, 2004) and L-NDMV, our lexicalized extension of the Neural Dependency Model with Valence (Jiang et al., 2016). We find that L-DMV only benefits from very small degrees of lexicalization and moderate sizes of training corpora. L-NDMV can benefit from big training data and lexicalization of greater degrees, especially when enhanced with good model initialization, and it achieves a result that is competitive with the current state-of-the-art.


Introduction
Grammar induction is the task of learning a grammar from a set of unannotated sentences.In the most common setting, the grammar is unlexicalized with POS tags being the tokens, and the training data is the WSJ10 corpus (the Wall Street Journal corpus with sentences no longer than 10 words) containing no more than 6,000 training sentences (Cohen et al., 2008;Berg-Kirkpatrick et al., 2010;Tu and Honavar, 2012).
Lexicalized grammar induction aims to incorporate lexical information into the learned grammar to increase its representational power and improve the learning accuracy.The most straightforward approach to encoding lexical information is full lexicalization (Pate and Johnson, 2016;Spitkovsky et al., 2013).A major problem with full lexicalization is that the grammar becomes much larger and thus learning is more data demanding.To mitigate this problem, Headden et al. (2009) and Blunsom and Cohn (2010) used partial lexicalization in which infrequent words are replaced by special symbols or their POS tags.Another straightforward way to mitigate the data scarcity problem of lexicalization is to use training corpora larger than the standard WSJ corpus.For example, Pate and Johnson (2016) used two large corpora containing more than 700k sentences; Marecek and Straka (2013) utilized a very large corpus based on Wikipedia in learning an unlexicalized dependency grammar.Finally, smoothing techniques can be used to reduce the negative impact of data scarcity.One example is Neural DMV (NDMV) (Jiang et al., 2016) which incorporates neural networks into DMV and can automatically smooth correlated grammar rule probabilities.
Inspired by this background, we conduct a systematic study regarding the impact of the degree of lexicalization and the training data size on the accuracy of grammar induction approaches.We experimented with a lexicalized version of Dependency Model with Valence (L-DMV) (Klein and Manning, 2004) and our lexicalized extension of NDMV (L-NDMV).We find that L-DMV only benefits from very small degrees of lexicalization and moderate sizes of training corpora.In comparison, L-NDMV can benefit from big training data and lexicalization of greater degrees, especially when it is enhanced with good model initialization.The performance of L-NDMV is competitive with the current state-of-the-art.

Lexicalized DMV
We choose to lexicalize an extended version of DMV (Gillenwater et al., 2010).We adopt a sim- Fully Connected Layer: ilar approach to that of Spitkovsky et al. ( 2013) and Blunsom and Cohn (2010) and represent each token as a word/POS pair.If a pair appears infrequently in the corpus, we simply ignore the word and represent it only with the POS tag.We control the degree of lexicalization by replacing words that appear less than a cutoff number in the WSJ10 corpus with their POS tags.With a very large cutoff number, the grammar is virtually unlexicalized; but when the cutoff number becomes smaller, the grammar becomes closer to be fully lexicalized.Note that our method is different from previous practice that simply replaces rare words with a special "unknown" symbol (Headden III et al., 2009).Using POS tags instead of the "unknown" symbol to represent rare words can be helpful in the neural approach introduced below in that the learned word vectors are more informative.

Lexicalized NDMV
With a larger degree of lexicalization, the grammar contains more tokens and hence more parameters (i.e., grammar rule probabilities), which require more data for accurate learning.Smoothing is a useful technique to reduce the demand for data in this case.Here we employ a neural approach to smoothing.Specifically, we propose a lexicalized extension of neural DMV (Jiang et al., 2016) and we call the resulting approach L-NDMV.
Extended Model: The model structure of L-NDMV is similar to that of NDMV except for the representations of the head and the child of the CHILD and DECISION rules.The network structure for predicting the probabilities of CHILD rules [p c 1 , p c 2 , ..., p cm ] (m is the vocabulary size; c i is the i-th token) and DECISION rules [p stop , p continue ] given the head word, head POS tag, direction and valence is shown in Figure 1.We denote the input continuous representations of the head word, head POS tag and valence by v word , v tag and v val respectively.By concatenating these vectors we get the input representation to the neural network: We map the input representation to the hidden layer f using the direction-specific weight matrix W dir and the ReLU activation function.We represent all the child tokens with matrix W chd = [W word , W tag ] which contains two parts: child word matrix W word ∈ R m×k and child POS tag matrix W tag ∈ R m×k , where k and k are the prespecified dimensions of output word vectors and tag vectors respectively.The i-th rows of W word and W tag represent the output continuous representations of the i-th word and its POS tag respectively.Note that for two words with the same POS tag, the corresponding POS tag representations are the same.We take the product of f and the child matrix W chd and apply a softmax function to obtain the CHILD rule probabilities.For DECISION rules, we replace W chd with the decision weight matrix W dec and follow the same procedure.
Extended Learning Algorithm: The original NDMV learning method is based on hard-EM and is very time-consuming when applied to L-NDMV with a large training corpus.We propose two improvements to achieve significant speedup.First, at each EM iteration we collect grammar rule counts from a different batch of sentences instead of from the whole training corpus and train the neural network using only these counts.Second, we train the same neural network across EM iterations without resetting.More details can be found in the supplementary material.Our algorithm can be seen as an extension of online EM (Liang and Klein, 2009) to accommodate neural network training.

Model Initialization
It was previously shown that the heuristic KM initialization method by Klein and Manning (2004) does not work well for lexicalized grammar induction (Headden III et al., 2009;Pate and Johnson, 2016) and it is very helpful to initialize learning with a model learned by a different grammar induction method (Le and Zuidema, 2015;Jiang et al., 2016).We tested both KM initialization and the following initialization method: we first learn an unlexicalized DMV using the grammar induction method of Naseem et al. (2010) and use it to parse the training corpus; then, from the parse trees we run maximum likelihood estimation to produce the initial lexicalized model.

Experimental Setup
For English, we used the BLLIP corpus 1 in addition to the regular WSJ corpus in our experiments.Note that the BLLIP corpus is collected from the same news article source as the WSJ corpus, so it is in-domain and is ideal for training grammars to be evaluated on the WSJ test set.In order to solve the compatibility issue as well as improve the POS tagging accuracy, we used the Stanford tagger (Toutanova et al., 2003) to retag the BLLIP corpus and selected the sentences for which the new tags are consistent with the original tags, which resulted in 182244 sentences with length less than or equal to 10 after removing punctuations.We used this subset of BLLIP and section 2-21 of WSJ10 for training, section 22 of WSJ for validation and section 23 of WSJ for testing.We used training sets of four different sizes: WSJ10 only (5779 sentences) and 20k, 50k, and all sentences from the BLLIP subset.For Chinese, we obtained 4762 sentences for training from Chinese Treebank 6.0 (CTB) after converting data to dependency structures via Penn2Malt (Nivre, 2006) and then stripping off punctuations.We used the recommended validation and test data split described in the documentation.
We trained the models with different degrees of lexicalization.We control the degree of lexicalization by replacing words that appear less than a cutoff number in the WSJ10 or CTB corpus with their POS tags.For each degree of lexicalization, we tuned the dimension of the hidden layer of the neural network on the validation dataset.For English, we tested nine word cutoff numbers: 100000, 500, 200, 100, 80, 70, 60, 50, and 40, which resulted in vocabulary sizes of 35, 63, 98, 166, 203, 226, 267, 306, and 390 respectively; for Chinese, the word cutoff numbers are 100000, 100, 70, 50, 40, 30, 20, 12, and 10.Ideally, with higher degrees of lexicalization, the hidden layer dimension should be larger in order to accommodate the increased number of tokens.For the neural network of L-NDMV, we initialized the word and tag vectors in the neu-1 Brown Laboratory for Linguistic Information Processing (BLLIP) 1987-89 WSJ Corpus Release 1 ral network by learning a CBOW model using the Gensim package ( Řehůřek and Sojka, 2010).We set the dimension of input and output word vectors to 100 and the dimension of input and output tag vectors to 20.We trained the neural network with learning rate 0.03, mini-batch size 200 and momentum 0.9.Because some of the neural network weights are randomly initialized, the model converges to a different local minimum in each run of the learning algorithm.Therefore, for each setup we ran our learning algorithm for three times and reported the average accuracy.More detail of the experimental setup can be found in the supplementary material.

Results on English
Figure 2(a) shows the directed dependency accuracy (DDA) of the learned lexicalized DMV with KM initialization.It can be seen that on the smallest WSJ10 training corpus, lexicalization improves learning only when the degree of lexicalization is small; with further lexicalization, the learning accuracy significantly degrades.On the three larger training corpora, the impact of lexicalization on the learning accuracy is still negative but is less severe.Overall, lexicalization seems to be very data demanding and even our largest training corpora could not bring about the benefit of lexicalization.Increasing the training corpus size is helpful regardless of the degree of lexicalization, but the learning accuracies with the 50K dataset are almost identical to those with the full dataset, suggesting diminishing return of more data.
Figure 2(b) shows the results of L-NDMV with KM initialization.The parsing accuracy is improved under all the settings, showing the advantage of NDMV.The range of lexicalization degrees that improve learning becomes larger, and the degradation in accuracy with large degrees of lexicalization becomes much less severe.Diminishing return of big data as seen in the first figure can still be observed.
Figure 2(c) shows the results of L-NDMV with the initialization method described in section 2.3.It can be seen that lexicalization becomes less data demanding and the learning accuracy does not decrease until the highest degrees of lexicalization.Larger training corpora now lead to significantly better learning accuracy and support lexicalization of greater degrees than smaller corpora.Diminishing return of big data is no longer observed, which implies further increase in accuracy with even more data.Table 1 compares the result of L-NDMV (with the largest corpus and the vocabulary size of 203 which was selected on the validation set) with previous approaches to dependency grammar induction.It can be seen that L-NDMV is competitive with previous state-of-the-art approaches.We did some further analysis of the learned word vectors in L-NDMV in the supplementary material.

Results on Chinese
Figure 2(d) shows the results of the three approaches on the Chinese treebank.Because the corpus is relatively small, we did not study the impact of the corpus size.Similar to the case of English, the accuracy of lexicalized DMV degrades with more lexicalization.However, the accuracy with L-NDMV increases significantly with more lexicalization even without good model initialization.Adding good initialization further boosts the performance of L-NDMV, but the benefit of lexicalization is less significant (from 0.55 to 0.58).

Effect of Grammar Rule Probability Initialization
We compare four initialization methods to L-NDMV: uniform initialization, random initialization, KM initialization (Klein and Manning, 2004), and good initialization as described in section 2.3 in Figure 3.Here we trained the L-NDMV model on the WSJ10 corpus with the same experimental setup as in section 3. Again, we find that good initialization leads to better performance than KM initialization, and both good initialization and KM initialization are significantly better than random and uniform initialization.Note that our results are different from those by Pate and Johnson (2016), who found that uniform initialization leads to similar performance to KM initialization.We speculate that it is because of the difference in the learning approaches (we use neural networks which may be more sensitive to initialization) and the training and test corpora (we use news articles while they use telephone scripts).

Conclusion and Future Work
We study the impact of the degree of lexicalization and the training data size on the accuracy of dependency grammar induction.We experimented with lexicalized DMV (L-DMV) and our lexicalized extension of Neural DMV (L-NDMV).We find that L-DMV only benefits from very small degrees of lexicalization and moderate sizes of training corpora.In contrast, L-NDMV can benefit from big training data and lexicalization of greater degrees, especially when enhanced with good model initialization, and it achieves a result that is competitive with the state-of-the-art.
In the future, we plan to study higher degrees of lexicalization or full lexicalization, as well as even larger training corpora (such as the Wikipedia corpus).We would also like to experiment with other grammar induction approaches with lexicalization and big training data.

Figure 1 :
Figure 1: The structure of the neural networks in the L-NDMV model.It predicts the probabilities of the CHILD rules and DECISION rules.

Figure 2 :
Figure 2: The impact of the training corpus size and the degree of lexicalization on L-DMV and L-NDMV with different initialization methods on English and Chinese.

Figure 3 :
Figure 3: Comparison of four initialization methods to L-NDMV: uniform initialization, random initialization, KM initialization and good initialization.

Table 1 :
Comparison of recent grammar induction systems.