Generative Incremental Dependency Parsing with Neural Networks

We propose a neural network model for scalable generative transition-based dependency parsing. A probability distribution over both sentences and transition sequences is parameterised by a feed-forward neural network. The model surpasses the accuracy and speed of previous generative dependency parsers, reaching 91 . 1% UAS. Perplexity results show a strong improvement over n -gram language models, opening the way to the ef-ﬁcient integration of syntax into neural models for language generation.


Introduction
Transition-based dependency parsers that perform incremental local inference with a discriminative classifier offer an appealing trade-off between speed and accuracy (Nivre, 2008;Zhang and Nivre, 2011;Choi and Mccallum, 2013). Recently neural network transition-based dependency parsers have been shown to give state-ofthe-art performance (Chen and Manning, 2014;Dyer et al., 2015;Weiss et al., 2015). However, the downstream integration of syntactic structure in language understanding and generation tasks is often done heuristically.
Neural networks have also been shown to be powerful generative models for language modelling (Bengio et al., 2003;Mikolov et al., 2010) and machine translation (Kalchbrenner and Blunsom, 2013;Devlin et al., 2014;Sutskever et al., 2014). However, currently these models lack awareness of syntax, which limits their ability to include longer-distance dependencies even when potentially unbounded contexts are used.
In this paper we propose a generative model for incremental parsing that offers an efficient way to incorporate syntactic information into a generative model. It relies on the strength of neural networks to overcome sparsity in the long conditioning contexts required for an accurate model, while also offering a principled approach to learn dependencybased word representations (Levy and Goldberg, 2014;Bansal et al., 2014).
Generative models for graph-based dependency parsing (Eisner, 1996;Wallach et al., 2008) are much less accurate than their discriminative counterparts. Syntactic language models based on PCFGs (Roark, 2001;Charniak, 2001) and incremental parsing (Chelba and Jelinek, 2000;Emami and Jelinek, 2005) have been proposed for speech recognition and machine translation. However, these models are also limited in either scalability, expressiveness, or both. A generative transitionbased dependency parser based on recurrent neural networks (Titov and Henderson, 2007) obtains high accuracy, but training and decoding is prohibitively expensive.
We perform efficient linear-time decoding with a particle filtering-based beam-search method where derivations after pruned after every word generation and the beam size depends on the uncertainty in the model (Buys and Blunsom, 2015).
The model obtains 91.1% UAS on the WSJ, which is 0.2% UAS better than the previous highest accuracy generative dependency parser (Titov and Henderson, 2007), while also being much more efficient. As a language model its perplexity reaches 111.8, a 23% reduction over an ngram baseline, when combining supervised training with unsupervised fine-tuning. Finally, we find that the model is able to generate sentences that display both local and syntactic coherence.

Generative Transition-based Parsing
Our parsing model is based on transition-based arc-standard projective dependency parsing (Nivre and Scholz, 2004). The generative formulation is similar to previous generative transition-based parsers (Titov and Henderson, 2007;Cohen et al., 2011;Buys and Blunsom, 2015), and also related to the joint tagging and parsing model of Bohnet and Nivre (2012).
The model predicts a sequence of parsing transitions: A shift transition generates a word (and its POS tag), while a reduce transition adds an arc (i, l, j), where i is the head node, j the dependent and l is the dependency label.
The joint probability distribution over a sentence with words w 1:n , tags t 1:n and a transition sequence a 1:2n is defined as where m i is the number of transitions that have been performed when (t i , w i ) is shifted and h j is the conditioning context at the jth transition.
A parser configuration (σ, β, A) for sentence s consists of a stack σ of indices in s, an index β to the next word to be generated, and a set of arcs A. The stack elements are referred to as σ 1 , . . . , σ |σ| , where σ 1 is the top element. For any node a, lc 1 (a) refers to the leftmost child of a in A, and rc 1 (a) to its rightmost child. A root node is added to the beginning of the sentence, and the head word of the sentence (we assume there is only one) is the dependent of the root.
The transition types are shift, left-arc and rightarc. Shift generates the next word of the sentence and pushes it on the stack. Left-arc adds an arc (σ 1 , l, σ 2 ) and removes σ 2 from the stack. Rightarc adds (σ 2 , l, σ 1 ) and pops σ 1 .
The parsing strategy adds arcs bottom-up. In a valid transition sequence the last transition is a right-arc from the root to the head word, and the root node is not involved in any other dependencies. We use an oracle to extract transition sequences from the training data: The oracle prefers reduce over shift transitions when both may lead to a valid derivation.

Neural Network Model
Our probability model is based on neural network language models with distributed representations (Bengio et al., 2003;Mnih and Hinton, 2007), as well as feed-forward neural network models for transition-based dependency parsing (Chen and Manning, 2014;Weiss et al., 2015). We estimate the distributions p(t i |h i ), p(w i |t i , h i ) and p(a j |h j ) with neural networks with shared input and hidden layers but separate output layers.
The templates for the conditioning context used are defined in Table 1. In the templates we obtain sentence indexes, which are then mapped to the corresponding words, tags and labels (for the dependencies of 2nd and 3rd order elements). The neural network allows us to include a large number of elements without suffering from sparsity.
In the input layer we make use of additive representations (Botha and Blunsom, 2014) so that for each word input position i we can include the word type, tag and other features, and learn input representations for each of these. Each context feature f has an input representation q f ∈ R D . The composite representation is computed as The hidden layer is then defined as where C j ∈ R D×D are transformation matrices defined for each position in sequence h, L = |h| and g is a (usually non-linear) activation function applied element-wise. The matrices C j can be approximated to be diagonal to reduce the number of model parameters and speed up the model by avoiding expensive matrix multiplications.
For the output layer predicting the next transition a, the hidden layer is mapped with a scoring function where k a is the transition output representation and e a is the bias weight. The score is normalised with the soft-max function: .
The output layer for predicting the next tag has a similar form, using the scoring function for tag representation t t and bias o t . The probability p(w|t, h) can be estimated similarly. However, to reduce the computational cost of normalising over the entire vocabulary, we factorize the probability as P (w|h) = P (c|t, h)P (w|c, t, h), where c = c(w) is the unique class of word w. For each c, let Γ(c) be the set of words in that class. The vocabulary is clustered into approximately |V | classes using Brown clustering (Brown et al., 1992), reducing the number of items to sum over in the normalisation factor from O(|V |) to O( |V |). Classbased factorization has been shown to be an effective strategy in normalizing neural language models (Baltescu and Blunsom, 2015), The class prediction score is defined as ψ(c, h) = s T c φ(h) + d c , where s c ∈ R D is the output weight vector for class c and d c is the class bias weight. The output layer then consists of a softmax function for p(c|h) and another softmax for the word prediction where Φ(w, h) = r T w φ(h)+b w is the word scoring function with output word representation r w and bias weight b w .
The model is trained with minibatch stochastic gradient descent (SGD) with Adagrad (Duchi et al., 2011) and L2 regularisation, to minimise the negative log likelihood of the joint distribution over parsed training sentences. For our experiments we train the model while the training objective improves, and choose the parameters of the iteration with the best development set accuracy (early stopping). The model obtains high accuracy with only a few training iterations.

Decoding
Beam-search decoders for transition-based parsing (Zhang and Clark, 2008) keep a beam of partial derivations, advancing each derivation by one transition at a time. When the size of the beam exceeds a set threshold, the lowest-scoring derivations are removed. However, in an incremental generative model we need to compare derivations with the same number of words shifted, rather than transitions performed. To let the decoding time remain linear, we also need to bound the total number of reduce transitions that can be performed over all derivations between two shift transitions.
To achieve this, we use a decoding method recently proposed for generative incremental parsing (Buys and Blunsom, 2015) based on particle filtering (Doucet et al., 2001), a sequential Monte Carlo sampling method.
In the algorithm, a fixed number of particles are divided among the partial derivations in the beam. Suppose i words have been shifted in all the derivations on the beam. To predict the next transition from derivation d j , its particles are divided according to p(a|h). In practice, adding only shift and the most likely reduce transition leads to almost no accuracy loss. After all the derivations have been advanced to shift word i + 1, a selection step is performed: The number of particles of each derivation is redistributed according to its probability, weighted by its current number of particles. Some derivations may be assigned 0 particles, in which case they are removed.
The particle filtering method lets the beam size depend of the uncertainty of the model, somewhat similar to Choi and Mccallum (2013), while fixing the total number of particles constrains the decoding time to be linear. The particle filter also allow us to sample outputs, and to marginalise over the syntax when generating.

Experiments
We evaluate our model for parsing and language modelling on the English Penn Treebank (Marcus et al., 1993) WSJ parsing setup 1 . Constituency trees are converted to projective CoNLL syntactic dependencies (Johansson and Nugues, 2007) with the LTH converter 2 . For some experiments  we also use the Stanford dependency representation (De Marneffe and Manning, 2008) (SD) 3 . Our neural network implementation is partly based on the OxLM neural language modelling framework (Baltescu et al., 2014). The model parameters are initialised randomly by drawing from a Gaussian distribution with mean 0 and variance 0.1, except for the bias weights, which are initialised by the unigram distributions of their output. We use minibatches of size 128, the L2 regularization parameter is 10, and the word representation and hidden layer of size is 256. The Adagrad learning rate is initialised to 0.05.
POS tags for the development and test sets are obtained with the Stanford POS tagger (Toutanova et al., 2003), with 97.5% test set accuracy. Words that occur only once in the training data are treated as unknown words. Unknown words are replaced by tokens representing morphological surface features (based on capitalization, numbers, punctuation and common suffixes) similar to those used in the implementation of generative constituency parsers .

Parsing results
We report unlabelled attachment score (UAS) and labelled attachment score (LAS) in our results, excluding punctuation. On the development set, we consider the effect of the choice of activation function ( Table 2), finding that a sigmoid activation (logistic function) performs best, following by tanh. Under our training setup the model can obtain up to 91.0 UAS after only 1 training iteration, thereby performing pure online learning.
We found that including third order dependencies in the conditioning context performs just 0.1% UAS better than including only first and second order dependencies. Including additional elements does not improve performance further.  trained only on words, not POS tags. Dependency parsers that do not use distributed representations tend to rely much more on the tags. Test set results comparing generative dependency parsers are given in Table 3 (our model is refered to as NN-GenDP). The graph-based generative baseline (Wallach et al., 2008), parameterised by Pitman-Yor Processes, is quite weak. Our model outperforms the generative model of Titov and Henderson (2007), which we retrained on our dataset, by 0.2%, despite that model being able to condition on arbitrary-sized contexts. The decoding speed of our model is around 20 sentences per second, against less than 1 sentence per second for Titov and Henderson's model. Using diagonal transformation matrices further increases our model's speed, but reduces parsing accuracy.
On the Stanford dependency representation our model obtains 90.63% UAS, 88.27% LAS. Although this performance is promising, it is still below the discriminative neural network models of Dyer et al. (2015) and Weiss et al. (2015), who obtained 93.1% UAS and 94.0% UAS respectively.

Language modelling
We also evaluate our parser as a language model, on the same WSJ data used for the parsing evaluation 4 . We perform unlabelled parsing, as experiments show that including labels in the conditioning context has a very small impact on performance. Neither do we use POS tags, as they are too expensive to predict in language generation applications.
Perplexity results on the WSJ are given in Table 4. As baselines we report results on modified Knesser-Ney (Kneser and Ney, 1995) and neural network 5-gram models. For our dependencybased language models we report perplexities based on the most likely parse found by the decoder, which gives an upper bound on the true the u.s. union board said revenue rose 11 % to $ NUM million , or $ NUM a share . mr. bush has UNK-ed a plan to buy the company for $ NUM to NUM million , or $ NUM a share . the plan was UNK-ed by the board 's decision to sell its $ NUM million UNK loan loan funds . in stocks coming months , china 's NUM shares rose 10 cents to $ NUM million , or $ NUM a share . in the case , mr. bush said it will sell the company business UNK concern to buy the company . it was NUM common shares in addition , with $ NUM million , or $ NUM a share , according to mr. bush . in the first quarter , 1989 shares closed yesterday at $ NUM , mr. bush has increased the plan . last year 's retrenchment price index index rose 11 cents to $ NUM million , or $ NUM million is asked . last year earlier , net income rose 11 million % to $ NUM million , or 91 cents a share . the u.s. union has UNK-ed $ NUM million , or 22 cents a share , in 1990 , payable nov. 9 .  value of the model perplexity.
First we only perform standard supervised training with the model -this already leads to an improvement of 10 perplexity points over the neural n-gram model. Second we consider a training setup where we first perform 5 supervised iterations, and then perform unsupervised training, treating the transition sequence as latent. For each minibatch parse trees are sampled with a particle filter. This approach further improves the perplexity to 111.8, a 23% reduction relative to the Knesser-Ney model.
The unsupervised training stage lets the parsing accuracy fall from 91.48 to 89.49 UAS. We postulate that the model is learning to make small adjustments to favour of parsing structures that explain the data better than the annotated parse trees, leading to the improvement in perplexity.
To test the scalability of our model, we also trained it on a larger unannotated corpus -a subset (of around 7 million words) of the billion word language modeling benchmark dataset (Chelba et al., 2013). After training the model on the WSJ, we parsed the unannotated data with the model, and continued to train on the obtained parses.
We observed a small increase in perplexity, from 203.5 for a neural n-gram model to 200.7 for the generative dependency model. We expect larger improvements when training on more data and with more sophisticated inference.
To evaluate our generative model qualitatively, we perform unconstrained generation of sentences (and parse trees) from the model, and found that sentences display a higher degree of syntactic coherence than sentences generated by an n-gram model. See Table 5 for examples generated by the model. The highest-scoring sentences of length 20 or more are given, from 1000 samples generated. Note that the generation includes unknown word tokens (here NUM, UNK and UNK-ed are used).

Conclusion
We presented an incremental generative dependency parser that can obtain accuracies competitive with discriminative models. The same model can be applied as an efficient syntactic language model, and for future work it should be integrated into language generation tasks such as machine translation.