Improved Transition-Based Parsing and Tagging with Neural Networks

We extend and improve upon recent work in structured training for neural network transition-based dependency parsing. We do this by experimenting with novel features, additional transition systems and by testing on a wider array of languages. In particular, we introduce set-valued features to encode the predicted morphological properties and part-of-speech confusion sets of the words being parsed. We also investigate the use of joint parsing and part-of-speech tagging in the neural paradigm. Finally, we conduct a multi-lingual evaluation that demonstrates the robustness of the overall structured neural approach, as well as the beneﬁts of the extensions proposed in this work. Our research further demonstrates the breadth of the applicability of neural network methods to dependency parsing, as well as the ease with which new features can be added to neural parsing models.


Introduction
Transition-based parsers (Nivre, 2008) are extremely popular because of their high accuracy and speed. Inspired by the greedy neural network transition-based parser of Chen and Manning (2014), Weiss et al. (2015) and Zhou et al. (2015) concurrently developed structured neural network parsers that use beam search and achieve state-of-the-art accuracies for English dependency parsing. 1 While very successful, these parsers have made use only of a small fraction of the rich options provided inside the transition-based framework: for example, all of these parsers use virtually identical atomic features and the arcstandard transition system.
In this paper we extend this line of work and introduce two new types of features that significantly improve parsing performance: (1) a set-valued (i.e., bag-of-words style) feature for each word's morphological attributes, and (2) a weighted set-valued feature for each word's k-best POS tags. These features can be integrated naturally as atomic inputs to the embedding layer of the network and the model can learn arbitrary conjunctions with all other features through the hidden layers. In contrast, integrating such features into a model with discrete features requires nontrivial manual tweaking. For example, Bohnet and Nivre (2012) had to carefully discretize the real-valued POS tag score in order to combine it with the other discrete binary features in their system. Additionally, we also experiment with different transition systems, most notably the integrated parsing and part-of-speech (POS) tagging system of Bohnet and Nivre (2012) and also the swap system of Nivre (2009).
We evaluate our parser on the CoNLL '09 shared task dependency treebanks, as well as on two English setups, achieving the best published numbers in many cases.

Model
In this section, we review the baseline model, and then introduce the features (which are novel) and the transition systems (taken from existing work) that we propose as extensions. We measure the impact of each proposed change on the development sets of the multi-lingual CoNLL '09 shared task treebanks (Hajič et al., 2009). For details on our experimental setup, see Section 3.

Baseline Model
Our baseline model is the structured neural network transition-based parser with beam search of Weiss et al. (2015). We use a feed-forward network with embedding, hidden and softmax layers. The input consists of a sequence of matrices extracted deterministically from a transition-based parse configuration (consisting of a stack and a buffer). Each matrix X g , corresponds to a feature group g (one of words, tags, or labels), and has dimension F g × V g . Here, X g f v is 1 if the f 'th feature takes on value v for group g, i.e. each row X g is a one-hot vector. These features are embedded and then concatenated to form the embedding layer, which in turn is input to the first hidden layer. The concatenated embedding layer can then be written as follows: where E g is a (learned) V g × D g embedding matrix for group g, and D g is the embedding dimension for group g. Beyond the embedding layer, there are two non-linear hidden layers (with nonlinearity introduced using a rectified linear activation function), and a softmax layer that outputs class probabilities for each possible decision. Training proceeds in two stages: We first train the network as a classifier by extracting decisions from gold derivations of the training set, as in Chen and Manning (2014). We then train a structured perceptron using the output of all network activations as features, as in Weiss et al. (2015). We use structured training and beam search during inference in all experiments. We train our models only on the treebank training set and do not use tri-training or other semi-supervised learning approaches (aside from using pre-trained word embeddings).

New Features
Prior work using neural networks for dependency parsing has not ventured beyond the use of one-hot feature activations for each feature type-location pair. In this work, we experiment with set-valued  features, in which a set (or bag) of features for a given location fire at once, and are embedded into the same embedding space. Note that for both of the features we introduce, we extract features from the same 20 tokens as used in the tags and words features from Weiss et al. (2015), i.e. various locations on the stack and input buffer.
Morphology. It is well known that morphological information is very important for parsing morphologically rich languages (see for example Bohnet et al. (2013)). We incorporate morphological information into our model using a setvalued feature function. We define the feature group morph as the matrix X morph such that, for where N f is the number of morphological features active on the token indexed by f . In other words, we embed a bag of features into a shared embedding space by averaging the individual feature embeddings.
k-best Tags. The non-linear network models of Weiss et al. (2015) and Chen and Manning (2014) embed the 1-best tag, according to a first-stage tagger, for a select set of tokens for any configuration. Inspired by the work of Bohnet and Nivre (2012), we embed the set of top tags according to a first-stage tagger. Specifically, we define the feature group ktags as the matrix X ktags such that, for where P(POS = v | f ) is the marginal probability that the token indexed by f has the tag indexed by v, according to the first-stage tagger.  Results. The contributions of our new features for pipelined arc-standard parsing are shown in Table 1. Morphology features (+morph) contributed a labeled accuracy score (LAS) gain of 2.9% in Czech, 1.5% in Spanish, and 0.9% in Catalan.
Adding the k-best tag feature (+morph +ktags) provides modest gains (and modest losses), peaking at 0.54% LAS for Spanish. This feature proves more beneficial in the integrated transition system, discussed in the next section. We note the ease with which we can obtain these gains in a multilayer embedding framework, without the need for any hand-tuning.

Integrating Parsing and Tagging
While past work on neural network transitionbased parsing has focused exclusively on the arcstandard transition system, it is known that better results can often be obtained with more sophisticated transition systems that have a larger set of possible actions. The integrated arc-standard transition system of Bohnet and Nivre (2012) allows the parser to participate in tagging decisions, rather than being forced to treat the tagger's tags as given, as in the arc-standard system. It does this by replacing the shift action in the arc-standard system with an action shift p , which, aside from shifting the top token on the buffer also assigns it one of the k best POS tags from a first-stage tagger. We also experiment with the swap action of Nivre (2009), which allows reordering of the tokens in the input sequence. This transition system is able to produce non-projective parse trees, which is important for some languages.
Results. The effect of using the integrated transition system is quantified in the bottom part of Table 1. The use of both 1) +morph +kbest features and 2) integrated parsing and tagging achieves the best score for 5 out of 7 languages tested. The use of integrated parsing and tagging provides, for example, a 0.8% LAS gain in German.

Experiments
In this section we provide final test set results for our baseline and full models on three standard setups from the literature: CoNLL '09, English WSJ and English Treebank Union.

General Setup
To train with predicted POS tags, we use a CRFbased POS tagger to generate 5-fold jack-knifed POS tags on the training set and predicted tags on the dev, test and tune sets; our tagger gets comparable accuracy to the Stanford POS tagger (Toutanova et al., 2003) with 97.44% on the WSJ test set. The candidate tags allowed by the integrated transition system on every shift p action are chosen by taking the top 4 tags for a token according to the CRF tagger, sorted by posterior probability, with no minimum posterior probability for a tag to be selected. We report unlabeled attachment score (UAS) and labeled attachment score (LAS). Whether punctuation is included in the evaluation is specified in each subsection. We use 1024 units in all hidden layers, a choice made based on the development set. We found network sizes to be of critical importance for the accuracy of our models. For example, LAS improvements can be as high as 0.98% in CoNLL'09 German when increasing the size of the two hidden layers from 200 to 1024. We use B = 16 or B = 32 based on the development set performance per language. For ease of experimentation, we deviate from Bohnet and Nivre (2012) and use a single unstructured beam, rather than separate beams for POS tag and parse differences.
We train our neural networks on the standard training sets only, except for initializing with word  embeddings generated by word2vec and using cluster features in our POS tagger. Unlike Weiss et al. (2015) we train our model only on the treebank training set and do not use tri-training, which can likely further improve the results.

CoNLL '09
Our multilingual evaluation follows the setup of the CoNLL '09 shared task 2 (Hajič et al., 2009). As standard, we use the supplied predicted morphological features from the shared task data; however, we predict k-best tags with our own POS tagger since k-best tags are not part of the given data. We follow standard practice and include all punctuation in the evaluation. We used the (integrated) arc-standard transition system for all languages except for Czech where we added a swap transition, obtaining a 0.4% absolute improvement in UAS and LAS over just using arc-standard.
Results. In Table 3, we compare our models to the winners of the CoNLL '09 shared task, Gesmundo et al. (2009), Bohnet (2009), Che et al. (2009), Ren et al. (2009, as well as to more recent results on the same datasets. It is worth pointing out that Gesmundo et al. (2009) is itself a neural net parser. Our models achieve higher labeled accuracy than the winning systems in the shared task in all languages. Additionally, our pipelined neural network parser always outperforms its linear counterpart, an in-house reimplementation of the system of Zhang and Nivre (2011), as well as the more recent and highly accurate parsers of Zhang and McDonald (2014) and Lei et al. (2014). For the integrated models our neural network parser  again outperforms its linear counterpart (Bohnet and Nivre, 2012), however, in some cases the addition of graph-based and cluster features (Bohnet and Nivre, 2012)+G+C can lead to even better results. The improvements in POS tagging (Table  2) range from 0.3% for English to 1.4% absolute for Chinese and are always higher for the neural network models compared to the linear models.

English WSJ
We experiment on English using the Wall Street Journal (WSJ) part of the Penn Treebank (Marcus et al., 1993), with standard train/test splits. We convert the constituency trees to Stanford style dependencies (De Marneffe et al., 2006) using version 3.3.0 of the converter. We use predicted POS tags and exclude punctuation from the evaluation, as is standard for English.
Results. The results shown in Table 4, we find that our full model surpasses, to our knowledge, all previously reported supervised parsing models for the Stanford dependency conversions. It surpasses its linear analog, the work of Bohnet and Nivre (2012) on Stanford Dependencies UAS by 0.9% UAS and by 1.14% LAS. It also outperforms the pipeline neural net model of Weiss et al. (2015) by a considerable margin and matches the semisupervised variant of Weiss et al. (2015).

English Treebank Union
Turning to cross-domain results, and the "Treebank Union" datasets, we use an identical setup to the one described in Weiss et al. (2015). This setup includes the WSJ with Stanford Dependencies, the OntoNotes corpus version 5 (Hovy et al., 2006), the English Web Treebank (Petrov and McDonald, 2012), and the updated and corrected Question Treebank (Judge et al., 2006). We train on the union of each corpora's training set and test on each domain separately.
Results. The results of this evaluation are shown in Table 5. As for the WSJ we find that the integrated transition system combined with our novel features performs better than previous work and in particular the model of Weiss et al. (2015), which serves as the starting point for this work. The improvements on the out-of-domain Web and Question corpora are particularly promising. Weiss et al. (2015) presented a parser that advanced the state of the art for English Stanford dependency parsing. In this paper we showed that this parser can be significantly improved by introducing novel set features for morphology and POS tag ambiguities, which are added with almost no feature engineering effort. The resulting parser is already competitive in the multi-lingual setting of the CoNLL'09 shared task, but can be further improved by utilizing an integrated POS tagging and parsing transition system. We find that for all settings the dense neural network model produces higher POS tagging and parsing accuracy gains than its sparse linear counterpart.