A Re-ranking Model for Dependency Parser with Recursive Convolutional Neural Network

In this work, we address the problem to model all the nodes (words or phrases) in a dependency tree with the dense representations. We propose a recursive convolutional neural network (RCNN) architecture to capture syntactic and compositional-semantic representations of phrases and words in a dependency tree. Different with the original recursive neural network, we introduce the convolution and pooling layers, which can model a variety of compositions by the feature maps and choose the most informative compositions by the pooling layers. Based on RCNN, we use a discriminative model to re-rank a $k$-best list of candidate dependency parsing trees. The experiments show that RCNN is very effective to improve the state-of-the-art dependency parsing on both English and Chinese datasets.


Introduction
Feature-based discriminative supervised models have achieved much progress in dependency parsing (Nivre, 2004;Yamada and Matsumoto, 2003;McDonald et al., 2005), which typically use millions of discrete binary features generated from a limited size training data. However, the ability of these models is restricted by the design of features. The number of features could be so large that the result models are too complicated for practical use and prone to overfit on training corpus due to data sparseness.
Recently, many methods are proposed to learn various distributed representations on both syntax and semantics levels. These distributed representations have been extensively applied on many * Corresponding author. natural language processing (NLP) tasks, such as syntax (Turian et al., 2010;Mikolov et al., 2010;Collobert et al., 2011;Chen and Manning, 2014) and semantics (Huang et al., 2012;Mikolov et al., 2013). Distributed representations are to represent words (or phrase) by the dense, low-dimensional and real-valued vectors, which help address the curse of dimensionality and have better generalization than discrete representations. For dependency parsing,  and Bansal et al. (2014) used the dense vectors (embeddings) to represent words or features and found these representations are complementary to the traditional discrete feature representation. However, these two methods only focus on the dense representations (embeddings) of words or features. These embeddings are pre-trained and keep unchanged in the training phase of parsing model, which cannot be optimized for the specific tasks.
Besides, it is also important to represent the (unseen) phrases with dense vector in dependency parsing. Since the dependency tree is also in recursive structure, it is intuitive to use the recursive neural network (RNN), which is used for constituent parsing (Socher et al., 2013a). However, recursive neural network can only process the binary combination and is not suitable for dependency parsing, since a parent node may have two or more child nodes in dependency tree.
In this work, we address the problem to rep-resent all level nodes (words or phrases) with dense representations in a dependency tree. We propose a recursive convolutional neural network (RCNN) architecture to capture syntactic and compositional-semantic representations of phrases and words. RCNN is a general architecture and can deal with k-ary parsing tree, therefore it is very suitable for dependency parsing. For each node in a given dependency tree, we first use a RCNN unit to model the interactions between it and each of its children and choose the most informative features by a pooling layer. Thus, we can apply the RCNN unit recursively to get the vector representation of the whole dependency tree. The output of each RCNN unit is used as the input of the RCNN unit of its parent node, until it outputs a single fixed-length vector at root node. Figure 1 illustrates an example how a RCNN unit represents the phrases "a red bike" as continuous vectors. The contributions of this paper can be summarized as follows.
• RCNN is a general architecture to model the distributed representations of a phrase or sentence with its dependency tree. Although RCNN is just used for the re-ranking of the dependency parser in this paper, it can be regarded as semantic modelling of text sequences and handle the input sequences of varying length into a fixed-length vector. The parameters in RCNN can be learned jointly with some other NLP tasks, such as text classification.
• Each RCNN unit can model the complicated interactions of the head word and its children. Combined with a specific task, RCNN can capture the most useful semantic and structure information by the convolution and pooling layers.
• When applied to the re-ranking model for parsing, RCNN improve the accuracy of base parser to make accurate parsing decisions. The experiments on two benchmark datasets show that RCNN outperforms the state-ofthe-art models.

Recursive Neural Network
In this section, we briefly describe the recursive neural network architecture of (Socher et al., 2013a). The idea of recursive neural networks (RNN) for natural language processing (NLP) is to train a deep learning model that can be applied to phrases and sentences, which have a grammatical structure (Pollack, 1990;Socher et al., 2013c). RNN can be also regarded as a general structure to model sentence. At every node in the tree, the contexts at the left and right children of the node are combined by a classical layer. The weights of the layer are shared across all nodes in the tree. The layer computed at the top node gives a representation for the whole sentence.
Following the binary tree structure, RNN can assign a fixed-length vector to each word at the leaves of the tree, and combine word and phrase pairs recursively to create intermediate node vectors of the same length, eventually having one final vector representing the whole sentence. Multiple recursive combination functions have been explored, from linear transformation matrices to tensor products (Socher et al., 2013c). Figure 2 illustrates the architecture of RNN.
The binary tree can be represented in the form of branching triplets (p → c 1 c 2 ). Each such triplet denotes that a parent node p has two children and each c k can be either a word or a non-terminal node in the tree.
Given a labeled binary parse tree, ((p 2 → ap 1 ), (p 1 → bc)), the node representations are computed by where (p 1 , p 2 , a, b, c) are the vector representation of (p 1 , p 2 , a, b, c) respectively, which are denoted by lowercase bold font letters; W is a matrix of parameters of the RNN. Based on RNN, Socher et al. (2013a) introduced a compositional vector grammar, which uses the syntactically untied weights W to learn the syntactic-semantic, compositional vector representations. In order to compute the score of how plausible of a syntactic constituent a parent is, RNN uses a single-unit linear layer for all p i : where v is a vector of parameters that need to be trained. This score will be used to find the highest scoring tree. For more details on how standard RNN can be used for parsing, see (Socher et al., 2011). Costa et al. (2003) applied recursive neural networks to re-rank possible phrase attachments in an incremental constituency parser. Their work is the first to show that RNNs can capture enough information to make the correct parsing decisions. Menchetti et al. (2005) used RNNs to re-rank different constituency parses. For their results on full sentence parsing, they re-ranked candidate trees created by the Collins parser (Collins, 2003).

Recursive Convolutional Neural Network
The dependency grammar is a widely used syntactic structure, which directly reflects relationships among the words in a sentence. In a dependency tree, all nodes are terminal (words) and each node may have more than two children. Therefore, the standard RNN architecture is not suitable for dependency grammar since it is based on the binary tree. In this section, we propose a more general architecture, called recursive convolutional neural network (RCNN), which borrows the idea of convolutional neural network (CNN) and can deal with to k-ary tree.

RCNN Unit
For ease of exposition, we first describe the basic unit of RCNN. A RCNN unit is to model a head word and its children. Different from the constituent tree, the dependency tree does not have non-terminal nodes. Each node consists of a word and its POS tags. Each node should have a different interaction with its head node.
Word Embeddings Given a word dictionary W, each word w ∈ W is represented as a real-valued vector (word embedding) w ∈ R m where m is the dimensionality of the vector space. The word embeddings are then stacked into a embedding matrix M ∈ R m|W| . For a word w ∈ W, its corresponding word embedding Embed(w) ∈ R m is retrieved by the lookup table layer. The matrix M
Distance Embeddings Besides word embeddings, we also use distributed vector to represent the relative distance of a head word h and one of its children c. For example, as shown in Figure 1, the relative distances of "bike" to "a" and "red" are -2 and -1, respectively. The relative distances also are mapped to a vector of dimension m d (a hyperparameter); this vector is randomly initialized. Distance embedding is a usual way to encode the distance information in neural model, which has been proven effectively in several tasks. Our experimental results also show that the distance embedding gives more benefits than the traditional representation. The relative distance can encode the structure information of a subtree.
Convolution The word and distance embeddings are subsequently fed into the convolution component to model the interactions between two linked nodes.
Different with standard RNN, there are no nonterminal nodes in dependency tree. Each node h in dependency tree has two associated distributed representations: 1. word embedding w h ∈ R m , which is denoted as its own information according to its word form; 2. phrase representation x h ∈ R m , which is denoted as the joint representation of the whole subtree rooted at h. In particular, when h is leaf node, Given a subtree rooted at h in dependency tree, we define c i , 0 < i ≤ L as the i-th child node of h, where L represents the number of children.
For each pair (h, c i ), we use a convolutional hidden layer to compute their combination representation z i .
where W (h,c i ) ∈ R m×n is the linear composition matrix, which depends on the POS tags of h and c i ; p i ∈ R n is the concatenated representation of h and the i-th child, which consists of the head word embeddings w h , the child phrase representation x c i and the distance embeddings d h,c i of h and c i , where ⊕ represents the concatenation operation. The distances d h,c i is the relative distance of h and c i in a given sentence. Then, the relative distances also are mapped to m-dimensional vectors. Different from constituent tree, the combination should consider the order or position of each child in dependency tree.
In our model, we do not use the POS tags embeddings directly. Since the composition matrix varies on the different pair of POS tags of h and c i , it can capture the different syntactic combinations. For example, the combination of adjective and noun should be different with that of verb and noun.
After the composition operations, we use tanh as the non-linear activation function to get a hidden representation z.

Max Pooling After convolution, we get Z
where K is dynamic and depends on the number of children of h. To transform Z to a fixed length and determine the most useful semantic and structure information, we perform a max pooling operation to Z on rows.
Thus, we obtain the vector representation x h ∈ R m of the whole subtree rooted at node h. Figure 3 shows the architecture of our proposed RCNN unit. Given a whole dependency tree, we can apply the RCNN unit recursively to get the vector representation of the whole sentence. The output of each RCNN unit is used as the input of the RCNN unit of its parent node.
Thus, RCNN can be used to model the distributed representations of a phrase or sentence with its dependency tree and applied to many NLP tasks. The parameters in RCNN can be learned jointly with the specific NLP tasks. Each RCNN unit can model the complicated interactions of the head word and its children. Combined with a specific task, RCNN can select the useful semantic and structure information by the convolution and max pooling layers. Figure 4 shows an example of RCNN to model the sentence "I eat sashimi with chopsitcks".

Parsing
In order to measure the plausibility of a subtree rooted at h in dependency tree, we use a singleunit linear layer neural network to compute the score of its RCNN unit.
For constituent parsing, the representation of a non-terminal node only depends on its two children. The combination is relative simple and its correctness can be measured with the final representation of the non-terminal node (Socher et al., 2013a).
However for dependency parsing, all combinations of the head h and its children c i (0 < i ≤ K) are important to measure the correctness of the subtree. Therefore, our score function s(h) is computed on all of hidden layers z i (0 < i ≤ K): where v (h,c i ) ∈ R m×1 is the score vector, which also depends on the POS tags of h and c i . Given a sentence x and its dependency tree y, the goodness of a complete tree is measured by summing the scores of all the RCNN units.
where h ∈ y is the node in tree y; Θ = {Θ W , Θ v , Θ w , Θ d } including the combination matrix set Θ W , the score vector set Θ v , the word embeddings Θ w and distance embeddings Θ d . Finally, we can predict dependency treeŷ with highest score for sentence x.
where gen(x) is defined as the set of all possible trees for sentence x. When applied in re-ranking, gen(x) is the set of the k-best outputs of a base parser.

Training
For a given training instance (x i , y i ), we use the max-margin criterion to train our model. We first predict the dependency treeŷ i with the highest score for each x i and define a structured margin loss ∆(y i ,ŷ i ) between the predicted treeŷ i and the given correct tree y i . ∆(y i ,ŷ i ) is measured by counting the number of nodes y i with an incorrect span (or label) in the proposed tree (Goodman, 1998).
where κ is a discount parameter and d represents the nodes in trees. Given a set of training dependency parses D, the final training objective is to minimize the loss function J(Θ), plus a l 2 -regulation term: where r i (Θ) = max By minimizing this object, the score of the correct tree y i is increased and the score of the highest scoring incorrect treeŷ i is decreased.
We use a generalization of gradient descent called subgradient method (Ratliff et al., 2007) which computes a gradient-like direction. The subgradient of equation is: To minimize the objective, we use the diagonal variant of AdaGrad (Duchi et al., 2011). The parameter update for the i-th parameter Θ t,i at time step t is as follows: where ρ is the initial learning rate and g τ ∈ R |θ i | is the subgradient at time step τ for parameter θ i .

Re-rankers
Re-ranking k-best lists was introduced by Collins and Koo (2005) and Charniak and Johnson (2005). They used discriminative methods to re-rank the constituent parsing. In the dependency parsing, Sangati et al. (2009) used a third-order generative model for re-ranking k-best lists of base parser. Hayashi et al. (2013) used a discriminative forest re-ranking algorithm for dependency parsing. These re-ranking models achieved a substantial raise on the parsing performances. Given T (x), the set of k-best trees of a sentence x from a base parser, we use the popular mixture re-ranking strategy (Hayashi et al., 2013;Le and Mikolov, 2014), which is a combination of the our model and the base parser.
where α ∈ [0, 1] is a hyperparameter; s t (x i , y, Θ) and s b (x i , y) are the scores given by RCNN and the base parser respectively.
To apply RCNN into re-ranking model, we first get the k-best outputs of all sentences in train set with a base parser. Thus, we can train the RCNN in a discriminative way and optimize the re-ranking strategy for a particular base parser.
Note that the role of RCNN is not fully valued when applied in re-ranking model since that the gen(x) in Eq. (8) is just the k-best outputs of a base parser, not the set of all possible trees for sentence x. The parameters of RCNN could overfit to kbest outputs of training set.

Datasets
To empirically demonstrate the effectiveness of our approach, we use two datasets in different languages (English and Chinese) in our experimental evaluation and compare our model against the other state-of-the-art methods using the unlabeled attachment score (UAS) metric ignoring punctuation.
English For English dataset, we follow the standard splits of Penn Treebank (PTB), using sections 2-21 for training, section 22 as development set and section 23 as test set. We tag the development and test sets using an automatic POS tagger (at 97.2% accuracy), and tag the training set using four-way jackknifing similar to (Collins and Koo, 2005).
Chinese For Chinese dataset, we follow the same split of the Penn Chinese Treeban (CTB5) as described in (Zhang and Clark, 2008) (Zhang and Clark, 2008). And following (Zhang and Clark, 2008) (Zhang and Nivre, 2011), we use gold segmentation and POS tags for the input.
We use the linear-time incremental parser (Huang and Sagae, 2010) as our base parser and calculate the 64-best parses at the top cell of the chart. Note that we optimize the training settings for base parser and the results are slightly improved on (Huang and Sagae, 2010). Then we use max-margin criterion to train RCNN. Finally, we use the mixture strategy to re-rank the top 64-best parses.
For initialization of parameters, we train word2vec embeddings (Mikolov et al., 2013) on Wikipedia corpus for English and Chinese respectively. For the combination matrices and score vectors, we use the random initialization within (0.01, 0.01). The parameters which achieve the best unlabeled attachment score on the development set will be chosen for the final evaluation.

English Dataset
We first evaluate the performances of the RCNN and re-ranker (Eq. (14)) on the development set. Figure 5 shows UASs of different models with varying k. The base parser achieves 92.45%. When k = 64, the oracle best of base parser achieves 97.34%, while the oracle worst achieves 73.30% (-19.15%) . RCNN achieves the maximum improvement of 93.00%(+0.55%) when k = 6. When k > 6, the performance of RCNN declines with the increase of k but is still higher than baseline (92.45%). The reason behind this is that RCNN could require more negative samples to avoid overfitting when k is large. Since the negative samples are limited in the k-best outputs of a base parser, the learnt parameters could easily overfits to the training set.
In mixture re-ranker, α is optimised by searching with the step-size 0.005. Therefore, we use the mixture re-ranker in the following experiments since it can take the advantages of both the RCNN and base models. Figure 6 shows the accuracies on the top ten POS tags of the modifier words with the largest improvements. We can see that our re-ranker can improve the accuracies of CC and IN, and therefore may indirectly result in rising the the well-known coordinating conjunction and PPattachment problems.
The final experimental results on test set are shown in Table 1. The hyperparameters of our model are set as in Table 2. Our re-ranker achieves the maximum improvement of 93.83%(+1.48%) on test set. Our system performs slightly better than many state-of-the-art systems such as Zhang and Clark (2008) and Huang and Sagae (2010). It outperforms Hayashi et al. (2013) and Le and Zuidema (2014), which also use the mixture reranking strategy.
Since the result of ranker is conditioned to kbest results of base parser, we also do an experiment to avoid this limitation by adding the oracle to k-best candidates. With including oracle, the re-ranker can achieve 94.16% on UAS, which is shown in the last line ("our re-ranker (with oracle)") of Table 1. Base Paser Re-ranker Figure 6: Accuracies on the top ten POS tags of the modifier words with the largest improvements on the development set.

Chinese Dataset
We also make experiments on the Penn Chinese Treebank (CTB5). The hyperparameters is the same as the previous experiment on English except that α is optimised by searching with the step-size 0.005. The final experimental results on the test set are shown in Table 3. Our re-ranker achieves the performance of 85.71%(+0.25%) on the test set, which also outperforms the previous state-of-theart methods. With adding oracle, the re-ranker can achieve 87.43% on UAS, which is shown in the last line ("our re-ranker (with oracle)") of Table 3. Zhang and Clark (2008) 91.4 Huang and Sagae (2010) 92.1 Distributed Representations Stenetorp (2013) 86.25  93.74 Chen and Manning (2014) 92.0 Re-rankers Hayashi et al. (2013) 93.12 Le and Zuidema (2014) 93.12 Our baseline 92.35 Our re-ranker 93.83(+1.48) Our re-ranker (with oracle) 94.16 Table 1: Accuracy on English test set. Our baseline is the result of base parser; our re-ranker uses the mixture strategy on the 64-best outputs of base parser; our re-ranker(with oracle) is to add the oracle to k-best outputs of base parser.

Traditional Methods
Compared with the re-ranking model of Hayashi et al. (2013), that use a large number of handcrafted features, our model can achieve a competitive performance with the minimal feature engineering.

Discussions
The performance of the re-ranking model is affected by the base parser. The small divergence of the dependency trees in the output list also results to overfitting in training phase. Although our re-Word embedding size m = 25 Distance embedding size m d = 25 Initial learning rate ρ = 0.1 Margin loss discount κ = 2.0 Regularization λ = 10 −4 k-best k = 64  Zhang and Clark (2008) 84.33 Huang and Sagae (2010) 85.20 Distributed Representations  82.94 Chen and Manning (2014) 83.9 Re-rankers Hayashi et al. (2013) 85.9 Our baseline 85.46 Our re-ranker 85.71(+0.25) Our re-ranker (with oracle) 87.43 Table 3: Accuracy on Chinese test set.
ranker outperforms the state-of-the-art methods, it can also benefit from improving the quality of the candidate results. It was also reported in other reranking works that a larger k (eg. k > 64) results the worse performance. We think the reason is that the oracle best increases when k is larger, but the oracle worst decrease with larger degree. The error types increase greatly. The re-ranking model requires more negative samples to avoid overfitting. When k is larger, the number of negative samples also needs to multiply increase for training. However, we just can obtain at most k negative samples from the k-best outputs of the base parser.
The experiments also show that the our model can achieves significant improvements by adding the oracles into the output lists of the base parser. This indicates that our model can be boosted by a better set of the candidate results, which can be implemented by combining the RCNN in the decoding algorithm.

Related Work
There have been several works to use neural networks and distributed representation for dependency parsing.  Stenetorp (2013) attempted to build recursive neural networks for transition-based dependency parsing, however the empirical performance of his model is still unsatisfactory. Chen and Manning (2014) improved the transition-based dependency parsing by representing all words, POS tags and arc labels as dense vectors, and modeled their interactions with neural network to make predictions of actions. Their methods aim to transition-based parsing and can not model the sentence in semantic vector space for other NLP tasks. Socher et al. (2013b) proposed a compositional vectors computed by dependency tree RNN (DT-RNN) to map sentences and images into a common embedding space. However, there are two major differences as follows. 1) They first summed up all child nodes into a dense vector v c and then composed subtree representation from v c and vector parent node. In contrast, our model first combine the parent and each child and then choose the most informative features with a pooling layer. 2) We represent the relative position of each child and its parent with distributed representation (position embeddings), which is very useful for convolutional layer. Figure 7 shows an example of DTRNN to illustrates how RCNN represents phrases as continuous vectors.
Specific to the re-ranking model, Le and Zuidema (2014) proposed a generative re-ranking model with Inside-Outside Recursive Neural Network (IORNN), which can process trees both bottom-up and top-down. However, IORNN works in generative way and just estimates the probability of a given tree, so IORNN cannot fully utilize the incorrect trees in k-best candidate results. Besides, IORNN treats dependency tree as a sequence, which can be regarded as a generalization of simple recurrent neural network (SRNN) (Elman, 1990). Unlike IORNN, our proposed RCNN is a discriminative model and can optimize the re-ranking strategy for a particular base parser. Another difference is that RCNN computes the score of tree in a recursive way, which is more natural for the hierarchical structure of natural language. Besides, the RCNN can not only be used for the re-ranking, but also be regarded as general model to represent sentence with its dependency tree.

Conclusion
In this work, we address the problem to represent all level nodes (words or phrases) with dense representations in a dependency tree. We propose a recursive convolutional neural network (RCNN) architecture to capture the syntactic and compositional-semantic representations of phrases and words. RCNN is a general architecture and can deal with k-ary parsing tree, therefore RCNN is very suitable for many NLP tasks to minimize the effort in feature engineering with a external dependency parser. Although RCNN is just used for the re-ranking of the dependency parser in this paper, it can be regarded as semantic modelling of text sequences and handle the input sequences of varying length into a fixed-length vector. The parameters in RCNN can be learned jointly with some other NLP tasks, such as text classification.
For the future research, we will develop an integrated parser to combine RCNN with a decoding algorithm. We believe that the integrated parser can achieve better performance without the limitation of base parser. Moreover, we also wish to investigate the ability of our model for other NLP tasks.