Transition-based Dependency Parsing Using Two Heterogeneous Gated Recursive Neural Networks

Recently, neural network based dependency parsing has attracted much interest, which can effectively alleviate the problems of data sparsity and feature engineering by using the dense features. However, it is still a challenge problem to sufﬁciently model the complicated syntactic and semantic compositions of the dense features in neural network based methods. In this paper, we propose two heterogeneous gated recursive neural networks:


Introduction
Transition-based dependency parsing is a core task in natural language processing, which has been studied with considerable efforts in the NLP community. The traditional discriminative dependency parsing methods have achieved great success (Koo et al., 2008;He et al., 2013;Bohnet, 2010;Huang and Sagae, 2010;Zhang and Nivre, 2011;Martins et al., 2009;McDonald et al., 2005;Nivre et al., 2006;Kübler et al., 2009;Goldberg and Nivre,  2013; Choi and McCallum, 2013;Ballesteros and Bohnet, 2014). However, these methods are based on discrete features and suffer from the problems of data sparsity and feature engineering (Chen and Manning, 2014).
Recently, distributed representations have been widely used in a variety of natural language processing (NLP) tasks (Collobert et al., 2011;Devlin et al., 2014;Socher et al., 2013;Turian et al., 2010;Mikolov et al., 2013b;Bengio et al., 2003). Specific to the transition-based parsing, the neural network based methods have also been increasingly focused on due to their ability to minimize the efforts in feature engineering and the boosted performance (Le and Zuidema, 2014;Stenetorp, 2013;Bansal et al., 2014;Chen and Manning, 2014;Zhu et al., 2015).
However, most of the existing neural network based methods still need some efforts in feature engineering. For example, most methods often select the first and second leftmost/rightmost children of the top nodes in stack, which could miss some valuable information hidden in the unchosen nodes. Besides, the features of the selected nodes are just simply concatenated and then fed into neural network. Since the concatenation operation is relatively simple, it is difficult to model the com-plicated feature combinations which can be manually designed in the traditional discrete feature based methods.
To tackle these problems, we use two heterogeneous gated recursive neural networks, tree structured gated recursive neural network (Tree-GRNN) and directed acyclic graph gated structured recursive neural network (DAG-GRNN), to model each configuration during transition based dependency parsing. The two proposed GRNNs introduce the gate mechanism (Chung et al., 2014) to improve the standard recursive neural network (RNN) (Socher et al., 2013;Socher et al., 2014), and can model the syntactic and semantic compositions of the nodes during parsing. Figure 1 gives a rough sketch for the standard RNN, Tree-GRNN and DAG-GRNN. Tree-GRNN is applied to the partial-constructed trees in stack, which have already been constructed according to the previous transition actions. DAG-GRNN is applied to model the feature composition of nodes in stack and buffer which have not been labeled their dependency relations yet. Intuitively, Tree-GRNN selects and merges features recursively from children nodes into their parent according to their dependency structures, while DAG-GRNN further models the complicated combinations of extracted features and explicitly exploits features in different levels of granularity.
To evaluate our approach, we experiment on two prevalent benchmark datasets: English Penn Treebank 3 (PTB3) and Chinese Penn Treebank 5 (CTB5) datasets. Experiment results show the effectiveness of our proposed method. Compared to the parser of Chen and Manning (2014), we receive 0.6% (UAS) and 0.9% (LAS) improvement on PTB3 test set, while we receive 0.8% (UAS) and 1.3% (LAS) improvement on CTB5 test set.

Neural Network Based Transition
Dependency Parsing

Transition Dependency Parsing
In this paper, we employ the arc-standard transition systems (Nivre, 2004) and examine only greedy parsing for its efficiency. Figure 2 gives an example of arc-standard transition dependency parsing.
In transition-based dependency parsing, the consecutive configurations of parsing process can be defined as c (i) = (s (i) , b (i) , A (i) ) which consists of a stack s, a buffer b, and a set of dependency arcs A. Then, the greedy parsing process consecutively predicts the actions based on the features extracted from the corresponding configurations. For a given sentence w 1 , . . . , w n , parsing process starts from a initial configuration c (0) = ([ROOT ], [w 1 , . . . , w n ], ∅), and terminates at some configuration c (2n) = ([ROOT ], ∅, A (2n) ], where n is the length of the given sentence w 1:n . As a result, we derive the parse tree of the sentence w 1:n according to the arcs set A (2n) .
In arc-standard system, there are three types of actions: Left-Arc, Right-Arc and Shift. Denoting s j (j = 1, 2, . . . ) as the j th top element of the stack, and b j (j = 1, 2, . . . ) as the j th front element of the buffer, we can formalize the three actions of arc-standard system as: • Left-Arc(l) adds an arc s 2 ← s 1 with label l and removes s 2 from the stack, resulting a new arc l(s 1 , s 2 ). Precondition: |s| ≥ 3 (The ROOT node cannot be child node). • Right-Arc(l) adds an arc s 2 → s 1 with label l and removes s 1 from the stack, resulting a new arc l(s 2 , s 1 ). Precondition: |s| ≥ 2. • Shift removes b 1 from the buffer, and adds it to the stack. Precondition: |b| ≥ 1.
The greedy parser aims to predict the correct transition action for a given configuration. There are two versions of parsing: unlabeled and labeled versions. The set of possible action candidates T = 2n l + 1 in the labeled version of parsing, and T = 3 in the unlabeled version, where n l is number of different types of arc labels.

Neural Network Based Parser
In neural network architecture, the words, POS tags and arc labels are mapped into distributed vectors (embeddings). Specifically, given the word embedding matrix E w ∈ R de×nw , each word w i is mapped into its corresponding column e w w i ∈ R de of E w according to its index in the dictionary, where d e is the dimensionality of embeddings and n w is the dictionary size. Likewise, The POS and arc labels are also mapped into embeddings by the POS embedding matrix E t ∈ R de×nt and arc label embedding matrix E l ∈ R de×n l respectively, where n t and n l are the numbers of distinct POS tags and arc labels respectively. Correspondingly, embeddings of each POS tag t i and each arc label l i are e t t i ∈ R de and e l l i ∈ R de extracted from E t and E l respectively.   Figure 3 gives the architecture of neural network based parser. Following Chen and Manning (2014), a set of elements S from stack and buffer (e.g. S = {s 2 .lc 2 .rc 1 , s 2 .lc 1 , s 1 , b 2 , s 2 .rc 2 .rc 1 , . . . }) is chosen as input. Specifically, the information (word, POS or label) of each element in the set S (e.g. {s 2 .lc 2 .rc 1 .t, s 2 .lc 1 .l, s 1 .w, s 1 .t, b 2 .w, . . . }) are extracted and mapped into their corresponding embeddings. Then these embeddings are concatenated as the input vector x ∈ Rd. A special token NULL is used to represent a non-existent element.
We perform a standard neural network using one hidden layer with d h hidden units followed by a softmax layer as: where Here, g is a non-linear function which can be hyperbolic tangent, sigmoid, cube (Chen and Manning, 2014), etc.

Recursive Neural Network
Recursive neural network (RNN) is one of classical neural networks, which performs the same set of parameters recursively on a given structure (e.g. syntactic tree) in topological order (Pollack, 1990;Socher et al., 2013).
In the simplest case, children nodes are combined into their parent node using a weight matrix W which is shared across the whole network, followed by a non-linear function g(·). Specifically, given the left child node vector h L ∈ R d and right child node vector h R ∈ R d , their parent node vector h P ∈ R d will be formalized as: where W ∈ R d×2d and g is a non-linear function as mentioned above.
Stack Buffer Sub tree DAG-GRNN Tree-GRNN Figure 4: Architecture of our proposed dependency parser using two heterogeneous gated recursive neural networks.
(DAG-GRNN). Tree-GRNN is applied to the subtrees with partial dependency relations in stack which have already been constructed according to the previous transition actions. DAG-GRNN is employed to model the feature composition of nodes in stack and buffer which have not been labeled their dependency relations yet. Figure 4 shows the whole architecture of our model, which integrates two different GRNNs to predict the action for each parsing configuration. The detailed descriptions of two GRNNs will be discussed in the following two subsections.

Tree Structured Gated Recursive Neural Network
It is a natural way to merge the information from children nodes into their parent node recursively according to the given tree structures in stack. Although the dependency relations have been built, it is still hard to apply the recursive neural network (as Eq. 3) directly for the uncertain number of children of each node in stack. By averaging operation on children nodes (Socher et al., 2014), the parent node cannot well capture the crucial features from the mixed information of its children nodes. Here, we propose tree structured gated recursive neural network (Tree-GRNN) incorporating the gate mechanism Chung et al., 2014;Chen et al., 2015a;Chen et al., 2015b), which can selectively choose the  In Tree-GRNN, each node p of trees in stack is composed of three components: state vector of left children nodes v p l ∈ R dc , state vector of current node v p n ∈ R dn and state vector of right children nodes v p r ∈ R dc , where d n and d c indicate the corresponding vector dimensionalities. Particularly, we represent information of node p as a vector where v p ∈ R q and q = 2d c +d n . Specifically, v p n contains the information of current node including its word form p.w, pos tag p.t and label type p.l as shown in Eq. 5, and v p l and v p r are initialized by zero vectors 0 ∈ R dc , then update as Eq. 6.
where word embedding e w p.w ∈ R de , pos embedding e t p.t ∈ R de and label embedding e l p.l ∈ R de are extracted from embedding matrices E w , E t and E l according to the indices of the corresponding word p.w, pos p.t and label p.l respectively. Specifically, in the case of unlabeled attachment parsing, we ignore the last term e l p.l in Eq. 5. Thus, the dimensionality d n of v p n varies. In labeled attachment parsing case, we set a special token NULL to represent label p.l if not available (e.g. p is the node in stack or buffer).
By given node p and its left children nodes p.lc i and right children nodes p.rc i , we update the left children information v p l and right children infor- where o p.lc i and o p.rc i are the reset gates of the nodes p.lc i and p.rc i respectively as shown in Eq. 7. In addition, functions N L (p) and N R (p) result the numbers of left and right children nodes of node p respectively. The operator indicates element multiplication here. W l ∈ R dc×q and W r ∈ R dc×q are weight matrices. b l ∈ R dc and b r ∈ R dc are bias terms. The reset gates o p.lc i and o p.rc i can be formalized as where σ indicates the sigmoid function, W o ∈ R q×(q+dn) and b o ∈ R q . By the mechanism above, we can summarize the whole information into the stack recursively from children nodes to their parent using the partial-built tree structure. Intuitively, the gate mechanism can selectively choose the crucial features of a child node according to the gate state which is derived from the current child node and its parent.

Directed Acyclic Graph Structured Gated Recursive Neural Network
Previous neural based parsing works feed the extracted features into a standard neural network with one hidden layer. Then, the hidden units are fed into a softmax layer, outputting the probability vector of available actions. Actually, it cannot well model the complicated combinations of extracted features. As for the nodes, whose dependency relations are still unknown, we propose another recursive neural network namely directed acyclic graph structured gated recursive neural network (DAG-GRNN) to better model the interactions of features. Intuitively, the DAG structure models the combinations of features by recursively mixing the information from the bottom layer to the top layer σ σ  as shown in Figure 4. The concatenation operation can be regraded as a mix of features in different levels of granularity. Each node in the directed acyclic graph can be seen as a complicated feature composition of its governed nodes. Moreover, we also use the gate mechanism to better model the feature combinations by introducing two kinds of gates, namely "reset gate" and "update gate". Intuitively, each node in the network seems to preserve all the information of its governed notes without gates, and the gate mechanism similarly plays a role of filter which decides how to selectively exploit the information of its children nodes, discovering and preserving the crucial features.
DAG-GRNN structure consists of minimal structures as shown in Figure 6. Vectors h P , h L , h R and hP ∈ R q denote the value of the parent node P , left child node L, right child node R and new activation nodeP respectively. The value of parent node h P ∈ R q is computed as: where zP , z L and z R ∈ R q are update gates for new activation nodeP , left child node L and right child node R respectively. Operator indicates element-wise multiplication.
The update gates z can be formalized as: which are constrained by: where W z ∈ R 3q×3q is the coefficient of update gates. The value of new activation node hP is computed as: where WP ∈ R q×2q , r L ∈ R q , r R ∈ R q . r L and r R are the reset gates for left child node L and right child node R respectively, which can be formalized as: where W r ∈ R 2q×2q is the coefficient of two reset gates and σ indicates the sigmoid function. Intuitively, the reset gates r partially read the information from the left and right children, outputting a new activation node hP , while the update gates z selectively choosing the information among the the new activation nodeP , the left child node L and the right child node R. This gate mechanism is effective to model the combinations of features.
Finally, we concatenate all the nodes in the DAG-GRNN structure as input x of the architecture described in Section 2.2, resulting the probability vector for all available actions.

Inference
We use greedy decoding in parsing. At each step, we apply our two GRNNs on the current configuration to extract the features. After softmax operation, we choose the feasible transition with the highest possibility, and perform the chosen transition on the current configuration to get the next configuration state.
In practice, we do not need calculate the Tree-GRNN over the all trees in the stack on the current configuration. Instead, we preserve the representations of trees in the stack. When we need apply a new transition on the configuration, we update the relative representations using Tree-GRNN.

Training
We use the maximum likelihood (ML) criterion to train our model. By extracting training set (x i , y i ) from gold parse trees using a shortest stack oracle which always prefers Left-Arc(l) or Right-Arc(l) action over Shift, the goal of our model is to minimize the loss function with the parameter set θ: where m is number of extracted training examples which is as same as the number of all configurations. Following Socher et al. (2013), we use the diagonal variant of AdaGrad (Duchi et al., 2011) with minibatch strategy to minimize the objective. We also employ dropout strategy to avoid overfitting.
In practice, we perform DAG-GRNN with two hidden layers, which gets the best performance. We use the approximated gradient for Tree-GRNN, which only performs gradient back propagation on the first two layers.

Datasets
To evaluate our proposed model, we experiment on two prevalent datasets: English Penn Treebank 3 (PTB3) and Chinese Penn Treebank 5 (CTB5) datasets.
• English For English Penn Treebank 3 (PTB3) dataset, we use sections 2-21 for training, section 22 and section 23 as development set and test set respectively. We adopt CoNLL Syntactic Dependencies (CD) (Johansson and Nugues, 2007) using the LTH Constituent-to-Dependency Conversion Tool. • Chinese For Chinese Penn Treebank 5 (CTB5) dataset, we follow the same split as described in (Zhang and Clark, 2008). Dependencies are converted by the Penn2Malt tool with the head-finding rules of (Zhang and Clark, 2008).

Experimental Settings
For parameter initialization, we use random initialization within (-0.01, 0.01) for all parameters except the word embedding matrix E w . Specifically, we adopt pre-trained English word embeddings from (Collobert et al., 2011). And we pretrain the Chinese word embeddings on a huge unlabeled data, the Chinese Wikipedia corpus, with word2vec toolkit (Mikolov et al., 2013a). Table 1 gives the details of hyper-parameter settings of our approach. In addition, we set minibatch size to 20. In all experiments, we only take s 1 , s 2 , s 3 nodes in stack and b 1 , b 2 , b 3 nodes in buffer into account. We also apply dropout strategy here, and only dropout at the nodes in stack and buffer with probability p = 20%.

Results
The experiment results on PTB3 and CTB5 datasets are list in Table 2 and 3 respectively. On all datasets, we report unlabeled attachment scores (UAS) and labeled attachment scores (LAS). Conventionally, punctuations are excluded in all evaluation metrics.
To evaluate the effectiveness of our approach, we compare our parsers with feature-based parser and neural-based parser. For feature-based parser, we compare our models with two prevalent parsers: MaltParser (Nivre et al., 2006) and MSTParser (McDonald and Pereira, 2006). For neural-based parser, we compare our results with parser of Chen and Manning (2014). Compared with parser of Chen and Manning (2014), our parser with two heterogeneous gated recursive neural networks (Tree-GRNN+DAG-GRNN) receives 0.6% (UAS) and 0.9% (LAS) improvement on PTB3 test set, and receives 0.8% (UAS) and 1.3% (LAS) improvement on CTB5 test set.
Since that speed of algorithm is not the focus of our paper, we do not optimize the speed a lot. On CTB (UAS), it takes about 2 days to train Tree-GRNN+DAG-GRNN model with CPU only. The testing speed is about 2.7 sentences per second. All implementation is based on Python.

Effects of Gate Mechanisms
We adopt five different models: plain parser, Tree-RNN parser, Tree-GRNN parser, Tree-RNN+DAG-GRNN parser, and Tree-GRNN+DAG-GRNN parser. The experiment results show the effectiveness of our proposed two heterogeneous gated recursive neural networks.
Specifically, plain parser is as same as parser of Chen and Manning (2014). The difference between them is that plain parser only takes the nodes in stack and buffer into account, which uses a simpler feature template than parser of Chen and Manning (2014). As plain parser omits all children nodes of trees in stack, it performs poorly compared with parser of Chen and Manning (2014).
In addition, we find plain parser outperforms MaltParser (standard) on PTB3 dataset making about 1% progress, while it performs poorer than MaltParser (standard) on CTB5 dataset. It shows that the children nodes of trees in stack is of great importance, especially for Chinese. Moreover, it also shows the effectiveness of neural network based model which could represent complicated features as compacted embeddings. Tree-RNN parser additionally exploits all the children nodes of trees in stack, which is a simplified version of Tree-GRNN without incorporating the gate mechanism described in Section 4.1. In anther word, Tree-RNN omits the gate terms o p.lc i and o p.rc i in Eq. 6. As we can see, the results are significantly boosted by utilizing the all information in stack, which again shows the importance of children nodes of trees in stack. Although the results of Tree-RNN are comparable to results of Chen and Manning (2014), it not outperforms parser of Chen and Manning (2014) in all cases (e.g. UAS on CTB5), which implies that exploiting all information without selection might lead to incorporate noise features. Moreover, Tree-GRNN parser further boosts the performance by incorporating the gate mechanism. Intuitively, Tree-RNN who exploits all the information of stack without selection cannot well capture the crucial features, while Tree-GRNN with gate mechanism could selectively choose and preserve the effective features by adapting the current gate state.
We also experiment on parsers using two heterogeneous gated recursive neural networks: Tree-RNN+DAG-GRNN parser and Tree-GRNN+DAG-GRNN parser. The similarity of   two parsers is that they all employ the DAG structured recursive neural network with gate mechanism to model the combination of features extracted from stack and buffer. The difference between them is the former one employs the Tree-RNN without gate mechanism to model the features of stack, while the later one employs the gated version (Tree-GRNN). Again, the performance of these two parsers is further boosted, which shows DAG-GRNN can well model the combinations of features which is summarized by Tree-(G)RNN structure. In addition, we find the performance does not drop a lot in almost cases by turning off the gate mechanism of Tree-GRNN, which implies that the DAG-GRNN can help selecting the information from trees in stack, even it has not been selected by gate mechanism of Tree-GRNN yet.

Convergency Speed
To further analyze the convergency speed of our approach, we compare the UAS results on development sets of two datasets for first ten epoches as shown in Figure 7 and 8. As plain parser only take the nodes in stack and buffer into ac- count, the performance is much poorer than the rest parsers. Moreover, Tree-GRNN converges slower than Tree-RNN, which shows that it might be more difficult to learn this gate mechanism. By introducing the DAG-GRNN, both Tree-RNN and Tree-GRNN parsers become faster to converge, which shows that the DAG-GRNN is of great help in boosting the convergency speed.

Related Work
Many neural network based methods have been used for transition based dependency parsing.  and Bansal et al. (2014) used the dense vectors (embeddings) to represent words or features and found these representations are complementary to the traditional discrete feature representation. However, these two methods only focus on the dense representations (embeddings) of words or features. Stenetorp (2013) first used RNN for transition based dependency parsing. He followed the standard RNN and used the binary combination to model the representation of two linked words. But his model does not achieve the performance of the traditional method. Le and Zuidema (2014) proposed a generative re-ranking model with Inside-Outside Recursive Neural Network (IORNN), which can process trees both bottom-up and top-down. However, IORNN works in generative way and just estimates the probability of a given tree, so IORNN cannot fully utilize the incorrect trees in k-best candidate results. Besides, IORNN treats dependency tree as a sequence, which can be regarded as a generalization of simple recurrent neural network (SRNN) (Elman, 1990).
Although the two methods also used RNN, they just deal with the binary combination, which is unnatural for dependency tree. Zhu et al. (2015) proposed a recursive convolutional neural network (RCNN) architecture to capture syntactic and compositional-semantic representations of phrases and words in a dependency tree. Different with the original recursive neural network, they introduced the convolution and pooling layers, which can model a variety of compositions by the feature maps and choose the most informative compositions by the pooling layers. Chen and Manning (2014) improved the transition-based dependency parsing by representing all words, POS tags and arc labels as dense vectors, and modeled their interactions with neural network to make predictions of actions. Their method only relies on dense features, and is not able to automatically learn the most useful feature conjunctions to predict the transition action.
Compared with (Chen and Manning, 2014), our method can fully exploit the information of all the descendants of a node in stack with Tree-GRNN. Then DAG-GRNN automatically learns the complicated combination of all the features, while the traditional discrete feature based methods need manually design them. Dyer et al. (2015) improved the transition-based dependency parsing using stack long short term memory neural network and received significant improvement on performance. They focused on exploiting the long distance dependencies and information, while we aims to automatically model the complicated feature combination.

Conclusion
In this paper, we pay attention to the syntactic and semantic composition of the dense features for transition-based dependency parsing. We propose two heterogeneous gated recursive neural networks, Tree-GRNN and DAG-GRNN. Each hidden neuron in two proposed GRNNs can be regarded as a different combination of the input features. Thus, the whole model has an ability to simulate the design of the sophisticated feature combinations in the traditional discrete feature based methods.
Although the two proposed GRNNs are only used for the greedy parsing based on arc-standard transition system in this paper, it is easy to generalize them to other transition systems and graph based parsing. In future work, we would also like to extend our GRNNs for the other NLP tasks.