Graph-based Dependency Parsing with Bidirectional LSTM

In this paper, we propose a neural network model for graph-based dependency parsing which utilizes Bidirectional LSTM (BLSTM) to capture richer contextual information instead of using high-order factorization, and enable our model to use much fewer features than previous work. In addition, we propose an effective way to learn sentence segment embedding on sentence-level based on an extra forward LSTM network. Although our model uses only ﬁrst-order factorization, experiments on English Peen Treebank and Chinese Penn Treebank show that our model could be competitive with previous higher-order graph-based dependency parsing models and state-of-the-art models.


Introduction
Dependency parsing is a fundamental task for language processing which has been investigated for decades. It has been applied in a wide range of applications such as information extraction and machine translation. Among a variety of dependency parsing models, graph-based models are attractive for their ability of scoring the parsing decisions on a whole-tree basis. Typical graph-based models factor the dependency tree into subgraphs, including single arcs (McDonald et al., 2005), sibling or grandparent arcs (McDonald and Pereira, 2006;Carreras, 2007) or higher-order substructures (Koo and Collins, 2010;Ma and Zhao, 2012) and then score the whole tree by summing scores of the subgraphs. In these models, subgraphs are usually represented as high-dimensional feature vectors which are then fed into a linear model to learn the feature weights.
However, conventional graph-based models heavily rely on feature engineering and their performance is restricted by the design of features. In addition, standard decoding algorithm (Eisner, 2000) only works for the first-order model which limits the scope of feature selection. To incorporate high-order features, Eisner algorithm must be somehow extended or modified, which is usually done at high cost in terms of efficiency. The fourth-order graph-based model (Ma and Zhao, 2012), which seems the highest-order model so far to our knowledge, requires O(n 5 ) time and O(n 4 ) space. Due to the high computational cost, highorder models are normally restricted to producing only unlabeled parses to avoid extra cost introduced by inclusion of arc-labels into the parse trees.
To alleviate the burden of feature engineering, Pei et al. (2015) presented an effective neural network model for graph-based dependency parsing. They only use atomic features such as word unigrams and POS tag unigrams and leave the model to automatically learn the feature combinations. However, their model requires many atomic features and still relies on high-order factorization strategy to further improve the accuracy.
Different from previous work, we propose an LSTM-based dependency parsing model in this paper and aim to use LSTM network to capture richer contextual information to support parsing decisions, instead of adopting a high-order factorization. The main advantages of our model are as follows: • By introducing Bidirectional LSTM, our model shows strong ability to capture potential long range contextual information and exhibits improved accuracy in recovering long distance dependencies. It is different to previous work in which a similar effect is usually achieved by high-order factorization. More-over, our model also eliminates the need for setting feature selection windows and reduces the number of features to a minimum level.
• We propose an LSTM-based sentence segment embedding method named LSTM-Minus, in which distributed representation of sentence segment is learned by using subtraction between LSTM hidden vectors. Experiment shows this further enhances our model's ability to access to sentence-level information.
• Last but important, our model is a first-order model using standard Eisner algorithm for decoding, the computational cost remains at the lowest level among graph-based models.
Our model does not trade-off efficiency for accuracy.
We evaluate our model on the English Penn Treebank and Chinese Penn Treebank, experiments show that our model achieves competitive parsing accuracy compared with conventional high-order models, however, with a much lower computational cost.

Graph-based dependency parsing
In dependency parsing, syntactic relationships are represented as directed arcs between head words and their modifier words. Each word in a sentence modifies exactly one head, but can have any number of modifiers itself. The whole sentence is rooted at a designated special symbol ROOT, thus the dependency graph for a sentence is constrained to be a rooted, directed tree.
For a sentence x, graph-based dependency parsing model searches for the highest-scoring tree of x: Here y * (x) is the tree with the highest score, Y (x) is the set of all valid dependency trees for x and Score(x,ŷ; θ) measures how likely the treeŷ is the correct analysis of the sentence x, θ are the model parameters. However, the size of Y (x) grows exponentially with respect to the length of the sentence, directly solving equation (1) is impractical.
The common strategy adopted in the graphbased model is to factor the dependency treeŷ into Figure 1: First-order, Second-order and Thirdorder factorization strategy. Here h stands for head word, m stands for modifier word, s and t stand for the sibling of m. g stands for the grandparent of m.
a set of subgraph c which can be scored in isolation, and score the whole treeŷ by summing score of each subgraph: Score(x,ŷ; θ) = c∈ŷ ScoreC(x, c; θ) (2) Figure 1 shows several factorization strategies. The order of the factorization is defined according to the number of dependencies that subgraph contains. The simplest first-order factorization (McDonald et al., 2005) decomposes a dependency tree into single dependency arcs. Based on the first-order factorization, second-order factorization (McDonald and Pereira, 2006;Carreras, 2007) brings sibling and grandparent information into their model. Third-order factorization (Koo and Collins, 2010) further incorporates richer contextual information by utilizing grand-sibling and tri-sibling parts.
Conventional graph-based models (McDonald et al., 2005;McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Ma and Zhao, 2012) score subgraph by a linear model, which heavily depends on feature engineering. The neural network model proposed by Pei et al. (2015) alleviates the dependence on feature engineering to a large extent, but not completely. We follow Pei et al. (2015) to score dependency arcs using neural network model. However, different from their work, we introduce a Bidirectional LSTM to capture long range contextual information and an extra forward LSTM to better represent segments of the sentence separated by the head and modifier. These make our model more accurate in recovering long-distance dependencies and further decrease the number of atomic features.

Neural Network Model
In this section, we describe the architecture of our neural network model in detail, which is summarized in Figure 2.

Input layer
In our neural network model, the words, POS tags are mapped into distributed embeddings. We represent each input token x i which is the input of Bidirectional LSTM by concatenating POS tag embedding e p i ∈ R de and word embedding e w i ∈ R de , d e is the the dimensionality of embedding, then a linear transformation w e is performed and passed though an element-wise activation function g: where x i ∈ R de , w e ∈ R de×2de is weight matrix, b e ∈ R de is bias term. the dimensionality of input token x i is equal to the dimensionality of word and POS tag embeddings in our experiment, ReLU is used as our activation function g.

Bidirectional LSTM
Given an input sequence x = (x 1 , . . . , x n ), where n stands for the number of words in a sentence, a standard LSTM recurrent network computes the hidden vector sequence h = (h 1 , . . . , h n ) in one direction.
Bidirectional LSTM processes the data in both directions with two separate hidden layers, which are then fed to the same output layer. It computes the forward hidden sequence − → h , the backward hidden sequence ← − h and the output sequence v by iterating the forward layer from t = 1 to n, the backward layer from t = n to 1 and then updating the output layer: We simply add the forward hidden vector − → h t and the backward hidden vector ← − h t together, which gets similar experiment result as concatenating them together with a faster speed.
The output vectors of Bidirectional LSTM are used as word feature embeddings. In addition, they are also fed into a forward LSTM network to learn segment embedding.

Segment Embedding
Contextual information of word pairs 1 has been widely utilized in previous work (McDonald et al., 2005;McDonald and Pereira, 2006;Pei et al., 2015). For a dependency pair (h, m), previous work divides a sentence into three parts (prefix, infix and suffix) by head word h and modifier word m. These parts which we call segments in our work make up the context of the dependency pair (h, m).
Due to the problem of data sparseness, conventional graph-based models can only capture contextual information of word pairs by using bigrams or tri-grams features. Unlike conventional models, Pei et al. (2015) use distributed representations obtained by averaging word embeddings in segments to represent contextual information of the word pair, which could capture richer syntactic and semantic information. However, their method is restricted to segment-level since their segment embedding only consider the word information within the segment. Besides, averaging operation simply treats all the words in segment equally. However, some words might carry more salient syntactic or semantic information and they are expected to be given more attention.
A useful property of forward LSTM is that it could keep previous useful information in their memory cell by exploiting input, output and forget gates to decide how to utilize and update the memory of previous information. Given an input sequence v = (v 1 , . . . , v n ), previous work  often uses the last hidden vector h n of the forward LSTM to represent the whole sequence. Each hidden vector h t (1 ≤ t ≤ n) can capture useful information before and including v t .
Inspired by this, we propose a method named LSTM-Minus to learn segment embedding. We utilize subtraction between LSTM hidden vectors to represent segment's information. As illustrated in Figure 3, the segment infix can be described as h m − h 2 , h m and h 2 are hidden vector of the forward LSTM network. The segment embedding of suffix can also be obtained by subtraction between the last LSTM hidden vector of the sequence (h 7 ) and the last LSTM hidden vector in infix (h m ). For prefix, we directly use the last LSTM hidden vector in prefix to represent it, which equals to subtract a zero embedding. When no prefix or suffix exists, the corresponding embedding is set to zero.
Specifically, we place an extra forward LSTM layer on top of the Bidirectional LSTM layer and learn segment embeddings using LSTM-Minus based on this forward LSTM. LSTM-minus enables our model to learn segment embeddings from information both outside and inside the segments and thus enhances our model's ability to access to sentence-level information.

Hidden layer and output layer
As illustrated in Figure 2, we map all the feature embeddings to a hidden layer. Following Pei et al. (2015), we use direction-specific transformation to model edge direction and tanh-cube as our activation function: 1} which indicates the direction between head and modifier.
A output layer is finally added on the top of the hidden layer for scoring dependency arcs: is the output vector, L is the number of dependency types. Each dimension of the output vector is the score for each kind of dependency type of head-modifier pair.

Features in our model
Previous neural network models (Pei et al., 2015;Pei et al., 2014;Zheng et al., 2013) normally set context window around a word and extract atomic features within the window to represent the contextual information. However, context window limits their ability in detecting long-distance information. Simply increasing the context window size to get more contextual information puts their model in the risk of overfitting and heavily slows down the speed.
Unlike previous work, we apply Bidirectional LSTM to capture long range contextual information and eliminate the need for context windows, avoiding the limit of the window-based feature selection approach. Compared with Pei et al. (2015), the cancellation of the context window allows our model to use much fewer features. Moreover, by combining a word's atomic features (word form and POS tag) together, our model further decreases the number of features. Pei et al. (2015) Table 1 lists the atomic features used in 1storder atomic model of Pei et al. (2015) and atomic features used in our basic model. Our basic model only uses the outputs of Bidirectional LSTM for head word and modifier word, and the distance between them as features. Distance features are encoded as randomly initialized embeddings. As we can see, our basic model reduces the number of atomic features to a minimum level, making our model run with a faster speed. Based on our basic model, we incorporate additional segment information (prefix, infix and suffix), which further improves the effect of our model.

Neural Training
In this section, we provide details about training the neural network.

Max-Margin Training
We use the Max-Margin criterion to train our model. Given a training instance (x (i) , y (i) ), we use Y (x (i) ) to denote the set of all possible dependency trees and y (i) is the correct dependency tree for sentence x (i) . The goal of Max Margin training is to find parameters θ such that the difference in score of the correct tree y (i) from an incorrect treeŷ ∈ Y (x (i) ) is at least (y (i) ,ŷ).
Score(x (i) ,y (i) ; θ) ≥ Score(x (i) ,ŷ; θ)+ (y (i) ,ŷ) The structured margin loss (y (i) ,ŷ) is defined as: is the head (with type) for the j-th word of x (i) in tree y (i) and κ is a discount parameter. The loss is proportional to the number of word with an incorrect head and edge type in the proposed tree.
Given a training set with size m, The regularized objective function is the loss function J(θ) including a l 2 -norm term: By minimizing this objective, the score of the correct tree is increased and score of the highest scoring incorrect tree is decreased.

Optimization Algorithm
Parameter optimization is performed with the diagonal variant of AdaGrad (Duchi et al., 2011) with minibatchs (batch size = 20) . The parameter update for the i-th parameter θ t,i at time step t is as follows: where α is the initial learning rate (α = 0.2 in our experiment) and g τ ∈ R |θ i | is the subgradient at time step τ for parameter θ i . To mitigate overfitting, dropout (Hinton et al., 2012) is used to regularize our model. we apply dropout on the hidden layer with 0.2 rate.
We initialized the parameters using pretrained word embeddings. Following , we use a variant of the skip n-gram model introduced by  on Gigaword corpus (Graff et al., 2003

Experiments
In this section, we present our experimental setup and the main result of our work.

Experiments Setup
We conduct our experiments on the English Penn Treebank (PTB) and the Chinese Penn Treebank (CTB) datasets. For English, we follow the standard splits of PTB3. Using section 2-21 for training, section 22 as development set and 23 as test set. We conduct experiments on two different constituency-todependency-converted Penn Treebank data sets. The first one, Penn-YM, was created by the Penn2Malt tool 2 based on Yamada and Matsumoto (2003) head rules. The second one, Penn-SD, use Stanford Basic Dependencies (Marneffe et al., 2006) and was converted by version 3.3.0 3 of the Stanford parser. The Stanford POS Tagger (Toutanova et al., 2003) with ten-way jackknifing of the training data is used for assigning POS tags (accuracy ≈ 97.2%).
For Chinese, we adopt the same split of CTB5 as described in (Zhang and Clark, 2008). Following (Zhang and Clark, 2008;Chen and Manning, 2014), we use gold segmentation and POS tags for the input.

Experiments Results
We first make comparisons with previous graphbased models of different orders as shown in Ta-ble 2. We use MSTParser 4 for conventional firstorder model (McDonald et al., 2005) and secondorder model (McDonald and Pereira, 2006). We also include the results of conventional high-order models (Koo and Collins, 2010;Ma and Zhao, 2012;Zhang and McDonald, 2012;Zhang et al., 2013;Zhang and McDonald, 2014) and the neural network model of Pei et al. (2015). Different from typical high-order models (Koo and Collins, 2010;Ma and Zhao, 2012), which need to extend their decoding algorithm to score new types of higher-order dependencies. Zhang and McDonald (2012) generalized the Eisner algorithm to handle arbitrary features over higher-order dependencies and controlled complexity via approximate decoding with cube pruning. They further improve their work by using perceptron update strategies for inexact hypergraph search (Zhang et al., 2013) and forcing inference to maintain both label and structural ambiguity through a secondary beam (Zhang and McDonald, 2014).
Following previous work, UAS (unlabeled attachment scores) and LAS (labeled attachment scores) are calculated by excluding punctuation 5 . The parsing speeds are measured on a workstation with Intel Xeon 3.4GHz CPU and 32GB RAM which is same to Pei et al. (2015). We measure the parsing speeds of Pei et al. (2015) according to their codes 6 and parameters.
On accuracy, as shown in  Table 3: Comparison with previous state-of-the-art models on Penn-YM, Penn-SD and CTB5.
basic model outperforms previous first-order graph-based models by a substantial margin, even outperforms Zhang and McDonald (2012)'s unlimited-order model. Moreover, incorporating segment information further improves our model's accuracy, which shows that segment embeddings do capture richer contextual information. By using segment embeddings, our improved model could be comparable to high-order graph-based models 7 .
With regard to parsing speed, our model also shows advantage of efficiency. Our model uses only first-order factorization and requires O(n 3 ) time to decode. Third-order model requires O(n 4 ) time and fourth-order model requires O(n 5 ) time. By using approximate decoding, the unlimitedorder model of Zhang and McDonald (2012) requires O(k ·log(k)·n 3 ) time, where k is the beam size. The computational cost of our model is the lowest among graph-based models. Moreover, although using LSTM requires much computational cost. However, compared with Pei's 1st-order model, our model decreases the number of atomic features from 21 to 3, this allows our model to require a much smaller matrix computation in the scoring model, which cancels out the extra computation cost introduced by the LSTM computation. Our basic model is the fastest among first-order and second-order models. Incorporating segment information slows down the parsing speed while it is still slightly faster than conventional first-order model. To compare with conventional high-order models on practical parsing speed, we can make an indirect comparison according to Zhang and McDonald (2012). Conventional first-order model is about 10 times faster than  Note that our model can't be strictly comparable with third-order model (Koo and Collins, 2010) and fourthorder model (Ma and Zhao, 2012) since they are unlabeled model. However, our model is comparable with all the three unlimited-order models presented in (Zhang and McDonald, 2012), (Zhang et al., 2013) and (Zhang and McDonald, 2014), since they all are labeled models as ours.

Method
Peen  We further compare our model with previous state-of-the-art systems for English and Chinese. Table 3 lists the performances of our model as well as previous state-of-the-art systems on on Penn-YM, Penn-SD and CTB5. We compare to conventional state-of-the-art graph-based model (Zhang and McDonald, 2014), conventional state-of-theart transition-based model using beam search (Zhang and Nivre, 2011), transition-based model combining graph-based approach (Bernd Bohnet, 2012) , transition-based neural network model using stack LSTM  and transitionbased neural network model using beam search (Weiss et al., 2015). Overall, our model achieves competitive accuracy on all three datasets. Although our model is slightly lower in accuarcy than unlimited-order double beam model (Zhang and McDonald, 2014) on Penn-YM and CTB5, our model outperforms their model on Penn-SD. It seems that our model performs better on data sets with larger label sets, given the number of labels used in Penn-SD data set is almost four times more than Penn-YM and CTB5 data sets.
To show the effectiveness of our segment embedding method LSTM-Minus, we compare with averaging method proposed by Pei et al. (2015). We get segment embeddings by averaging the output vectors of Bidirectional LSTM in segments. To make comparison as fair as possible, we let two models have almost the same number parameters. Table 4 lists the UAS of two methods on test set. As we can see, LSTM-Minus shows better performance because our method further incorporates more sentence-level information into our model.

Impact of Network Structure
In this part, we investigate the impact of the components of our approach.

LSTM Recurrent Network
To evaluate the impact of LSTM, we make error analysis on Penn-YM. We compare our model with Pei et al. (2015) on error rates of different distance between head and modifier.
As we can see, the five models do not show much difference for short dependencies whose distance less than three. For long dependencies, both our two models show better performance compared with the 1st-order model of Pei et al. (2015), which proves that LSTM can effectively capture long-distance dependencies. Moreover, our models and Pei's 2nd-order phrase model both improve accuracy on long dependencies compared with Pei's 1st-order model, which is in line with our expectations. Using LSTM shows the same effect as high-order factorization strategy. Compared with 2nd-order phrase model of Pei et al. (2015), our basic model occasionally performs worse in recovering long distant dependencies. However, this should not be a surprise since higher order models are also motivated to recover longdistance dependencies. Nevertheless, with the introduction of LSTM-minus segment embeddings, our model consistently outperforms the 2nd-order phrase model of Pei et al. (2015) in accuracies of all long dependencies. We carried out significance test on the difference between our and Pei's models. Our basic model performs significantly better than all 1st-order models of Pei et al. (2015) (ttest with p<0.001) and our basic+segment model (still a 1st-order model) performs significantly better than their 2nd-order phrase model (t-test with p<0.001) in recovering long-distance dependencies.

Related work
Dependency parsing has gained widespread interest in the computational linguistics community. There are a lot of approaches to solve it. Among them, we will mainly focus on graph-based dependency parsing model here. Dependency tree factorization and decoding algorithm are necessary for graph-based models. McDonald et al. (2005) proposed the first-order model which decomposes a dependency tree into its individual edges and use a effective dynamic programming algorithm (Eisner, 2000) to decode. Based on firstorder model, higher-order models (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Ma and Zhao, 2012) factor a dependency tree into a set of high-order dependencies which bring interactions between head, modifier, siblings and (or) grandparent into their model. However, for above models, scoring new types of higherorder dependencies requires extensions of the underlying decoding algorithm, which also requires higher computational cost. Unlike above models, unlimited-order models (Zhang and McDonald, 2012;Zhang et al., 2013;Zhang and McDonald, 2014) could handle arbitrary features over higherorder dependencies by generalizing the Eisner algorithm.
In contrast to conventional methods, neural network model shows their ability to reduce the effort in feature engineering. Pei et al. (2015) proposed a model to automatically learn high-order feature combinations via a novel activation function, allowing their model to use a set of atomic features instead of millions of hand-crafted features.
Different from previous work, which is sensitive to local state and accesses to larger context by higher-order factorization. Our model makes parsing decisions on a global perspective with firstorder factorization, avoiding the expensive computational cost introduced by high-order factorization.
LSTM network is heavily utilized in our model. LSTM network has already been explored in transition-based dependency parsing.  presented stack LSTMs with push and pop operations and used them to implement a state-of-the-art transition-based dependency parser.  replaced lookup-based word representations with characterbased representations obtained by Bidirectional LSTM in the continuous-state parser of , which was proved experimentally to be useful for morphologically rich languages.

Conclusion
In this paper, we propose an LSTM-based neural network model for graph-based dependency parsing. Utilizing Bidirectional LSTM and segment embeddings learned by LSTM-Minus allows our model access to sentence-level information, making our model more accurate in recovering longdistance dependencies with only first-order factorization. Experiments on PTB and CTB show that our model could be competitive with conventional high-order models with a faster speed.