Incremental Parsing with Minimal Features Using Bi-Directional LSTM

Recently, neural network approaches for parsing have largely automated the combination of individual features, but still rely on (often a larger number of) atomic features created from human linguistic intuition, and potentially omitting important global context. To further reduce feature engineering to the bare minimum, we use bi-directional LSTM sentence representations to model a parser state with only three sentence positions, which automatically identifies important aspects of the entire sentence. This model achieves state-of-the-art results among greedy dependency parsers for English. We also introduce a novel transition system for constituency parsing which does not require binarization, and together with the above architecture, achieves state-of-the-art results among greedy parsers for both English and Chinese.


Introduction
Recently, neural network-based parsers have become popular, with the promise of reducing the burden of manual feature engineering. For example, Chen and Manning (2014) and subsequent work replace the huge amount of manual feature combinations in non-neural network efforts (Nivre et al., 2006;Zhang and Nivre, 2011) by vector embeddings of the atomic features. However, this approach has two related limitations. First, it still depends on a large number of carefully designed atomic features. For example, Chen and Manning (2014) and subsequent work such as Weiss et al. (2015) use 48 atomic features from Zhang and Nivre (2011), including select thirdorder dependencies. More importantly, this approach inevitably leaves out some nonlocal information which could be useful. In particular, though such a model can exploit similarities between words and other embedded categories, and learn interactions among those atomic features, it cannot exploit any other details of the text.
We aim to reduce the need for manual induction of atomic features to the bare minimum, by using bi-directional recurrent neural networks to automatically learn context-sensitive representations for each word in the sentence. This approach allows the model to learn arbitrary patterns from the entire sentence, effectively extending the generalization power of embedding individual words to longer sequences. Since such a feature representation is less dependent on earlier parser decisions, it is also more resilient to local mistakes.
With just three positional features we can build a greedy shift-reduce dependency parser that is on par with the most accurate parser in the published literature for English Treebank. This effort is similar in motivation to the stack-LSTM of Dyer et al. (2015), but uses a much simpler architecture.
We also extend this model to predict phrasestructure trees with a novel shift-promote-adjoin system tailored to greedy constituency parsing, and with just two more positional features (defining tree span) and nonterminal label embeddings we achieve the most accurate greedy constituency parser for both English and Chinese. Figure 1: The sentence is modeled with an LSTM in each direction whose input vectors at each time step are word and part-of-speech tag embeddings.

LSTM Position Features
The central idea behind this approach is exploiting the power of recurrent neural networks to let the model decide what apsects of sentence context are important to making parsing decisions, rather than relying on fallible linguistic information (which moreover requires leaving out information which could be useful). In particular, we model an input sentence using Long Short-Term Memory networks (LSTM), which have made a recent resurgence after being initially formulated by Hochreiter and Schmidhuber (1997).
The input at each time step is simply a vector representing the word, in this case an embedding for the word form and one for the part-of-speech tag. These embeddings are learned from random initialization together with other network parameters in this work. In our initial experiments, we used one LSTM layer in each direction (forward and backward), and then concatenate the output at each time step to represent that sentence position: that word in the entire context of the sentence. This network is illustrated in Figure 1.  It is also common to stack multiple such LSTM layers, where the output of the forward and backward networks at one layer are concatenated to form the input to the next. We found that parsing performance could be improved by using two bidirectional LSTM layers in this manner, and concatenating the output of both layers as the positional feature representation, which becomes the input to the fully-connected layer. This architec- Figure 3: The arc-standard dependency parsing system (Nivre, 2008) (re omitted). Stack S is a list of heads, j is the start index of the queue, and s 0 and s 1 are the top two head indices on S.  ture is shown in Figure 2. Intuitively, this represents the sentence position by the word in the context of the sentence up to that point and the sentence after that point in the first layer, as well as modeling the "higher-order" interactions between parts of the sentence in the second layer. In Section 5 we report results using only one LSTM layer ("Bi-LSTM") as well as with two layers where output from each layer is used as part of the positional feature ("2-Layer Bi-LSTM").

Shift-Reduce Dependency Parsing
We use the arc-standard system for dependency parsing (see Figure 4). By exploiting the LSTM architecture to encode context, we found that we were able to achieve competitive results using only three sentence-position features to model parser state: the head word of each of the top two trees on the stack (s 0 and s 1 ), and the next word on the queue (q 0 ); see Table 1.
The usefulness of the head words on the stack is clear enough, since those are the two words that are linked by a dependency when taking a reduce action. The next incoming word on the queue is also important because the top tree on the stack should not be reduced if it still has children which have not yet been shifted. That feature thus allows input: w 0 . . . w n−1 the model to learn to delay a right-reduce until the top tree on the stack is fully formed, shifting instead.

Hierarchical Classification
The structure of our network model after computing positional features is fairly straightforward and similar to previous neural-network parsing approaches such as Chen and Manning (2014) and Weiss et al. (2015). It consists of a multilayer perceptron using a single ReLU hidden layer followed by a linear classifier over the action space, with the training objective being negative log softmax.
We found that performance could be improved, however, by factoring out the decision over structural actions (i.e., shift, left-reduce, or rightreduce) and the decision of which arc label to assign upon a reduce. We therefore use separate classifiers for those decisions, each with its own fully-connected hidden and output layers but sharing the underlying recurrent architecture. This structure was used for the results reported in Section 5, and it is referred to as "Hierarchical Actions" when compared against a single action classifier in Table 3.

Shift-Promote-Adjoin Constituency Parsing
To further demonstrate the advantage of our idea of minimal features with bidirectional sentence representations, we extend our work from dependency parsing to constituency parsing. However, the latter is significantly more challenging than the former under the shift-reduce paradigm because: • we also need to predict the nonterminal labels • the tree is not binarized (with many unary rules and more than binary branching rules) While most previous work binarizes the constituency tree in a preprocessing step (Zhu et al., 2013;Wang and Xue, 2014;Mi and Huang, 2015), we propose a novel "Shift-Promote-Adjoin" paradigm which does not require any binariziation or transformation of constituency trees (see Figure 5). Note in particular that, in our case only the Promote action produces a new tree node (with a non-terminal label), while the Adjoin action is the linguistically-motivated "sisteradjunction" operation, i.e., attachment (Chiang, 2000;Henderson, 2003). By comparison, in previous work, both Unary-X and Reduce-L/R-X actions produce new labeled nodes (some of which are auxiliary nodes due to binarization). Thus our paradigm has two advantages: • it dramatically reduces the number of possible actions, from 3X + 1 or more in previous work to 3 + X, where X is the number of nonterminal labels, which we argue would simplify learning; • it does not require binarization (Zhu et al., 2013;Wang and Xue, 2014) or compression of unary chains (Mi and Huang, 2015) There is, however, a more closely-related "shiftproject-attach" paradigm by Henderson (2003). For the example in Figure 5 he would use the following actions: shift(I), project(NP), project(S), shift(like), project(VP), shift(sports), project(NP), attach, attach.
The differences are twofold: first, our Promote action is head-driven, which means we only promote the head child (e.g., VP to S) whereas his Project action promotes the first child (e.g., NP to S); and secondly, as a result, his Attach action is always right-attach whereas our Adjoin action could be either left or right. The advantage of our method is its close resemblance to shift-reduce dependency parsing, which means that our constituency parser is jointly performing both tasks and can produce both kinds of trees. This also means that we use head rules to determine the correct order of gold actions.
We found that in this setting, we did need slightly more input features. As mentioned, node labels are necessary to distinguish whether a tree has been sufficiently promoted, and are helpful in any case. We used 8 labels: the current and immediate predecessor label of each of the top two stacks on the tree, as well as the label of the leftand rightmost adjoined child for each tree. We also found it helped to add positional features for the leftmost word in the span for each of those trees, bringing the total number of positional features to five. See Table 1 for details.

Experimental Results
We report both dependency and constituency parsing results on both English and Chinese.
All experiments were conducted with minimal hyperparameter tuning. The settings used for the reported results are summarized in Table 6. Networks parameters were updated using gradient backpropagation, including backpropagation through time for the recurrent components, using ADADELTA for learning rate scheduling (Zeiler, 2012). We also applied dropout (Hinton et al., 2012) (with p = 0.5) to the output of each LSTM layer (separately for each connection in the case of the two-layer network).
We tested both types of parser on the Penn Treebank (PTB) and Penn Chinese Treebank (CTB-5), with the standard splits for each of training, development, and test sets. Automatically predicted part of speech tags with 10-way jackknifing were used as inputs for all tasks except for Chinese dependency parsing, where we used gold tags, following the traditions in literature. Table 2 shows results for English Penn Treebank using Stanford dependencies. Despite the minimally designed feature representation, relatively few training iterations, and lack of precomputed embeddings, the parser performed on par with state-of-the-art incremental dependency parsers, and slightly outperformed the state-ofthe-art greedy parser.

Dependency Parsing: English & Chinese
The ablation experiments shown in the   Table 3: Ablation studies on PTB dev set (wsj 22). Forward and backward context, and part-ofspeech input were all critical to strong performace. Figure 6 compares our parser with that of Chen and Manning (2014) in terms of arc recall for various arc lengths. While the two parsers perform similarly on short arcs, ours significantly outpeforms theirs on longer arcs, and more interestingly our accuracy does not degrade much after length 6. This confirms the benefit of having a global sentence repesentation in our model. Table 4 summarizes the Chinese dependency parsing results. Again, our work is competitive with the state-of-the-art greedy parsers.   Table 4: Development and test set results for shiftreduce dependency parser on Penn Chinese Treebank (CTB-5) using only (s 1 , s 0 , q 0 ) position features (trained and tested with gold POS tags).

Related Work
Because recurrent networks are such a natural fit for modeling languages (given the sequential nature of the latter), bi-directional LSTM networks are becoming increasingly common in all sorts of linguistic tasks, for example event detection in Ghaeini et al. (2016). In fact, we discovered after submission that Kiperwasser and Goldberg (2016) have concurrently developed an extremely similar approach to our dependency parser. Instead of extending it to constituency parsing, they also apply the same idea to graph-based dependency parsing.

Conclusions
We have presented a simple bi-directional LSTM sentence representation model for minimal features in both incremental dependency and incremental constituency parsing, the latter using a novel shift-promote-adjoint algorithm. Experiments show that our method are competitive with the state-of-the-art greedy parsers on both parsing tasks and on both English and Chinese.