LSTM Shift-Reduce CCG Parsing

We describe a neural shift-reduce parsing model for CCG , factored into four unidirectional LSTMs and one bidirectional LSTM. This factorization allows the linearization of the complete parsing history, and results in a highly accurate greedy parser that outperforms all previous beam-search shift-reduce parsers for CCG . By further deriving a globally optimized model using a task-based loss, we improve over the state of the art by up to 2 . 67% labeled F1.


Introduction
Combinatory Categorial Grammar (CCG; Steedman, 2000) parsing is challenging due to its so-called "spurious" ambiguity that permits a large number of non-standard derivations (Vijay-Shanker and Weir, 1993;Kuhlmann and Satta, 2014). To address this, the de facto models resort to chart-based CKY (Hockenmaier, 2003;Clark and Curran, 2007), despite CCG being naturally compatible with shiftreduce parsing (Ades and Steedman, 1982). More recently, Zhang and Clark (2011) introduced the first shift-reduce model for CCG, which also showed substantial improvements over the long-established state of the art (Clark and Curran, 2007).
The success of the shift-reduce model (Zhang and Clark, 2011) can be tied to two main contributing factors. First, without any feature locality restrictions, it is able to use a much richer feature set; while intensive feature engineering is inevitable, it has nevertheless delivered an effective and conceptually simpler alternative for both parameter estimation and inference. Second, it couples beam search with global optimization (Collins, 2002;Collins and Roark, 2004;Zhang and Clark, 2008), which makes it less prone to search errors than fully greedy models (Huang et al., 2012).
In this paper, we present a neural architecture for shift-reduce CCG parsing based on long short-term memories (LSTMs; Hochreiter and Schmidhuber, 1997). Our model is inspired by Dyer et al. (2015), in which we explicitly linearize the complete history of parser states in an incremental fashion by requiring no feature engineering (Zhang and Clark, 2011;Xu et al., 2014), and no atomic feature sets (Chen and Manning, 2014). However, a key difference is that we achieve this linearization without relying on any additional control operations or compositional tree structures (Socher et al., 2010;Socher et al., 2011;Socher et al., 2013), both of which are vital in the architecture of Dyer et al. (2015). Crucially, unlike the sequence-to-sequence transduction model of , which primarily conditions on the input words, our model is sensitive to all aspects of the parsing history, including arbitrary positions in the input.
As another contribution, we present a global LSTM parsing model by adapting an expected Fmeasure loss (Xu et al., 2016). As well as naturally incorporating beam search during training, this loss optimizes the model towards the final evaluation metric (Goodman, 1996;Smith and Eisner, 2006;Auli and Lopez, 2011b), allowing it to learn shiftreduce action sequences that lead to parses with high expected F-scores. We further show the globally optimized model can be leveraged with greedy inference, resulting in a deterministic parser as accurate cal assignment accuracy than the C&C parser (Clark and Curran, 2007), even with the same supertagging model (Zhang and Clark, 2011;Xu et al., 2014).
In our parser, we follow this strategy and adopt the Zhang and Clark (2011) style shift-reduce transition system, which assumes a set of lexical categories has been assigned to each word using a supertagger (Bangalore and Joshi, 1999;Clark and Curran, 2004). Parsing then proceeds by applying a sequence of actions to transform the input maintained on a queue, into partially constructed derivations, kept on a stack, until the queue and available actions on the stack are both exhausted. At each time step, the parser can choose to shift (sh) one of the lexical categories of the front word onto the stack, and remove that word from the queue; reduce (re) the top two subtrees on the stack using a CCG rule, replacing them with the resulting category; or take a unary (un) action to apply a CCG type-raising or type-changing rule to the stack-top element. For example, the deterministic sequence of shift-reduce actions that builds the derivation in Fig.1 is: sh ⇒ NP , un ⇒ S /(S \NP ), sh ⇒ (S \NP )/NP , re ⇒ S /NP , sh ⇒ NP and re ⇒ S , where we use ⇒ to indicate the CCG category produced by an action. 1

LSTM
Recurrent neural networks (RNNs; e.g., see Elman, 1990) are factored into an input layer x t and a hidden state (layer) h t with recurrent connections, and they can be represented by the following recurrence: where x t is the current input, h t−1 is the previous hidden state and Φ is a set of affine transformations parametrized by θ. Here, we use a variant of RNN referred to as LSTMs, which augment Eq. 1 with a cell state, c t , s.t.
Compared with conventional RNNs, this extra facility gives LSTMs more persistent memories over longer time delays and makes them less susceptible to the vanishing gradient problem (Bengio et al., 1994). Hence, they are better at modeling temporal events that are arbitrarily far in a sequence. Several extensions to the vanilla LSTM have been proposed over time, each with a modified instantiation of Φ θ that exerts refined control over e.g., whether the cell state could be reset  or whether extra connections are added to the cell state . Our instantiation is as follows for all LSTMs: where σ is the sigmoid activation and is the element-wise product.
In addition to unidirectional LSTMs that model an input sequence x 0 , x 1 , . . . , x n−1 in a strict leftto-right order, we also use bidirectional LSTMs (BLSTMs; Graves and Schmidhuber, 2005), which read the input from both directions with two independent LSTMs. At each step, the forward hidden state h t is computed using Eq. 2 for t = (0, 1, . . . , n − 1); and the backward hidden stateĥ t is computed similarly but from the reverse direction for t = (n − 1, n − 2, . . . , 0). Together, the two hidden states at each step t capture both past and future contexts, and the representation for each x t is obtained as the concatenation [h t ;ĥ t ].

Embeddings
The neural network model employed by Chen and Manning (2014), and followed by a number of other parsers (Weiss et al., 2015;Zhou et al., 2015;Ambati et al., 2016;Andor et al., 2016;Xu et al., 2016) allows higher-order feature conjunctions to be automatically discovered from a set of dense feature embeddings. However, a set of atomic feature templates, which are only sensitive to contexts from the top few elements on the stack and queue are still needed to dictate the choice of these embeddings. Instead, we dispense with such templates and seek input: w 0 . . . w n−1 axiom: 0 : (0, , β, φ) goal: 2n − 1 + µ : (n, δ, , ∆) t : (j, δ, x w j |β, ∆) t + 1 : (j + 1, δ|x w j , β, ∆) (sh; 0 ≤ j < n) Figure 2: The shift-reduce deduction system. For the sh deduction, xw j denotes an available lexical category for wj; for re, x denotes the set of dependencies on x.
to design a model that is sensitive to both local and non-local contexts, on both the stack and queue. Consequently, embeddings represent atomic input units that are added to our parser and are preserved throughout parsing. In total we use four types of embeddings, namely, word, CCG category, POS and action, where each has an associated look-up table that maps a string of that type to its embedding. The look-up table for words is L w ∈ R k×|w| , where k is the embedding dimension and |w| is the size of the vocabulary. Similarly, we have look-up tables for CCG categories, L c ∈ R l×|c| , for the three types of actions, L a ∈ R m×3 , and for POS tags, L p ∈ R n×|p| .

Model
Parser. Fig. 2 shows the deduction system of our parser. 2 We denote each parse item as (j, δ, β, ∆), where j is the positional index of the word at the front of the queue, δ is the stack (with its top element s 0 to the right), and β is the queue (with its top element w j to the left) and ∆ is the set of CCG dependencies realized for the input consumed so far. Each item is also associated with a step indicator t, signifying the number of actions applied to it and the goal is reached in 2n − 1 + µ steps, where µ is the total number of un actions. We also define each action in our parser as a 4-tuple (τ t , c t , w ct , p wc t ), where τ t ∈ {sh, re, un} for t ≥ 1, c t is the resulting category of τ t , and w ct is the head word attached to 3), and the shaded cells on the right c t with p wc t being its POS tag. 3 LSTM model. LSTMs are designed to handle time-series data, in a purely sequential fashion, and we try to exploit this fact by completely linearizing all aspects of the parsing history. Concretely, we factor the model as five LSTMs, comprising four unidirectional ones, denoted as U, V, X and Y, and an additional BLSTM, denoted as W (Fig. 3). Before parsing each sentence, we feed W with the complete input (padded with a special embedding ⊥ as the end of sentence token); and we use w j = [h W j ;ĥ W j ] to represent w j in subsequent steps. 4 We also add ⊥ to the other 4 unidirectional LSTMs as initialization.
Given this factorization, the stack representation for a parse item (j, δ, β, ∆) at step t, for t ≥ 1, is obtained as and together with w j , [δ t ; w j ] gives a representation for the parse item. For the axiom item, we represent it as . Each time the parser applies an action (τ t , c t , w ct , p wc t ), we update the model by adding the embedding of τ t , denoted as L a (τ t ), onto U, and adding the other three embeddings of the action 4-tuple, that is, L c (c t ), L w (w ct ) and L p (p wc t ), onto V, X and Y respectively.
To predict the next action, we first derive an action hidden layer b t , by passing the parse item representation [δ t ; w j ] through an affine transformation, s.t.
where B is a parameter matrix of the model, r is a bias vector and f is a ReLU non-linearity (Nair and Hinton, 2010). Then we apply another affine transformation (with A as the weights and s as the bias) to b t : and obtain the probability of the i th action in a t as where T (δ t , β t ) is the set of feasible actions for the current parse item, and τ i t ∈ T (δ t , β t ).

Derivations and Dependency Structures
Our model naturally linearizes CCG derivations "incrementally" following their post-order traversals. As such, the four unidirectional LSTMs always have the same number of steps; and at each step, the concatenation of their hidden states (Eq. 3) represents a point in a CCG derivation (i.e., an action 4-tuple). Due to the large amount of flexibility in how dependencies are realized in CCG (Hockenmaier, 2003;Clark and Curran, 2007), and in line with most existing CCG parsing models, including dependency models, we have chosen to model CCG derivations, rather than dependency structures. 5 We also hypothesize that tree structures are not necessary for the current model, since they are already implicit in the linearized derivations; similarly, we have found the action embeddings to be nonessential ( §5.2).

Training
As a baseline, we first train a greedy model, in which we maximize the log-likelihood of each target action in the training data. More specifically, let (τ g 1 , . . . , τ g Tn ) be the gold-standard action sequence for a training sentence n, a cross-entropy criterion is used to obtain the error gradients, and for each sentence, training involves minimizing where θ is the set of all parameters in the model.
As other greedy models (e.g., see Chen and Manning (2014) and Dyer et al. (2015)), our greedy model is locally optimized, and suffers from the label bias problem (Andor et al., 2016). A partial solution to this is to use beam search at test time, thereby recovering higher scoring action sequences that would otherwise be unreachable with fully greedy inference. In practice, this has limited effect (Table 2), and a number of more principled solutions have been recently proposed to derive globally optimized models during training (Watanabe and Sumita, 2015; Weiss et al., 2015;Zhou et al., 2015;Andor et al., 2016). Here, we extend our greedy model into a global one by adapting the expected F-measure loss of Xu et al. (2016). To our best knowledge, this is the first attempt to train a globally optimized LSTM shift-reduce parser.
Let θ = {U, V, X, Y, W, B, A} be the weights of the baseline greedy model, 6 we initialize the weights of the global model, which has the same architecture as the baseline, to θ, and we reoptimize θ in multiple training epochs as follows: 1. Pick a sentence x n from the training set, decode it with beam search, and generate a k-best list of output parses with the current θ, denoted as Λ(x n ). 7 2. For each parse y i in Λ(x n ), compute its sentence-level F1 using the set of dependencies in the ∆ field of its parse item. In addition, let |y i | be the total number of actions that derived y i and s θ (y j i ) be the softmax action score of the j th action, given by the LSTM model. Compute the log-linear score of its action sequence as ρ(y i ) = 3. Compute the negative expected F1 objective (defined below) for x n and minimize it using stochastic gradient descent (maximizing expected F1). Repeat these three steps for the remaining sentences. 6 We use boldface letters to designate the weights of the corresponding LSTMs, and omit bias terms for brevity. 7 As in Xu et al. (2016), we did not preset k, and found k = 11.06 on average with a beam size of 8 that we used for this training.
More formally, the loss J(θ), is defined as is the sentence level F1 of the parse derived by y i , with respect to the gold-standard dependency structure ∆ G xn of x n ; p(y i |θ) is the normalized probability score of the action sequence y i , computed as where γ is a parameter that sharpens or flattens the distribution (Tromble et al., 2008). 8 Different from the maximum-likelihood objective, XF1 optimizes the model on a sequence level and towards the final evaluation metric, by taking into account all action sequences in Λ(x n ).

Attention-Based LSTM Supertagging
In addition to the size of the label space, supertagging is difficult because CCG categories can encode long-range dependencies and tagging decisions frequently depend on non-local contexts. For example, in He went to the zoo with a cat, a possible category for with, (S \NP )\(S \NP )/NP , depends on the word went further back in the sentence. Recently a number of RNN models have been proposed for CCG supertagging Vaswani et al., 2016;Xu et al., 2016), and such models show dramatic improvements over non-recurrent models (Lewis and Steedman, 2014b). Although the underlying models differ in their exact architectures, all of them make each tagging decision using only the hidden states at the current input position, and this imposes a potential bottleneck in the model. To mitigate this, we generalize the attention mechanisms of Bahdanau et al. (2015) and Luong et al. (2015), and adapt them to supertagging, by allowing the model to explicitly use hidden states from more than one input positions for tagging each word. Similar to Bahdanau et al. (2015) and Luong et al. (2015), a key feature in our model is a soft alignment vector that weights the relative importance of the considered hidden states.
For an input sentence w 0 , w 1 , . . . , w n−1 , we consider w t = [h t ;ĥ t ] ( §3.1) to be the representation of the t th word (0 ≤ t < n, w t ∈ R 2d×1 ), given by a BLSTM with a hidden state size d for both its forward and backward layers. 9 Let k be a context window size hyperparameter, we define H t ∈ R 2d×(k−1) as H t = [w t− k/2 , . . . , w t−1 , w t+1 , . . . , w t+ k/2 ], which contains representations for all words in the size k window except w t . At each position t, the attention model derives a context vector c t ∈ R 2d×1 (defined below) from H t , which is used in conjunction with w t to produce an attentional hidden layer: where f is a ReLU non-linearity, M ∈ R g×4d is a learned weight matrix, m is a bias term, and g is the size of x t . Then x t is used to produce another hidden layer (with N as the weights and n as the bias): z t = Nx t + n, and the predictive distribution over categories is obtained by feeding z t through a softmax activation. In order to derive the context vector c t , we first compute b t ∈ R (k−1)×1 from H t and w t using α ∈ R 1×4d , s.t. the i th entry in b t is for i ∈ [0, k−1), T = [t− k/2 , . . . , t−1, t+1, . . . , t+ k/2 ]; and c t is derived as follows: where a t is the alignment vector. We also experiment with two types of attention reminiscent of the global and local models in Luong et al. (2015), where the first attends over all input words (k = n) and the second over a local window. It is worth noting that two other works have concurrently tackled supertagging with BLSTM models. In Vaswani et al. (2016), a language model 9 Unlike in the parsing model, POS tags are excluded. layer is added on top of a BLSTM, which allows embeddings of previously predicted tags to propagate through and influence the pending tagging decision. However, the language model layer is only effective when both scheduled sampling for training  and beam search for inference are used. We show our attention-based models can match their performance, with only standard training and greedy decoding. Additionally,  presented a BLSTM model with two layers of stacking in each direction; and as an internal baseline, we show a non-stacking BLSTM without attention can achieve the same accuracy.

Experiments
Dataset and baselines. We conducted all experiments on CCGBank (Hockenmaier and Steedman, 2007) with the standard splits. 10 We assigned POS tags with the C&C POS tagger, and used 10-fold jackknifing for both POS tagging and supertagging. All parsers were evaluated using F1 over labeled CCG dependencies.
For supertagging, the baseline models are the RNN model of , the bidirectional RNN (BRNN) model of Xu et al. (2016), and the BLSTM supertagging models in Vaswani et al. (2016) and . For parsing experiments, we compared with the global beam-search shift-reduce parsers of Zhang and Clark (2011) and Xu et al. (2014). One neural shift-reduce CCG parser baseline is Ambati et al. (2016), which is a beam-search shift-reduce parser based on Chen and Manning (2014) and Weiss et al. (2015); and the others are the RNN shift-reduce models in Xu et al. (2016). Additionally, the chart-based C&C parser was included by default.
Model and training parameters. 11 All our LSTM models are non-stacking with a single layer. 12 For the supertagging models, the LSTM  hidden state size is 256, and the size of the attentional hidden layer (x t , Eq. 5) is 200. All parsing model LSTMs have a hidden state size of 128, and the size of the action hidden layer (b t , Eq. 4) is 80. Pretrained word embeddings for all models are 100-dimensional (Turian et al., 2010), and all other embeddings are 50-dimensional. We also pretrained CCG lexical category and POS embeddings on the concatenation of the training data and a Wikipedia dump parsed with C&C. 13 All other parameters were uniformly initialized in ± 6/(r + c), where r and c are the number of rows and columns of a matrix (Glorot and Bengio, 2010).
For training, we used plain non-minibatched stochastic gradient descent with an initial learning rate η 0 = 0.1 and we kept iterating in epochs until accuracy no longer increases on the dev set. For all models, a learning rate schedule η e = η 0 /(1 + λe) with λ = 0.08 was used for e ≥ 11. Gradients were clipped whenever their norm exceeds 5. Dropout training as suggested by Zaremba et al. (2014), with a dropout rate of 0.3, and an 2 penalty of 1 × 10 −5 , were applied to all models. Table 1 summarizes 1-best supertagging results. Our baseline BLSTM model without attention achieves the same level of accuracy as  and the baseline BLSTM model of Vaswani et al. (2016). Compared with the latter, our hidden state size is 50% smaller (256 vs. 512).

Supertagging Results
For training and testing the local attention model (BLSTM-local), we used an attention window size 13 We used the gensim word2vec toolkit: https:// radimrehurek.com/gensim/.   of 5 (tuned on the dev set), and it gives an improvement of 0.94% over the BRNN supertagger (Xu et al., 2016), achieving an accuracy on par with the beam-search (size 12) model of Vaswani et al. (2016) that is enhanced with a language model. Despite being able to consider wider contexts than the local model, the global attention model (BLSTMglobal) did not show further gains, hence we used BLSTM-local for all parsing experiments below.

Parsing Results
All parsers we consider use a supertagger probability cutoff β to prune categories less likely than β times the probability of the best category in a distribution: for the C&C parser, it uses an adaptive strategy to backoff to smaller β values if no spanning analysis is found given an initial β setting; for all the shift-reduce parsers, fixed β values are used without backing off. Since β determines the derivation space of a parser, it has a large impact on the final parsing accuracy.
For the maximum-likelihood greedy model, we found using a small β value (bigger ambiguity) for training significantly improved accuracy, and we chose β = 1 × 10 −5 (5.22 categories per word with jackknifing) via development experiments. This reinforces the findings in a number of other CCG parsers (Clark and Curran, 2007;Auli and Lopez, 2011a;Lewis and Steedman, 2014a): even though a smaller β increases ambiguity, it leads to more accurate models at test time. On the other hand, we found using larger β values at test time led to significantly better results (Table 2). And this differs from the beam-search models that use the same β value for both training and testing (Zhang and Clark, 2011;Xu et al., 2014).    Table 4: Parsing results on the dev (Section 00) and test (Section 23) sets with 100% coverage, with all LSTM models using the BLSTM-local supertagging model. All experiments using auto POS. CAT (lexical category assignment accuracy). LSTM-greedy is the full greedy parser.
The greedy model. Table 3 shows the dev set results for all greedy models, where the four types of embeddings, that is, word (w), CCG category (c), action (a) and POS (p), are gradually introduced. The full model LSTM-w+c+a+p surpasses all previous shift-reduce models (Table 4), achieving a dev set accuracy of 86.56%. Category embeddings (LSTM-w+c) yielded a large gain over using word embeddings alone (LSTM-w); action embeddings (LSTM-w+c+a) provided little improvement, but further adding POS embeddings (LSTM-w+c+a+p) gave noticeable recall (+0.61%) and F1 improvements (+0.36%) over LSTM-w+c. Fig. 4a shows the learning curves, where all models converged in under 30 epochs.
The XF1 model.  Table 5: Effect of different supertaggers on the full greedy parser. LSTM-greedy is the same parser as in Table 4, which uses the BLSTM-local supertagger.
a β value of 0.06 for both training and testing (tuned on the dev set); and training took 12 epochs to converge (Fig. 4b)  Effect of the supertagger. To isolate the parsing model from the supertagging model, we first experimented with the BRNN supertagging model as in Xu et al. (2016) for both training and testing the full greedy LSTM parser. Using this supertagger, we still achieved the highest F1 (85.86%) on the dev set (LSTM-BRNN, Table 5) in comparison with all previous shift-reduce models; and an improvement of 1.42% F1 over the greedy model of Xu et al. (2016) was obtained on the test set (Table 4). We then experimented with using the baseline BLSTM supertagging model for parsing (LSTM-BLSTM), and observed the attention-based setup (LSTM-greedy) outperformed it, despite the attention-based supertagger (BLSTM-local) did not give better multi-tagging accuracy. We owe this to the fact that large β cutoff values-resulting in almost deterministic supertagging decisions on average-are required by the parser during inference; for instance, BLSTM-local has an average ambiguity of 1.09 on the dev set with β = 0.06. 14 Comparison with chart-based models. For completeness and to put our results in perspective, we compare our XF1 models with other CCG parsers in the literature (Table 6):  is the log-linear C&C dependency hybrid model with an RNN supertagger front-end;  is an LSTM supertagger-factored parser using the A * CCG parsing algorithm of Lewis and Steedman (2014a); Vaswani et al. (2016) combine a BLSTM supertagger with a new version of the C&C parser  that uses a max-violation perceptron, which significantly improves over the 14 All β cutoffs were tuned on the dev set; for BRNN, we found the same β settings as in Xu et al. (2016) to be optimal; for BLSTM, β = 4 × 10 −5 for training (with an ambiguity of 5.27) and β = 0.02 for testing (with an ambiguity of 1.17). original C&C models; and finally, a global recursive neural network model with A * decoding . We note that all these alternative modelswith the exception of  and -use structured training that accounts for violations of the gold-standard, and we conjecture further improvements for our model are possible by incorporating such mechanisms. 15

Conclusion
We have presented an LSTM parsing model for CCG, with a factorization allowing the linearization of the complete parsing history. We have shown that this simple model is highly effective, with results outperforming all previous shift-reduce CCG parsers. We have also shown global optimization benefits an LSTM shift-reduce model; and contrary to previous findings with the averaged perceptron (Zhang and Clark, 2008), we empirically demonstrated beamsearch inference is not necessary for our globally optimized model. For future work, a natural direction is to explore integrated supertagging and parsing in a single neural model (Zhang and Weiss, 2016).