LSTM CCG Parsing

We demonstrate that a state-of-the-art parser can be built using only a lexical tagging model and a deterministic grammar, with no explicit model of bi-lexical dependencies. Instead, all dependencies are implicitly encoded in an LSTM supertagger that assigns CCG lexical categories. The parser signiﬁcantly out-performs all previously published CCG re-sults, supports efﬁcient and optimal A ∗ decoding, and beneﬁts substantially from semi-supervised tri-training. We give a detailed analysis, demonstrating that the parser can recover long-range dependencies with high accuracy and that the semi-supervised learning enables signiﬁcant accuracy gains. By running the LSTM on a GPU, we are able to parse over 2600 sentences per second while improving state-of-the-art accuracy by 1.1 F1 in domain and up to 4.5 F1 out of domain.


Introduction
Combinatory Categorial Grammar (CCG) is a strongly lexicalized formalism-the vast majority of attachment decisions during parsing are specified by the selection of lexical entries for words (see Figure 1 for examples). State-of-the-art parsers typically include a supertagging model, to select possible lexical categories, and a bi-lexical dependency model, to resolve the remaining parse attachment ambiguities. In this paper, we introduce a long shortterm memory (LSTM) CCG parsing model that has no explicit model of bi-lexical dependencies, but instead relies on a bi-directional recurrent neural network (RNN) supertagger to capture all long distance dependencies. This approach has a number of advantages: it is conceptually simple, allows for the reuse of existing optimal and efficient parsing algorithms, benefits significantly from semi-supervised learning, and is highly accurate both in and out of domain. The parser is publicly released. 1 Neural networks have shown strong performance in a range of NLP tasks; however they can break the dynamic programs for structured prediction problems, such as parsing, when vector embeddings are recursively computed for subparts of the output. Existing neural net parsers either (1) use greedy inference techniques including shift-reduce parsing (Henderson et al., 2013;Chen and Manning, 2014;Weiss et al., 2015;, constituency parse re-ranking (Socher et al., 2013), and stringto-string transduction (Vinyals et al., 2015), or (2) avoid recursive computations entirely (Durrett and Klein, 2015). Our approach gives a simple alternative: we only train a model for tagging decisions, where we can easily use recurrent architectures such as LSTMs (Hochreiter and Schmidhuber, 1997), and rely on the highly lexicalized nature of the CCG grammar to allow this tagger to specify nearly every aspect of the complete parse.
Our LSTM supertagger is bi-directional and includes a softmax potential over tags for each word in the sentence. During training, we jointly optimize all LSTM parameters, including the word embeddings, to maximize the conditional likelihood of supertag sequences. For inference, we use a recently introduced A* CCG parsing algorithm (Lewis and Steedman, 2014a), which efficiently searches for the  Figure 1: Four examples of prepositional phrase attachment in CCG. In the upper two parses, the attachment decision is determined by the choice of supertags. In the lower parses, the attachment is ambiguous given the supertags. In such cases, our parser deterministically attaches low (i.e. preferring the lower-right parse).
highest probability sequence of tags that combine to produce a complete parse tree. Whenever there is parsing ambiguity not specified by the supertags, the model attaches low (see Figure 1). This approach is not only conceptually simple but also highly effective, as we demonstrate with extensive experiments. Because the A* algorithm is extremely efficient and the LSTMs can be run in parallel on GPUs, the end-to-end parser can process over 2600 sentences per second. This is more than three times the speed of any publicly available parser for any formalism. Apart from Hall et al. (2014), we are not aware of efficient algorithms for running other state-of-art-parsers on GPUs. The LSTM parameters also benefit from semi-supervised training, which we demonstrate by employing a recently introduced tri-training scheme (Weiss et al., 2015). Finally, the recurrent nature of the LSTM allows for effective modelling of long distance dependencies, as we show empirically. Our approach significantly advances the state-of-the-art on benchmark datasets-improving accuracy by 1.1 F1 in domain and up to 4.5 F1 out of domain.

Background
Combinatory Categorial Grammar (CCG) Compared to a phrase-structure grammar, CCG contains a much smaller set of binary rules (we use 11), but a much larger set of lexical tags (we use 425). The binary rules are conjectured to be language-universal, and most language-specific information is lexicalized (Steedman, 2000). The large tag set means that most (but not all) attachment decisions are determined by tagging decisions. Figure 1 shows how a prepositional phrase attachment decision can be encoded in the choice of tags.
The process of assigning CCG categories to words is called supertagging. All supertaggers used in practice are probabilistic, providing a distribution over possible tags for each word. Parsing models either use these scores directly (Auli and Lopez, 2011b), or as a form of beam search (Clark and Curran, 2007), typically in conjunction with models of the dependencies or derivation.
Supertag-Factored A * CCG Parsing Lewis and Steedman (2014a) introduced supertag-factored CCG parsers, in which the score for a parse is simply the sum of the scores of its supertags. The parser takes in a distribution over supertags for each word, and outputs the highest scoring parse-subject to the hard constraint that the parse only uses standard CCG combinators (resolving any remaining ambiguity by attaching low). One advantage of the supertag-factored model is that it allows a simple A * parsing algorithm, which provably finds the highest scoring supertag sequence that can be combined to construct a complete parse.
In A * parsing, partial parses y i,j of span i . . . j are maintained in a sorted agenda and added to the chart in order of their cost, which is the sum of their Viterbi inside score g(y i,j ) and an upper bound on their Viterbi outside score h(y i,j ). When y i,j is doctor NP sent (S pss \NP )/PP for PP /NP Figure 2: Visualization of our supertagging model, based on stacked bi-directional LSTMs. Each word is fed into stacked LSTMs reading the sentence in each direction, the outputs of the LSTMs are combined, and there is a final softmax over categories.
added to the chart, the agenda is updated with any new partial parses that can be created by combining y i,j with existing chart items (Algorithm 1). If h is a monotonic upper bound on the outside score, the first chart entry for a span with a given category is guaranteed to be optimal-all other possible completions of the competing partial parses provably have lower scores, due to the outside score bounds. There is no guarantee this certificate of optimality is achieved efficiently for parses of the whole sentence, and in the worst case the algorithm could fill the entire parse chart. However, as we will see later, A* parsing is very efficient in practice for the models we present in this paper.
In the supertag-factored model, g and h are computed as follows, where g(y k ) is the score for word k having tag y k .
where Eq. 1 follows from the definition of the supertag factored model and Eq. 2 combines this definition with the fact that the max score over all supertags for a word is an upperbound on the score for the actual supertag used in the best parse.

LSTM CCG Supertagging Model
Supertagging is almost parsing (Bangalore and Joshi, 1999)-consequently the task is very chal-Algorithm 1 Agenda-based parsing algorithm Definitions x 1...N is the input words, and y variables denote scored partial parses. TAG(x 1...N ) returns a set of scored pre-terminals for every word. ADD(C, y) adds partial parse y to chart C. RULES(C, y) returns the set of scored partial parses that can be created by combining y with existing entries in C. The agenda A is ordered as described in Section 2.
if y / ∈ C then 9: ADD(C, y) 10: for y ∈ RULES(C, y) do 11: INSERT(A, y ) 12: return C 1,N lenging, with hundreds of tags, and the correct assignment often depending on long-range dependencies. For example, in The doctor sent for the patient arrived, the category for sent depends on the final word. Recent work has made dramatic progress, using feed-forward neural networks (Lewis and Steedman, 2014b) and RNNs (Xu et al., 2015). We make several extensions to previous work on supertagging. Firstly, we use bi-directional models, to capture both previous and subsequent sentence context into supertagging decisions. Secondly, we use LSTMs, rather than RNNs. Many tagging decisions rely on long-range context, and RNNs typically struggle to account for sequences of longer than a few words (Hochreiter and Schmidhuber, 1997). Finally, we use a deep architecture, to allow the modelling of complex interactions in the context.
Our supertagging model is summarized in Figure  2. Each word is mapped to an embedding vector. This vector is a concatenation of an embedding for the word (lower-cased), and embeddings for features of the word (we use 1 to 4 character prefixes and suffixes). The embedding vector is used as input to two stacked LSTMs (with depth 2), one processing the sentence left-to-right, and the other right-to-left.
The outputs from the LSTMs are projected into a further hidden layer, a bias is added, and a RELU non-linearity is applied. This layer gives a contextdependent representation of the word that is fed into a softmax over supertags.
We use a variant on the standard LSTM with coupled 'input' and 'forget' gates, and peephole connections. Each LSTM cell at position t takes three inputs: a cell state vector c t−1 and hidden state vector h t−1 from the cell at position t − 1, and x t from the layer below. It outputs h t to the layer above, and c t and h t to the cell at t + 1. c t and h t are computed as follows, where σ is the component-wise logistic sigmoid, and • is the component-wise product: We train the model using stochastic gradient descent, with a minibatch size of 1, a learning rate of 0.01, and using momentum with µ = 0.7. We then fine-tune models using a larger minibatch size of 32. Gradients whose L 2 norm exceeds 5 are clipped. Training was run for 30 epochs, shuffling the order of sentences after each epoch, and we used the model parameters with the highest development supertagging accuracy. The input layer uses dropout with a rate of 0.5. All trainable parameters have L 2 regularization of Λ = 10 −6 . Word embedding are initialized using 50-dimensional pre-trained values from Turian et al. (2010). For prefix and suffix embeddings, we use randomly initialized 32dimensional vectors-features occurring less than 3 times are replaced with an 'unknown' embedding. We add special start and end tokens to each sentence, with trainable parameters. The LSTM state size is 128 and the RELU layer has a size of 64.

Parsing Models
Our experiments focus on two parsing models:

Supertag-Factored
We use the supertagging model described in Section 3 to build a supertagfactored parser, closely following the approach described in Section 2. We also add a penalty of 0.1 (tuned on development data) for every time a unary rule is applied in a parse. The attach-low heuristic is implemented by adding a small penalty of − d at every binary rule instantiation, where d is the absolute distance between the heads of the left and right children, and is a small constant. We increase the penalty to 10 for clitics, to encourage these to attach locally. Because these penalties are ≤ 0, they do not affect the A* upper bound calculations.
Dependencies We also train a model with dependency features, to investigate how much they improve accuracy beyond the supertag-factored model. We adapt a joint CCG and SRL model (Lewis et al., 2015) to CCGbank parsing, by assigning every CCGbank dependency a role based on its argument number (i.e., the first argument of every category has role ARG0). A global log-linear model is trained to maximize the marginal likelihood of the gold dependencies. We use the same features and hyperparameters as Lewis et al. (2015), except that we do not use the supertagger score feature (to separate the effect of the dependencies features from the supertagger). We choose this model because it has an A * parsing algorithm, meaning that we do not need to use aggressive beam search.

Semi-supervised Learning
A number of papers have shown that strong parsers can be improved by exploiting text without goldstandard annotations. Recent work suggests tritraining, in which the output of two parsers is intersected to create training data for a third parser, is highly effective (Weiss et al., 2015).
We perform the first application of tri-training to a lexicalized formalism. Following Weiss et al., we parse the corpus of Chelba et al. (2013) with a shiftreduce parser and a chart-based model. We use the shift-reduce parser from Ambati et al. (2016) and our dependency model (without using a supertagger feature, to limit the correlation with our tagging model). On development sentences where the parsers produce the same supertags (40%), supertagging accuracy is 98.0%. This subset is considerably easier than general text-our CCGbank-trained supertagger is 97.4% accurate on this data-but tritraining still provides useful additional training data.
In total, we include 43 million words of text that the parsers annotate with the same supertags and 15 copies of the gold CCGbank training data. Our experiments show that tri-training improves both supertagging and parsing accuracy.

GPU Parsing
Our parser makes an unusual trade-off, by combining a complex tagging model with a deterministic parsing model. The A * parsing algorithm is extremely efficient, and the overall time required to process a sentence is dominated by the supertagger. GPUs can improve performance over CPUs by computing many vector operations in parallel. There are two major obstacles to using GPUs for parsing. First, most models use sparse rather than dense features, which are difficult to compute efficiently on GPUs. The most successful implementation we are aware of exploits the fact that the Berkeley parser is unlexicalized to run parsing operations in parallel (Hall et al., 2014). Second, most neural models have features that depend on the current parse or stack state (e.g. Chen and Manning (2014)). This makes it difficult to exploit the parallelism of GPUs, because these data structures are typically built incrementally on CPU. It may be possible to write GPU-specific code that maintains the entire parse state on GPU, but we are not aware of any such implementations.
In contrast, our supertagger only uses matrix operations, and does not take any parse state as inputmeaning it is straightforward to run on a GPU. To exploit the parallelism of GPUs, we process thousands of sentences simultaneously-improving parsing efficiency by an order-of-magnitude over CPU. A major advantage of our model is that it allows all of the computationally intensive decisions to occur on GPUs. Unlike existing GPU parsers, the LSTM can be run with generic library code. 2

Experimental setup
We trained our parser on Sections 02-21 of CCGbank (Hockenmaier and Steedman, 2007), using Section 00 for development, and Section 23 for test. Our experiments use a supertagger beam of 10 −4which does not affect the final scores, but reduces overheads such as building the initial agenda. 2 We use TensorFlow (Abadi et al., 2015).

Model
Dev  Where results are available, we compare our work with the following models: EASYCCG, which has the same parsing model as our parser, but uses a feed-forward neural-network supertagger (NN); the C&C parser (Clark and Curran, 2007), and C&C+RNN (Xu et al., 2015), which is the C&C parser with an RNN supertagger. All results are for 100% coverage of the test data.
We refer to the models described in Section 4 as LSTM and DEPENDENCIES respectively. We also report the performance of LSTM+DEPENDENCIES, which combines the model scores (weighting the LSTM score by 1.8, tuned on development data).

Supertagging Results
The most direct measure of the effectiveness of our LSTM and tri-training is on the supertagging task. Results are shown in Table 1. The improvement of our deep LSTM over the RNN model is greater than the improvement of the RNN over C&C model. Further gains follow from tri-training, improving the state-of-the-art by 1.7%.

English Parsing Results
Parsing results are shown in Figure 2. Surprisingly, our CCGBank-trained LSTM outperforms any previous approach. 3

Out-of-domain Experiments
We also evaluate on two out-of-domain datasets used by Rimell and Clark (2008), but did no development on this data. In both cases, we use Rimell and Clark's scripts for converting CCG parses to the target dependency representations. The datasets are: QUESTIONS 500 questions from TREC (Rimell and Clark, 2008).
Questions frequently contain very long range dependencies, providing an interesting test of the LSTM supertagger's ability to capture unbounded dependencies. We follow Rimell and Clark by re-training the supertagger on the concatenation of the CCGbank training data and 10 copies of the QUESTIONS training data.
BIOINFER 500 sentences from biomedical abstracts. This dataset tests the parser's robustness to a large amount of unseen vocabulary.
Results are shown in Table 3. Our LSTM parser outperforms existing work on question parsing, showing that it can successfully model the longrange dependencies found in questions. Adding dependency features yields only a small improvement.
On the BIOINFER corpus, our tri-trained LSTM parser is 4.5 F1 better than the previous state-ofthe-art. Dependency features appear to be much (2011b)'s joint parsing and supertagging model, due to differences in the experimental setup. These models are 0.3 and 1.5 F1 more accurate than the C&C baseline respectively, which is well within the margin of improvement obtained by our model.

Efficiency Experiments
In contrast to standard parsing algorithms, the efficiency of our model depends directly on the accuracy of the supertagger in guiding the search. We therefore measure the efficiency empirically. Results are shown in Table 4. 5 Our parser runs more slowly than EASYCCG on CPU, due to the more complex tagging model (but is 4.8 F1 more accurate). Adding dependencies substantially reduces efficiency, due to calculating sparse features. Without dependencies, the run time is dominated by the LSTM supertagger. Running the supertagger on a GPU reduces parsing times dramaticallyoutperforming SpaCy, the fastest publicly available parser (Choi et al., 2015). Roughly half the parsing time is spent on GPU supertagging, and half on CPU parsing. To better exploit batching in the GPU, our implementation dynamically buckets sentences by length (bins of width 10), and tags batches when the bucket size reaches 3072 (the number of threads on our GPU). We are not aware of any GPU implementations of shift-reduce parsers or lexicalized chart parsers, so it is unclear if most other state-ofthe-art parsers can be adapted to exploit GPUs.

Ablations
We also measure performance while removing different aspects of the full parsing model.

Supertagger Model Architecture
Numerous variations are possible on our supertagging architecture. Apart from tri-training, the major differences from the previous state-of-the-art (Xu et al., 2015) are that we use LSTMs rather than RNNs, and that we use bidirectional networks rather than only a forward-directional RNN. These modifications lead to a 1.3% improvement in accuracy. Table  5 shows performance while ablating these changes; they all contribute substantially to tagging accuracy. Table 6 shows several classes of words where the LSTM model outperforms the baseline neural network that uses only local context (NN). The performance increase on unseen words is likely due to the fact that the LSTM can model more context to determine the category for a word. Unsurprisingly, this leads to a large improvement in accuracy for words taking non-local arguments. Finally, we see a large improvement in prepositional phrase attachment. This improvement is likely to be due to the deep architecture, which can better take into account the interaction between the preposition, its argument  Table 7: Effect of simulating weaker grammars, by allowing the specified atomic categories to unify. * allows all atomic categories to unify, except conjunctions and punctuation. Results are on development sentences of length ≤40.
noun phrase, and its nominal or verbal attachment. Table 6 also shows cases where the semi-supervised models perform better. Accuracy improves on unseen words-showing that tri-training can be a more effective way of generalizing to unseen words than pre-trained word embeddings alone. We also see improvement in accuracy on wh-words, which we attribute to the training data containing more examples of rare categories used for wh-words in piedpiping and similar constructions. One case where performance remains weak for all models is on unseen usages-where words occur in the CCGbank training data, but not with the category required in the test data. The improvement from tri-training is limited, likely due to the weakness of the baseline parses, and new techniques will be required to correct such errors.

Effect of Grammar
A subtle but crucial point is that our method depends on the strictness of the CCGbank grammar to exclude ungrammatical derivations. Because there is no dependency model, we rely on the deterministic CCG grammar as a hard constraint. There is a trade-off between restrictive grammars which may be brittle on noisy text, and weaker grammars that may overgenerate ungrammatical sentences. We measure this trade-off by testing weaker grammars, which merge categories that are normally distinct. For example, if we merge PP and NP , then an S \NP can take either a PP or NP argument. Table 7 shows that relaxing the grammar significantly hurts performance; the deterministic constraints are crucial to training a high quality LSTM 227 CCG parser. With a very relaxed grammar in which all atoms can unify, dependencies features help compensate for the weakened grammar. Future work should explore further strengthening the grammar--e.g. marking plurality on NP s to enforce plural agreement, or using slash-modalities to prevent over-generation arising from composition (Baldridge and Kruijff, 2003).

Effect of Dependency Features
Perhaps our most surprising result is that high accuracy can be achieved with a rule-based grammar and no dependency features. We performed several experiments to verify whether the model can capture long-range dependencies, and the extent to which dependency features are required to further improve parsing performance.
Supertagging accuracy is still the bottleneck A natural question is whether further improvements to our model will require a more powerful parsing model (such as adding dependency or derivation features), or if future work should focus on the supertagger. We found that on sentences where all the supertags are correct in the final parse (51%), the F1 is very high: 97.7. On parses containing supertag errors, the F1 drops to just 80.3. This result suggests that parsing accuracy can be significantly increased by improving the supertagger, and that very high performance could be attained only using a supertagging model.
'Attach low' heuristic is surprisingly effective Given a sequence of supertags, our grammar is still ambiguous. As explained in Section 2, we resolve these ambiguities by attaching low. To investigate the accuracy of this heuristic, we performed oracle decoding given the highest scoring supertagsand found that F1 improved by 1.3, showing that there are limits to what can be achieved with a rulebased grammar. In contrast, an 'attach high' heuristic scores 5.2 F1 less than attaching low, suggesting that these decisions are reasonably frequent, but that attaching low is much more common.
Would adding a dependency model help here? We consider several dependencies whose attachment is often ambiguous given the supertags. Results are shown in Table 8   Supertag-factored model is accurate on longrange dependencies One motivation for CCG parsing is to recover long-range dependencies. While we do not explicitly model these dependencies, they can still be extracted from the parse. Instead, we rely on the LSTM supertagger to implicitly model the dependencies-a task that becomes more challenging with longer dependencies. We investigate the accuracy of our parser for dependencies of different lengths. Figure 3 shows that adding dependencies features does not improve the recovery of long-range dependencies over the LSTM alone; the LSTM accurately models long-range dependencies.

Related Work
Recent work has applied neural networks to parsing, mostly using neural classifiers in shift-reduce parsers (Henderson et al., 2013;Chen and Manning, 2014;Weiss et al., 2015). Unlike our approach, none of these report both state-ofthe-art speed and accuracy. Vinyals et al. (2015) in-stead propose embedding entire sentences in a vector space, and then generating parse trees as strings. Our model achieves state-of-the-art accuracy with a non-ensemble model trained on the standard training data, whereas their model requires ensembles or extra supervision to match the state of the art.
Most work on CCG parsing has either used CKY chart parsing (Hockenmaier, 2003;Clark and Curran, 2007;Fowler and Penn, 2010;Auli and Lopez, 2011a) or shift-reduce algorithms (Zhang and Clark, 2011;Xu et al., 2014;Ambati et al., 2015). These methods rely on beam-search to cope with the huge space of possible CCG parses. Instead, we use Lewis and Steedman (2014a)'s A * algorithm. By using a semi-supervised LSTM supertagger, we improved over Lewis and Steedman's parser by 4.8 F1.
CCG supertagging was first attempted with maximum-entropy Markov models (Clark, 2002)in practice, the combination of sparse features and a large tag set makes such models brittle. Lewis and Steedman (2014b) applied feed-forward neural networks to supertagging, motivated by using pretrained work embeddings to reduce sparsity. Xu et al. (2015) showed further improvements by using RNNs to condition on non-local context. Concurrently with this work, Xu et al. (2016) explored bidirectional RNN models, and Vaswani et al. (2016) use bidirectional LSTMs with a different training procedure.
Our tagging model is closely related to the bidirectional LSTM POS tagging model of . We see larger gains over the state-ofthe-art-likely because supertagging involves more long-range dependencies than POS tagging.
Other work has successfully applied GPUs to parsing, but has required GPU-specific code and algorithms (Yi et al., 2011;Johnson, 2011;Canny et al., 2013;Hall et al., 2014). GPUs have also been used for machine translation .

Conclusions and Future Work
We have shown that a combination of deep learning, linguistics and classic AI search can be used to build a parser with both state-of-the-art speed and accuracy. Future work will explore using our parser to recover other representations from CCG, such as Universal Dependencies (McDonald et al., 2013) or semantic roles. The major obstacle is the mismatch between these representations and CCGbank-we will therefore investigate new techniques for obtaining other representations from CCG parses. We will also explore new A * parsing algorithms that explicitly model the global parse structure using neural networks, while maintaining optimality guarantees.