CCG Supertagging with a Recurrent Neural Network

Recent work on supertagging using a feed-forward neural network achieved signiﬁ-cant improvements for CCG supertagging and parsing (Lewis and Steedman, 2014). However, their architecture is limited to considering local contexts and does not naturally model sequences of arbitrary length. In this paper, we show how directly capturing sequence information us-ing a recurrent neural network leads to further accuracy improvements for both su-pertagging (up to 1.9%) and parsing (up to 1% F1), on CCGBank, Wikipedia and biomedical text.


Introduction
Combinatory Categorial Grammar (CCG; Steedman, 2000) is a highly lexicalized formalism; the standard parsing model of  uses over 400 lexical categories (or supertags), compared to about 50 POS tags for typical CFG parsers. This makes accurate disambiguation of lexical types much more challenging. However, the assignment of lexical categories can still be solved reasonably well by treating it as a sequence tagging problem, often referred to as supertagging (Bangalore and Joshi, 1999). Clark and Curran (2004) show that high tagging accuracy can be achieved by leaving some ambiguity to the parser to resolve, but with enough of a reduction in the number of tags assigned to each word so that parsing efficiency is greatly increased.
In addition to improving parsing efficiency, supertagging also has a large impact on parsing accuracy (Curran et al., 2006;Kummerfeld et al., 2010), since the derivation space of the parser is determined by the supertagger, at both train-*All work was completed before the author joined Facebook. ing and test time.  enhanced supertagging using a so-called adaptive strategy, such that additional categories are supplied to the parser only if a spanning analysis cannot be found. This strategy is used in the de facto C&C parser , and the two-stage CCG parsing pipeline (supertagging and parsing) continues to be the choice for most recent CCG parsers (Zhang and Clark, 2011;Auli and Lopez, 2011;Xu et al., 2014).
Despite the effectiveness of supertagging, the most widely used model for this task ) has a number of drawbacks. First, it relies too heavily on POS tags, which leads to lower accuracy on out-of-domain data (Rimell and Clark, 2008). Second, due to the sparse, indicator feature sets mainly based on raw words and POS tags, it shows pronounced performance degradation in the presence of rare and unseen words (Rimell and Clark, 2008;Lewis and Steedman, 2014). And third, in order to reduce computational requirements and feature sparsity, each tagging decision is made without considering any potentially useful contextual information beyond a local context window. Lewis and Steedman (2014) introduced a feedforward neural network to supertagging, and addressed the first two problems mentioned above. However, their attempt to tackle the third problem by pairing a conditional random field with their feed-forward tagger provided little accuracy improvement and vastly increased computational complexity, incurring a large efficiency penalty.
We introduce a recurrent neural network-based (RNN) supertagging model to tackle all the above problems, with an emphasis on the third one. RNNs are powerful models for sequential data, which can potentially capture long-term dependencies, based on an unbounded history of previous words ( §2); similar to Lewis and Steedman (2014) we only use distributed word representa-tions ( §2.2). Our model is highly accurate, and by integrating it with the C&C parser as its adaptive supertagger, we obtain substantial accuracy improvements, outperforming the feed-forward setup on both supertagging and parsing.
2 Supertagging with a RNN

Model
We use an Elman recurrent neural network (Elman, 1990) which consists of an input layer x t , a hidden state (layer) h t with a recurrent connection to the previous hidden state h t−1 and an output layer y t . The input layer is a vector representing the surrounding context of the current word at position t, whose supertag is being predicted. 1 The hidden state h t−1 keeps a representation of all context history up to the current word. The current hidden state h t is computed using the current input x t and hidden state h t−1 from the previous position. The output layer represents probability scores of all possible supertags, with the size of the output layer being equal to the size of the lexical category set.
The parameterization of the network consists of three matrices which are learned during supervised training. Matrix U contains weights between the input and hidden layers, V contains weights between the hidden and output layers, and W contains weights between the previous hidden state and the current hidden state. The following recurrence 2 is used to compute the activations of the hidden state at word position t: where f is a non-linear activation function; here we use the sigmoid function f (z) = 1 1+e −z . The output activations are calculated as: where g is the softmax activation function g(z i ) = e z i j e z j that squeezes raw output activations into a probability distribution.

Word Embeddings
Our RNN supertagger only uses continuous vector representations for features and each feature type has an associated look-up table, which maps a feature to its distributed representation. In total, three feature types are used. The first type is word embeddings: given a sentence of N words, (w 1 , w 2 , . . . , w N ), the embedding feature of w t (for 1 ≤ t ≤ N ) is obtained by projecting it onto a n-dimensional vector space through the look-up table L w ∈ R |w|×n , where |w| is the size of the vocabulary. Algebraically, the projection operation is a simple vector-matrix product where a one-hot vector b j ∈ R 1×|w| (with zeros everywhere except at the jth position) is multiplied with L w : where j is the look-up index for w t .
In addition, as in Lewis and Steedman (2014), for every word we also include its 2-character suffix and capitalization as features. Two more lookup tables are used for these features. L s ∈ R |s|×m is the look-up table for suffix embeddings, where |s| is the suffix vocabulary size. L c ∈ R 2×m is the look-up table for the capitalization embeddings. L c contains only two embeddings, representing whether or not a given word is capitalized.
We extract features from a context window surrounding the current word to make a tagging decision. Concretely, with a context window of size k, k/2 words either side of the target word are included. For a word w t , its continuous feature representation is: where e wt ∈ R 1×n , s wt ∈ R 1×m and c wt ∈ R 1×m are the output vectors from the three different look-up tables, and [e wt ; s wt ; c wt ] denotes the concatenation of three vectors and hence f wt ∈ R 1×(n+2m) . At word position t, the input layer of the network x t is: where x t ∈ R 1×k(n+2m) and the right-hand side is the concatenation of all feature representations in a size k context window. We use pre-trained word embeddings from Turian et al. (2010) to initialize lookup table L w , and we apply a set of word pre-processing techniques at both training and test time to reduce sparsity. All words are first lower-cased, and all numbers are collapsed into a single digit '0'. If a lower-cased hyphenated word does not have an entry in the pre-trained word embeddings, we attempt to back-off to the substring after the last hyphen. For compound words and numbers delimited by "\/", we attempt to back-off to the substring after the delimiter. After pre-processing, the Turian embeddings have a coverage of 94.25% on the training data; for out-of-vocabulary words, three separate randomly initialized embeddings are used for lower-case alphanumeric words, upper-case alphanumeric words, and non-alphanumeric symbols.
For padding at the start and end of a sentence, the "unknown" entry from the pre-trained embeddings is used. Look-up tables L s and L c are also randomly initialized, and all look-up tables are modified during supervised training using backpropagation.

Experiments
Datasets and Baseline. We follow the standard splits of CCGBank (Hockenmaier and Steedman, 2007) for all experiments using sections 2-21 for training, section 00 for development and section 23 as in-domain test set. The Wikipedia corpus from Honnibal et al. (2009) and the Bioinfer corpus (Pyysalo et al., 2007) are used as two outof-domain test sets. We compare supertagging accuracy with the MaxEnt C&C supertagger and the neural network tagger of Lewis and Steedman (2014) (henceforth NN), and we also evaluate parsing accuracy using these three supertaggers as a front-end to the C&C parser. We use the same 425 supertag set used in both C&C and NN.
Hyperparameters and Training. For L w , we use the scaled 50-dimensional Turian embeddings (n = 50 for L w ) as initialization. We have experimented during development with using 100dimensional embeddings and found no improvements in the resulting model. Out-of-vocabulary embedding values in L w and all embedding values in L s and L c are initialized with a uniform distribution in the interval [−2.0, 2.0]. The embedding dimension size m of L s and L c is set to 5. Other parameters of the network {U, V, W} are initialized with values drawn uniformly from the interval [−2.0, 2.0], and are then scaled by their corresponding input vector size. We experimented with context window sizes of 3, 5, 7, 9 and 11 during development and found a window size of 7 gives the best performing model on the dev set. We use a fixed learning rate of 0.0025 and a hidden state size of 200.  To train the model, we optimize cross-entropy loss with stochastic gradient descent using minibatched backpropagation through time (BPTT; Rumelhart et al., 1988;Mikolov, 2012); the minibatch size for BPTT, again tuned on the dev set, is set to 9.
Embedding Dropout Regularization. Without any regularization, we found cross-entropy error on the dev set started to increase while the error on the training set was continuously driven to a very small value (Fig. 1a). With the suspicion of overfitting, we experimented with l 1 and l 2 regularization and learning rate decay but none of these techniques gave any noticeable improvements for our model. Following Legrand and Collobert (2014), we instead implemented word embedding dropout as a regularization for all the look-up tables, since the capacity of our tagging model mainly comes from the look-up tables, as in their system. We observed more stable learning and better generalization of the trained model with dropout. Similar to other forms of droput (Srivastava et al., 2014), we randomly drop units and their connections to other units at training time. Concretely, we apply a binary dropout mask to x t , with a dropout rate of 0.25, and at test time no mask is applied, but the input to the network, x t , at each word position is scaled by 0.75. We experimented during development with different dropout rates, but found the above choice to be optimal in our setting.

Supertagging Results
We use the RNN model which gives the highest 1-best supertagging accuracy on the dev set as the final model for all experiments. Without any form of regularization, the best model was obtained at the 20th epoch, and it took 35 epochs for the dropout model to peak (Fig. 1b). We use the dropout model for all experiments and, unlike the C&C supertagger, no tag dictionaries are used. Table 1 shows 1-best supertagging accuracies on the dev set. The accuracy of the C&C supertag-    ger drops about 1% with automatically assigned POS tags, while our RNN model gives higher accuracy (+0.47%) than the C&C supertagger with gold POS tags. All timing values are obtained on a single Intel i7-4790k core, and all implementations are in C++ except NN which is implemented using Torch and Java, and therefore we believe the efficiency of NN could be vastly improved with an implementation with a lower-level language. Table 2 compares different supertagging models for multi-tagging accuracy at the default β levels used by the C&C parser on the dev set. The β parameter determines the average number of supertags assigned to each word (ambiguity) by a supertagger when integrated with the parser; categories whose probabilities are not within β times the probability of the 1-best category are pruned. At the first β level (0.075), the three supertagging models give very close ambiguity levels, but our RNN model clearly outperforms NN and C&C (auto POS) in both word (WORD) and sentence (SENT) level accuracies, giving similar word-level accuracy as C&C (gold POS). For other β levels (except β = 0.001), the RNN model gives comparable ambiguity levels to the C&C model which uses a tagdict, while being much more accurate than both the other two models.    Fig. 1c compares multi-tagging accuracies of all the models on the dev set. For all models, the same β levels are used (ranging from 0.075 to 10 −4 , and all C&C default values are included). The RNN model consistently outperforms other models across different ambiguity levels. Table 3 shows 1-best accuracies of all models on the test data sets (Bio-GENIA gold-standard CCG lexical category data from Rimell and Clark (2008) are used, since no gold categories are available in the Bioinfer data). With gold-standard POS tags, the C&C model outperforms both the NN and RNN models on CCGBank and Bio-GENIA; with auto POS, the accuracy of the C&C model drops significantly, due to its high reliance on POS tags. Fig. 2 shows multi-tagging accuracies on all test data (using β levels ranging from 0.075 to 10 −6 , and all C&C default values are included). On CCGBank, the RNN model has a clear accuracy advantage, while on the other two data sets, the accuracies given by the NN model are closer to the RNN model at some ambiguity levels, representing these data sets are still more challenging than CCGBank. However, both the NN and RNN models are more robust than the C&C model on the two out-of-domain data sets.

Parsing Results
We integrate our supertagging model into the C&C parser, at both training and test time, using all default parser settings; C&C hybrid model is used for CCGBank and Wikipedia; the normal-form model is used for the Bioinfer data, in line with Lewis and Steedman (2014) and Rimell and Clark (2008). Parsing development results are shown in Table 4; for out-of-domain data sets, no separate development experiments were done. Final results are shown in Table 5, and we substantially improve parsing accuracies on CCGBank and Wikipedia. The accuracy of our model on CCGBank represents a F1 score improvement of 1.53%/1.85% over the C&C baseline, which is comparable to the best known accuracy reported in Auli and Lopez (2011). However, our RNN-supertaggingbased model is conceptually much simpler, with no change to the parsing model required at all.

Conclusion
We presented a RNN-based model for CCG supertagging, which brings significant accuracy improvements for supertagging and parsing, on both in-and out-of-domain data sets. Our supertagger is fast and well-suited for large scale processing.