Supertagging With LSTMs

In this paper we present new state-of-the-art performance on CCG supertagging and parsing. Our model outperforms existing approaches by an absolute gain of 1.5%. We analyze the performance of several neural models and demonstrate that while feed-forward architectures can compete with bidirectional LSTMs on POS tagging, models that encode the complete sentence are necessary for the long range syntactic information encoded in supertags.


Introduction
Morphosyntactic labels for words are commonly used in a variety of NLP applications. For this reason, part-of-speech (POS) tagging and supertagging have drawn significant attention from the community. Combinatory Categorial Grammar is a lexicalized grammar formalism that is widely used for syntactic and semantic parsing. Supertagging (Clark, 2002;Bangalore and Joshi, 2010) assigns complex syntactic labels to words to enable fast and accurate parsing. The disambiguation of correctly labeling a word with one of over 1,200 CCG labels is difficult compared to choosing on of the 45 POS labels in the Penn Treebank (Marcus et al., 1993). In addition to the large label space of CCG supertags, labeling a word correctly depends on knowledge of syntactic phenomena arbitrarily far in the sentence (Hockenmaier and Steedman, 2007). This is because supertags encode highly specific syntactic information (e.g. types and locations of arguments) about a word's usage in a sentence.
In this paper, we show that Bidirectional Long Short-Term Memory recurrent neural networks (bi-LSTMs) (Graves, 2013;Zaremba et al., 2014), which can use information from the entire sentence, are a natural and powerful architecture for CCG supertagging. In addition to the bi-LSTM, we create a simple yet novel model that outperforms the previous state-of-the-art RNN model that uses handcrafted features  by 1.5%. Concurrent to this work (Lewis et al., 2016) introduced a different training methodology for bi-LSTM for supertagging. We provide a detailed analysis of the quality of various LSTM architectures, forward, backward, and bi-directional, shedding light over the ability of the bi-LSTM to exploit rich sentential context necessary for performing supertagging. We also show that a baseline feed-forward neural network (NN) architecture significantly outperforms previous feed-forward NN baselines, with slightly fewer features, achieving better accuracy than the RNN model from .
Recently, bi-LSTMs have achieved high accuracies in a simpler sequence labeling task: partof-speech tagging Ling et al., 2015) on the Penn treebank, with small improvements over local models. However, we achieve strong accuracies compared to  using feed-forward neural network model trained on local context, showing that this task does not require bi-LSTMs. Our strong feed-forward NN baselines show the power of feed-forward NNs for some tasks.
Our main contributions are the introduction of a new bi-LSTM model for CCG supertagging that achieves state-of-the-art, on both CCG supertagging and parsing, and a detailed analysis of our results, including a comparison of bi-LSTMs and simpler feed forward NN models for supertagging and POS tagging, which suggests that the added complexity of bi-LSTMs may not be necessary for POS tagging, where local contexts suffice to a much greater extent than in supertagging.

Models And Training
We use feed-forward neural network models and bidirectional LSTM (bi-LSTM) based models in this work.

Feed-Forward
For both POS tagging and our baseline supertagging model, we use feed-forward neural networks with two hidden layers of rectified linear units (Nair and Hinton, 2010). For supertagging, we use a slightly smaller set than Lewis and Steedman (2014a), using a left and right 3-word window with suffix and capitalization features for the center word. However, unlike them, we train on the full set of supertag categories observed during training.
In POS tagging, when tagging word w i , we consider only features from a window of five words, with w i at the center. For each w j with i − 2 ≤ j ≤ i + 2, we add w j lowercased and a string that encodes the basic "word shape" of w j . This is computed by replacing all sequences of uppercase letters with A, all sequences of lowercase letters with a, all sequences of digits with 9, and all sequences of other characters with * . Finally, we add two and three letter suffixes and two letter prefix for w i only.

LSTM models
We experiment with two kinds of bi-LSTM models. We train a basic bi-LSTM where the forward and backward LSTMs take input words w i and produce hidden state − → h i and ← − h i . For each position, we produceh i , wherẽ where σ(x) = max(0, x) is a rectifier nonlinearity, and where W← − h and W− → h are parameters to be learned. The unnormalized likelihood of an output supertag is computed using supertag embeddings The final softmax layer computes normalized supertag probabilities.
Although bidirectional LSTMs can capture long distance interactions between words, each output label is predicted independently. To explicitly model supertag interactions, our next model combines two models, the bi-LSTM and a LSTM language model (LM) over the supertags (Figure 1). At position i, the LM accepts an input supertag t i−1 producing hidden state h LM i , and a second combiner layer, parametrized by matrices W LM and Wh transforms h i and h LM i to h i similar to the combiner forh i (Equation 1). Output supertag probabilities are computed just as before, replacing replacingh i with h i . We refer to this model as bi-LSTM-LM. For all our LSTM models, we only use words as input features.

Training
We train our models to maximize the log-likelihood of the data with minibatch gradient ascent. Gradients of the models are computed with backpropagation (Chauvin and Rumelhart, 1995). Since gold supertags are available during training time and not while decoding, a bi-LSTM-LM trained on gold supertags might not recover from errors caused by using incorrectly predicted supertags. This results in the bi-LSTM-LM slightly underperforming the bi-LSTM (we refer to training with gold supertags as g-train in Table 1). To bridge this gap between training and testing we also experiment with a sampling training regime in addition to training.
Scheduled sampling: Following (Bengio et al., 2015;Ranzato et al., 2015), for each output token, with some probability p, we use the most likely predicted supertag (arg max t i P (t i | h i )) from the model in position i−1 as input to the supertag LSTM LM in position i and use the gold supertag with probability 1 − p. We denote this training as sstrain-1. We also experiment with using the 5-best previous predicted supertags from the output distribution at position i − 1 and feed them to the LM as input in position i as a bit vector. Additionally, we g-train ss-train-1 ss-train-5 1 Figure 2: Scheduled sampling improves the perplexity of the gold sequence under predicted tags. We see that the perplexity of the gold supertag sequence when using predicted tags for the LM is lower for ss-train-1 and ss-train-5 than with g-train.
use their probabilities (re-normalized over the 5-best tags) and scale the input supertag embeddings with their re-normalized probability during look-up. We refer to this setting as ss-train-5. In this work, we use an inverse sigmoid schedule to compute p, where s is the epoch number and k is a hyperparameter that is tuned. 1 In Figure 2, we see that for the development set training with scheduled sampling improves the perplexity of the gold supertag sequence when using predicted supertags, indicating better recovery from conditioning on erroneous supertags. For both ss-train and g-train, we use gold supertags for the output layer and train the model to maximize the log-likelihood of the data. 2

Architectures
Our feed-forward models use 2048 rectifier units in the first hidden layer, 50 and 128 rectifier units in the second hidden layer for POS tagging and Supertagging respectively, and 64 dim. input embeddings.
Our LSTM based models use 512 hidden states. We pre-train our word embeddings with a 7gram feed-forward neural language model using the NPLM toolkit 3 on a concatenation of the BLLIP corpus (Charniak et al., 2000) and WSJ sections 02-21 of the Penn Treebank. 1 The reader should refer to (Bengio et al., 2015) for details. 2 We use dropout for all our feed-forward (Srivastava, 2013) and bi-LSTM based models (Zaremba et al., 2014). We carry out a grid search over dropout probabilities and sampling schedules. We train the LSTMs for 25 epochs and the feed-forward models for 30 epochs, tuning on the development data.
3 http://nlg.isi.edu/software/nplm/  Table 1: Accuracies on the development section. The language model provides a boost in performance, and large gains on the parseability of the sequence (%P). The numbers for bi-LSTM-LM + ss-train-1 and + g-train are with beam decoding. All others use greedy decoding. Interestingly, greedy decoding with ss-train-5 works as well as beam decoding with ss-train-1.

Decoding
We perform greedy decoding. For each position i, we select the most probable supertag from the output distribution. For the bi-LSTM-LM models trained with g-train and ss-train-1, we feed the most likely supertag from the output distribution as LM input in the next position. We decode with beam search (size 12) for bi-LSTM-LMs trained with g-train and ss-train-1. For the bi-LSTM-LMs trained with ss-train-5, we perform greedy decoding similar to training, feeding the k-best supertags from the output supertag distribution in position i − 1 as input to the LM in position i, along with the renormalized probabilities. We don't perform beam decoding for ss-train-5, as the previous k-best inputs already capture different paths through the network. 4

Data
For supertagging, experiments were run with the standard splits of CCGbank. Unlike previous work no features were extracted for the LSTM models and rare categories were not thresholded. Words were lowercased and digits replaced with @.
CCGbank's training section contains 1,284 lexical categories (394 in Dev). The distribution of categories has a long tail, with only a third of those cate-   gories having a frequency count ≥ 10 (the threshold used by existing literature). Following (Lewis and Steedman, 2014b), we allow the model to predict all categories for a word, not just those with which the word was observed to co-occur in the training data. Accuracies on these unseen (word, cat) pairs are presented in the third column of Table 1. Table 3 presents our Feed-Forward POS tagging results. We achieve 97.28% on the development set and 97.4% on test. Although slightly below state-ofthe-art, we approach existing work with bi-LSTMs, and our models are much simpler and faster to train. 5 Table 1 shows a steady increase in performance as the model is provided additional context. The forward and backward models are presented with information that may be arbitrarily far away in the sentence, but only in a specific direction. This yields weaker results than the Feed Forward model which can see in both directions within a small window. The real gains are achieved by the Bidirectional LSTM which incorporates knowledge from the entire sentence. Our addition of a language model and changes to training, further improve the perfor- 5   mance. Our final model (bi-LSTM-LM+ss-train-1 model with beam decoding) has a test accuracy of 94.5%, 1.5% above state-of-the-art.

Parsing
Our primary goal in this paper was to demonstrate how a bi-LSTM captures new and different information from uni-directional or feed-forward approaches. This advantage also translates to gains in parsing. Table 4 presents new state-of-the-art parsing results for both  and our bi-LSTM-LM +ss-train-1. These results were attained using our part-of-speech tags (Table 3) and the Java implementation  of the C&C parser (Clark and Curran, 2007) 6 .

Error Analysis
Our analysis indicates that the information following a word is more informative than what preceded it. Table 2 compares how well our models recover common and syntactically interesting supertags. In particular, the Forward and Backward models, motivate the need for a Bi-directional approach.   The first two rows show prepositional phrase attachment decisions (noun and verb attaching categories are in rows one and two, respectively). Here the forward model outperforms the backward model, presumably because knowing the word to be modified and the preposition, is more important than observing the object of the prepositional phrase (the information available to the backward model).
If the information missing from either the forward or backward models were local, the bidirectional model should perform the same as the Feed-Forward model, instead it surpasses it, often by a large margin. This implies there is long range information necessary for choosing a supertag.
Embeddings In addition, we can visualize the information captured by our models by investigating a category's nearest neighbors based on the learned embeddings. We see see that the forward model learns internal structure with the query category, but the list of arguments is nearly random. In contrast, the backward model clusters categories primarily based on the final argument, perhaps sharing similarities in the subject argument only because of the predictable SVO nature of English text. However, due to its lack of forward context the model incorrectly asso-ciates categories with less-common first arguments (e.g. S[qem]). Finally, the bidirectional embeddings appear to cleanly capture the strengths of both the forward and backward models.
Consistency and Internal Structure Because supertags are highly structured their co-occurence in a sentence must be permitted by the combinators of CCG. Without encoding this explicitly, the language model dramatically increases the percent of predicted sequences that result in a valid parse by up to 15% (last column of Table 2).
Sparsity One consideration of our approach is that we do not threshold rare categories or use any tag dictionaries; our models are presented with the full space of CCG categories, despite the long tail. This did not did not hurt performance and the models learned to successfully use several categories which were outside the set of traditionally-thresholded frequent categories. Additionally, the total number of categories used correctly at least once by the bidirectional models was substantially higher than the other models (∼270 vs. ∼220 of 394), though the large number of unused categories (≥120) indicates that there is still substantial room for improvement.

Conclusions and Future Work
Because bi-LSTMs with a language model encode an entire sentence at decision time, we demonstrated large gains in supertagging and parsing. Future work will investigate improving performance on rare categories.