TAG Parsing with Neural Networks and Vector Representations of Supertags

We present supertagging-based models for Tree Adjoining Grammar parsing that use neural network architectures and dense vector representation of supertags (elementary trees) to achieve state-of-the-art performance in unlabeled and labeled attachment scores. The shift-reduce parsing model eschews lexical information entirely, and uses only the 1-best supertags to parse a sentence, providing further support for the claim that supertagging is “almost parsing.” We demonstrate that the embedding vector representations the parser induces for supertags possess linguistically interpretable structure, supporting analogies between grammatical structures like those familiar from recent work in distributional semantics. This dense representation of supertags overcomes the drawbacks for statistical models of TAG as compared to CCG parsing, raising the possibility that TAG is a viable alternative for NLP tasks that require the assignment of richer structural descriptions to sentences.


Introduction
Recent work has applied Combinatory Categorial Grammar (CCG, Steedman and Baldridge (2011)) to the problem of broad-coverage parsing in order to derive grammatical representations that are sufficiently rich to support tasks requiring deeper representation of a sentence's meaning (Lewis et al., 2015;Reddy et al., 2016;Nadejde et al., 2017). Yet CCG is only one of a number of mildly context-sensitive grammar formalisms that can provide such rich representations, and each has distinct advantages. In this paper we explore the applicability of another formalism, Tree Adjoin-ing Grammar (TAG, Joshi and Schabes (1997)), to the task of broad-coverage parsing.
TAG and CCG share the property of lexicalization: words are associated with elementary units of grammatical structure which are composed during a derivation using one of a small set of operations to produce a parse tree. The task of parsing involves the construction of a derivation tree that encodes the application of this set of actions to a set of elementary lexically-associated objects. TAG differs from CCG in having an even richer set of lexical units, so that the identification of these units in a derivation could be even more informative for subsequent tasks involving semantic interpretation, translation and the like, which have been the focus of CCG-based work.
The elementary units of CCG and TAG (categories for CCG, and elementary trees for TAG) determine a word's combinatory potential, in a way that is not the case for the usual part-ofspeech tags used in parsing. Indeed, the assignment of elementary objects to the words in a sentence almost determines the possible parse for a sentence. The near uniqueness of a parse given a sequence of lexical units motivated Bangalore and Joshi (1999) to decompose the parsing problem into two phases: supertagging, where elementary objects, or supertags, are assigned to each word, and stapling, where these supertags are combined together. They claim that given a perfect supertagger, a parse of a sentence follows from syntactic features provided by the supertags, and therefore, supertagging is "almost parsing." This claim has been confirmed in subsequent work: it has been shown that the task of parsing given a gold sequence of supertags can achieve high accuracy (TAG: (Bangalore et al., 2009;Chung et al., 2016), CCG: (Lewis et al., 2016)). However, it has also been revealed that the difficulty of supertagging, because of the large set of possible supertags, re-sults in inaccuracies that prevent us from effectively utilizing syntactic information provided by the imperfect set of supertags that are assigned. This problem is even more severe for TAG parsing. TAG differs from CCG in having a smaller set of combinatory operations, but a more varied set of elementary objects: the TAG-annotated version of the Penn Treebank that we use (Chen, 2001) includes 4727 distinct supertags (2165 occur once) while the CCG-annotated version (Hockenmaier and Steedman, 2007) includes 1286 distinct supertags (439 occur once). As a result, building a robust, broad-coverage TAG parser has proven difficult.
In this work, we show that robust supertaggingbased parsing of TAG is indeed possible by using a dense representation of supertags that is induced using neural networks. In the first half of the paper, we present a neural network supertagger based on a bi-directional LSTM (BLSTM) architecture, inspired by the work of Xu (2015) and Lewis et al. (2016) in CCG, and we make crucial use of synchronized dropout (Gal and Ghahramani, 2016). This supertagger achieves the stateof-the-art accuracy on the WSJ Penn Treebank. When combined with an existing TAG chart parser (Bangalore et al., 2009), the LSTM-based supertagger already yields state-of-the-art unlabeled and labeled attachment scores.
In the second half of the work, we present a shift-reduce parsing model based on a feedforward neural network that makes use of dense supertag embeddings. Although this approach has much in common with the approach to shiftreduce CCG parsing taken by Zhang and Clark (2011), it differs in its additive structures in supertag embeddings. When a CCG operation combines two supertags (categories), it yields a resulting category that is typically distinct from the two that are combined, and CCG shift-reduce parsers (e.g. Xu (2015)) make use of this result to guide subsequent actions. When the resulting category is the same as some lexical category assignment (for example when function application over (S\N P )/N P N P yields S\N P , the same as an intransitive verb), the parser will benefit from sharing statistics across these contexts. For TAG however, substitution or adjoining of one elementary tree into another does not change the nature of the elementary tree into which the operation has taken place. Consequently, the results of partial derivations are not identified with atomic lexical entries, resulting in sparser data. We propose a solution to this problem for TAG by introducing vector representations that are added to the supertag embedding when an operation has been applied to an elementary tree. Not only does this result in a TAG-parser with the best known performance over the WSJ Penn Treebank, but the resulting supertag embeddings turn out to contain linguistically sensible linear structure that we illustrate.

TAG Parsing and Dependency Parsing
TAG is a tree-rewriting system, and typically the elementary structures of a TAG are phrase structure trees. Thus, TAG-derived structures are also phrase structure trees. In addition, a TAG derivation also yields a record of the derivational operations (substitutions, adjunctions) used to produce the derived tree. Since these operations are context-free, this record also forms a tree, called the derivation tree, whose nodes are the elementary objects and the edges are combinatory operations. If we assume the TAG is lexicalized (i.e., each elementary structure is anchored by at least one terminal symbol), then we can label the nodes of the derivation tree with the tree names and also the anchors of the elementary trees, and we obtain what is formally a dependency tree. 1 Since each node is also labeled with an elementary tree from the grammar, we can associate rich linguistic structure with that node, such as passive voice or empty subject.
In addition, it has long been observed that the derivation tree can also be interpreted linguistically as a dependency tree (Rambow and Joshi, 1997), if certain assumptions are made about the shape of the elementary trees in the grammar (Frank, 2001). The substitution operation corresponds to the obligatory addition of an argument, and adjunction is used to add adjuncts, as well as function words to a lexical head. The one exception is the treatment of long distance whmovement in TAG. Here, a matrix clause is represented by a predicative auxiliary tree which is adjoined into the embedded clause, so that the whelement moved from the embedded clause can still be substituted locally into the tree headed by its verb. As a result, the dependency between the matrix and embedded verbs is inverted relative to the  Figure 1: TAG derivation tree (left) and closely related dependency tree (right) for The bill, which they failed to pass, would regulate emissions. Substitution edges are labeled SUBJ or OBJ, predicative auxiliary edges are labeled PREDAUX, while all other adjoining edges are labeled ADJ. We use the same edge labels in the dependency tree. The derivation tree also carries the name of the elementary tree used during the derivation, which can be used to look up rich syntactic information about that word in context. normally assumed dependency. This can be seen in Figure 1, where in the linguistically motivated dependency tree (right) pass depends on failed as the latter's object, while in the TAG derivation tree (left), failed depends on pass, linked by an arc marked PREDAUX for predicative auxiliary. These cases can be detected automatically because of the trees used; as a result of this inversion, there is almost no non-projectivity in English. In summary, TAG parsing into derivation trees is very closely related to dependency parsing. In this paper, we are interested in extracting derivation trees, not the derived trees (which can be recovered from the derivation trees).
The corpus we use is obtained by extracting a TAG grammar from the WSJ part of the Penn Treebank corpus, resulting in a grammar and derivation trees labeled with the grammar Chen (2001). For example, in Figure 1, t27 is the basic tree for a transitive verb (regulate), while t722 is the tree for a transitive verb which forms an object relative clause with an overt relative pronoun but an empty subject ( Figure 2). 2 The corpus and grammar were iteratively refined to obtain linguistically plausible derivation trees which could serve as input for a generation task (Bangalore and Rambow, 2000). As a result, the dependency structure is similar to Universal Dependency (Nivre et al., 2016), apart from the different treatment of long-distance wh-movement noted above: the primary dependencies are between the core meaningbearing lexical words, while function words (auxiliaries, determiners, complementizers) depend on their lexical head and have no dependents. 3 We label verbal argument arcs with deep dependency labels: Subject, Object, and Indirect Object normalized for passive and dative shift. All other arcs are labeled as Adjuncts. This means that our label set is small, but determining the argument labels requires detection of voice alternations and dative shift.

Long-Distance Dependencies and LSTMs
In CCG, a transitive verb is uniformly associated with the category (S\N P )/N P , and variation in the word order of a clause is addressed through the use of different combinatory operations. This results in greater parsing ambiguity given a sequence of categories. In TAG, the set of operations is more restricted. While this has the positive effect of reducing parsing ambiguity given a sequence of elementary trees, it necessitates a proliferation in the number of elementary trees. For example, a TAG will associate different elementary trees for the same transitive verb in order to derive canonical clauses, subject and object relatives, and subject and object questions. Not only does this lead to a larger number of supertags, it also means that the determination of the correct supertag requires sensitivity to long-distance dependencies. For example, in the question Who does Bill think Harry likes?, the category of the verb like requires sensitivity to the first word of the sentence. To address this problem, we make use of a supertagging model that is based on a recurrent network architecture, the Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber (1997)), which is constructed so that its update rule avoids the vanishing/exploding gradient problem.

Supertagger Model
The model architecture we adopt is depicted in Figure 3, a BLSTM. The input for each word is represented via the concatenation of a 100-dimensional embedding of the word, a 5dimensional embedding of a predicted part of speech tag, and a 10-dimensional embedding of a suffix vector (which encodes the presence of 1 and 2 character suffixes of the word). We initialize the word embeddings to be the pre-trained GloVe vectors (Pennington et al., 2014); for words which do not have a corresponding GloVe vector, we initialize their embedding to a zero vector. The other embeddings are randomly initialized. Features for each word are fed into the BLSTMs. To produce an output for the network, we concatenate the output vectors from the two LSTM directions and apply an affine transformation before the softmax function to obtain a probability distribution over the 4727 supertags. We train this network, including the embeddings, by optimizing the negative log-likelihood of the observed sequences of supertags in a mini-batch stochastic fashion with the Adam optimization algorithm with l = 0.001 (Kingma and Ba, 2015).
Since neural networks have numerous parameters, regularization plays a key role in training. This is typically accomplished by using dropout (Srivastava et al., 2014). Although dropout training has been successful on feed-forward neural networks, performing dropout on recurrent neural networks has been problematic (Gal and Ghahramani, 2016). Armed with a novel interpretation of dropout based on variational inference on parameters, Gal and Ghahramani (2016) propose that dropout noise should be shared across the time steps. We apply this technique to the training of our LSTM network, and achieve an improvement of approximately 2% in accuracy. 4 Transition-based Parsing for TAG As discussed in Section 2, TAG parsing into derivation trees is closely related to dependency parsing; it is natural to make use of techniques from dependency parsing to reconstruct a TAG derivation tree. We make use of this approach here, eschewing complete chart-based parsing algorithms in favor of greedy or beam-search-based explorations of possible parses.

Shift-Reduce Parsing Algorithm
We employ the arc-eager system of shift-reduce parsing, familiar from the MALT parser (Nivre et al., 2006). In this system, an oracle is trained to predict a sequence of transition operations from an initial state to a terminal state for each sentence. Each state is represented by c = (s, b, A) where s, b, and A denote the stack, buffer and set of dependency relations derived so far. Therefore, our objective is to predict a transition operation given the configuration set c. The initial configuration is defined as s = [ROOT ], b = [w 1 , · · · w n ], and A = ∅ where n is the number of tokens in the sentence w 1 w 2 · · · w n . At a particular state, denote the top ith element of the stack and the buffer by s i and b i respectively. The arc-eager system defines four types of operations with corresponding preconditions: LEFT-ARC, RIGHT-ARC, SHIFT and REDUCE. For the present parser, the LEFT-ARC and RIGHT-ARC operations are each further divided into seven different types depending on the derivational operation involved and the location: Substitution 0-4, Adjoining, and Co-anchor attachment. Substitution n represents an instance of substitution into an argument slot of an elementary tree that is uniquely annotated with the num-ber n (we discuss the interpretation of such numbers below). Adjoining represents an application of the adjoining operation. It is not further subdivided, as the current parser does not distinguish among different loci of adjoining within an elementary tree. Co-anchor attachment represents the substitution into a node that is construed as a cohead of an elementary tree. An example of this is the insertion of a particle into a verbally headed tree associated with a verb-particle construction, such as the insertion of up into the pick-headed tree to generate I picked up the book. The transitions terminate when the buffer is empty.
This system will fail to capture non-projective TAG derivation structures. However, as noted in Section 2, there is almost no non-projectivity in TAG derivation structures of English. Concretely, we find that WSJ Sections 01-22 contain only 26 non-projective sentences (0.065%), and those sentences are discarded during training. WSJ Section 00 does not have any non-projective sentences.
On the other hand, WSJ Sections 01-22 contain 0.6% of non-projective sentences in dependency grammar (Chen and Manning, 2014), an order of magnitude more than non-projectivity for TAG. This suggests that the problem of TAG parsing is more compatible with standard shift-reduce parsing than dependency grammar parsing is. 4

Parser Model
In this work, we use a non-lexicalized parser, which does not have access to the identities of the words of the sentence to be parsed. 5 Instead, the parser's decisions will be guided by the supertags of the top k elements from the stack and the first k elements of the buffer. Using these features as input, we build a two-layer feed-forward network to predict the action for the parser to take. As noted above, the identity of the supertag does not allow 4 We recognize alternatives to shift-reduce parsing. For instance, Dozat and Manning (2017) propose a graph-based parser that accommodates non-projectivity. It remains open whether such alternatives will work for TAG parsing, and we leave this for the future. We emphasize, however, that because of the nature of TAG derivations, the issue of nonprojectivity is much less severe than dependency parsing. 5 We have tried adding word embeddings as inputs to the parser with different choices of hyperparameters (e.g., the number of embedding dimensions). Unfortunately, our experiments yielded degraded performance. It should be noted, however, that TAG supertags typically provide enough information for deriving correct parses; the only cases that supertags cannot disambiguate are ambiguous attachments to identical nonterminals (e.g. The picture of my friend with green eyes). the parser to encode whether a particular node in an elementary tree has already been targeted by a substitution operation. In order to overcome this deficiency, we augment the parser's state with substitution memory, which encodes for each possible substitution site (from 0 to 4) in a supertag T whether that substitution has already applied in T .
Each supertag is mapped to a d-dimensional vector by an embedding matrix E ∈ R d× (N +2) where N denotes the number of supertags; we also have additional vectors representing the empty state and ROOT . Substitution memory is similarly transformed, with a substitution memory embedding matrix M ∈ R d×5 , to a d-dimensional vector that encodes in a distributed manner where substitution has applied. Each column in M is the vector corresponding to a specific substitution type. Each element from the stack and buffer is then represented by adding the supertag T embedding to the embedding associated with each vector from M corresponding to the substitution operations already performed on T , if any. Mathematically, suppose that we are at the configuration c = (s, b, A), and p (i) ∈ R 5 denotes the substitution history of s i . p (i) is an indicator vector that p (i) j = 1 if and only if we have already performed substitution j into s i in the parser, and 0 otherwise. Define p (k+1) in the same way for b 1 . 6 Then, the input vector to the network can be expressed as [Es 1 + M p (1) ; · · · ; Es k + M p (k) ; Eb 1 + M p (k+1) ; · · · ; Eb k ] This model with the additive substitution memory has several conceptual advantages. First, the additive structure gives us an unbounded scope of the past transition, avoiding making decisions that lead to substitution collisions without a computationally expensive architecture such as an ensemble of LSTMs (Xu, 2015). Moreover, as TAG supertags encode rich syntactic features, the parsing data for some supertags tend to become scarce. The most common 300 supertags in the Penn Tree Bank WSJ Sections 01-22 cover 96.8% of the data. In a situation of such data sparsity, it becomes crucial to link, for example, the behaviors of intransitive verbs with those of transitive verbs. With substitution memory, the network can develop representations under which addition of appropriate substitution vectors serves to transform one supertag into another, allowing the generalization across these contexts. Indeed, as we will show in a later section, the substitution memory embeddings and supertag embeddings turn out to yield interpretable and linguistically sensible structures.
Finally, we concatenate the vectors associated with the relevant elements from the stack and buffer into a 2dk dimensional vector and feed it to the network to obtain a probability distribution over the possible transition actions. The architecture is visualized in Figure 4. Following Chen and Manning (2014), we use the cube activation function for the first layer, which could better capture interactions. We, again, optimize the negative loglikelihood in a mini-batch stochastic fashion with the Adam optimization algorithm with l = 0.001 (Kingma and Ba, 2015). With regards to decoding, we consider both greedy parsing as well as a beam search algorithm, where we keep transition action hypotheses at each time step, in the experiments we report below.

Supertag Input to the Parser
We consider three types of supertag inputs to the neural network parser: gold supertags, 1-best supertags from the BLSTM supertagger, and 1-best supertags from the MICA chart parser (Bangalore et al., 2009). MICA searches through n-best supertags with their corresponding probabilities and produces a full parse forest that abides by the TAG grammar. To generate the 1-best supertags from MICA, we first feed 10-best supertags from the BLSTM supertagger to the MICA chart parser, and retain only the supertags of the best parse. These supertags have the special property that there exists a feasible parse in the TAG grammar for every sentence, which does not necessarily hold for the 1-best supertags from the BLSTM su-pertagger.

Experimental Setups
In order to ensure comparability with past work on TAG parsing, we follow the protocol of Bangalore et al. (2009) andChung et al. (2016), and use the grammar and the TAG-annotated WSJ Penn Tree Bank described in Section 2. Following that work, we use Sections 01-22 as the training set, Section 00 as the development set, and Section 23 as the test set. The training, development, and test sets comprise 39832, 1921, and 2415 sentences, respectively. The development set contains 177 sentences with at least one supertag that was absent from the training set. We implement the networks in TensorFlow (Abadi et al., 2015). During training, we shuffle the order of the sentences in the training set to form mini-batches. Each mini-batch consists of 100 sentences, except the last which contains 32 sentences.
For supertagging, we first generate predicted POS tags for both the training set and the development set. The POS-tagger architecture is similar to that of the supertagger shown in Figure 3, except that, obviously, we do not feed it POS embeddings. The BLSTMs each contain 128 units, and we do not apply dropout at this stage. To derive predicted POS tags for the supertagger training set, we perform 10-fold jackknife training over the training set. For the supertagger, each direction of LSTM computation involves two layers, and each LSTM contains 512 units. The hidden units, layerto-layer, and input units dropout rates are 0.5, 0.5, and 0.2 respectively. After each training epoch, we test the parser on the development set. When classification accuracy does not improve on two consecutive epochs, we end the training.
For the parser, we initialize the supertag embedding matrix E and the substitution memory embedding matrix M according to For all of the experiments reported here, we fix the hyper-parameters as follows: the embedding dimensions d for the supertag and substitution memory embeddings are 50, the number of units is 200 on both of the two hidden layers, and the input dropout rate is 0.2 and the hidden dropout rate is 0.3. We choose k = 3 or 5 for the stack/buffer scope. After each training epoch, we test the parser on the development set, and when the greedy accuracy fails to improving on two consecutive epochs, we terminate the training.

Supertagging Results
We achieve on Section 00 supertagging accuracy of 89.32%, 90.67% if we disregard the 177 sentences that contain an unseen supertag. This performance surpasses previous results on this task: Bangalore et al. (2009) report 88.52% accuracy using a maxent supertagger combined with a chart parser (MICA), which is the best result over a tag set of this complexity, though better results are reported for considerably smaller tag sets (on the order of 300 supertags). The n-best and β pruning accuracy (Clark and Curran, 2007) are given in Figure 5. In the β pruning scheme, we pick supertags whose probabilities are greater than β times the probability of the most likely supertag. We show the results for β ∈ [0.075, 0.03, 0.01, 0.005, 0.001, 0.0001]. It is noteworthy that with β = 0.005, the average number of supertags picked for each token (ambiguity level) is about 2, but the accuracy surpasses 98%, suggesting that incorporating the β pruning method in the stapling phase of TAG parsing will enhance the parser. We also obtain comparable accuracy of 89.44% on Section 23.
As discussed above, TAG supertags alone provide rich syntactic information. In order to understand how much such information our supertagger sucessfully captures, we analyze the 1-best supertag results on the basis of the syntactic properties of the elementary trees defined in Chung et al. (2016). Extending the notion of binary precision and recall, we define the macro-averaging precision and recall as the simple average over precision or recall corresponding to each class (Sokolova and Lapalme, 2009). We also compute accuracy, which is simply the ratio of correctly classified examples to the entire number of examples. Table 1 shows the results along with those for the maxent supertagger (Bangalore et al., 2009). Recall tends to be lower than precision; we can attribute this pattern to the nature of the macroaveraging scheme that equally treats each class regardless of the size; poor recall performance on a small class, such as the class of dative shift verbs, influences the overall recall as much as performance on a large class. Observe, however, that the BLSTM supertagger yields significantly better performance on recall in general, and it outper- Figure 5: Section 00 n-best accuracy (left), and β pruning accuracy (right). Sentences with unseen supertags are disregarded.  forms the maxent supertagger by a large margin in handling long dependencies of wh-movement and relativization. Lastly, we interpret our supertagging performance in the context of prepositional phrase (PP) attachment ambiguity. Normally, in dependency parsing, PP attachment is resolved by the parser. However, in our case, it can be resolved before parsing, during the supertagging step. This is because the supertags for prepositions vary depending on the type of constituent modified by the PP containing the preposition; for example, t4 is the supertag for a preposition whose PP modifies an NP, while t13 is the supertag for a preposition whose PP modifies a VP.

MICA
To test how well our supertagger resolves PP attachment ambiguity, we used the dataset from Ratnaparkhi et al. (1994) (derived from the PTB WSJ) to extract a test set of sentences with PPs that are ambiguous between attaching to a VP or to an NP. 7 We then supertagged these sentences and checked whether the supertag for the preposition in the ambiguous PP is a VP modifier or an NP modifier. Of our test set of 1951 sentences, 1616 had supertags modifying the correct part of speech, to give an accuracy of 0.826. Table 2 compares this result to past work. The supertagger outperforms all other models besides the Word Vector model. Since this Word Vector model (like the MaxEnt model) is specifically trained for this task, and given that our supertagger is not trained for this particular task, the accuracy is reasonably encouraging. This result suggests that TAG supertagging is a reasonable intermediate level between only resolving PP attachment and conducting full parsing.

Parsing Results
Parsing results and comparison with prior models are summarized in Tables 3, 4 (Section 00), and 5 (Section 23). From Table 4, we see that the combination of the BLSTM supertagger, MICA chart parser, and the neural network parser achieves state-of-the-art performance, even compared to parsers that make use of lexical information, POS tags, and hand-engineered features. With gold supertags, the neural network parser with beam size 16 performs slightly better than the chart parser. As shown in Table 5, our supertag-based parser outperforms SyntaxNet (Andor et al., 2016) with the computationally expensive global normalization. This suggests that, besides providing the grammars and linguistic features that can be used in downstream tasks in addition to derivation trees (Semantic Role Labeling: (Chen and Rambow, 2003), Textual Entailments: (Xu et al., 2017)), supertagging also improves parsing performance.

Learned Vector Representation
We motivated the use of embeddings in the parser to encode properties of the supertags and the substitution operations performed on them. We can examine their structure in a way similar to what Mikolov et al. (2013) did for word embeddings by performing analogy tests on the learned supertag embeddings. Consider, for example, the analogy that an elementary tree representing a clause headed by a transitive verb (t27) is to a clause headed by an intransitive verb (t81) as a subject relative clause headed by a transitive verb (t99) is to a subject relative headed by an intransitive verb (t109). Following Mikolov et al. (2013), we can express this analogy with the equation t27 − t81 ≈ t99 − t109, which can be rearranged as t27 − t81 + t109 ≈ t99. By seeing if this approximate equality holds when the embeddings of the relevant supertags have been added and subtracted, we can test how well the embeddings capture syntactic properties of the supertags.
To create a set of such analogies, we extracted all pairs (stag1, stag2) such that stag2 is the result of excising exactly one substitution node from stag1. The idea here is that, once a substitution node is filled within a supertag, the result behaves like a supertag without that substitution node; for example, a transitive verb with its object filled behaves like an intransitive verb. We then create analogies by choosing two such pairs, (stag1, stag2) and (stag3, stag4), chosen so that stag1 and stag2 are related in the same way that stag3 and stag4 are related. From these two pairs we then form an equation of the form stag1 − stag2 + stag4 ≈ stag3.
We considered three different criteria for choosing which pairs of pairs can form analogies: A-1, where both pairs must have the same deep syntactic role (Drole) for the excised substitution node; A-2, where both pairs must have the same Drole and POS for the excised substitution node; and A-3, where both pairs must have the same Drole and same POS for the excised substitution node, and the heads of all supertags in the analogy must have the same POS. For each analogy generated, we computed the left hand side by adding and subtracting the relevant supertag embeddings and used cosine similarity to determine the most similar embeddings to the result and whether the intended right hand side was among the closest neighbors. We used four metrics for evaluation: Acc, the proportion of analogies for which the closest neighbor was the correct supertag; Acc-300, the proportion of analogies for which the closest neighbor amongst the 300 most common supertags was the correct supertag; Avg Rank, the average position of the correct choice in the ranked list of the closest neighbors; and Avg Rank-300,  Table 3: Parsing Results on Section 00. k is # of elements from stack and buffer used as input, B is the beam size. We show mean and standard deviation over 5 trials with different initialization for each configuration. BLSTM+Chart shows results obtained by feeding the 1-best supertag inputs from the MICA chart parser discussed in Section 4.3.

Gold Stags
Predicted   the average position of the correct choice in the ranked list of the closest neighbors amongst the 300 most common supertags. We expect that the embeddings for common supertags would be better representations than embeddings for rare supertags. Thus, we restricted our experiment to analogies between supertags among the 300 most common ones in the training set. (Indeed, experiments that included rare supertags in the analogies produced poor results.) Table 6 provides the results for the 3 types of analogies, which are very promising, particularly type A-3. We can visualize these results by performing PCA on the embedding vectors. Figure 6a shows the first 2 PCA components of A-3 analogies involving supertags containing transitive and intransitive predicates across a variety of structures. We see that virtually all pairs differ from one another by a similar vector, and in fact this difference is essentially the vector associated with substitution 1 in the substitution embedding memory (shown in blue). Figure 6b shows the case of pairs of canonical sentence elementary trees (read in I read the book) and their subject relative analogs (read in the guy who read the book). This again shows a systematic mapping between grammati-  cally related embeddings, suggesting that the embeddings encode relevant structural properties.

Conclusions and Future Work
We presented a state-of-the-art TAG supertagger and parser, the former based on a BLSTM architecture, and the latter on a non-lexicalized shiftreduce parser using a feed-forward network. The parser makes crucial use of supertag embeddings that provide linguistically interpretable vector representations of the supertags. These positive results suggest that TAG can provide the foundation of NLP systems for tasks requiring deeper analysis than current dependency parsers provide, and we will apply our parser to such tasks in the future. Nonetheless, a large discrepancy remains in parser performance with gold supertags and predicted supertags, indicating that supertagging is still a bottleneck. We will explore ways to leverage our supertagger's high β-pruning accuracy in parsing.