End-to-End Graph-Based TAG Parsing with Neural Networks

We present a graph-based Tree Adjoining Grammar (TAG) parser that uses BiLSTMs, highway connections, and character-level CNNs. Our best end-to-end parser, which jointly performs supertagging, POS tagging, and parsing, outperforms the previously reported best results by more than 2.2 LAS and UAS points. The graph-based parsing architecture allows for global inference and rich feature representations for TAG parsing, alleviating the fundamental trade-off between transition-based and graph-based parsing systems. We also demonstrate that the proposed parser achieves state-of-the-art performance in the downstream tasks of Parsing Evaluation using Textual Entailments (PETE) and Unbounded Dependency Recovery. This provides further support for the claim that TAG is a viable formalism for problems that require rich structural analysis of sentences.


Introduction
Tree Adjoining Grammar (TAG, Joshi and Schabes (1997)) and Combinatory Categorial Grammar (CCG, Steedman and Baldridge (2011)) are both mildly context-sensitive grammar formalisms that are lexicalized: every elementary structure (elementary tree for TAG and category for CCG) is associated with exactly one lexical item, and every lexical item of the language is associated with a finite set of elementary structures in the grammar (Rambow and Joshi, 1994). In TAG and CCG, the task of parsing can be decomposed into two phases (e.g. TAG: Bangalore and Joshi (1999); CCG: Clark and Curran (2007)): supertagging, where elementary units or supertags are assigned to each lexical item and parsing where these supertags are combined together. The first phase of supertagging can be considered as "almost parsing" because supertags for a sentence almost always determine a unique parse (Bangalore and Joshi, 1999). This near uniqueness of a parse given a gold sequence of supertags has been confirmed empirically (TAG: Bangalore et al. (2009);Chung et al. (2016); ; CCG: Lewis et al. (2016)).
We focus on TAG parsing in this work. TAG differs from CCG in having a more varied set of supertags. Concretely, the TAG-annotated version of the WSJ Penn Treebank (Marcus et al., 1993) that we use (Chen et al., 2005) includes 4727 distinct supertags (2165 occur once) while the CCGannotated version (Hockenmaier and Steedman, 2007) only includes 1286 distinct supertags (439 occur once). This large set of supertags in TAG presents a severe challenge in supertagging and causes a large discrepancy in parsing performance with gold supertags and predicted supertags (Bangalore et al., 2009;Chung et al., 2016;. In this work, we present a supertagger and a parser that substantially improve upon previously reported results. We propose crucial modifications to the bidirectional LSTM (BiLSTM) supertagger in . First, we use character-level Convolutional Neural Networks (CNNs) for encoding morphological information instead of suffix embeddings. Secondly, we perform concatenation after each BiLSTM layer. Lastly, we explore the impact of adding additional BiLSTM layers and highway connections. These techniques yield an increase of 1.3% in accuracy. For parsing, since the derivation tree in a lexicalized TAG is a type of dependency tree (Rambow and Joshi, 1994), we can directly apply dependency parsing models. In particular, we use the biaffine graph-based parser proposed by  together with our novel techniques for supertagging.
In addition to these architectural extensions for supertagging and parsing, we also explore multitask learning approaches for TAG parsing. Specif-ically, we perform POS tagging, supertagging, and parsing using the same feature representations from the BiLSTMs. This joint modeling has the benefit of avoiding a time-consuming and complicated pipeline process, and instead produces a full syntactic analysis, consisting of supertags and the derivation that combines them, simultaneously. Moreover, this multi-task learning framework further improves performance in all three tasks. We hypothesize that our multi-task learning yields feature representations in the LSTM layers that are more linguistically relevant and that generalize better (Caruana, 1997). We provide support for this hypothesis by analyzing syntactic analogies across induced vector representations of supertags Friedman et al., 2017). The end-to-end TAG parser substantially outperforms the previously reported best results.
Finally, we apply our new parsers to the downstream tasks of Parsing Evaluation using Textual Entailements (PETE, Yuret et al. (2010)) and Unbounded Dependency Recovery (Rimell et al., 2009). We demonstrate that our end-to-end parser outperforms the best results in both tasks. These results illustrate that TAG is a viable formalism for tasks that benefit from the assignment of rich structural descriptions to sentences.

Our Models
TAG parsing can be decomposed into supertagging and parsing. Supertagging assigns to words elementary trees (supertags) chosen from a finite set, and parsing determines how these elementary trees can be combined to form a derivation tree that yield the observed sentence. The combinatory operations consist of substitution, which inserts obligatory arguments, and adjunction, which is responsible for the introduction of modifiers, function words, as well as the derivation of sentences involving long-distance dependencies. In this section, we present our supertagging models, parsing models, and joint models.

Supertagging Model
Recent work has explored neural network models for supertagging in TAG  and CCG (Xu et al., 2015;Lewis et al., 2016;Vaswani et al., 2016;Xu, 2016), and has shown that such models substantially improve performance beyond non-neural models. We extend previously proposed BiLSTM-based models (Lewis et al., 2016; in three ways: 1) we add character-level Convolutional Neural Networks (CNNs) to the input layer, 2) we perform concatenation of both directions of the LSTM not only after the final layer but also after each layer, and 3) we use a modified BiLSTM with highway connections.

Input Representations
The input for each word is represented via concatenation of a 100-dimensional embedding of the word, a 100-dimensional embedding of a predicted part of speech (POS) tag, and a 30dimensional character-level representation from CNNs that have been found to capture morphological information (Santos and Zadrozny, 2014;Chiu and Nichols, 2016;Ma and Hovy, 2016). The CNNs encode each character in a word by a 30 dimensional vector and 30 filters produce a 30 dimensional vector for the word. We initialize the word embeddings to be the pre-trained GloVe vectors (Pennington et al., 2014); for words not in GloVe, we initialize their embedding to a zero vector. The other embeddings are randomly initialized. We obtain predicted POS tags from a BiL-STM POS tagger with the same configuration as in Ma and Hovy (2016).

Deep Highway BiLSTM
The core of the supertagging model is a deep bidirectional Long Short-Term Memory network (Graves and Schmidhuber, 2005). We use the following formulas to compute the activation of a single LSTM cell at time step t: Here a semicolon ; means concatenation, is element-wise multiplication, and σ is the sigmoid function. In the first BiLSTM layer, the input x t is the vector representation of word t. (The sequence is reversed for the backwards pass.) In all subsequent layers, x t is the corresponding output from the previous BiLSTM; the output of a BiLSTM at timestep t is equal to [h f t ; h b t ], the concatenation of hidden state corresponding to input t in the forward and backward pass. This concatenation af-ter each layer differs from  and Lewis et al. (2016), where concatenation happens only after the final BiLSTM layer. We will show in a later section that concatenation after each layer contributes to improvement in performance.
We also extend the models in  and Lewis et al. (2016) by allowing highway connections between LSTM layers. A highway connection is a gating mechanism that combines the current and previous layer outputs, which can prevent the problem of vanishing/exploding gradients (Srivastava et al., 2015). Specifically, in networks with highway connections, we replace Eq. 6 by: Indeed, our experiments will show that highway connections play a crucial role as we add more BiLSTM layers.
We generally follow the hyperparameters chosen in Lewis et al. (2016) and . Specifically, we use BiLSTMs layers with 512 units each. Input, layer-to-layer, and recurrent (Gal and Ghahramani, 2016) dropout rates are all 0.5. For the CNN character-level representation, we used the hyperparameters from Ma and Hovy (2016).
We train this network, including the embeddings, by optimizing the negative log-likelihood of the observed sequences of supertags in a minibatch stochastic fashion with the Adam optimization algorithm with batch size 100 and = 0.01 (Kingma and Ba, 2015). In order to obtain predicted POS tags and supertags of the training data for subsequent parser input, we also perform 10fold jackknife training. After each training epoch, we test the supertagger on the dev set. When classification accuracy does not improve on five consecutive epochs, training ends.

Parsing Model
Until recently, TAG parsers have been grammar based, requiring as input a set of elemenetary trees (supertags). For example, Bangalore et al. (2009) proposes the MICA parser, an Earley parser that exploits a TAG grammar that has been transformed into a variant of a probabilistic CFG. One advantage of such a parser is that its parses are guaranteed to be well-formed according to the TAG grammar provided as input.
More recent work, however, has shown that data-driven transition-based parsing systems outperform such grammar-based parsers (Chung et al., 2016;Friedman et al., 2017).  and Friedman et al. (2017) achieved state-of-the-art TAG parsing performance using an unlexicalized shift-reduce parser with feed-forward neural networks that was trained on a version of the Penn Treebank that had been annotated with TAG derivations. Here, we pursue this data-driven approach, applying a graph-based parser with deep biaffine attention  that allows for global training and inference.

Input Representations
The input for each word is the concatenation of a 100-dimensional embedding of the word and a 30-dimensional character-level representation obtained from CNNs in the same fashion as in the supertagger. 1 We also consider adding 100-dimensional embeddings for a predicted POS tag ) and a predicted supertag Friedman et al., 2017). The ablation experiments in Kiperwasser and Goldberg (2016) illustrated that adding predicted POS tags boosted performance in Stanford Dependencies. In Universal Dependencies,  empirically showed that their dependency parser gains significant improvements by using POS tags predicted by a Bi-LSTM POS tagger. Indeed,  and Friedman et al. (2017) demonstrated that their unlexicalized neural network TAG parsers that only get as input predicted supertags can achieve state-of-theart performance, with lexical inputs providing no improvement in performance. We initialize word embeddings to be the pre-trained GloVe vectors as in the supertagger. The other embeddings are randomly initialized.

Biaffine Parser
We train our parser to predict edges between lexical items in an LTAG derivation tree. Edges are labeled by the operations together with the deep syntactic roles of substitution sites (0=underlying subject, 1=underlying direct object, 2=underlying indirect object, 3,4=oblique arguments, CO=cohead for prepositional/particle verbs, and adj=all adjuncts). Figure 1 shows our biaffine parsing ar- Figure 1: Biaffine parsing architecture. For the dependency from John to sleeps in the sentence John sleeps, the parser first predicts the head of John and then predicts the dependency label by combining the dependent and head representations. In the joint setting, the parser also predicts POS tags and supertags. chitecture. Following  and Kiperwasser and Goldberg (2016), we use BiLSTMs to obtain features for each word in a sentence. We add highway connections in the same fashion as our supertagging model.
We first perform unlabeled arc-factored scoring using the final output vectors from the BiLSTMs, and then label the resulting arcs. Specifically, suppose that we score edges coming into the ith word in a sentence i.e. assigning scores to the potential parents of the ith word. Denote the final output vector from the BiLSTM for the kth word by h k and suppose that h k is d-dimensional. Then, we produce two vectors from two separate multilayer perceptrons (MLPs) with the ReLU activation: where h arc-dep k and h arc-head k are d arc -dimensional vectors that represent the kth word as a dependent and a head respectively. Now, suppose the kth row of matrix H (arc-head) is h arc-head k . Then, the probability distribution s i over the potential heads of the ith word is computed by where W (arc) ∈ R darc×darc and b (arc) ∈ R darc . In training, we simply take the greedy maximum probability to predict the parent of each word. In the testing phase, we use the heuristics formulated by  to ensure that the resulting parse is single-rooted and acyclic. Given the head prediction of each word in the sentence, we assign labeling scores using vectors obtained from two additional MLP with ReLU. For the kth word, we obtain: Let p i be the index of the predicted head of the ith word, and r be the number of dependency relations in the dataset. Then, the probability distribution i over the possible dependency relations of the arc pointing from the p i th word to the ith word is calculated by: We generally follow the hyperparameters chosen in . Specifically, we use BiLSTMs layers with 400 units each. Input, layer-to-layer, and recurrent dropout rates are all 0.33. The depths of all MLPs are all 1, and the MLPs for unlabeled attachment and those for labeling contain 500 (d arc ) and 100 (d rel ) units respectively. For character-level CNNs, we use the hyperparameters from Ma and Hovy (2016).
We train this model with the Adam algorithm to minimize the sum of the cross-entropy losses from head predictions (s i from Eq. 7) and label predictions ( i from Eq. 8) with = 0.01 and batch size 100 (Kingma and Ba, 2015). After each training epoch, we test the parser on the dev set. When labeled attachment score (LAS) 2 does not improve on five consecutive epochs, training ends.

Joint Modeling
The simple BiLSTM feature representations for parsing presented above are conducive to joint modeling of POS tagging and supertagging; rather than using POS tags and supertags to predict a derivation tree, we can instead use the BiLSTM hidden vectors derived from lexical inputs alone to predict POS tags and supertags along with the TAG derivation tree.
where h pos k ∈ R dpos and h stag k ∈ R dstag . We obtain probability distribution over the POS tags and supertags by: where W (pos) , b (pos) , W (stag) , and b (stag) are in R npos×dpos , R npos , R nstag×dstag , and R nstag respectively, with n pos and n stag the numbers of possible POS tags and supertags respectively. We use the same hyperparameters as in the parser. The MLPs for POS tagging and supertagging both contain 500 units. We again train this model with the Adam algorithm to minimize the sum of the cross-entropy losses from head predictions (s i from Eq. 7), label predictions ( i from Eq. 8), POS predictions (Eq. 9), and supertag predictions (Eq. 10) with = 0.01 and batch size 100. After each training epoch, we test the parser on the dev set and compute the percentage of each token that is assigned the correct parent, relation, supertag, and POS tag. When the percentage does not improve on five consecutive epochs, training ends.
This joint modeling has several advantages. First, the joint model yields a full syntactic analysis simultaneously without the need for training separate models or performing jackknife training. Secondly, joint modeling introduces a bias on the hidden representations that could allow for better generalization in each task (Caruana, 1997). Indeed, in experiments described in a later section, we show empirically that predicting POS tags and supertags does indeed benefit performance on parsing (as well as the tagging tasks).

Results and Discussion
We follow the protocol of Bangalore et al. (2009), Chung et al. (2016, , and Friedman et al. (2017); we use the grammar and the TAG-annotated WSJ Penn Tree Bank extracted by Chen et al. (2005). Following that work, we use Sections 01-22 as the training set, Section 00 as the dev set, and Section 23 as the test set. The training, dev, and test sets comprise 39832, 1921, and 2415 sentences, respectively. We implement all of our models in TensorFlow (Abadi et al., 2016). 3

Supertaggers
Our BiLSTM POS tagger yielded 97.37% and 97.53% tagging accuracy on the dev and test sets, performance on par with the state-of-the-art (Ling et al., 2015;Ma and Hovy, 2016). 4 Seen in the middle section of Table 1 is supertagging performance obtained from various model configurations. "Final concat" in the model name indicates that vectors from forward and backward pass are concatenated only after the final layer. Concatenation happens after each layer otherwise. Numbers immediately after BiLSTM indicate the numbers of layers. CNN, HW, and POS denote respectively character-level CNNs, highway connections, and pipeline POS input from our BiL-STM POS tagger. Firstly, the differences in performance between BiLSTM2 (final concat) and BiLSTM2 and between BiLSTM2 and BiLSTM2-CNN suggest an advantage to performing concatenation after each layer and adding character-level CNNs. Adding predicted POS to the input somewhat helps supertagging though the difference is small. Adding a third BiLSTM layer helps only if there are highway connections, presumably because deeper BiLSTMs are more vulnerable to the vanishing/exploding gradient problem. Our supertagging model (BiLSTM3-HW-CNN-POS) that performs best on the dev set achieves an accuracy of 90.81% on the test set, outperforming the previously best result by more than 1.3%. Table 3 shows parsing results on the dev set. Abbreviations for models are as before with one addition: Stag denotes pipeline supertag input from our best supertagger (BiLSTM3-HW-CNN-POS in Table 1). As with supertagging, we observe a gain from adding character-level CNNs. Interestingly, adding predicted POS tags or supertags deteriorates performance with BiLSTM3. These results suggest that morphological information and word information from character-level CNNs and word embeddings overwhelm the in-  formation from predicted POS tags and supertags. Again, highway connections become crucial as the number of layers increases. We finally evaluate the parsing model with the best dev performance (BiLSTM4-HW-CNN) on the test set (Table 3). It achieves 91.37 LAS points and 92.77 UAS points, improvements of 1.8 and 1.7 points respectively from the state-of-the-art.

Joint Models
We provide joint modeling results for supertagging and parsing in Tables 2 and 3. For these joint models, we employed the best parsing configuration (4 layers of BiLSTMs, character-level CNNs, and highway connections), with and without POS tagging added as an additional task. We can observe that our full joint model that performs 1 2 3 4 5 6 7 8 9 10 11+    Figures 2 and 3 illustrate the relative performance of the feed-forward neural network shiftreduce TAG parser  and our joint graph-based parser with respect to two of the measures explored by McDonald and Nivre (2011), namely dependency length and distance between a dependency and the root of a parse. The graph-based parser outperforms the shift-reduce parser across all conditions. Most interesting is the fact that the graph-based parser shows less of an effect of dependency length. Since the shiftreduce parser builds a parse sequentially with one parsing action depending on those that come before it, we would expect to find a propogation of errors made in establishing shorter dependencies to the establishment of longer dependencies.
Lastly, it is worth noting our joint parsing ar-1 2 3 4 5 6 7 8 9 10 11+ chitecture has a substantial advantage regarding parsing speed. Since POS tagging, supertagging, and parsing decisions are made independently for each word in a sentence, our system can parallelize computation once the sentence is encoded in the BiLSTM layers. Our current implementation processes 225 sentences per second on a single Tesla K80 GPU, an order of magnitude faster than the MICA system (Bangalore et al., 2009). 5

Joint Modeling and Network Representations
Given the improvements we have derived from the joint models, we analyze the nature of inductive bias that results from multi-task training and attempt to provide an explanation as to why joint modeling improves performance.

Noise vs. Inductive Bias
One might argue that joint modeling improves performance merely because it adds noise to each task and prevents over-fitting. If the introduction of noise were the key, we would still expect to gain an improvement in parsing even if the target supertag were corrupted, say by shuffling the order of supertags for the entire training data (Caruana, 1997). We performed this experiment, and the result is shown as "Joint (Shuffled Stag)" in Table 3. Parsing performance falls behind the best non-joint parser by 0.7 LAS points. This suggests that inducing the parser to create representations to predict both supertags and a parse tree is beneficial for both tasks, beyond a mere introduction of noise.

Syntactic Analogies
We next analyze the induced vector representations in the output projection matrices of our supertagger and joint parsers using the syntactic analogy framework . Consider, for instance, the analogy that an elementary tree representing a clause headed by a transitive verb (t27) is to a clause headed by an intransitive verb (t81) as a subject relative clause headed by a transitive verb (t99) is to a subject relative headed by an intransitive verb (t109). Following the ideas in Mikolov et al. (2013) for word analogies, we can express this structural analogy as t27 -t81 + 5 While such computational resources were not available in 2009, our parser differs from the MICA chart parser in being able to better exploit parallel computation enabled by modern GPUs. t109 = t99 and test it by cosine similarity. Table  4 shows the results of the analogy test with 246 equations involving structural analogies with only the 300 most frequent supertags in the training data. While the embeddings (projection matrix) from the independently trained supertagger do not appear to reflect the syntax, those obtained from the joint models yield linguistic structure despite the fact that the supertag embeddings (projection matrix) is trained without any a priori syntactic knowledge about the elementary trees.
The best performance is obtained by the supertag representations obtained from the training of the transition-based parser  and Friedman et al. (2017). For the transitionbased parser, it is beneficial to share statistics among the input supertags that differ only by a certain operation or property  during the training phase, yielding the success in the analogy task. For example, a transitive verb supertag whose object has been filled by substitution should be treated by the parser in the same way as an intransitive verb supertag. In our graph-based parsing setting, we do not have a notion of parse history or partial derivations that directly connect intransitive and transitive verbs. However, syntactic analogies still hold to a considerable degree in the vector representations of supertags induced by our joint models, with average rank of the correct answer nearly the same as that obtained in the transition-based parser.
This analysis bolsters our hypothesis that joint training biases representation learning toward linguistically sensible structure. The supertagger is just trained to predict linear sequences of supertags. In this setting, many intervening supertags can occur, for instance, between a subject noun and its verb, and the supertagger might not be able to systematically link the presence of the two in the sequence. In the joint models, on the other hand, parsing actions will explicitly guide the network to associate the two supertags.

Downstream Tasks
Previous work has applied TAG parsing to the downstream tasks of syntactically-oriented textual entailment (Xu et al., 2017) and semantic role labeling (Chen and Rambow, 2003). In this work, we apply our parsers to the textual entailment and unbounded dependency recovery tasks and achieve state-of-the-art performance. These re-  Table 4: Syntactic analogy test results on the 300 most frequent supertags. Avg. rank is the average position of the correct choice in the ranked list of the closest neighbors; the top line indicates the result of using supertag embeddings that are trained jointly with a transition based parser (Friedman et al., 2017).
sults bolster the significance of the improvements gained from our joint parser and the utility of TAG parsing for downstream tasks.

Parser Evaluation using Textual Entailments (PETE) is a shared task from the SemEval-2010
Exercises on Semantic Evaluation (Yuret et al., 2010). The task was intended to evaluate syntactic parsers across different formalisms, focusing on entailments that could be determined entirely on the basis of the syntactic representations of the sentences that are involved, without recourse to lexical semantics, logical reasoning, or world knowledge. For example, syntactic knowledge alone tells us that the sentence John, who loves Mary, saw a squirrel entails John saw a squirrel and John loves Mary but not, for instance, that John knows Mary or John saw an animal. Prior work found the best performance was achieved with parsers using grammatical frameworks that provided rich linguistic descriptions, including CCG (Rimell and Clark, 2010;Ng et al., 2010), Minimal Recursion Semantics (MRS) (Lien, 2014), and TAG (Xu et al., 2017). Xu et al. (2017) provided a set of linguisticallymotivated transformations to use TAG derivation trees to solve the PETE task. We follow their procedures and evaluation for our new parsers.
We present test results from two configurations in Table 5. One configuration is a pipeline approach that runs our BiLSTM POS tagger, supertagger, and parser. The other one is a joint approach that only uses our full joint parser. The joint method yields 78.1% in accuracy and 76.4% in F1, improvements of 2.4 and 2.7 points over the previously reported best results.

Unbounded Dependency Recovery
The unbounded dependency corpus (Rimell et al., 2009) specifically evaluates parsers on unbounded dependencies, which involve a constituent moved from its original position, where an unlimited number of clause boundaries can intervene. The corpus comprises 7 constructions: object extraction from a relative clause (ObRC), object extraction from a reduced relative clause (ObRed), subject extraction from a relative clause (SbRC), free relatives (Free), object wh-questions (ObQ), right node raising (RNR), and subject extraction from an embedded clause (SbEm). Because of variations across formalisms in their representational format for unbounded depdendencies, past work has conducted manual evaluation on this corpus (Rimell et al., 2009;Nivre et al., 2010). We instead conduct an automatic evaluation using a procedure that converts TAG parses to structures directly comparable to those specified in the unbounded dependency corpus. To this end, we apply two types of structural transformation in addition to those used for the PETE task: 6 1) a more extensive analysis of coordination, 2) resolution of differences in dependency representations in cases involving copula verbs and co-anchors (e.g., verbal particles). See Appendix A for details. After the transformations, we simply check if the resulting dependency graphs contain target labeled arcs given in the dataset. Table 6 shows the results. Our joint parser outperforms the other parsers, including the neural network shift-reduce TAG parser . Our data-driven parsers yield relatively low performance in the ObQ and RNR constructions. Performance on ObQ is low, we expect, because of their rarity in the data on which the parser is   trained. 7 For RNR, rarity may be an issue as well as the limits of the TAG analysis of this construction. Nonetheless, we see that the rich structural representations that a TAG parser provides enables substantial improvements in the extraction of unbounded dependencies. In the future, we hope to evaluate state-of-the-art Stanford dependency parsers automatically.

Related Work
The two major classes of data-driven methods for dependency parsing are often called transitionbased and graph-based parsing (Kübler et al., 2009). Transition-based parsers (e.g. MALT (Nivre, 2003)) learn to predict the next transition given the input and the parse history. Graph-based parsers (e.g. MST (McDonald et al., 2005)) are trained to directly assign scores to dependency graphs. Empirical studies have shown that a transitionbased parser and a graph-based parser yield similar overall performance across languages (Mc-Donald and Nivre, 2011), but the two strands of data-driven parsing methods manifest the fundamental trade-off of parsing algorithms. The former prefers rich feature representations with parsing history over global training and exhaustive search, and the latter allows for global training and inference at the expense of limited feature representations (Kübler et al., 2009).
Recent neural network models for transitionbased and graph-based parsing can be viewed as remedies for the aforementioned limitations. Andor et al. (2016) developed a transition-based parser using feed-forward neural networks that performs global training approximated by beam search. The globally normalized objective addresses the label bias problem and makes global 7 The substantially better performance of the C&C parser is in fact the result of additions that were made to the training data.
training effective in the transition-based parsing setting. Kiperwasser and Goldberg (2016) incorporated a dynamic oracle (Goldberg and Nivre, 2013) in a BiLSTM transition-based parser that remedies global error propagation. Kiperwasser and Goldberg (2016) and  proposed graph-based parsers that have access to rich feature representations obtained from BiLSTMs.
Previous work integrated CCG supertagging and parsing using belief propagation and dual decomposition approaches Lopez, 2011). Nguyen et al. (2017) incorporated a graph-based dependency parser (Kiperwasser and Goldberg, 2016) with POS tagging. Our work followed these lines of effort and improved TAG parsing performance.

Conclusion and Future Work
In this work, we presented a state-of-the-art TAG supertagger, a parser, and a joint parser that performs POS tagging, supertagging, and parsing. The joint parser has the benefit of giving a full syntactic analysis of a sentence simultaneously. Furthermore, the joint parser achieved the best performance, an improvement of over 2.2 LAS points from the previous state-of-the-art. We have also seen that the joint parser yields state-of-the-art in textual entailment and unbounded dependency recovery tasks, and raised the possibility that TAG can provide useful structural analysis of sentences for other NLP tasks. We will explore more applications of our TAG parsers in future work.

A Transformations for Unbounded Dependency Recovery Corpus
For automatic evaluation on the unbounded dependency recovery corpus (UDR, Rimell et al. (2009)), we run simple conversion of dependency labels in UDR to those in our TAG grammar (See Table 7) with a couple of exceptions.
• Change arcs from verbs to wh-adverbs as in "where is the city located?" to adjunction.
• Reflect causative-inchoative alternation in the subject embedded construction. Concretely, change the role of "door" in "hold the door shut" from the subject to the object of "shut." We then transform TAG dependency trees. Finally, we simply check if the resulting dependency graphs contain target labeled arcs given in the dataset.
Below is a full description of transformations. This set of structural transformations is applied in the order in which we will present it, so that the output of previous transformations can feed subsequent ones. In the following, we denote an arc pointing from node B to node A with label C as (A, B, C) where A and B are called the child (dependent) and the parent (head) in the relation.

A.1 Transformations from PETE
We apply three types of transformation from Xu et al. (2017) to interpret the TAG parses.
Relative Clauses When an elementary tree of a relative clause adjoins into a noun, we add a reverse arc with the label reflecting the type of the relative clause elementary tree. For a subject relative, we add a 0-labeled arc, for an object relative, we add a 1-labeled arc, and so forth.

UDR Labels
TAG Labels nsubj, cop 0 dobj, pobj, obj2, nsubjpass 1 others (advmod etc) ADJ Sentential Complements Sentential complementation in TAG derivations can be analyzed via either adjoining the higher clause into the embedded clause (necessarily so in cases of longdistance extraction from the embedded clause) or substituting the embedded clause in the higher clause. In order to normalize this divergence, for an adjunction arc involving a predicative auxiliary elementary tree (supertag), we add a reverse arc involving the 1 relation (sentential complements).

A.2 Coordination
We roughly follow the method presented in Xu et al. (2017) with extensions. Under the TAG analysis, VP coordination involves a VP-recursive auxiliary tree headed by the coordinator that includes a VP substitution node (for the second conjunct) with label 1. In order to allow the first clauses subject argument (as well as modal verbs and negations) to be shared by the second verb, we add the relevant relations to the second verb. In addition, we analyze sentential coordination cases. Sentence coordination in our TAG grammar usually happens between two complete sentences and no modifiers or arguments are shared, and therefore it can be analyzed via substituting a sentence int the coordinator with label 1. However, when sentential coordination happens between two relative clause modifiers, our TAG grammar analyzes the second clause as a complete sentence, meaning that we need to recover the extracted argument by consulting the property of the first clause. Furthermore, the deep syntactic role of the extracted argument can be different in the two relative clauses. For instance, in the sentence, "... the same stump which had impaled the car of many a guest in the past thirty years and which he refused to have removed," we need to recover an arc from removed to stump with label 1 whereas the arc from impaled to stump has label 0. To resolve this issue, when there is coordination of two relative clause modifiers, we add an edge from the head of the second clause to the modified noun with the same label as the label that under which the relative pronoun is attached to the head.

A.3 Resolving Differences in Dependency Representations
Small Clauses The UDR corpus has inconsistency with regards to small clauses. UDR gives an analysis that a small clause contains a subject and a complement as in (nsubj, guy, liar) in "the guy who I call a liar." in the subject embedded constructions. However, in the object question and object free relative constructions, a small clause is analyzed as two arguments of the verb. For instance, UDR specifies (what, adopted, dobj) in "we adopted what I would term pseudocapitalism." To solve this problem we add an arc from the head of the matrix clause to the subject in a small clause with label 1.
Co-anchors In our TAG grammar, Co-anchor attachment represents the substitution into a node that is construed as a co-head of an elementary tree. For instance, "for" is deemed as a co-anchor to "hope" in the sentence "that is exactly what I'm hoping for (Figure 4). In this case, UDR would pick the relation (what, hope, pobj). Therefore, when there is a co-anchor to a head tree, we add all arcs that involve the head tree to the co-anchor tree.
Wh-determiners and Wh-adverbs Our TAG grammar analyzes a wh-determiner via adjoining the noun into the wh-determiner ( Figure 5). This is also true for cases where a wh-adverb is followed by an adjective and a noun as in how many battles did she win? In contrast, UDR corpus gives an analysis that the noun is the head of the constituent. In order to resolve this discrepancy, when a word adjoins into a wh-word, 8 we pick all arcs with the wh-word as the child and add the arcs obtained from such arcs by replacing the wh-word child by the word adjoining into the wh-word.
Copulas A copula is usually treated as a dependent to the predicate both in our TAG grammar (adjunction) and UDR. However, we found two situations where they differ from each other. First, when wh-extraction happens on the complement, as in "obviously there has been no agreement on what American conservatism is, or rather, what it should be," the TAG grammar analyzes it via substituting the wh-word ("what") into the copula ("is"). To reconcile this disagreement between the TAG grammar and UDR, when substitution happens into a be verb, we add the substitution into 8 We considered imposing a more strict condition that the word adjoining into the wh-word is a noun, but we found cases that this method fails to cover; for example, UDR gives (dobj, get, much) for a sentence "opinion is mixed on how much of a boost the overall stock market would get even if dividend growth continues at double-digit levels." Figure 4: Co-anchor case from a sentence "that is exactly what I'm hoping for. The UDR gives the red arc (what, for, pobj). The blue arc (what, for, 1) is obtained from (what, hope, 1). Figure 5: Wh-determiner case from a sentence What songs did he sing? The UDR gives the red arc (songs, sing, dobj). The blue arc (song, sing, 1) is obtained from (what, sing, 1) and (songs, what, ADJ). the copula. 9 Second, UDR treats non-be copulas differently than be verbs. An example is the UDR relation (those, stayed, nsubj) "in the other hemisphere it is growing colder and nymphs, those who stayed alive through the summer, are being brought into nests for quickening and more growing" where our parser yields (those, alive, 0). For this reason, when a lemma of a verb is a non-be copula, 10 we add arcs involving the word to the copula adjoining into the copula.

PP attachment with multiple noun candidates
We observed that PP attachment with multiple noun candidates is often at stake in UDR. 11 For instance, UDR provides (part, had, nsubj) and (several, tried, nsubj) for the sentences "... there is no part of the earth that has not had them" and "there were several on the Council who tried to live like Christians" while the TAG parser outputs (earth, had, nsubj) and (Council, tried, nsubj) respectively. While we count these cases as "wrong" since they manifest certain disambiguation (though not purely unbounded dependency recovery), we ignore superficial (conventional) differences in head selection. In our TAG grammar "a lot of people" would be headed by "lot" whereas UDR would recognize "people" as the head. Hence, when "lot/lots/kind/kinds/none of" occurs, we add all arcs with "lot/lots/kind/kinds/none" to the head of the phrase that is the object of "of." Modals In the UDR corpus, a modal depends on an auxiliary verb following the modal, if there is one. For example, "Rosie reinvented this man, who may or may not have known about his child" is given the relation (may, have, aux). In the TAG grammar, both "may" and "have" adjoin into "known." Therefore, when the head of a modal has another child with adjunction, we add an arc from the child to the modal.
Existential there UDR gives the "cop" relation between an existential there and the be verb. For example, it gives (be, legislation, cop) in "... on how much social legislation there should be." On the other hand, our TAG grammar analyzes that