Wide-Coverage Neural A* Parsing for Minimalist Grammars

Minimalist Grammars (Stabler, 1997) are a computationally oriented, and rigorous formalisation of many aspects of Chomsky’s (1995) Minimalist Program. This paper presents the first ever application of this formalism to the task of realistic wide-coverage parsing. The parser uses a linguistically expressive yet highly constrained grammar, together with an adaptation of the A* search algorithm currently used in CCG parsing (Lewis and Steedman, 2014; Lewis et al., 2016), with supertag probabilities provided by a bi-LSTM neural network supertagger trained on MGbank, a corpus of MG derivation trees. We report on some promising initial experimental results for overall dependency recovery as well as on the recovery of certain unbounded long distance dependencies. Finally, although like other MG parsers, ours has a high order polynomial worst case time complexity, we show that in practice its expected time complexity is cubic in the length of the sentence. The parser is publicly available.


Introduction
Parsers based on linguistically expressive formalisms, such as Head-Driven Phrase Structure Grammar (HPSG; Pollard and Sag 1994) and Combinatory Categorial Grammar (CCG; Steedman 1996), have been shown to be more effective than those using shallower finite state or context free grammars at recovering certain unbounded long-distance dependencies (Rimell et al., 2009;Nivre et al., 2010) which are often vital for tasks like open domain question answering. Furthermore, as proven independently by Huybregts (1984) and Shieber (1985), some languages exhibit constructions which are beyond even the weak generative capacity of any context free grammar. The investigation of parsing systems based on more powerful (mildly) context sensitive formalisms has therefore been a very active area of research within the field of computational psycholinguistics over the past 35 years or so (see, e.g., Joshi (1985Joshi ( , 1990; Steedman (2000); Hale (2011); Stabler (2013); Rambow and Joshi (2015); Stanojević and Stabler (2018)).
Another linguistically expressive grammatical framework is Transformational Grammar (Chomsky, 1957(Chomsky, , 1965(Chomsky, , 1981, whose latest incarnation is the Minimalist Program (MP; Chomsky 1995). A defining property of MP is that constituents move. For example, in 1a below, what moves to the left periphery of the clause from a deep subject position and will therefore be interpreted as the semantic AGENT of eat; in 1b, meanwhile, it moves from the deep object position and so is interpreted instead as the semantic PATIENT.
(1) a. What i do you think t i eats mice? b. What i do you think mice eat t i ? MP continues to dominate much of theoretical syntax, and Stabler's (1997) rigorous formalisation has proven a popular choice for investigations into human sentence processing (Hale, 2003;Kobele et al., 2013;Stabler, 2013;Graf and Marcinek, 2014;Graf et al., 2015;Gerth, 2015;Stanojević and Stabler, 2018). On the other hand, this TG has enjoyed far less popularity within computational linguistics more generally, 2 which is unfortunate given that it is arguably the most extensively developed syntactic theory across the greatest number of languages, many of which are otherwise under-resourced. Conversely, the process of constructing large grammar fragments and subjecting these to computational testing can have a salutary impact on the theory itself, forcing choices between competing analyses of the same construction, and exposing incompatibilities between analyses of different constructions, along with areas of over-and undergeneration which may otherwise go unnoticed (Bierwisch, 1963;Abney, 1996).
The received wisdom within NLP is that TG/MP is too complex and insufficiently formalized to be applied to realistic parsing tasks. This assumption prompted Sproat and Lappin (2005) to issue a challenge to the Minimalist community which has hitherto gone unanswered: to construct a wide-coverage statistical parser trained in a supervised fashion and exhibiting comparable performance to other state-of-the-art parsers. This paper is the first to address this challenge, introducing the first ever wide-coverage parser in the Minimalist (and arguably the entire TG) tradition, along with some promising initial experimental results. The parser is equipped with a linguistically expressive, wide-coverage grammar based on an extended version of Stabler's (1997) Minimalist Grammars (MG) formalism, which is a rigorously formal, computationally oriented and polynomially parseable interpretation of mainstream MP that is weakly equivalent to Multiple Context Free Grammars (MCFG; Seki et al. 1991). The parser itself is an adaptation of a highly efficient A* CCG parsing algorithm (Lewis and Steedman, 2014;Lewis et al., 2016) with a bi-LSTM model trained on MGbank, an MG version of the Penn Treebank currently under development.

Background
Beginning in the 1960s, a number of parsers were developed which implemented aspects of the various iterations of Chomskyan syntactic theory (e.g. Petrick 1965;Zwicky et al. 1965;Woods 1970Woods , 1973Marcus 1980;Kuhns 1990;Fong 1991;Stabler 1992;Fong and Ginsburg 2012), but these systems either used toy grammars/lexicons or operated over relatively closed domains.
Principar (Lin, 1993), and its descendent Minipar (Lin, 2001(Lin, , 2003, are the only truly widecoverage parsers in the Chomskyan tradition of which we are aware. Minipar incorporates MP's bare phrase structure and some of its economy principles. It is also statistical, having been selftrained on a 1GB corpus. However, while these parsers model the phrase structure and locality constraints of TG, they are not transformational: movement is achieved by passing features up a precompiled network of nodes representing a tree from the site of the trace to the site of the antecedent, with the latter merged directly into its surface position, in the style of GPSG. Under this approach, antecedents necessarily c-command their traces (Lin, 1993, page 115), making these parsers unsuitable for implementing MP analyses involving remnant movement (see Stabler 1999).

MG parsers
A number of working parsers have been developed for Stablerian MGs, which do allow for actual movement, including remnant movement. What all working MG parsers (Harkema, 2001;Hale, 2003;Stabler, 2013;Stanojević and Stabler, 2018) have until now shared in common is that they are small-scale theoretical implementations equipped only with toy lexicons/grammars. There has been a limited amount of research into probabilistic MG parsing, most notably in generative locally normalised models (Hale, 2003;Hunter and Dyer, 2013). However, these works remain so far untested owing to the unavailability, until very recently, of any MG treebank for training and evaluating models.

MGbank
MGbank (Torr, 2017(Torr, , 2018) is a treebank of MG derivation trees constructed in part manually by hand-annotating a subset of PTB sentences and in part automatically using a parser equipped with the manually constructed grammar and guided by the corresponding PTB and CCGbank (Hockenmaier and Steedman, 2007) structures. The corpus was continuously machine tested for over-and undergeneration throughout its development. It currently covers over 463,000 words of the PTB, or nearly 56% of its trees, with a lexicon of over 47,100 entries; the average sentence length in MGbank is 16.9 (vs 21.7 in the original PTB) and the maximum sentence length is 50. The derivation trees produced by the parser were transduced into Xbar and MG derived phrase structure trees, which are also included in the treebank.
The MGbank grammar has been designed to capture many long distance dependencies not included in the original treebank, including the binding of reflexive/reciprocal anaphors and floating quantifiers by their antecedents, the dependency between the two subconstituents of a discontinuous quoted expression ("funny thing," says the kicker, "both these candidates are named Rudolph Giuliani."), the licensing of polarity items such as anymore, anyway and much by interrogative and negation heads, and the distributional dependency between expletive there and an indefinite DP associate. All of these long distance dependencies, along with those involved in control, raising, topicalization and wh movement, are integrated into the grammar itself, obviating the need for separate post-processing techniques to recover them (Johnson, 2002;Cahill et al., 2004). The MG lexical categories have also been annotated with over 100 fine-grained selectional and agreement restriction features (e.g. +3SG, -NOM, +INF, MASC, +INDEF, +FOR, MNR, +LOC, etc) to avoid many instances of unwanted overgeneration.
Movement is clearly a very powerful operation. However, it is constrained in MGbank using many of the locality constraints proposed in the TG literature. These include not only Stabler's (1997) strict version of the Shortest Move Constraint, but also a partially derelativized version (DSMC) inspired by Rizzi (1990), along with versions of the specifier/adjunct island constraints, the right roof constraint, complex NP constraint, coordinate structure constraint, that-trace filter, Principle A of Chomsky's (1981) Binding Theory, and so on.

Minimalist Grammars
Our parser uses the MG formalism described in Torr and Stabler (2016;henceforth T&S) and Torr (2018Torr ( , 2019. Here we give only a brief overview. MGs are strongly lexicalised, with ordered feature sequences on lexical categories determining both the subcategorization frames of words and the movement operations which must apply. There are four basic types of structure building features: =x/x= selectors and x selectees, and +f licensors and -f licensees. Selectors and selectees trigger Merge operations, with x= indicating rightward selection and =x leftward selection (similar to the forward and backward slash notation in CCG). Licensors and licensees trigger Move operations. Except for a single c selectee at the root of the tree, all features entering the derivation must be checked and deleted by applying one of a small set of (here, around 45) abstract binary Merge and unary Move rules; these rules concatenate and reorder expressions' string components.
Consider the following MG lexicon.
✏, they, ✏ :: d ✏, saw, ✏ :: d= =d v ✏, who, ✏ :: d -wh ✏, [int], ✏ :: v= +WH Each entry consists of a string component, followed by a type separator, 3 followed by a sequence of syntactic features. The epsilons represent empty strings and are slots for left and right dependent strings to be merged into. 4 Strings enclosed in square brackets are also empty, and appear in this form at the lexical level only simply to make the trees easier to read. Figure 1 shows the MG derivation tree for the embedded question who they saw, along with its corresponding phrase structure tree, in which indicates a position from which a phrase has moved; the leaf nodes of the derivation tree are lexical items while the final surface string appears at the root node; binary nodes represent the result of a Merge operation while unary nodes represent the result of a Move operation. The interesting step occurs at the lowest binary node: because who has a -wh licensee still to check, its string is not merged into the right ✏ (complement) slot of saw when these two items are Merged; instead, it is kept in a separate moving chain until its -wh feature is checked by the +WH of [int] via an application of Move.

The Parser
Our parser uses an adaptation of the A* search algorithm for CCG presented in Lewis and Steedman (2014) (henceforth, L&S). In this section we first review that algorithm, before going on to show how it was adapted to the MG formalism.

A* CCG parsing
Combinatory Categorial Grammar (CCG; Steedman 2000) is another linguistically expressive formalism capable of recovering unbounded long distance dependencies. Like MG, CCG is strongly lexicalised, with a large lexical category set and a small set of abstract combinatory rules, the most basic of which is forward/backward application (equivalent to MG's Merge). Categories are either basic (NP, S, etc) or functional. The functional categories determine the subcategorization frame of the words they label. For example, the category who, ✏, they saw : c ✏, ✏, they saw : +WH c, who : -wh they, saw, ✏ : v, who : -wh ✏, saw, ✏ : =d v, who : -wh ✏, who, ✏ :: d -wh ✏, saw, ✏ :: d= =d v ✏, they, ✏ :: d Figure 1: MG derivation tree (left) and phrase structure tree (right) for the embedded question who they saw. The derivation has been simplified for ease of exposition by removing case and head movements, as well as the null tense and light verb heads.
for a transitive verb is (S\NP)/NP, which says that this word must combine with an (object) NP on its right (indicated by the forward slash), which will yield a category which must combine with a second (subject) NP on its left (indicated by the backward slash). In place of movement, CCG uses type raising and function composition rules to capture unbounded long distance dependencies.
CCG already has a very well-established research tradition in wide-coverage parsing (see, e.g., Hockenmaier and Steedman (2002) A key advancement in CCG parsing that enabled it to become efficient enough to support largescale NLP tasks was the introduction of Markovian supertagging techniques in Clark and Curran (2007b) that were borrowed from Lexicalised Tree Adjoining Grammar (LTAG; (Bangalore and Joshi, 1999)). Because the supertags predetermine much of the combinatorics, supertagging is sometimes referred to as 'almost parsing'.
Inspired by the A* algorithm for PCFGs of Klein and Manning (2003), L&S present a simple yet highly effective CCG parsing model which is factored over the probabilities assigned by the lexical supertagger alone, with no explicit model of the derivation at all. This approach is highly efficient and avoids the need for aggressively pruning the search space, which degraded the performance of earlier CKY CCG parsers. Instead, the parser considers the complete distribution of the 425 most commonly occurring CCG lexical categories for each word. The supertagger was originally a unigram log-linear classifier, but Lewis et al. (2016) greatly enhanced its accuracy by exchanging this for a stacked bi-LSTM neural model.
The key difference between A* and CKY CCG parsing is the fact that A* uses search heuris-tics that avoid building the whole chart without compromising the correctness guarantees. This is achieved using an agenda implemented as a priority queue of items ranked by their cost, calculated as a product of their inside cost and an upper bound on their expected outside cost. The agenda is initialised with the full set of 425 supertags for each word. The parser pops the item with the lowest cost from the agenda, stores it in the chart if it is not already there, and attempts to combine it with other items already present the chart. Newly created items have their costs calculated before being added to the priority queue agenda. The entire process is repeated until a complete parse for the sentence is returned. The algorithm guarantees that the first parse returned is the most probable (i.e. the Viterbi parse) according to the model. L&S treat a CCG parse y as a list of lexical categories c 0 . . .c n 1 together with a derivation, and make the simplifying assumptions that all derivations licensed by the grammar are equally likely, and that the probability of a given lexical category assignment is conditionally independent of all the other assignments given the sentence. Let Y be the set of all derivations licensed by the grammar; then the optimal parseŷ for a given sentence S with words w 0 . . .w n 1 is given as: Let ↵ be a set of indices {i,..,j} for words w i ...w j labelled with category sequence c i ...c j inside some expression. The inside probability of ↵ is simply the product of the probabilities of the lexical category assignments given the sentence.
The upper bound estimate for the outside probability of a span ↵ is given by where max c i p(c i |S) is simply the lexical category with the highest probability assigned to word w i according to the supertagger; this can be precomputed for the sentence and cached. To avoid numerical errors caused by multiplying together extremely small numbers, we convert the probabilities to log space costs and use addition rather than multiplication.

A* MG parsing
The simplicity, speed and performance of L&S's A* CCG parser made it attractive for a first implementation of a wide-coverage MG parser. However, while CCG and MG are similar in some respects 5 (such as the fact that they are both strongly lexicalised), there are also some fundamental differences between the formalisms which mean that some adaptations are needed in order to port this A* algorithm to MGs. The first (trivial) issue is that MG derivations feature discontinuous spans in order to allow for movement, as we saw in Figure 1. Therefore, we must redefine ↵ in Equations 2 and 3 to be the set of word indices covered by all the spans contained within an MG expression.
The second issue is that, following Kobele (2008) and T&S, the MGbank grammar allows for so-called Across-the-Board (ATB) phrasal and head movements in order to capture adjunct control, parasitic gaps, and certain coordination structures. ATB phrasal movement is illustrated in 2 below.
(2) Who i did Jack say Mary likes t i and Pete hates t i ?
In 2, who has moved from two separate base generated object positions in across-the-board fashion. T&S (following Kobele 2008) propose to account for this by initially generating two instances of who in the two object positions and then later unifying them into a single item when the second conjunct is merged into the main structure. For A*, when two expressions containing unifiable movers are merged together, only one of those movers must contribute to the cost of the resulting expression in order to avoid excessive penalisation for what is now just a single instance of the moving item. We can achieve this for both ATB head and phrasal movement by first calculating the sum of the costs of the two expressions that are Merged, and then subtracting from this the cost of one member of each pair of unified movers.
In the MGbank grammar (unlike in Kobele 2008), it can be the case that two unified (head) movers have different derivational histories, in which case they may well have different costs. In such cases, the parser uses the greater of these two costs when calculating the inside cost of the newly formed expression. If the lower of the two costs were used instead, it may make some costs nonmonotonically increasing. 6 The final problem relates to the fact that, unlike CCG, MG allows for phonetically null heads (following mainstream MP), but supertaggers can only tag the overt words of a sentence. However, we would like our probability model to also be defined over the null heads. Addressing this problem, Torr (2018) proposes an algorithm for extracting a set of complex LTAG-like MG lexical supertag categories from a corpus of MG derivation trees, which we adopt here. Each supertag contains precisely one overt atomic MG lexical item and zero or more atomic null heads anchored to it. For example, in Figure 1, the [int] head would be included inside the supertag anchored by saw. The supertagging model can now be refactored over these complex, overt MG categories. The parser continues to manipulate the atomic categories, but now keeps track of the fact that the v= of [int] must obligatorily be checked by the v feature of (this specific instance of) saw, and vice versa. During parsing, the overt heads carry the entire cost of their supertag into the agenda; the null heads are simply assigned a zero cost.
Pseudo-code for the A* MG parser is provided in Appendix A.

Model description
We used two types of MG grammars in our experiments: Abstract and Reified. The difference between them is that in the Reified gram-mar, most of the 100 or so fine-grained selectional and agreement restriction features have been removed with the exception of the following 5 features, which are necessary to the inner workings of the parser: ANA, EDGE, IT, +NONE, MAIN. The Reified grammar is clearly more constrained, which should make it more precise (at some expense to recall) but at the same time more difficult to supertag correctly due to the sparsity that comes with a higher number of supertags. Extracting the complex MG supertags from the entire MGbank corpus resulted in a Reified tagset of 3926 items and an Abstract tagset of 2644 items. 7 For both Abstract and Reified we used the same supertagging neural architecture that works by initially embedding the word tokens using the final layer of an ELMo embedder (Peters et al., 2018), followed by a single affine transformation to compress the embeddings into a vector of size 128 for each word. These embeddings are further fed into a two layer bi-LSTM (Hochreiter and Schmidhuber, 1997;Graves, 2013). Finally, the hidden states of the final layer of the bi-LSTM are passed through a two layer MLP to predict the distribution of the supertags for each word. The parameters are trained using an Adam optimizer with a learning rate of 0.0002.

Recovering MGBank dependencies
We first tested the parser on its ability to recover global syntactic and semantic (local and non-local) dependencies extracted from MGbank. We extracted labelled and unlabelled bi-lexical dependencies for each binary non-terminal in the Xbar phrase structure trees transduced from the derivation trees and included in MGbank. 8 To make up for the shortfall in the number of trees in MGbank, we used both sections 00 and 01 for 7 This number of tags is closer to the 4727 elementary trees of the TAG treebank of Chen (2001) than to CCGbank's (Hockenmaier and Steedman, 2007) 1286 lexical categories. 8 As in Collins (1999), the labels are triples of the parent, non-head child and head child categories. The dependencies include both local dependencies and those created by movement, hence this evaluation is more akin to the deep dependency evaluation discussed in Clark et al. (2002) for CCG than to the more standard practice of evaluating parsers in terms of just local dependencies (e.g. (Collins, 1999)). The semantic head of the clause is taken to be the main verb, while its syntactic head, if present, is the overt complementizer; similarly, nouns are taken to be semantic heads of PPs and DPs while their syntactic heads are the preposition and determiner respectively; the semantic heads of coordination structures are the conjuncts themselves, while the syntactic head is the coordinator. Unlabelled dependencies are also undirected, as is standard practice in CCG evaluation.   Table 1 shows the results on the MGbank test set. On both dependency types, the Reified model has higher precision, F1-score and exact matching, but has a lower score on recall owing to the constraining impact of the selectional and agreement features: The Abstract model parsed 1924 sentences (96.5%) out of 1998 in the test set, while the Reified model parsed 1902 (95.4%). These F1 scores are respectable for a first attempt at widecoverage MG parsing, though it should be noted that the MGbank test set is somewhat easier than the PTB test set owing to the difference of 4.8 in average sentence length between the two corpora.

Comparison to CCG
Cross-formalism comparison is in general a difficult task (Clark and Curran, 2007a) because it is necessary to account both for (1) the differences in how the parsers work and (2) the differences in the kinds of structures they predict. To control for (1) we re-implemented a CCG parser similar to L&S's CCG A* algorithm but using our supertagger to make the comparison fair. We first trained our CCG supertagger on the CCG trees from CCGbank, but only on those sentences that are also present in MGbank. We then tested the CCG parser on the recovery of CCGbank dependencies for the test sentences also appearing in MGbank, and compared this to an off the shelf CCG parser, namely EasyCCG, that was trained over the whole of the CCGbank training set. The results are shown in Table 2. Our CCG parser shows much better performance in spite of being trained on much less data than EasyCCG, making it a tough point of comparison for our MG parser.
To account for (2)   MG parsers on their ability to recall the dependencies for which both CCGbank and MGbank agree by taking as the test set the intersection of the gold unlabelled undirected CCGbank and syntactic MGbank dependencies for sentences appearing in the MGbank test set. Precision cannot be computed due to the difficulties in normalising predictions on the CCG and MG sides: one might predict more dependencies which may be correct but are not predicted by the syntactic theory used in the other parser and therefore would be penalised.
The results of this evaluation are shown in Table 4. The CCG parser clearly exhibits superior performance, although the MG parser performs respectably given that it is up against a near state-ofthe-art parser for a formalism with a much longer history in wide-coverage parsing. The higher performance of the CCG parser is likely the result of a more complete search due to the lower complexity of the formalism (the CCG parser parsed all sentences) and of the much smaller supertag set that is easier to predict as evident in Table 3. This means that the MG parser requires a larger amount of training data than the CCG parser to achieve similar levels of accuracy and efficiency (because the speed of A* parsing depends on the quality of the probabilistic model). We tried replacing all MG supertags occurring less than twice in the training data with UNK tags to reduce the noise from unreliable tags, but this hurt performance. Once MGbank's coverage is increased, the difference between the formalisms may narrow.
Our MG parser is currently a prototype Python implementation, hence to keep parsing times practical it was necessary to prune the search space by retaining only the 40 most likely supertags per word. Even so, the parser still timed out on a few sentences in the test set. Once reimplemented in a faster language, its recall should increase as it will have more time to explore a less aggressively pruned search space.

Parsing speed
The CKY MG parser of Harkema (2001), when augmented with head movement, has a worst case time complexity of O(n 4k+12 ) where k is the maximum number of phrasal movers that can be contained in any single expression. In the MGbank formalism, owing to DSMC, k = 4 (see Torr 2019), meaning that the worst case complexity of parsing with this formalism using Harkema's algorithm would be O(n 28 ). Our A* parsing algorithm operates in a similar fashion, except that it takes an additional multiplicative cost of O(log n) due to the usage of a heap data structure for implementing the agenda. O(n 28 log n) is, of course, a prohibitively high time complexity. However, although A* does not improve on the worst case theoretical complexity of CKY, it can dramatically improve its practical expected complexity. Figure 2 shows the scatter plot of parsing times for different sentence lengths and the average curve. The average curve is less informative in very long sentences due to the smaller number of parses, but in regions where there are more data points a clear pattern can be observed: a cubic polynomial curve approximates average time taken to parse sentences extremely well, which means that the expected time complexity of MG parsing with our grammar and statistical model is O(n 3 ). This is much better than the worst case analysis, although the variance is high, with some sentences still requiring a very long time to parse.
Recently, Stanojević (2019) has shown that with relatively small adjustments to the parser's inference rules, MGs with head movement can be parsed in O(n 2k+5 ) time in the worst case, 9 which for the MGbank grammar equates to O(n 13 ), a dramatic improvement over O(n 28 ). We hope to leverage these efficiency gains in the future to improve the expected time complexity of the parser.

Coverage
Section 00 of the PTB contains 1921 sentences with an average sentence length of 21.9 words; other than a 212 word outlier, the maximum sentence length is 96. When run over all of these sentences, the Reified parser returned parses for 1490 (77.6%) sentences with an average sentence length of 14 and a maximum sentence length of 53. The Abstract parser returned 1549 parses (80.6%) with an average sentence length of 15.3 and a maximum sentence length of 49. The CCG A* parser returned 1909 parses (99.4%).

Recovery of unbounded dependencies
As noted in Section 1, the recovery of unbounded dependencies, including wh-object questions, is a primary motivation for using linguistically expressive parsers. Wh-object questions themselves are extremely rare in the PTB, but object relative clauses, which also involve unbounded movement, are relatively frequent. Following Clark et al.
(2004), we manually evaluated our parser on the free and non-free object (and embedded subject) 9 Fowlie and Koller (2017) previously demonstrated that MGs without head movement could be parsed in O(n 2k+3 ) worst case time, which was already a dramatic improvement over Harkema's original result. However, Stanojević (2019) shows that adding head movement to Fowlie and Koller's system increases complexity to O(n 2k+9 ). relative clauses in section 00 of the Penn Treebank, as well as on the two examples of so-called tough movement. The MGbank analyses of these constructions are discussed in Appendix B.
There are 24 examples of non-free object relative dependencies across 20 sentences in section 00, and 17 free object relative dependencies across 16 sentences. All of these sentences, along with indications of which dependencies our parser did and did not recover, are given in Appendix C, and are presented using the tokenization our MG parser used (the CCG A* parser used the original CCGbank tokenization). Also included are a phrase structure tree and its corresponding derivation tree illustrating the MGbank analysis of a restrictive object relative clause.
On the free object relatives, our Abstract parser performed best, recovering 13/17 dependencies. The parser only predicted 14 free object relatives meaning that the precision was 13/14. Of the 4 free object relative dependencies in the data which it missed, 3 were in very long sentences on which the parser timed out (the time-out was set to 30 mins), suggesting that a faster re-implementation may achieve higher recall. In the one case which the parser actually got wrong, it correctly identified that there was a free object relative dependency, but extracted the wrong object from a double object verb. Clark et al. (2004) reported recall of 14/17 (with precision 14/15), while our A* CCG parser recovered 15.5/17 of the free object relative dependencies with precision also 15.5/17. Non-free object relatives are harder than both wh object questions and free object relatives because they require a head noun to be identified in addition to an extraction site. Our Abstract parser performed best here, retrieving 10/24; the CCG A* parser recovered 15/24, with precision of 15/21 (Clark et al. (2004) also reported recall of 15/24 and precision of 15/20). Our Reified parser retrieved 13/24 with precision 13/17 when allowed to reparse any sentences it initially failed to find any analyses for with increasingly relaxed tag-dictionary settings. In two of the errors, the parser correctly identified the extraction site, but attached the relative clause to the wrong NP. For example, in sentence 1, the parser attached whom Sony hosted for a year to complaint rather than to American. Appositive relative clauses such as this are treated as involving adjunction of the relative clause to the head noun in MGbank, and the choice of attachment to either American or complaint is underdetermined by the model (the same supertag containing the requisite [rel] and [adjunctizer] heads will be assigned to hosted in either case). 10 For the restrictive relative clause in sentence 8, the parser incorrectly assigned the supertag containing the [relativizer] null head (which causes the noun to undergo promotion) to the noun esteem rather than to damage, hence the problem here originates with the scores assigned by the supertagger. In the other two errors, the parser incorrectly predicted an object extraction dependency, again owing to tagging mistakes.
We also evaluated on the 2 tough movement examples in section 00, one of which is shown below. (

3) [That
Tough movement is of linguistic interest because it arguably involves a DP licensed in two case positions as well as so-called improper movement, in which an A 0 -movement step feeds subsequent A-movement. In order to generate tough movements, MGbank uses a null [op] head which has the effect of a unary type-changing rule mapping an ordinary DP into a DP with additional Aand A 0 -movement licensees.
Our parser failed to correctly analyse either of the two examples in section 00 owing to supertagging errors. For example, in 3 there are three important tagging decisions to be made: hard must be assigned the supertag for a tough adjective, that the supertag for a pronoun which undergoes tough movement, 11 and take the supertag for a transitive verb. The highest scoring tag assigned to hard by the Abstract supertagger was the supertag for a regular adjective that takes a CP complement (eager to help). The correct tough adjective supertag, meanwhile only ranked 14th, meaning that the A* search algorithm never got to consider it. Furthermore, the highest ranked tag for take was the supertag for an unergative intransitive verb; the correct transitive verb tag appeared in second place. Finally, the supertag for a pronoun undergoing tough movement was not included in the 40 10 One way to resolve such ties would be to augment our simple supertag-factored model with a secondary headdependency model; an alternative would be to hard code the constraint in the grammar using fine-grained properties and requirements such as HUMAN and +HUMAN. 11 This supertag contains both the overt pronoun category assigned to that and the [op] null head (see Figure 6). tags assigned to that owing to the fact that this supertag did not appear in the training data at all. We tried increasing the 8 examples of tough movement in the training data to 18 examples (including one example with that as the tough mover) by performing some additional hand annotation of PTB sentences. This bolstered the tough adjective supertag to 10th position, while the tough movement supertag for that now appeared in 28th position, but this was not enough to enable the parser to correctly recover the tough movement analysis.
Our A* CCG parser scored 1/2 (the same as Clark et al. 2004); its higher performance is no doubt due to the much smaller tag set and the fact that CCG does not require special supertags for tough-moved DPs.

Conclusion
We have presented the first wide-coverage parser based on Transformational Grammar theory. The results of this initial attempt are optimistic. First, the accuracy on recovering syntactic and semantic dependencies predicted by the Minimalist syntax is relatively high considering the higher complexity of the mechanisms behind Minimalism compared to other formalisms. In comparison to CCG, a formalism with a much longer history of widecoverage parsing, performance currently lags behind. However, the gap will likely narrow as the size and quality of MGbank improves and as better probabilistic models are developed enabling these systems to parse a higher number of sentences.
Another important and optimistic result of this investigation is that Minimalist Grammar parsing is not as slow as may have been expected given its worst case time complexity. Worst case complexity results have previously been raised as one of the prime criticisms of TG theories. Our results show that the combination of a good neural probabilistic model and A* search, together with a strong formal grammar, makes Minimalist parsing practical for the majority of sentences. Robert C Berwick and Samuel D Epstein. 1995 updateWeight(agenda, item) The A* search algorithm presented in Algorithm 1 is an adaptation of the weighted deductive parsing approach (Nederhof, 2003;Maier et al., 2012) to Minimalist Grammars. It uses two data structures, an agenda and a chart. The agenda is implemented as a priority queue with support for an increase-key operation. Concretely, we use a Fibonacci heap (Fredman and Tarjan, 1987), but many other types of heap could be used for the same purpose.
The chart is currently organised similarly to that of standard CKY in that it constitutes the uppertriangular portion of an (n + 1) x (n + 1) matrix, where n is the length of the string, and each cell [i, j] in this matrix references some span from position i to position j in the input string. However, whereas in standard CKY, these cell indices reference the span of the entire expression, in our MG parser they reference only the span of the narrow yield of the expression, where the narrow yield refers to all those indices which are not part of some span which is undergoing or may undergo movement. For example, the narrow yield of the TP expression in 4 below is the set of indices corresponding to the words Jack and gone there (shown in bold face). The moving chain why is excluded from the narrow yield, as is the head string has because, depending on the type of com-plementizer which selects for this TP, has may undergo head movement to yield why has Jack gone there, or not undergo head movement to yield, e.g., you know why Jack has gone there.
(4) Jack, has, gone there : t, why : -wh At present, the chart is not organised according to the yields of any of the moving elements. However, Stanojević (2019) shows how this could be done using a trie-like data structure in order to significantly reduce computational complexity, and we intend to reimplement the parser in this way in the future.
Expressions within each cell are also currently placed into bins according to the first feature of their head chain, so that when the system encounters a t= feature, for example, it only needs to consider merging this expressions with other expressions whose first feature is t.
The call updateWeight(agenda, item) finds the current (backpointer, weight) pair of item in the agenda and compares it to the newly constructed (backpointer, weight) pair. The weight includes both the inside and outside scores. Only the pair with a lower weight is kept in the agenda. This update is made efficient by using an additional hashtable and the increase-key heap operation.

B Appendix: MGbank analyses of relative clauses and tough movement
The MGbank analysis of restrictive relative clauses is illustrated in phrase structural terms for the phrase the book of ghost stories which Jack read in Figure 3; the derivation tree for the simpler phrase the book which Jack read is shown in Figure 4. This analysis is inspired by an analysis in Bhatt (2002) and departs from that of Kayne (1994), where the wh determiner and the NP form a constituent in both the deep and surface structure (with the NP moving to the specifier of the wh DP to derive the correct word ordering , not its selectee (n) category, hence the type of projecting movement Bhatt proposes must be precompiled into the lexicon. Note that relative that is often treated as a complementizer rather than as a relative pronoun in MP (Radford, 2004, pages 228-230 (Chomsky and Lasnik, 1977)). Free relatives, as in I like [what you're reading], have a very similar analysis, but project only as far as CP (as they lack any head noun) and are then selected for by a null determiner head. Appositive relatives, as in the book, which you've read, is on the table, receive a head external analysis, again projecting only as far as CP and then adjoining to their head noun. Figures 5 and 6 show the phrase structure and derivation trees for the tough movement example that got hard to take, which is one of the two examples of tough movement found in section 00 of the PTB. It has generally been assumed since (Chomsky, 1977(Chomsky, , 1981 that the infinitival clause is a type of relative clause with a null constituent in its left periphery that is co-indexed both with the object trace and the subject of the tough adjective. This null constituent is in fact included in the original PTB, although it is generally just ignored by treebank parsers. MGbank follows Brody (1993) and Hornstein (2001)  non-free (non-reduced) relative clauses in section 00 of the PTB, and indicate which ones our best models did and did not correctly analyse. Figure 3: MGbank's phrase structural analysis of the phrase the book of ghost stories which Jack read, which contains a restrictive relative clause as the complement of the determiner the. The tree has been simplified in certain respects, for instance by removing the successive cyclic wh movement through spec-vP which is assumed in MP and included in the actual MGbank trees. ⇤ indicates a trace of head movement, indicates a trace of overt phrasal movement, and µ indicates the landing site of a covert movement.
Figure 5: Derived Xbar tree showing MGbank's analysis for the phrase that got hard to take with tough movement. The tree has been simplified here by removing the successive cyclic wh movement through spec-vP that is standardly assumed in MP and is included in the actual MGbank trees. Note that µ indicates the landing site of a covert movement.  Figure 6: MG derivation tree for the sentence that got hard to take whose phrase structure tree is given in Figure 5. Note that the final CP layer is omitted here to save space. 1.
The survey found that nearly half of Hong Kong consumers espouse what it identified as materialistic values compared with about one-third in Japan and the U.S.

2.
What she did was like taking the law into your own hands 3.
We work damn hard at what we do for damn little pay and what she did cast unfair aspersions on all of us 4.
There may be others doing what she did 5.
The U.S. wants the removal of what it perceives as barriers to investment ; Japan denies there are real barriers 6.
But they have n't clarified what those might be 7.
Deregulation has effectively removed all restrictions on what banks can pay for deposits as well as opened up the field for new products such as high -rate CDs 8.
Mr. Martin said they have n't yet decided what their next move would be but he did n't rule out the possibility of a consent solicitation aimed at replacing Georgia Gulf 's board 9.
What matters is what advertisers are paying per page and in that department we are doing fine this fall said Mr. Spoon w.o. 10. What this tells us is that U.S. trade law is working he said t.o.
11. The paper accused him of being a leading proponent of peaceful evolution a catch phrase to describe what China believes is the policy of Western countries to seduce socialist nations into the capitalist sphere t.o.
12. Despite the harsh exchanges the U.S. and China still seem to be looking for a way to mend relations which have deteriorated into what Mr. Nixon referred to as the greatest crisis in Chinese -American relations since his initial visit to China num years ago 13. Judge Ramirez num said it is unjust for judges to make what they do. 14. Judges are not getting what they deserve t.o.
15. Composer Marc Marder a college friend of Mr. Lane 's who earns his living playing the double bass in classical music ensembles has prepared prepared an exciting eclectic score that tells you what the characters are thinking and feeling far more precisely than intertitles or even words would 16. We have and I 'm sure others have considered what our options are and we 've had conversations with people who in the future might prove to be interesting partners Figure 7: The 16 sentences with free object relative clause dependencies in section 00 of the PTB. Each tick indicates a point awarded for the correct identification of the extraction site of the wh word; t.o. indicates that the parser timed out before returning a parse, and w.o. indicates that the parser correctly identified an object relative dependency but extracted the wrong object of a double object verb. Our Abstract parser correctly identified 13/17 dependencies with a precision of 13/14. Our A* CCG parser correctly recovered 15.5/17 of these dependencies with precision 15.5/17 (we awarded the CCG parser half a point for sentence 15 because it related what to thinking but not feeling, which it analysed as intransitive). Note that sentence 3 contains two free object relative clauses.