Statistical Parsing of Tree Wrapping Grammars

We describe an approach to statistical parsing with Tree-Wrapping Grammars (TWG). TWG is a tree-rewriting formalism which includes the tree-combination operations of substitution, sister-adjunction and tree-wrapping substitution. TWGs can be extracted from constituency treebanks and aim at representing long distance dependencies (LDDs) in a linguistically adequate way. We present a parsing algorithm for TWGs based on neural supertagging and A* parsing. We extract a TWG for English from the treebanks for Role and Reference Grammar and discuss first parsing results with this grammar.


Introduction
We present a statistical parsing approach for Tree-Wrapping Grammar (TWG) (Kallmeyer et al., 2013). TWG is a grammar formalism closely related to Tree-Adjoining Grammar (TAG) (Joshi and Schabes, 1997), which was originally developed with regard to the formalization of the typologically oriented Role and Reference Grammar (RRG) (Van Valin and LaPolla, 1997;Van Valin Jr, 2005). TWG allows for, among others, a more linguistically adequate representation of long distance dependencies (LDDs) in sentences, such as topicalization or long distance wh-movement. In the present paper we show a grammar extraction algorithm for TWG, propose a TWG parser, and discuss parsing results for the grammar extracted from the RRG treebanks RRGbank and RRGparbank 1 (Bladier et al., 2018).
Similarly to TAG, TWG has the elementary tree combination operations of substitution and sisteradjunction. Additionally, TWG includes the operation of tree-wrapping substitution, which accounts for preserving the connection between the parts of the discontinuous constituents. Operations similar to tree-wrapping substitution were proposed by (Rambow et al., 1995) as subsertion in D-Tree Grammars (DTG) and by (Rambow et al., 2001) as generalized substitution in D-Tree substitution grammar (DSG). To our best knowledge, no statistical parsing approach was proposed for DTG or DSG. An approach to symbolic parsing for TWGs with edge features was proposed in (Arps et al., 2019). In this work, we propose a statistical parsing approach for TWG and extend the pipeline based on supertagging and A * algorithm (Waszczuk, 2017; originally developed for TAG to be applied to TWG. The contributions of the paper are the following: 1) We present the first approach to statistical parsing for Tree-Wrapping Grammars. 2) We propose an extraction algorithm for TWGs based on the algorithm developed for TAG by (Xia, 1999). 3) We extend and modify the neural A TAG-parser (Waszczuk, 2017;Kasai et al., 2018; to handle the operation of tree-wrapping substitution. used to capture long distance dependencies (LDDs), see the wh-movement in Fig. 1. Here, the left tree with the d-edge (depicted as a dashed edge) gets split; the lower part fills a substitution slot while the upper part merges with the root of the target tree. TWG is more powerful than TAG (Kallmeyer, 2016). The reason is that a) TWG allows for more than one wrapping substitution stretching across specific nodes in the derived tree and b) the two target nodes of a wrapping substitution (the substitution node and the root node) need not come from the same elementary tree, which makes wrapping non-local compared to adjunction in TAG. ; Figure 1: Tree-wrapping substitution for the sentence "What do you think you remember" with longdistance wh-movement.

CLAUSE
Linguistic phenomena leading to LDD differ across languages. Among LDDs in English are some cases of extraction of a phrase to a non-canonical position with respect to its head, which is typically fronting in English (Candito and Seddah, 2012). We identified the following LDD variants in our data which can be captured with tree-wrapping substitution: long-distance relativization, long-distance wh-movement, and long-distance topicalization, which we discuss in Section 6. 2 LDD cases are rather rare in the data, which is partly due to the RRG analysis of operators such as modals, which do not embed CORE constituents (in contrast to, for example, the analyses in the Penn Treebank). Only 0,11 % of tokens in our experiment data (including punctuation) are dislocated from their canonical position in sentence to form an LDD. This number is on a par with 0,16 % of tokens reported by (Candito and Seddah, 2012) for French data.

Statistical Parsing with TWGs
The proposed A TWG parser 3 is a direct extension of the simpler A TAG parser described in (Waszczuk, 2017). The parser is specified in terms of weighted deduction rules (Shieber et al., 1995;Nederhof, 2003) and can be also seen as a weighted variant of the symbolic TWG parser (Arps et al., 2019). As in , both TWG elementary trees (supertags) and dependency links are weighted, a schema also used in A CCG parsing (Yoshikawa et al., 2017). These weights come directly from a neural supertagger and dependency parser, similar to the one proposed by (Kasai et al., 2018). Parsing consists then in finding a best-weight derivation among the derivations that can be constructed based on the deduction rules for a given sentence. The supertagger takes on input a sequence of word embeddings 4 (x i ) n i=1 , to which a 2-layer BiLSTM transducer is applied to provide the contextualized word representations (h i ) n i=1 , common to all subsequent tasks: POS tagging, TWG supertagging, and dependency parsing. On top of that, we apply two additional 2-layer BiLSTM transducers in order to obtain the supertag-and dependency-specific word representations: The supertag-specific representations are used to predict both supertags and POS tags (POS tagging is a purely auxiliary task, since POS tags are fully determined by the supertags): Finally, the dependency parsing component is based on biaffine scoring (Dozat and Manning, 2017), in which the head and dependent representations are obtained by applying two feed-forward networks to the dependency-specific word representations, ) and dp i = FF dp (h ). The score of word j becoming the head of word i is then defined as: where M is a matrix and b is a bias vector. 5 Extending the TAG parser to TWG involved adapting the weighted deduction rules to handle wrapping substitution as well as updating the corresponding implementation with appropriate index structures to speed up querying the chart. The A heuristic is practically unchanged and it is both admissible (by construction) and monotonic (checked at run time), which guarantees that the first derivation found by the parser is the one with the best weight. The scheme of our parsing architecture is shown in Fig. 2. In Appendix A we provide details on modifications we have applied to the A parser to handle the tree-wrapping substitution.

TWG extraction
To extract a TWG from RRGbank and RRGparbank, we adapt the top-down grammar extraction algorithm developed by (Xia, 1999) for TAG. While inital and sister-adjoining trees can be extracted following this algorithm, we added a new procedure to extract d-edge trees for wrapping substitution operation. Extraction of initial and sister-adjoining elementary trees requires manually defined percolation tables for marking head and modifier nodes. In order to extract d-edge elementary trees for LDDs, dependent constituents need to be marked prior to TWG extraction. In RRGbank and RRGparbank the constituents belonging to LDDs are indicated with features PRED-ID and NUC-ID and an index. These indicated parts alongside with the mother node are extracted to form a single tree with a dominance link (d-edge) (see for instance the elementary tree for "What to say" in Fig. 3). The remaining nodes plus the duplicated mother node and a substitution slot form the target tree, for example the tree for "I'm trying" in Fig. 3. Please find a more detailed formal description of our extraction algorithm along with a link to the percolation tables in Appendix B.  We have taken the gold and silver data from RRGbank and the English part of RRGparbank 6 . The data is split in a train and a test set. We have taken 10 % of the sentences from the train set to create a dev. set. Thus, our train, dev. and test sets include 4960, 551, and 2145 trees, respectively. There are 46 constituents with LDDs in the train set, 5 in the dev. set and 27 in the test set. We extracted a TWG from this data and present in Table 1 statistics on the elementary tree templates (supertags) in the TWG. We compare the parsing results with the parser DiscoDOP (van Cranenburgh and Bod, 2013) which is based on the discontinuous data-oriented parsing model. We also compare our results with the stateof-the-art transition-based parser Discoparset (Coavoux and Cohen, 2019). We evaluated 7 the overall performance of the parsers and also analyzed how well all three systems predict LDDs (see Tables 2  and 3). Unrelated to LDDs, the treebanks contain crossing branches (e.g., for operators and modifiers). Prior to TWG extraction, we decross these while keeping track of the transformation in order to be able to reverse it. For parsing with DiscoDOP and Discoparset, we added crossing branches for all LDDs. To evaluate LDD prediction with DiscoDOP and Discoparset we counted how many crossing branches were established in parsed trees. For ParTAGe we counted the LDD predictions as correct whenever the predicted supertags and dependencies indicated that the long distance element would be substituted to the elementary tree of the corresponding predicate. We counted partially correct LDDs in both parsing architectures as correctly predicted as long as the connection between the predicate and the fronted element was predicted.   (Coavoux and Cohen, 2019). In case of Discoparset, the numbers in subscript represent the relative gain provided by BERT (Devlin et al., 2019) used in neither DiscoDOP nor ParTAGe experiments.

Error analysis for LDD prediction
We evaluated the performance of our parsing architecture with regard to the labeled F1-score and we also focused on prediction of the LDDs (see Tables 2 and 3  predicted the LDDs in the test data more accurately than the compared parsers. Please note that LDDs are generally rare in the corpus data and that we also had only about 5000 sentences in the training data. Some mistakes resulted from the wrong prediction of a POS tag which 6 Gold annotated data means that data were annotated and approved by at least two annotators of RRGbank or RRGparbank and silver data means an annotation by one linguist. 7 We use the evaluation parameters distributed together with DiscoDOP for constituency parsing evaluation. Our evaluation file is available at https://github.com/TaniaBladier/Statistical_TWG_Parsing/blob/main/experiments/eval.prm. leads to the parser confusing an LDD constituent with a construction without LDD. For example, in (1), the word "is" should have POS tag V, but the parsing system erroneously labels it as AUX (= auxiliary) and thus interprets the wh-element as a predicate. In order to check our assumption about POS tags as a source of error, we have run an experiment in which we presented the parser with gold POS tags. Although this additional information helped to rule out the LDD errors in (1), restriction of the available supertags introduced new errors in LDD predictions (see Table 3) and also was the reason why some sentences could not be parsed (as shown in Table 2).
(1) a. What is one to think of all this? (is tagged AUX instead of V) b. [. . . ] which he told her to place on her tongue (which tagged CLM instead of PRO-REL) In some cases where the relative or wh phrase of the LDD is an adjunct, as in (2), the parser incorrectly attaches it higher, taking it to be a modifier of the embedding verb. (2) And why do you imagine that we bring people to this place?
Cases where the embedding verb also has a strong tendency to take a wh-element as argument sometimes get parsed incorrectly: In (3), which is analysed as an argument of said. (3) [. . . ] slip of paper which they said was the bill

Conclusions and Outlook
We have presented a statistical parsing algorithm for parsing Tree-Wrapping Grammar -a grammar formalism inspired by TAG which aims at linguistically better representations of long distance dependencies. The LDDs in TWG are represented in a single elementary tree called d-edge tree which is combined with the target tree using tree-wrapping substitution. This operation allows to simultaneously put both parts of a discontinuous constituent to the corresponding slots of the target tree. We have extracted a TWG for English from two RRG treebanks and have compared our parsing experiments with the parser DiscoDOP based on the DOP parsing model and with the transition-based parser Discoparset. We have evaluated our parser on prediction of LDDs and could achieve more accurate results than the compared parsers. In our future work we plan to explore TWG extraction and parsing for different languages, since the linguistic phenomena leading to LDDs vary across the languages. In particular, we have already started to work on extraction of TWGs for German and French. We plan to apply our TWG extraction and parsing algorithm to other constituency treebanks, for example French Treebank (Abeillé et al., 2003). We also plan to implement a slightly extended version of tree wrapping substitution which would allow to place the parts of discontinuous constituents in various slots between the nodes of the target tree.  Our TWG parser is specified in terms of weighted deduction rules (Shieber et al., 1995;Nederhof, 2003). Each deduction rule (see Table 4) takes the form of a set of antecedent items, presented above the horizontal line, from which the consequent item (below the horizontal line) can be deduced, provided that the corresponding conditions (on the right) are satisfied. The specification of the TWG parser consists of 8 deduction rules which constitute a blend of the TAG parser  with the symbolic TWG parser (Arps et al., 2019). Here, we assume familiarity with both these parsers and limit ourselves to explaining the features specific to the statistical TWG parser.
Weights. A pair (w, m) is assigned to each chart item via deduction rules, where w is the inside weight, i.e., the weight of the inside derivation, and m is a map assigning weights to the individual gaps in the corresponding gap list Γ. Since each gap in Γ can be uniquely identified by its starting position, we use the starting positions as keys in m. The need to use a map (dictionary) data structure instead of a single scalar value, as in the TAG parser, stems from the CW rule (complete wrapping), in which the calculation of the resulting weight map requires removing the weight corresponding to the gap (f 1 , f 2 , y).
We use ∅ to denote an empty map, m[x ⇒ y] to denote m with y assigned to x, m[x ⇒ ⊥] to denote m with x removed from the set of keys (together with the corresponding value), and sum(m) to denote the sum of values (weights) in the map m. We also re-use the concatenation operator ⊕ to represent map union. Whenever map union is used (m 1 ⊕ m 2 ), the sets of keys of the two map arguments (m 1 and m 2 ) are guaranteed to be disjoint (an invariant which can be proved by induction over the deduction rules).
Heuristic. Given a chart item η = (x, i, j, Γ) with the corresponding weights (w, m), the TWG A heuristic (which provides a lower-bound estimate on the cost of parsing the remaining part of the sentence) is a straightforward generalization of the TAG A heuristic used by . In particular, it accounts for the total minimal cost of scanning each word outside the span (i, j), as well as the words remaining in the gaps in Γ. Thus, in constrast with the TAG heuristic, since there can be many gaps in Γ, the sum of the weights in the map m has to be accounted for.