Supertagging-based Parsing with Linear Context-free Rewriting Systems

We present the first supertagging-based parser for linear context-free rewriting systems (LCFRS). It utilizes neural classifiers and outperforms previous LCFRS-based parsers in both accuracy and parsing speed by a wide margin. Our results keep up with the best (general) discontinuous parsers, particularly the scores for discontinuous constituents establish a new state of the art. The heart of our approach is an efficient lexicalization procedure which induces a lexical LCFRS from any discontinuous treebank. We describe a modification to usual chart-based LCFRS parsing that accounts for supertagging and introduce a procedure that transforms lexical LCFRS derivations into equivalent parse trees of the original treebank. Our approach is evaluated on the English Discontinuous Penn Treebank and the German treebanks Negra and Tiger.


Introduction
In NLP, constituency parsing is a task that assigns -usually tree-shaped -syntactic structures to sentences. Formalisms such as context-free grammars (CFG) are used in this setting because they are conceptually simple, interpretable, and parsing is tractable (cubic in sentence length).
Discontinuous constituents span non-contiguous sets of positions in a sentence. The resulting phrase structures do not take the shape of a tree anymore, as they contain crossing branches (cf. the left of Fig. 1), and cannot be modeled by CFG. As a countermeasure, many treebanks, e.g. the Penn Treebank (PTB; Marcus et al., 1994), denote these phrase structures as trees nevertheless and introduce designated notations for discontinuity, which is then often ignored in parsing. However, discontinuity occurs in about 20 % of the sentences in the PTB and to an even larger extent in German treebanks such as Negra and Tiger. For parsing discontinuous constituents, so-called mildly context-sensitive grammar formalisms have been investigated, e.g. tree-adjoining grammars (TAG;Joshi et al., 1975) and linear context-free rewriting systems (LCFRS; Vijay-Shanker et al., 1987). An LCFRS derivation of a discontinuous phrase is shown in the right of Fig. 1. The increased expressiveness of these formalisms comes at the cost of a higher parsing complexity: given a sentence of length n, parsing is in O(n 6 ) for TAG and O(n 3·fo(G) ) for a binary LCFRS G. The grammar-specific fanout fo(G) indicates that G can parse constituents spanning up to n non-contiguous sets of positions. TAG have the same expressiveness as LCFRS with fanout 2 (Seki et al., 1991), which accounts for 96.67 % of the sentences in Negra and 96.83 % of the sentences in Tiger (Maier and Søgaard, 2008). Previous publications have established mildly context-sensitive formalisms in the field of statistical constituent parsing, and found methods to tame the high parsing complexity (Evang and Kallmeyer, 2011;Kallmeyer and Maier, 2013;Angelov and Ljunglöf, 2014;van Cranenburgh, 2012).
One approach for making parsing with mildly context-sensitive grammars tractable is supertagging, which was originally introduced for lexical TAG (Bangalore and Joshi, 1999). A TAG is lexical if each rule contains exactly one lexical item, i.e. word in the parsed language. The supertagger is a (often discriminative) classifier that selects for each position of the input sentence a subset of the rules of the TAG; these are the so-called supertags. Parsing is then performed with the much smaller grammar of supertags. Research on supertagging has also been conducted in the context of combinatory categorial grammars (CCG; Clark, 2002), but not yet for LCFRS. The use of recurrent neural networks (RNN) as classifiers in supertagging has improved their accuracy by far (Vaswani et al., 2016;Kasai et al., 2017;Bladier et al., 2018;Kadari et al., 2018). Recently, Mörbitz and Ruprecht (2020) introduced a lexicalization procedure for probabilistic LCFRS 1 , paving the way to employ supertagging for parsing with this formalism. Early experiments showed that the approach is infeasible in realistic settings: the set of rules explodes in a step of the construction where new rules are introduced for pairs of terminals in the grammar. To mitigate this problem, we conduct the procedure for single derivations. Consequently, we only have to construct rules for pairs of terminals that occur in sibling nodes of a derivation (cf. step (4) in Section 4). Moreover, we consider unweighted LCFRS, as weights of underlying grammar structures are usually not considered in supertaggingbased approaches.
In this paper, we present the first supertaggingbased parser for LCFRS. Section 3 extends the usual chart-based parsing approach for LCFRS to account for supertagging with lexical LCFRS. Section 4 adapts the lexicalization procedure by Mörbitz and Ruprecht (2020) to efficiently induce a lexical LCFRS from any given treebank. We implemented and evaluated the approach. Section 5 describes the experimental setups of our evaluation using three discontinuous treebanks (one English and two German). Section 6 compares our results to recent LCFRS-based parsers and other state-ofthe-art parsers for discontinuous constituents. The implementation of our approach is published on GitHub. 2 1 Their work is an instance of the lexicalization of (unweighted) multiple context-free tree grammars by Engelfriet et al. (2018).

Notation
We start by introducing some basic notation that will be used throughout Sections 3 and 4. The set of non-negative (resp. positive) integers is denoted by N (resp. N + ). We abbreviate {1, ..., n} by [n]; for each n N + , the set [n] is the empty set. An alphabet Σ is a finite and non-empty set; the set of (finite) strings over Σ is denoted by Σ * . The symbol ε denotes an empty string or sequence.
Compositions. Linear context-free rewriting systems (LCFRS) extend the rule-based string rewriting mechanism of CFG to string tuples; we describe the generation process by compositions. Let k ∈ N and s 1 , . . . , s k , s ∈ N + ; one can think of k as the number of arguments of a function mapping string tuples of the sizes s 1 , . . . , s k to a string tuple of size s. A Σ-composition is a tuple (u 1 , . . . , u s ) where each u 1 , . . . , u s is a non-empty string over Σ and variables of the form x j i with i ∈ [k] and j ∈ [s i ]. Each of these variables must occur exactly once in u 1 · · · u s and they are ordered such that x 1 i occurs before x 1 i+1 and x j i occurs before x j+1 i for each i ∈ [k − 1] and j ∈ [s i − 1]. The set of all such compositions is denoted by C Σ (s 1 ···s k ,s) . As usual in the literature, we will only consider binary compositions (where k ≤ 2) in the following. Variables of the form x i 1 and x j 2 are abbreviated by x i and y j , respectively.
We associate with each composition (u 1 , . . . , u s ) ∈ C Σ (s 1 ···s k ,s) a function from k string tuples, where the i-th tuple is of arity s i , to a string tuple of arity s. This function is denoted by (u 1 , . . . , u s ) . Intuitively, it replaces each variable of the form x i in u 1 , . . . , u s by the i-th component of the first argument, and y j by the j-th component of the second argument. The identity composition (x 1 , . . . , x s ) is denoted by id s .  Figure 2: An overview over the supertagging-based parsing procedure. A sequence tagger predicts k (here, k = 1) supertags and one enriched preterminal (cf. step (4) in Section 4) for each sentence position. The supertags are rules of a uni-lexical LCFRS (the annotation of the nonterminals is explained in Section 4). The terminal of each rule is the sentence position it was predicted for (rather than the word at that position). The range of sentence positions is parsed and finally the resulting derivation is transformed into a parse tree. The transformation requires the predicted nonterminals, which are used as preterminals.

2925
Let c ∈ C Σ s 1 ···s k ,s be a composition where k ∈ [2], i ∈ [k] such that s i = 1, and w ∈ Σ * . We obtain the partial application of c to w as i-th argument, denoted by c i (w) as follows: • c 2 (w) ∈ C Σ (s 1 ,s) is obtained from c by replacing y 1 by w and • c 1 (w) is obtained from c by replacing x 1 by w and each variable y j by x j . If k = 1, then c 1 (w) ∈ C Σ ε,s , otherwise c 1 (w) ∈ C Σ s 2 ,s .

LCFRS. A (binary) LCFRS is a tuple
• S ⊆ N (initial nonterminals) such that fo(A) = 1 for each A ∈ S , and • R is a finite set (rules); each rule in R is of the form A → c(B 1 , . . . , B k ), where k ∈ {0, 1, 2}, A, B 1 , . . . , B k ∈ N, and c ∈ C Σ (fo(B 1 )··· fo(B k ),fo(A)) . The function c maps k string tuples (of sizes fo(B 1 ), . . . , fo(B k )) to a string tuple of size fo(A). We call A the lefthand side (lhs), B 1 , . . . , B k the right-hand side (rhs) and c the rule's composition. We drop the parentheses around the rhs if k = 0.
In our examples, whenever the fanout of a nonterminal is greater than 1, the fanout is the subscript of the nonterminal. For instance, VP 2 denotes a verbal phrase with fanout 2. The fanout of G is fo(G) = max A∈N fo(A).
Rules of the form A → c, A → c(B), and A → c(B 1 , B 2 ) are called terminating, monic, and branching, respectively. A rule is called (uni-/double-)lexical, if its composition contains at least one terminal (resp. exactly one terminal/exactly two terminals). The LCFRS G is called (uni-)lexical, if each rule is (uni-)lexical.
A derivation in G (starting with A ∈ N) is a tree over rules d = r(d 1 , . . . , d k ) such that r is of the form A → c(B 1 , . . . , B k ) and each d i is a derivation in G starting with B i . The set of derivations in G is denoted by D G . The string tuple computed by d is defined recursively as w = c (w 1 , . . . , w k ) where w 1 , . . . , w k are the string tuples computed by d 1 , . . . , d k ; in the following we also call d a derivation for w.

Supertagging-based parsing
Our parsing model consists of two components: a uni-lexical LCFRS and a discriminative sequence tagger which we henceforth call supertagger. The LCFRS is induced from a treebank by an adaptation of the construction of Mörbitz and Ruprecht (2020); the interested reader may find a detailed description of this procedure in Section 4. After the induction, we replace every terminal of the LCFRS by the wildcard symbol " ", and we refer to the resulting rules as supertags.
Our parsing pipeline is depicted in Fig. 2.
(1) Given a sentence w, the supertagger predicts for each position of w the k best supertags, where k constitutes a hyperparameter of our approach.
(2) We combine the predicted supertags to a new LCFRS which we call G w . In doing so, we replace the wildcard of each supertag by the sentence position it was predicted for.
(3) We employ a usual chart-based parsing algorithm to parse the sequence 1 2 · · · |w| of sentence positions with G w .
(4) We transform the resulting derivation in G w into a parse tree of the same form as those in the original treebank.
As G w only resembles a fraction of all supertags, this approach shifts a huge amount of work from parsing with grammars to predicting the rules. Thus its success is mainly determined by the quality of the supertagger.

Inducing Lexical LCFRS
Our lexicalization scheme is based on Mörbitz and Ruprecht (2020). However, we ignore all weights and perform lexicalization on individual derivations rather than on a grammar induced from the entire treebank. More specifically, we directly read off a set of uni-lexical rules from each tree in the treebank; then the union of these rules forms our uni-lexical LCFRS G lex . In contrast to that, Mörbitz and Ruprecht (2020) first induce an LCFRS G from the entire treebank and then lexicalize G. Thus G lex may have a different language than the lexicalization of G.
We obtain a set of uni-lexical rules from each tree t in the treebank by the following procedure.
(1) Binarize the tree. The symbol | is appended to constituents that result from binarization (this reflects Markovization with a vertical context of 1 and a horizontal context of 0).
(2) Transform the tree into an LCFRS derivation using the standard technique for induction of LCFRS (Maier and Søgaard, 2008).
(3) Collapse every chain of monic rules; the nonterminals of each chain are combined to a new nonterminal.
(4) Remove every terminating rule that has a parent and insert the terminal from its composition into the parent.
(5) Propagate terminals from double-lexical terminating rules into non-lexical branching rules. All rules in the resulting derivation are lexical.
(6) Split all remaining double-lexical terminating rules into two uni-lexical rules. All rules in the resulting derivation are uni-lexical. The resulting derivation is called d lex (t).
(7) Read off the rules of d lex (t); call them R(t).
These steps are defined such that in the LCFRS formed by R(t), d lex (t) is a derivation for the sentence of t. Moreover, we are able to reconstruct t from d lex (t) by reverting steps (6) to (1) (we will give the details later).
Finally, we obtain the uni-lexical LCFRS G lex by combining the rules R(t) for each tree t. The initial nonterminals of G lex are all left-hand sides of roots of d lex (t).
Let t be a tree in the treebank.
Steps (1) and (2) and their reversal are standard techniques for trees and LCFRS. After applying them to t, we obtain an LCFRS derivation in which each occurring rule is either of the form where c contains no terminals and none of B 1 , B 2 is an initial nonterminal. Let us denote this derivation by d.
In the following, we describe steps (3) to (6) of the above procedure in more detail (showing examples in Figs. 3 to 6) and also glimpse at how the individual steps are reverted.
Step (3). We repeatedly replace parts in d of is not the root of d, then the corresponding nonterminal in the parent's rhs is replaced by A+B. 3 After this step, there are only branching rules and terminating rules in d. Figure 3 shows an example for this step.
This step is easy to reverse, as the composition of every removed rule is c = id fo(B) . We give the  formal description in Appendix A.3.
Step (4). We remove every non-root occurrence r of a terminating rule A → (σ) in d. Let r be the ith child of its parent (with i ∈ [2]), then we replace the parent's composition c by c i (σ) and remove the ith nonterminal in the parent's rhs. We note that the parent becomes lexical, and after this step, every rule in d is either branching or lexical. Moreover, every terminal rule in d is either double-lexical (if both children were removed) or the root of d (and thus its only node). Figure 4 shows an example for this step.
(a) A derivation for the string tuple (A hearing, on the issue). Gray arrows show the terminals that are put into binary nonlexical rules during step (4).
The derivation resulting from applying step (4) to the derivation in Fig. 4a. Clearly, this step loses information, namely the left-hand sides of the removed rules. These nonterminals are part-of-speech tags (that may be enriched with nonterminals of collapsed monic rules from the previous step). For the reversal of this step, we opted to predict them along with the supertags as part of the supertagger. The formal description of the reversal is given in Appendix A.2.
Step (5). For each occurrence r of a branching rule A → c(A 1 , A 2 ) in d, let us consider the occurrence t of the leftmost terminating rule (i.e. t is a leaf) that is reachable via the second successor of r. For example, in Fig. 5a, the two binary rules (r) are end points of gray arrows; these arrows start at the mentioned leaves (t). Our goal is to remove one terminal from t and propagate it all the way up to r. For this, at each node s on the path from t to r (from bottom up): • If s is t, we remove the leftmost terminal in the rule's composition at s. • If s is neither t nor r, we insert the last removed terminal right before the variable x 1 and then remove the leftmost terminal in the rule's composition at s. We note that if the rule at s is monic and the variable x 1 occurs right of the terminal in its composition, then we propagate a different terminal than the one received from the child. In order to be able to reverse this step, we need to remember whether the terminal in the rule's composition stayed the same or was swapped with the terminal received from the child. In the following, we consider this information as part of the rule (cf. the gray annotation swapped in Fig. 5). • If s is r, we insert the last removed terminal right before the variable y 1 in the rule's composition at s. If s r, let s be the parent of s and s the ith child of s . If, after removal of a terminal at s, the first component in the composition is empty: • we annotate the lhs nonterminal at s and the ith rhs nonterminal at s with − and remove the empty component, and • if i = 1 (resp. i = 2), we remove x 1 (resp. y 1 ) and replace every other occurrence of x i by x i−1 (resp. y j by y j−1 ) at s . Otherwise, we annotate the nonterminals with + .
We note that the rule at r is uni-lexical and branching now, the rule at t is uni-lexical and terminating, and the number of terminals in each rule between them did not change. After this step, every rule in d is lexical. Figure 5 shows an example for this step.
There is a suitable leaf t for every branching rule r. Intuitively, this holds since (a) after step (4) every leaf of d is a double-lexical rule and (b) for each branching rule we first go to its second successor and then follow the path of first successors until we reach a leaf. Here, (a) guarantees that there exists a double-lexical rule for each branching rule and (b) guarantees that each double-lexical rule is "assigned" to at most one branching rule, thus at most one terminal is removed from it. We refer the more interested reader to consult the proof of correctness by Engelfriet et al. (2018); this proof also applies to our method.
VP 2 → (x 1 , y 1 x 2 y 2 ) (NP 2 , VP| 2 ) VP| 2 → (scheduled, today) (a) A derivation for the string tuple (A hearing, scheduled on the issue today). Gray arrows show how terminals will be propagated through the derivation to lexicalize branching rules during step (5).
The derivation resulting from applying step (5) to the derivation in Fig. 5a. A gray annotation swapped marks a monic rule whose terminal changed. The reversal of this step removes all annotation ( + , − , and swapped ) and restores each composition in d to its original form. We note that the original composition can be obtained deterministically; the construction is given in Appendix A.1.
Step (6). We replace the rightmost terminal σ 2 in the composition of each double-lexical terminating rule by a variable and add a new nonterminal A R to the rule's right-hand side (making it a uni-lexical monic rule). Then we insert the rule A R → (σ 2 ) as a child. After this step, every rule in d is unilexical. Figures 5b and 6 show an example for this step. The reversal of this step is straightforward.
VP| 2 − → (today) Figure 6: The derivation resulting from applying step (6) to the derivation in Fig. 5b. Each rule in the derivation contains exactly one terminal.

Experiments
Implementation. The induction of uni-lexical LCFRS and parsing was implemented by extending disco-dop (van Cranenburgh et al., 2016), from which we could borrow the generic LCFRS extraction and statistical parsing implementation. Moreover, we used the computation of evaluation scores in disco-dop.
The supertagger was implemented using the flair framework (Akbik et al., 2019). We report results for three types of architectures: • bert -the output of the four topmost layers of a pretrained BERT 4 model (Devlin et al., 2019), which is fine-tuned during training, • flair -the concatenation of language-specific fasttext (Mikolov et al., 2018) and flair embeddings (Akbik et al., 2018), which is fed through a two layered Bi-LSTM (Hochreiter and Schmidhuber, 1997), and • supervised (small/large) -word embeddings (one-hot embeddings and character-based Bi-LSTM outputs) are trained with the model, and fed through a two layered Bi-LSTM. The small model adopts its size parameters from Stanojević and Steedman (2020); Coavoux and Cohen (2019) and the large model from Corro (2020). On top of each of those, there are two linear layers in parallel: one for the supertags and one for the nonterminals that were removed in step (4) of our lexicalization scheme (i.e. part-of-speech tags plus nonterminals from collapsed monic rules). The sequence tagger is trained to predict the gold supertag and the removed nonterminal for each sentence position via the sum of cross-entropy losses. More details with respect to hyperparameters for all models are shown in Appendix B.
During parsing, the predicted supertags are interpreted as a probabilistic grammar. At each sentence position, the weight of the rules is the softmax of the supertag's score among the k best scores. The parsing implementation that we borrow from disco-dop supports heuristics and early-stopping to speed-up the parsing process. For each intermediate parse that does not span all sentence positions, we use the best supertag probability for each position that does not belong to the parse as a heuristic to estimate the weight of a complete parse.
We extended the parser with a fallback mechanism that deals with parse fails, i.e. when it is not able to find parse trees for the whole sentence. It picks the largest partial derivations (for parts of the sentence) that it was able to find and combines them as children of artificial NOPARSE nodes. This is especially beneficial in settings with small k as there are many parse fails (cf. Table 1 column cov.). For example, if we did not use this mechanism, we would obtain prec. = 95.53, rec. = 46.21 and F1 = 62.29 for the development set of Negra and k = 1 (cf. first row in Table 1).
We use only the highest-scoring nonterminal predicted for the reversal of step (4). (2019), we use three treebanks for discontinuous constituent parsing in our evaluations: Negra (Skut et al., 1998), Tiger (Brants et al., 2004, and a discontinuous version of the Penn treebank (DPTB; Evang and Kallmeyer, 2011). The treebanks were split according to the usual standards into training, development and test sets. 5 During development, the lexicalization, tagging and parsing were mostly tested and optimized using Negra. We binarized each training set before extracting the LCFRS and supertags. Markovization with horizontal context h = 0 and vertical context v = 1 has yielded the best results; we thus extracted 3275 supertags from the training set of Negra, 4614 from Tiger and 4509 from DPTB. More context in Markovization lead to a blowup in the number of supertags which proved to be disadvantageous.

Data. Following Coavoux and Cohen
Baselines. We report labeled F1-scores, obtained from predicted and gold parse trees us-ing disco-dop (with the usual parameters in proper.prm), for all constituents (F1) and all discontinuous constituents (Dis-F1). Additionally to the scores, parse speed 6 is reported in sentences per second (sent/s).
Our scores are compared to recent state-of-theart parsers for discontinuous constituent trees in four categories: • grammar-based parsers (van Cranenburgh et al., 2016;Gebhardt, 2020;Versley, 2016)that directly rely on an underlying (probabilistic) grammar, • chart-based parsers (Corro, 2020; Stanojević and Steedman, 2020) -that share parsing algorithms with LCFRS, but lack an explicit set of rules, • transition systems (Coavoux and Cohen, 2019;Coavoux et al., 2019), and • neural systems (Fernández-González and Gómez-Rodríguez, 2020; Vilares and Gómez-Rodríguez, 2020) -all other recent parsing approaches using neural classifiers. Our approach is in the first category, as the supertags are clearly constructed from a grammar that was extracted from the treebank. Therefore, the local relations in the predicted derivations are restricted to those occurring in the treebank. The approaches by Corro (2020) and Stanojević and Steedman (2020), on the other hand, rank spans in the sentence for occurrence in the predicted parse tree and predict their nonterminal; both independently from previous spans and nonterminals. Hence, they allow any combination of parent/child nonterminals in the resulting derivations. Table 1 shows statistics of our parser on the development sets for different amounts (k) of supertags per sentence position. Specifically, we report the parsing speed (sent/s), the rate of sentence positions where the gold supertag was among the k predicted supertags (tag accuracy), the rate of sentences that was completely parsed (coverage) and parsing scores (labeled precision, recall and F1).

Results
We see that the parsing speed gradually drops with rising k, but for k > 10 there are barely any gains in terms of parsing scores. As expected, with rising k, the recall increases drastically. The preci-  sion, on the other hand, only changes slightly. The drop in precision using Negra and Tiger may be explained by a significant decrease in parse fails from k = 1 to k = 2, then the effects of fewer parse fails and considering lower-scored supertags seem to balance each other out. We found k = 10 to be a good parameter for the rest of our experiments. Table 3 shows the parsing scores and speed of our trained models on the test set compared to the scores reported in other recent publications on discontinuous constituent parsing. The experiments suggest that parsing using LCFRS can greatly benefit from supertagging with respect to both speed and accuracy. This, however requires a strong discriminative classifier for the sequence tagger to predict useful rules. Most notably, the prediction accuracy for discontinuous constituents seems to strongly benefit from pretrained word embeddings.
Compared to other parsing approaches, we obtain results that are on par with the state of the art; recently, this is rather unusual for grammar-based constituent parsing. We would like to especially highlight our results for discontinuous constituents, which surpass the previous state of the art by a wide margin.
Unfortunately, we can only compare our results to those of other supertagging-based parsers to a very limited extent, as authors seem to either report no parsing scores at all (Bladier et al., 2018), or give attachment scores for dependency relations (Kasai et al., 2017;Tian et al., 2020). However, Table 2 compares the accuracy of our supertagger to some recent publications. The CCG community is very active in the field of neural supertagging, achieving an improvement from 91.3% (Lewis and Steedman, 2014) to 96.4% accuracy (Tian et al., 2020) for predicted supertags in the last 6 years. We can not compete with those numbers, but this may be due to the fact that there are far fewer supertags trained in these approaches than in ours. In the case of TAG, the supertagger by Bladier et al.
(2018) achieves a better accuracy than ours. But again, there are fewer tags to predict. Compared to Kasai et al. (2017), our models with pretrained embeddings seem to be on par in both the number of tags and the accuracy.

Conclusion
We described an approach to utilize supertagging for parsing discontinuous constituents with LCFRS and evaluated it. Compared to other parsers for the  (2020) and Stanojević and Steedman (2020) address discontinuous constituent parsing using approaches that share an algorithmic foundation with LCFRS parsing, but do not use an underlying grammar. Both of them restrict constituents to two non-contiguous spans (equivalent to an LCFRS with fanout 2), we have no such limitation. Considering the margin between our discontinuous F1-score and theirs, we suppose that this restriction is only benefiting the complexity, not the accuracy.
Future Work. Compared to previous approaches for supertagging, we utilize large sets of supertags. We are confident that the accuracy of the supertagger can be improved by appropriately reducing these sets. The approach how terminals are transported in derivations during step (5) of the extraction is quite technical and chosen such that there is no impact on the fanout of the grammar (Mörbitz and Ruprecht, 2020). Alternative techniques could conceivably result in smaller sets of supertags and/or improve parsing results.
To validate the benefit of LCFRS (compared to using TAG or CCG) for supertagging-based approaches to constituent paring, we aim for an in-depth comparison of our work to previous approaches. However, currently, these approaches lack of publicly available implementations for constituent parsing.

A Unlexicalizing Derivations
In this Appendix we formally describe the reversal of selected steps of our lexicalization scheme (cf. Section 4). In each instance we assume a derivation like it would be obtained right after applying the corresponding step.
This step is applied to each occurrence r of branching rules of the form A → c(A 1 , A 2 ) from the bottom to the top (i.e. it was already done for branching rules in the subtrees below a node before it is applied to the node itself). Let t be the leftmost occurrence of a terminating rule that is reachable from the second child of r. At each node s on the path from r to t (i.e. from top down) we perform three steps.
(1) We transform the composition back into the original composition, (2) we remove all annotation ( + , − , and swapped ), and (3) we pass a terminal to the child (if s is not t).
Obtaining the original composition. The composition at each node s is transformed back into the original composition depending on the type of the rule. We note that if s is a branching rule and s r, the composition at s was already changed previously in this step and we leave it as it is.
• If B 2 is annotated with − (i.e., its first component was removed during step (5)), we replace σ with y 0 , and replace every occurrence of y i by y i+1 .
Moreover, if s occurs as a successor of the right child of some other branching rule, then the nonterminals B and B 1 have annotation as well.
If B 1 is annotated with − , we replace every occurrence of x i by x i+1 afterwards.
Monic rule Let B → c (B 1 ) be the rule at s with c of the form (u 1 , . . . , u s ), σ 1 be the terminal received from the parent, and σ 2 be the terminal in c .
1. If B is annotated with − , then c is replaced by (ε, u 1 , . . . , u s ). 2. If B 1 is annotated with − , then x 0 is inserted as the first symbol in the first component in c .
After that, every occurrence of x i is replaced by x i+1 . 3. If the terminal was swapped during step (5), the terminal σ 2 is removed from c and σ 1 is added as the first symbol in the first component of c. We remark that if B is annotated with − , then it must be the case that B 1 is annotated with − as well or the terminal was swapped during step (5). Hence we do not add empty components here.
Terminating rule Let B → (σ 2 ) be the rule at s and σ 1 be the terminal received from the parent. We replace the rule by B → (σ 1 , σ 2 ) if B is annotated with − and by B → (σ 1 σ 2 ) otherwise.
Passing the terminal to the child.
• If s is r, let σ be the terminal in c. We pass σ to the next node on the path to t.
• If s is neither r nor t, and there is a branching rule at s, we pass the terminal received from the parent to the next node on the path to t.
• If there is a monic rule of the form B → c (B 1 ) at s, we let σ 1 be the terminal reveived from the parent and σ 2 the terminal in c . If the terminal in this rule was swapped during step (5), we pass σ 2 to the next node on the path to t, otherwise we pass σ 1 .
We recall that during step (4), certain nonterminals that occurred as the lhs of terminating rules were removed. For reverting this step, we assume that these nonterminals are predicted by an oracle. We replace every occurrence of a terminating rule of the form • A → (σ 1 σ 2 ) by A → (x 1 y 1 )(A 1 , A 2 ) A 1 → (σ 1 ), A 2 → (σ 2 ) and • A → (σ 1 , σ 2 ) by A → (x 1 , y 1 )(A 1 , A 2 ) A 1 → (σ 1 ), A 2 → (σ 2 ) , where A 1 and A 2 are the predicted nonterminals for σ 1 and σ 2 , respectively. We replace every subderivation d of the form A → c(B) d 1 , where σ is the terminal in c and A 1 the predicted nonterminal for σ, as follows: