Incremental Discontinuous Phrase Structure Parsing with the GAP Transition

This article introduces a novel transition system for discontinuous lexicalized constituent parsing called SR-GAP. It is an extension of the shift-reduce algorithm with an additional gap transition. Evaluation on two German treebanks shows that SR-GAP outperforms the previous best transition-based discontinuous parser (Maier, 2015) by a large margin (it is notably twice as accurate on the prediction of discontinuous constituents), and is competitive with the state of the art (Fernández-González and Martins, 2015). As a side contribution, we adapt span features (Hall et al., 2014) to discontinuous parsing.


Introduction
Discontinuous constituent trees can be used to model directly certain specific linguistic phenomena, such as extraposition, or more broadly to describe languages with some degree of word-order freedom. Although these phenomena are sometimes annotated with indexed traces in CFG treebanks, other constituent treebanks are natively annotated with discontinuous constituents, e.g. the Tiger corpus (Brants, 1998).
From a parsing point of view, discontinuities pose a challenge. Mildly context-sensitive formalisms, that are expressive enough to model discontinuities have high parsing complexity. For example, the CKY algorithm for a binary probabilistic LCFRS is in O(n 3k ), where k is the fan-out of the grammar (Kallmeyer, 2010).
Recently, there have been several proposals for direct discontinuous parsing. They correspond roughly to three different parsing paradigms. (i) Chart parsers are based on probabilistic LCFRS (Kallmeyer and Maier, 2013;Maier, 2010 and Kallmeyer, 2011), or on the Data-Oriented Parsing (DOP) framework (van Cranenburgh, 2012;van Cranenburgh and Bod, 2013;van Cranenburgh et al., 2016). However, the complexity of inference in this paradigm requires to design elaborate search strategies and heuristics to make parsing run-times reasonable. (ii) Several approaches are based on modified non-projective dependency parsers, for example Hall and Nivre (2008), or more recently Fernández-González and Martins (2015) who provided a surprisingly accurate parsing method that can profit from efficient dependency parsers with rich features. (iii) Transition-based discontinuous parsers are based on the easy-first framework (Versley, 2014a) or the shift-reduce algorithm augmented with a swap action (Maier, 2015). In the latter system, which we will refer to as SR-SWAP, a SWAP action pushes the second element of the stack back onto the buffer.
Although SR-SWAP is fast and obtained good results, it underperforms Fernández-González and Martins (2015)'s parser by a large margin. We believe this result does not indicate a fatal problem for the transition-based framework for discontinuous parsing, but emphasizes several limitations inherent to SR-SWAP, in particular the length of derivations (Section 3.5).
The shift-reduce system is based on two data structures. The stack (S) contains tree fragments representing partial hypotheses and the buffer (B) contains the remaining terminals. A parsing configuration is a couple S, B . Initially, B contains the sentence as a list of terminals and S is empty.
The three types of actions are defined as follows.
• SHIFT( S, w|B ) = S|w, B • REDUCE-X( S|A 1 |A 2 , B ) = S|X, B • REDUCEUNARY-X( S|A, B ) = S|X, B where X is any non-terminal in the grammar. The analysis terminates when the buffer is empty and the only symbol in the stack is an axiom of the grammar. This transition system can predict any labelled constituent tree over a set of non-terminal symbols N .
These three action types can only produce binary trees. In practice, shift-reduce parsers often assume that their data are binary. In this article, we assume that trees are binary, that each node X is annotated with a head h (notation: X[h]), and that the only unary nodes are parents to a terminal. Therefore, we only need unary reductions immediately after a SHIFT. We refer the reader to Section 3.1 for the description of the preprocessing operations.

SR-GAP Transition System
In order to handle discontinuous constituents, we need an algorithm expressive enough to predict non-projective trees.
Compared to the standard shift-reduce algorithm, the main intuition behind SR-GAP is that reductions do not always apply to the two top-most elements in the stack. Instead, the left element of a reduction can be any element in the stack and must be chosen dynamically.
To control the choice of the symbols to which a reduction applies, the usual stack is split into two data structures. A deque D represents its top and a stack S represents its bottom. Alternatively, we could see these two data structures as a single stack with two pointers indicating its top and a split point. The respective top-most element of D and S are those available for a reduction.
The transition system is given as a deductive system in Figure 3. A REDUCE-X action pops the top element of S and the top element of D, flushes the content of D to S and finally pushes a new non-terminal X on D. As feature extraction (Section 3.1) relies on the lexical elements, we use two types of binary reductions, left and right, to assign heads to new constituents. Unary reductions replace the top of D by a new non-terminal.
The SHIFT action flushes D to S, pops the next token from B and pushes it onto D.
Finally, the GAP action pops the first element of S and appends it at the bottom of D. This action enables elements below the top of S to be also available for a reduction with the top of D.
Figure 3: SR-GAP transition system for discontinuous phrase structure parsing. X[h] denotes a nonterminal X and its head h. s 0 and d 0 denote the top-most elements of respectively S and D.
Constraints In principle, a tree can be derived by several distinct sequences of actions. If a SHIFT follows a sequence of GAPS, the GAPS will have no effect, because SHIFT flushes D to S before pushing a new terminal to D. In order to avoid useless GAPS, we do not allow a SHIFT to follow a GAP. A GAP must be followed by either another GAP or a binary reduction. Moreover, as we assume that preprocessed trees do not contain unary nodes, except possibly above the terminal level, unary reductions are only allowed immediately after a SHIFT. Other constraints on the transition system are straightforward, we refer the reader to Table 7 of Appendix A for the complete list.

Oracle and Properties
Preliminary Definitions Following Maier and Lichte (2016), we define a discontinuous tree as a rooted connected directed acyclic graph T = (V, E, r) where • V is a set of nodes; • r ∈ V is the root node; • E : V × V is a set of (directed) edges and E * is the reflexive transitive closure of E.
If (u, v) ∈ E, then u is the parent of v. Each node has a unique parent (except the root that has none). Nodes without children are terminals.
The right index (resp. left index) of a node is the index of the rightmost (resp. leftmost) terminal dominated by this node. For example, the left index of the node labelled S: in Figure 2 is 1 and its right index is 5.
Oracle We extract derivations from trees by following a simple tree traversal. We start with an initial configuration. While the configuration is not final, we derive a new configuration by performing the gold action, which is chosen as follows: • if the nodes at the top of S and at the top of D have the same parent node in the gold tree, perform a reduction with the parent node label; • if the node at the top of D and the i th node in S have the same parent node, perform i − 1 GAP; • otherwise, perform a SHIFT, optionally followed by a unary reduction in the case where the parent node of the top of D has only one child.
For instance, the gold sequence of actions for the tree in Figure 2 is the sequence SH, SH, SH, SH, SH, RR(NP), GAP, GAP, RR(NP), GAP, RL(S:), RR(S). Table 1 details the sequence of configurations obtained when deriving this tree.
Given the constraints defined in Section 2.2, and if we ignore lexicalisation, there is a bijection between binary trees and derivations. 1 To see why, we define a total order < on the nodes of a tree. Let n and n be two nodes and let n < n iff either rindex(n) < rindex(n ) or (n , n) ∈ E * .
It is immediate that if (n , n) ∈ E * , then n must be reduced before n in a derivation. An invariant of the GAP transition system is that the right index of the first element of D is equal to the index of the last shifted element. Therefore, after having shifted the terminal j, it is impossible to create nodes whose right-index is strictly smaller to j. We conclude that during a derivation, the nodes must be created according to the strict total order < defined above. In other words, for a given tree, there is a unique possible derivation which enforces the constraints described above. Recip-   Figure 2, part-of-speech tags are omitted.
rocally, a well-formed derivation corresponds to a unique tree.

Completeness and Soundness
The GAP transition system is sound and complete for the set of discontinuous binary trees labelled with a set of non-terminal symbols. When augmented with certain constraints to make sure that predicted trees are unbinarizable (see Table 7 of Appendix A), this result also holds for the set of discontinuous n-ary trees (modulo binarization and unbinarization). Completeness follows immediately from the correctness of the oracle, which corresponds to a tree traversal in the order specified by <.
To prove soundness, we need to show that any valid derivation sequence produces a discontinuous binary tree. It holds from the transition system that no node can have several parents, as parent assignation via REDUCE actions pops the children nodes and makes them unavailable to other reductions. This implies that at any moment, the content of the stack is a forest of discontinuous trees. Moreover, at each step, at least one action is possible (thanks to the constraints on actions). As there can be no cycles, the number of actions in a derivation is upper-bounded by 1 2 (n 2 + n) for a sentence of length n (see Appendix A.1). Therefore, the algorithm can always reach a final configuration, where the forest only contains one discontinuous tree.
The correctness of SR-GAP system holds only for the robust case: that is the full set of labeled discontinuous trees, and not, say, for the set of trees derived by a true LCFRS grammar also able to reject agrammatical sentences. From an empirical point of view, a transition system that over-generates is necessary for robustness, and is desirable for fast approximate linear-time inference. However, from a formal point of view, the relationship of the SR-GAP transition system with automata explicitly designed for LCFRS parsing (Villemonte de La Clergerie, 2002; Kallmeyer and Maier, 2015) requires further investigations.

Length of Derivations
Any derivation produced by SR-GAP for a sentence of length n will contain exactly n SHIFTS and n − 1 binary reductions. In contrast, the number of unary reductions and GAP actions can vary. Therefore several possible derivations for the same sentence may have different lengths. This is a recurring problem for transition-based parsing because it undermines the comparability of derivation scores. In particular, Crabbé (2014) observed that the score of a parse item is approximatively linear in the number of previous transitions, which creates a bias towards long derivations.
Different strategies have been proposed to ensure that all derivations have the same length (Zhu et al., 2013;Crabbé, 2014;Mi and Huang, 2015). Following Zhu et al. (2013), we use an additional IDLE action that can only be performed when a parsing item is final. Thus, short derivations are padded until the last parse item in the beam is final. IDLE actions are scored exactly like any other action.
SR-CGAP As an alternative strategy to the problem of comparability of hypotheses, we also present a variant of SR-GAP, called SR-CGAP, in which the length of any derivation only depends on the length of the sentence. In SR-CGAP, each SHIFT action must be followed by either a unary reduction or a ghost reduction (Crabbé, 2015), and each binary reduction must be preceded by exactly one compound GAP i action (i ∈ {0, . . . m}) specifying the number i of consecutive standard GAPS. For example, GAP 0 will have no effect, and GAP 2 counts as a single action equivalent to two consecutive GAPS. We call these actions COM-POUNDGAP, following the COMPOUNDSWAP actions of Maier (2015).
With this set of actions, any derivation will have exactly 4n − 2 actions, consisting of n shifts, n unary reductions or ghost reductions, n − 1 compound gaps, and n − 1 reductions.
The parameter m (maximum index of a compound gap) is determined by the maximum number of consecutive gaps observed in the training set. Contrary to SR-GAP, SR-CGAP is not complete, as some discontinuous trees whose derivation should contain more than m consecutive GAPS cannot be predicted.

Beam Search with a Tree-structured Stack
A naive beam implementation of SR-GAP will copy the whole parsing configuration at each step and for each item in the beam. This causes the parser algorithm to have a practical O(k ·n 2 ) complexity, where k is the size of the beam and n the length of a derivation. To overcome this, one can use a tree-structured stack (TSS) to factorize the representations of common prefixes in the stack as described by  for projective dependency parsing. However the discontinuites entail that a limited amount of copying cannot be entirely avoided. When a reduction follows n GAP actions, we need to grow a new branch of size n + 1 in the tree-structured stack to account for reordering. The complexity of the inference becomes O(k · (n + g)) where g is the number of gaps in the derivation. As there are very few gap actions (in proportion) in the dataset, the practical runtime is linear in the length of the derivation.

Relationship to Dependency Parsing Algorithms
The transition system presented in this article uses two distinct data structures to represent the stack.
In this respect, it belongs to the family of algorithms presented by Covington (2001) for dependency parsing. Covington's algorithm iterates over every possible pair of words in a sentence and decides for each pair whether to attach them -with a left or right arc -or not. This algorithm can be formulated as a transition system with a split stack (Gómez-Rodríguez and Fernández-González, 2015).

Datasets
We evaluated our model on two corpora, namely the Negra corpus (Skut et al., 1997) and the Tiger corpus (Brants, 1998). To ensure comparability with previous work, we carried out experiments on several instantiations of these corpora. We present results on two instantiations of Negra. NEGRA-30 consists of sentences whose length is smaller than, or equal to, 30 words. We used the same split as Maier (2015). A second instantiation, NEGRA-ALL, contains all the sentences of the corpus, and uses the standard split (Dubey and Keller, 2003).
For the Tiger corpus, we also use two instantiations. TIGERHN08 is the split used by Hall and Nivre (2008). TIGERM15 is the split of Maier (2015), which corresponds to the SPMRL split (Seddah et al., 2013). 2 We refer the reader to Table 8 in Appendix A for further details on the splits used.
For both corpora, the first step of preprocessing consists in removing function labels and reattaching the nodes attached to the ROOT and causing artificial discontinuities (these are mainly punctuation terminals). 3 Then, corpora are head-annotated using the headrules included in the DISCO-DOP package, and binarized by an order-0 head-Markovization (Klein and Manning, 2003). There is a rich literature on binarizing LCFRS (Gómez-Rodríguez et al., 2009;Gildea, 2010), because both the gapdegree and the rank of the resulting trees need to be minimized in order to achieve a reasonable complexity when using chart-based parsers (Kallmeyer and Maier, 2013). However, this seems not to be a problem for transition-based parsing, and the gains of using optimized binarization algorithms do not seem to be worth the 2 As in previous work (Maier, 2015), two sentences (number 46234 and 50224) are excluded from the test set because they contain annotation errors. 3 We used extensively the publicly available software TREETOOLS and DISCO-DOP for these preprocessing steps. The preprocessing scripts will be released with the parser source code for full replicability.   Table 2. Due to discontinuities, it is possible that both the left-and right-index of s i are generated by the same child of s i .  We use c, w and t to denote a node's label, its head and the part-of-speech tag of its head. When used as a subscript, l (r) refers to the left (right) index of a node. Finally lo (ro) denotes the token immediately left to the left index (right to the right index). See Figure 4 for a representation of a configuration with these notations.
complexity of these algorithms (van Cranenburgh et al., 2016). Unless otherwise indicated, we did experiments with gold part-of-speech tags, following a common practice for discontinuous parsing.

Classifier
We used an averaged structured perceptron (Collins, 2002) with early-update training (Collins and Roark, 2004). We use the hash trick (Shi et al., 2009) to speed up feature hashing. This has no noticeable effect on accuracy and this improves training and parsing speed. The only hyperparameter of the perceptron is the number of training epochs.
We fixed it at 30 for every experiment, and shuffled the training set before each epoch.

Features
We tested three feature sets described in Table 2 and Figure 4. The BASELINE feature set is the transposition of Maier (2015)'s baseline features to the GAP transition system. It is based on B, on S, and on the top element of D, but does not use information from the rest of D (i.e. the gapped elements). This feature set was designed in order to obtain an experimental setting as close as possible to that of Maier (2015).
In constrast, the EXTENDED feature set includes information from further in D, as well as additional context in S and n-grams of categories from both S and D.
The third feature set SPANS is based on the idea that constituent boundaries contain critical information (Hall et al., 2014) for phrase structure parsing. This intuition is also confirmed in the context of lexicalized transition-based constituent parsing (Crabbé, 2015). To adapt this type of features to discontinuous parsing, we only rely on the right and left index of nodes, and on the tokens preceding the left index or following the right index. 4 Unknown Words In order to learn parameters for unknown words and limit feature sparsity, we replace hapaxes in the training set by an UN-KNOWN pseudo-word. This accounts for an improvement of around 0.5 F1.

Results
We report results on test sets in Table 3. All the metrics were computed by DISCO-DOP with the parameters included in this package (proper.prm). The metrics are labelled F1, ignoring roots and punctuation. We also present metrics which consider only the discontinuous constituents (Disc. F1 in Tables 3 and 4), as these  can give some qualitative insight into the strengths and weaknesses of our model.
For experiments on TIGERM15, we additionally report evaluation scores computed with the SPMRL shared task parameters 5 for comparability with previous work.
SR-GAP vs SR-CGAP In most experimental settings, SR-CGAP slightly underperformed SR-GAP. This result came as a surprise, as both compound actions for discontinuities (Maier, 2015) and ghost reductions (Crabbé, 2014) were reported to improve parsing.
We hypothesize that this result is due to the rarity of unary constituents in the datasets and to the difficulty to predict COMPOUNDGAPS with a bounded look at D and S caused by our practical definition of feature templates (Table 2). In contrast, predicting gaps separately involves feature extraction at each step, which crucially helps.

Feature Sets
The EXTENDED feature set outperforms the baseline by up to one point of F1. This emphasizes that information about gapped non-terminal is important. The SPANS feature set gives another 1 point improvement. This demonstrates clearly the usefulness of span features for discontinuous parsing. A direct extension of this feature set would include information about the  boundaries of gaps in a discontinuous constituent. A difficulty of this extension is that the number of gaps in a constituent can vary.

Comparisons with Previous Works
There are three main approaches to direct discontinuous parsing. 6 One such approach is based on unprojective or pseudo-projective dependency parsing (Hall and Nivre, 2008;Fernández-González and Martins, 2015), and aims at enriching dependency labels in such a way that constituents can be retrieved from the dependency tree. The advantage of such systems is that they can use off-the-shelf dependency parsers with rich features and efficient inference.
The last paradigm is transition-based parsing. Versley (2014a) and Versley (2014b) use an easyfirst strategy with a swap transition. Maier (2015) and Maier and Lichte (2016) use a shift-reduce algorithm augmented with a swap transition. Table 3 includes recent results from these various parsers. The most successful approach so far is that of Fernández-González and Martins (2015), which outperforms by a large margin transitionbased parsers (Maier, 2015;Maier and Lichte, 2016).
SR-GAP vs SR-SWAP In the same settings (baseline features and beam size of 4), SR-GAP outperforms SR-SWAP by a large margin on all datasets. It is also twice as accurate when we only consider discontinuous constituents.
In Section 3.5, we analyse the properties of both transition systems and give hypotheses for the performance difference.
Absolute Scores On all datasets, our model reaches or outperforms the state of the art (Fernández-González and Martins, 2015). This is still the case in a more realistic experimental setup with predicted tags, as reported in Table 5. 7 As pointed out by Maier and Lichte (2016), a limitation of shift-reduce based parsing is the locality of the feature scope when performing the search. The parser could be in states where the necessary information to take the right parsing decision is not accessible with the current scoring model.
To get more insight into this hypothesis, we tested large beam sizes. If the parser maintains a much larger number of hypotheses, we hope that it could compensate for the lack of information by delaying certain decisions. In Table 4, we present additional results on development sets of both instantiations of the TIGER corpus, with different beam sizes. As was expected, a larger beam size  gives better results. The beam size controls the tradeoff between speed and accuracy. 8 Interestingly, the improvement from a larger beam size is greater on discontinuous constituents than overall. For example, from 16 to 32, F1 improves by 0.5 on TIGERHN8 and F1 on discontinuous constituents improves by 1.4.
This suggests that further improvements could be obtained by augmenting the lookahead on the buffer and using features further on S and D. We plan in the future to switch to a neural model such as a bi-LSTM in order to obtain more global representations of the whole data structures (S, D, B).

Discussion: Comparing SR-SWAP and SR-GAP
This section investigates some differences between SR-SWAP and SR-GAP. We think that characterizing properties of transition systems helps to gain better intuitions into the problems inherent to discontinuous parsing.
Derivation Length Assuming that GAP or SWAP are the hardest actions to predict and are responsible for the variability of the lengths of derivation, we hypothesize that the number of these actions, hence the length of a derivation, is an important factor. Shorter derivations are less prone to error propagation.
In both cases, the shortest possible derivation for a sentence of length n corresponds to a projective tree, as the derivation will not contain any SWAP or GAP.
In the worst case, i.e. the tree that requires the longest derivation to be produced by a transition system, SR-GAP is asymptotically twice as more economical than SR-SWAP (Table 6). In Figure 5 of Appendix A, we present the trees corresponding to the longest possible derivations in both cases.  Table 6: Statistics on Tiger train corpus. n is the length of a sentence. SR-CSWAP is a variant of SR-SWAP proposed by Maier (2015).
These trees maximise the number of GAP and SWAP actions. The fact that derivations are shorter with SR-GAP is confirmed empirically. In Table 6, we present several metrics computed on the train section of TIGERM15. In average, SR-SWAP derivations are empirically 50% longer than SR-GAP derivations. Despite handling discontinuities, SR-GAP derivations are not noticeably longer than those we would get with a standard shift-reduce transition system (n shifts and n − 1 binary reductions).
Intuitively, the difference in length of derivations comes from two facts. First, swapped terminals are pushed on the buffer and must be shifted once more, whereas with SR-GAP, each token is shifted exactly once. Second, transition systems for discontinuous parsing implicitly predict an order on terminals (discontinuous trees can be transformed to continuous trees by changing the precedence order on terminals). With SR-SWAP, reordering is done by swapping two terminals. In contrast, SR-GAP can swap complex non-terminals (already ordered chunks of terminals), making the reordering more efficient in terms of number of operations.
It would be interesting to see if SR-SWAP is improved when swapping non-terminals is allowed. However, it would make feature extraction more complex, because it would no longer be assumed that the buffer contains only tokens.
The effect of the derivation length is confirmed by Maier and Lichte (2016) who explored different oracles for SR-SWAP and found that oracles producing shorter derivations gave better results.
Feature Locality A second property which explains the performance of SR-GAP is the access to three data structures (vs two for SR-SWAP) for extracting features; SR-GAP has access to an extended domain of locality. Moreover, with SR-SWAP, the semantics of features from the queue does not make a distinction between swapped tokens and tokens that have not been shifted yet. When the parser needs to predict a long sequence of consecutive swaps, it is hardly in a position to have access to the relevant information. The use of three data structures, along with shorter sequences of GAP actions, seems to alleviate this problem.

Conclusion
We have introduced a novel transition system for lexicalized discontinuous parsing. The SR-GAP transition system produces short derivations, compared to SR-SWAP, while being able to derive any discontinuous tree.
Our experiments show that it outperforms the best previous transition system (Maier, 2015) in similar settings and different datasets. Combined with a span-based feature set, we obtained a very efficient parser with state-of-the-art results.
We also provide an efficient C++ implementation of our parser, based on a tree-structured stack.
Direct follow-ups to this work consist in switching to a neural scoring model to improve the representations of D and S and alleviate the locality issues in feature extraction (Kiperwasser and Goldberg, 2016;Cross and Huang, 2016  SR-GAP There are n shifts and n − 1 binary reductions in a derivation. The longest derivation maximises the number of gap actions, by performing as many gap actions as possible before each binary reductions. When S contains k elements, there are k − 1 possible consecutive gap actions. So the longest derivation starts by n shifts, followed by n − 2 gap actions, one binary reduction, n − 3 gap actions, one binary reduction, and so on. L gap (n) = n + ((n − 2) + 1) + · · · + 1 = 1 + 2 + · · · + n = n(n − 1) 2 This corresponds to the tree on the left-hand side of Figure 5.
SR-SWAP Using the oracle or Maier (2015), the longest derivation for a sentence of length n consists in maximising the number of swaps before each reduction. 9 After the first shift, the derivation performs repeatedly n − i shifts, n − i − 1 swaps and one reduction, i being the number of shifted terminals before each iteration.