Discontinuous Incremental Shift-reduce Parsing

We present an extension to incremental shift-reduce parsing that handles discontinuous constituents, using a linear clas-siﬁer and beam search. We achieve very high parsing speeds (up to 640 sent./sec.) and accurate results (up to 79.52 F 1 on TiGer).


Introduction
Discontinuous constituents consist of more than one continuous block of tokens. They arise through phenomena which traditionally in linguistics would be analyzed as being the result of some kind of "movement", such as extraposition or topicalization. The occurrence of discontinuous constituents does not necessarily depend on the degree of freedom in word order that a language allows for. They can be found, e.g., in almost equal proportions in English and German treebank data (Evang and Kallmeyer, 2011).
Generally, discontinuous constituents are accounted for in treebank annotation. One annotation method consists of using trace nodes that denote the source of a movement and are co-indexed with the moved constituent. Another method is to annotate discontinuities directly by allowing for crossing branches. Fig. 1 shows an example for the latter approach with which we are concerned in this paper, namely, the annotation of (1). The tree contains a discontinuous VP due to the fact that the fronted pronoun is directly attached. framed as a separate pre-, post-or in-processing task to PCFG parsing (Johnson, 2002;Dienes and Dubey, 2003;Jijkoun, 2003;Levy and Manning, 2004;Schmid, 2006;Cai et al., 2011, among others); see particularly Schmid (2006) for more details. Directly annotated discontinuous constituents can be parsed with a dependency parser, given a reversible transformation from discontinuous constituency trees to non-projective dependency structures. Transformations have been proposed by Hall and Nivre (2008), who use complex edge labels that encode paths between lexical heads, and recently by Fernández-González and Martins (2015), who use edge labels to encode the attachment order of modifiers to heads.
Direct parsing of discontinuous constituents can be done with Linear Context-Free Rewriting System (LCFRS), an extension of CFG which allows its non-terminals to cover more than one continuous block (Vijay-Shanker et al., 1987). LCFRS parsing is expensive: CYK chart parsing with a binarized grammar can be done in O(n 3k ) where k is the block degree, the maximal number of continuous blocks a non-terminal can cover (Seki et al., 1991). For a typical treebank LCFRS (Maier and Søgaard, 2008), k ≈ 3, instead of k = 1 for PCFG. In order to improve on otherwise impractical parsing times, LCFRS chart parsers employ different strategies to speed up search : Kallmeyer and Maier (2013) use A * search; van Cranenburgh (2012) and van Cranenburgh and Bod (2013) use a coarse-to-fine strategy in combination with Data-Oriented Parsing; Angelov and Ljunglöf (2014) use a novel cost estimation to rank parser items. Maier et al. (2012) apply a treebank transformation which limits the block degree and therewith also the parsing complexity.
Recently Versley (2014) achieved a breakthrough with a EaFi, a classifier-based parser that uses an "easy-first" approach in the style of Goldberg and Elhadad (2010). In order to obtain discontinuous constituents, the parser uses a strategy known from non-projective dependency parsing (Nivre, 2009;): For every non-projective dependency tree, there is a projective dependency tree which can be obtained by reordering the input words. Non-projective dependency parsing can therefore be viewed as projective dependency parsing with an additional reordering of the input words. The reordering can be done online during parsing with a "swap" operation that allows to process input words out of order. This idea can be transferred, because also for every discontinuous constituency tree, one can find a continuous tree by reordering the terminals. Versley (2014) uses an adaptive gradient method to train his parser. He reports a parsing speed of 40-55 sent./sec. and results that surpass those reported for the above mentioned chart parsers.
In (continuous) constituency parsing, incremental shift-reduce parsing using the structured perceptron is an established technique. While the structured perceptron for parsing has first been used by Collins and Roark (2004), classifier-based incremental shift-reduce parsing has been taken up by Sagae and Lavie (2005). A general formulation for the application of the perceptron algorithm to various problems, including shift-reduce constituency parsing, has been introduced by Zhang and Clark (2011b). Improvements have followed (Zhu et al., 2012;Zhu et al., 2013). A similar strategy has been shown to work well for CCG parsing (Zhang and Clark, 2011a), too.
In this paper, we contribute a perceptron-based shift-reduce parsing architecture with beam search (following Zhu et al. (2013) and Bauer (2014)) and extend it such that it can create trees with crossing branches (following Versley (2014)). We present strategies to improve performance on discontinuous structures, such as a new feature set.
Our parser is very fast (up to 640 sent./sec.), and produces accurate results. In our evaluation, where we pay particular attention to the parser performance on discontinuous structures, we show among other things that surprisingly, a grammarbased parser has an edge over a shift-reduce approach concerning the reconstruction of discontinuous constituents.
The remainder of the paper is structured as follows. In subsection 2.1, we introduce the general parser architecture; the subsections 2.2 and 2.3 introduce the features we use and our strategy for handling discontinuous structures. Section 3 presents and discusses the experimental results, section 4 concludes the article.

Shift-reduce parsing with perceptron training
An item in our parser consists of a queue q of token/POS-pairs to be processed, and a stack s, which holds completed constituents. 1 The parser uses different transitions: SHIFT shifts a terminal from the queue on to the stack. UNARY-X reduces the first element on the stack to a new constituent labeled X. BINARY-X-L and BINARY-X-R reduce the first two elements on the stack to a new X constituent, with the lexical head coming from the left or the right child, respectively. FINISH removes the last element from the stack. We additionally use an IDLE transition, which can be applied any number of times after FINISH, to improve the comparability of analyses of different lengths (Zhu et al., 2013). The application of a transition is subject to restrictions. UNARY-X, e.g., can only be applied when there is at least a single item on the stack. We implement all restrictions listed in the appendix of Zhang and Clark (2009), and add additional restrictions that block transitions involving the root label when not having arrived at the end of a derivation. We do not use an underlying grammar to filter out transitions which have not been seen during training.
For decoding, we use beam search (Zhang and Clark, 2011b). Decoding is started by putting the start item (empty stack, full queue) on the beam. Then, repeatedly, a candidate list is filled with all items that result from applying legal transitions to the items on the beam, followed by putting the highest scoring n of them back on the beam (given a beam size of n). Parsing is finished if the highest scoring item on the beam is a final item (stack holds one item labeled with the root label, queue is empty), which can be popped. Item scores are computed as in Zhang and Clark (2011b): The score of the i + 1th item is computed as the sum of the score of the ith item and the dot product of a global feature weight vector and the local weight vector resulting from the changes induced by the corresponding transition to the i + 1th item. The start item has score 0. We train the global weight vector with an averaged Perceptron with early update (Collins and Roark, 2004).
Parsing relies on binary trees. As in previous work, we binarize the incoming trees headoutward with binary top and bottom productions. Given a constituent X which is to be binarized, all intermediate nodes which are introduced will be labeled @X. Lexical heads are marked with Collins-style head rules. As an example, Fig. 2 shows the binarized version of the tree of Fig. 1.
Finally, since we are learning a sparse model, we also exploit the work of Goldberg and Elhadad (2011) who propose to include a feature in the calculation of a score only if it has been observed ≥ MINUPDATE times.

Handling Discontinuities
In order to handle discontinuities, we use two variants of a swap transition which are similar to swap-eager and swap-lazy from Nivre (2009) and . The first variant, SIN-GLESWAP, swaps the second item of the stack back on the queue. The second variant COM-POUNDSWAP i bundles a maximal number of adjacent swaps. It swaps i items starting from the second item on the stack, with 1 ≤ i < |s|. Both swap operations can only be applied if 1. the item has not yet been FINISHed and the last transition has not been a transition with the root category, 2. the queue is not empty, 3. all elements to be swapped are pre-terminals, and 4. if the first item of the stack has a lower index than the second (this inhibits swap loops).
SINGLESWAP can only been applied if there are at least two items on the stack. For COM-POUNDSWAP i , there must be at least i + 1 items.
Transition sequences are extracted from treebank trees with an algorithm that traverses the tree bottom-up and collects the transitions. For a given tree τ , intuitively, the algorithm works as follows. We start out with a queue t containing the preterminals of τ , a stack σ that receives finished constituents, a counter s that keeps track of the number of terminals to be swapped, and an empty sequence r that holds the result. First, the first element of t is pushed on σ and removed from t.

Repeat while transitions can be added:
(a) if the top two elements on σ, l and r, have the same parent p labeled X and l/r is the head of p, add BINARY-X-l/r to r, pop two elements from σ and push p; (b) if the top element on σ is the only child of its parent p labeled X, add UNARY-X, pop an element of σ and push p.
2. If |t| > 0, while the first element of t is not equal to the leftmost pre-terminal dominated by the right child of the parent of the top element on σ (i.e., while there are terminals that must be swapped), add SHIFT to r, increment s, push the first element of t on σ and remove it from t. Finally, add another SHIFT to r, push first element of t to σ and remove it from t (this will contribute to the next reduction). If s > 0, we must swap. Either we add s many SWAP transitions or one COMPOUNDSWAP s to r. Then we move s many elements from σ to the front of t, starting with the second element of σ. Finally we set s = 0.
As an example, consider the transition sequence we would extract from the tree in Fig. 2. Using SINGLESWAP, we would obtain SHIFT, SHIFT, SHIFT, SHIFT, SINGLESWAP, SINGLESWAP, BINARY-VP-R, SHIFT, BINARY-@S-R, SHIFT, BINARY-S-L, FINISH. Using COMPOUNDSWAP i , instead of two SINGLESWAPs, we would just obtain a single COMPOUNDSWAP 2 . unigrams s0xwc, s1xwc, s2xwc, s3xwc, s0xtc, s1xwc, s2xtc, s3xwc, s0xy, s1xy, s2xy, s3xy bigrams s0xs1c, s0xs1w, s0xs1x, s0ws1x, s0cs1x, s0xs2c, s0xs2w, s0xs2x, s0ws2x, s0cs2x, s0ys1y, s0ys2y, s0xq0t, s0xq0w We explore two methods which improve the performance on discontinuous structures. Even though almost a third of all sentences in the German NeGra and TiGer treebanks contains at least one discontinuous constituent, among all constituents, the discontinuous ones are rare, making up only around 2%. The first, simple method addresses this sparseness by raising the importance of the features that model the actual discontinuities by counting all feature occurrences at a gold swap transition twice (IMPORTANCE).
Secondly, we use a new feature set (DISCO) with bigram and unigram features that conveys information about discontinuities. The features condition the possible occurrence of a gap on previous gaps and their properties. 2 The feature templates are shown in Fig. 4. x denotes the gap type of a tree on the stack. There are three possible values, either "none" (tree is fully continuous), "pass" (there is a gap at the root, i.e., this gap must be filled later further up in the tree), or "gap" (the root of this tree fills a gap, i.e., its children have gaps, but the root does not). Finally, y is the sum of all gap lengths.

Data
We use the TiGer treebank release 2.2 (TIGER), and the NeGra treebank (NEGRA). For TIGER, we use the first half of the last 10,000 sentences for development and the second half for testing. 3 We also recreate the split of Hall and Nivre (2008) (TIGERHN), for which we split TiGer in 10 parts, assigning sentence i to part imod 10. The first of those parts is used for testing, the concatenation of the rest for training.
From NeGra, we exclude all sentences longer than 30 words (in order to make a comparison with rparse possible, see below), and split off the last 10% of the treebank for testing, as well as the previous 10% for development. As a preprocessing step, in both treebanks we remove spurious discontinuities that are caused by material which is attached to the virtual root node (mainly punctuation). All such elements are attached to the least common ancestor node of their left and right terminal neighbors (as proposed by Levy (2005), p. 163). We furthermore create a continuous variant NEGRACF of NEGRA with the method usually used for PCFG parsing: For all maximal continuous parts of a discontinuous constituent, a separate node is introduced (Boyd, 2007). Subsequently, all nodes that do not cover the head child of the discontinuous constituent are removed.
No further preprocessing or cleanup is applied.

Experimental Setup
Our parser is implemented in Java. We run all our experiments with Java 8 on an Intel Core i5, allocating 15 GB per experiment. All experiments are carried out with gold POS tags, as in previous work on shift-reduce constituency parsing (Zhang and Clark, 2009). Grammatical function labels are discarded.
For the evaluation, we use the corresponding module of discodop. 4 We report several metrics (as implemented in discodop): • Extended labeled bracketing, in which a bracket for a single node consists of its label and a set of pairs of indices, delimiting the continuous blocks it covers. We do not include the root node in the evaluation and ignore punctuation. We report labeled precision, recall and F 1 , as well as exact match (all brackets correct).
• Leaf-ancestor (Sampson and Babarczy, 2003), for which we consider all paths from leaves to the root.
• Tree edit distance (Emms, 2008), which consists of the minimum edit distance between gold tree and parser output.
Aside from a full evaluation, we also evaluate only the constituents that are discontinuous. We perform 20 training iterations unless indicated otherwise. When training stops, we average the model (as in Daumé III (2006)).
We run further experiments with rparse 5 (Kallmeyer and Maier, 2013) to facilitate a comparison with a grammar-based parser.

Results
We start with discontinuous parsing experiments on NEGRA and TIGER, followed by continuous parsing experiments, and a comparison to grammar-based parsing.

Discontinuous Parsing
NeGra The first goal is to determine the effect of different beam sizes with BASELINE features and the COMPOUNDSWAP i operation. We run experiments with beam sizes 1, 2, 4 and 8; Fig. 5 shows the results obtained on the dev set after each iteration. Fig. 6 shows the average decoding speed during each iteration for each beam size (both smoothed).
Tracking two items instead of one results in a large improvement. Raising the beam size from 2 to 4 results in a smaller improvement. The improvement obtained by augmenting the beam size from 4 to 8 is even smaller. This behavior is mirrored by the parsing speeds during training: The differences in parsing speed roughly align with the result differences. Note that fast parsing during training means that the parser does not perform well (yet) and that therefore, early update is done more often. Note finally that the average parsing speeds on the test set after the last training iteration  For further experiments on NeGra, we choose a beam size of 8. Tab. 1 shows the bracketing scores for various parser setups. In Tab. 2, the corresponding TED and Leaf-Ancestor scores are shown.
In the first block of the tables, we compare SWAP with COMPOUNDSWAP i . On all  Table 2: Results NEGRA TED and Leaf-Ancestor constituents, the latter beats the former by 0.8 (F 1 ). On discontinuous constituents, using COM-POUNDSWAP i gives an improvement of more than four points in precision and of about 0.8 points in recall. A manual analysis confirms that as expected, particularly discontinuous constituents with large gaps profit from bundling swap transitions.
In the second block, we run the BASELINE features with COMPOUNDSWAP i combined with SEPARATOR, EXTENDED and DISCO. The SEP-ARATOR features were not as successful as they were for Zhang and Clark (2009). All scores for discontinuous constituents drop (compared to the baseline). The EXTENDED features are more effective and give an improvement of about half a point F 1 on all constituents, as well as the highest exact match among all experiments. On discontinuous constituents, precision raises slightly but we loose about 1.4% in recall (compared to the baseline). The latter seems to be due to the fact that in comparison to the baseline, with EXTENDED, more sentences get erroneously analyzed as not containing any crossing branches. This effect can be explained with data sparseness and is less pronounced when more training data is available (see below). Similarly to EXTENDED, the new DISCO features lead to a slight gain over the baseline (on all constituents). As with EXTENDED, on discontinuous constituents, we again gain precision (3%) but loose recall (0.5%), because more sentences wrongly analyzed as not having discontinuities than in the BASELINE. A category-based evaluation of discontinuous constituents reveals that EX-TENDED has an advantage over DISCO when considering all constituents. However, we can also see that the DISCO features yield better results than EXTENDED particularly on the frequent discontinuous categories (NP, VP, AP, PP), which indicates that the information about gap type and gap length is useful for the recovery of discontinuities. IM-PORTANCE (see Sec. 2.3) is not very successful, yielding results which lie in the vicinity of those of the BASELINE.
In the third block of the tables, we test the performance of the DISCO features in combination with other techniques, i.e., we use the BASELINE and DISCO features with COMPOUNDSWAP i and combine it with EXTENDED and SEPARATOR features as well as with the IMPORTANCE strategy. All experiments beat the BASELINE/DISCO combination in terms of F 1 . EXTENDED and DISCO give a cumulative advantage, resulting in an increase of precision of almost 4%, resp. over 6% on discontinuous constituents, compared to the use of DISCO, resp. EXTENDED alone. Adding the SEPARATOR features to this combination does not bring an advantage. The IMPORTANCE strategy is the most successful one in combination with DISCO, causing a boost of almost 10% on precision of discontinuous constituents, leading to the highest overall discontinuous F 1 of 29.41 (notably more than 12 points higher than the baseline); also on all constituents we obtain the third-highest F 1 . Combining DISCO with IMPORTANCE and EX-TENDED leads to the highest overall F 1 on all constituents of 76.95, however, the results on discontinuous constituents are slightly lower than for IM-PORTANCE alone. This confirms the previously observed behavior: The EXTENDED features help when considering all constituents, but they do not seem to be effective for the recovery of discontinuities in particular.
In the TED and LA scores (Tab. 2), we see much less variation than in the bracketing scores. As reported in the literature (e.g., Rehbein and van Genabith (2007)), this is because of the fact that with bracketing evaluation, a single wrong attachment can "break" brackets which otherwise would be counted as correct. Nevertheless, the trends from bracketing evaluation repeat.
To sum up, the COMPOUNDSWAP i operation works better than SWAP because the latter misses long gaps. The most useful feature sets were EX-TENDED and DISCO, both when used independently and when used together. DISCO was particularly useful for discontinuous constituents. SEP-ARATOR yielded no usable improvements. IM-PORTANCE has also proven to be effective, yielding the best results on discontinuous constituents (in combination with DISCO). Over almost all experiments, a common error is that on root level, CS and S get confused, indicating that the present features do not provide sufficient information for disambiguation of those categories. We can also confirm the tendency that discontinuous VPs in relatively short sentences are recognized correctly, as reported by Versley (2014).
TiGer We now repeat the most successful experiments on TIGER. Tab. 3 shows the parsing results for the test set.
Some of the trends seen on the experiments with NEGRA are repeated. EXTENDED and DISCO yields an improvement on all constituents. However, now not only DISCO, but also EXTENDED lead to improved scores on discontinuous constituents. As mentioned above, this can be explained with the fact that for the EXTENDED features to be effective, the amount of training data available in NEGRA was not enough. Other than in NEGRA, the DISCO features are now more effective when used alone, leading to the highest overall F 1 on discontinuous constituents of 19.45. They are, however, less effective in combination with EXTENDED. This is partially remedied by giving the swap transitions more IMPORTANCE, which leads to the highest overall F 1 on all constituents of 74.71.
The models we learn are sparse, therefore, as mentioned above, we can exploit the work of Goldberg and Elhadad (2011). They propose to only include the weight of a feature in the computation of a score if it has been seen more than MIN-UPDATE times. We repeat the BASELINE experiment with two different MINUPDATE settings (see Tab. 3). As expected, the MINUPDATE models are much smaller. The final model with the baseline experiment uses 8.3m features (parsing speed on test set 73 sent./sec.), with MINUPDATE 5 3.3m features (121 sent./sec.) and with MINUPDATE 10 1.8m features (124 sent./sec.). With MINUP-DATE 10, the results do degrade. However, with MINUPDATE 5 in addition to the faster parsing we consistently improve over the baseline.
Finally, in order to check the convergence, we run a further experiment in which we limit training iterations to 40 instead of 20, together with beam size 4. We use the BASELINE features with COMPOUNDSWAP i combined with DISCO, EX-

Continuous Parsing
We investigate the impact of the swap transitions on both speed and parsing results by running an experiment with NEGRACF using the BASELINE and EXTENDED features. The corresponding results are shown in Tab. 4.
Particularly high frequency categories (NP, VP, S) are much easier to find in the continuous case and show large improvements. This explains why without the swap transition, F 1 with BASELINE features is 6.9 points higher than the F 1 on discontinuous constituents (with COMPOUNDSWAP i ). With the EXTENDED features, we obtain a small improvement.
Note that with the shift-reduce approach, the difference between the computational cost of producing discontinuous constituents vs. the cost of producing continuous constituents is much lower than for a grammar-based approach. When producing continuous constituents, parsing is only 20% faster than with the swap transition, namely 97 instead of 81 sentences per second.
In order to give a different perspective on the role of discontinuous constituents, we perform two further evaluations. First, we remove the discontinuities from the output of the discontinuous baseline parser using the procedure described in Sec. 3.1 and evaluate the result against the continuous gold data. We obtain an F 1 of 76.70, 5.5 points lower than the continuous baseline.  Secondly, we evaluate the output of the continuous baseline parser against the discontinuous gold data. This leads to an F 1 78.89, 2.9 point more than the discontinuous baseline. Both evaluations confirm the intuition that parsing is much easier when discontinuities (i.e., in our case the swap transition) do not have to be considered.

Comparison with other Parsers
rparse In order to compare our parser with a grammar-based approach, we now parse NEGRA with rparse, with the same training and test sets as before (i.e., we do not use the development set). We employ markovization with v = 1, h = 2 and head driven binarization with binary top and bottom productions. The first thing to notice is that rparse is much slower than our parser. The average parsing speed is about 0.3 sent./sec.; very long sentences require over a minute to be parsed. The parsing results are shown in Tab. 5. They are about 5 points worse than those reported by Kallmeyer and Maier (2013). This is due to the fact that they train on the first 90% of the treebank, and not on the first 80% as we do, which leads to an increased number of unparsed sentences. In comparison to the baseline setting of the shift-reduce parser with beam size 8, the results are around 10 points worse. However, rparse reaches an F 1 of 26.61 on discontinuous constituents, which is 5.9 points more than we achieved with the best setting with our parser.
In order to investigate why the grammar-based approach outperforms our parser on discontinuous constituents, we count the frequency of LCFRS productions of a certain gap degree in the binarized grammar used in the rparse experiment. The  average occurrence count of rules with gap degree 0 is 12.18. Discontinuous rules have a much lower frequency, the average count of productions with one, two and three gaps being 3.09, 2.09, and 1.06, respectively. In PCFG parsing, excluding low frequency productions does not have a large effect (Charniak, 1996); however, this does not hold for LCFRS parsing, where they have a major influence (cf. Maier (2013, p. 205)): This means that removing low frequency productions has a negative impact on the parser performance particularly concerning discontinuous structures; however, it also means that low frequency discontinuous productions get triggered reliably. This hypothesis is confirmed by the fact that the our parser performs much worse on discontinuous constituents with a very low frequency (such as CS, making up only 0.62% of all discontinuous constituents) than it performs on those with a high frequency (such as VP, making up 60.65% of all discontinuous constituents), while rparse performs well on the low frequency constituents.
Our results exceed those of EaFi 6 and the exact match score of H&N. We are outperformed by the F&M parser. Note, that particularly the comparison to EaFi must be handled with care, since Versley (2014) uses additional preprocessing: PPinternal NPs are annotated explicitly, and the parenthetical sentences are changed to be embedded by their enclosing sentence (instead of vice versa).
We postpone a thorough comparison with both EaFi and the dependency parsers to future work. 6 Note that Versley (2014) reports a parsing speed of 40-55 sent./sec.; depending on the beam size and the training set size, per second, our parser parses 39-640 sentences.

Discussion
To our knowledge, surprisingly, numerical scores for discontinuous constituents have not been reported anywhere in previous work. The relatively low overall performance with both grammar-based and shift-reduce based parsing, along with the fact that the grammar-based approach outperforms the shift-reduce approach, is striking. We have shown that it is possible to push the precision on discontinuous constituents, but not the recall, to the level of what can be achieved with a grammar-based approach.
Particularly the outcome of the experiments involving the EXTENDED features and IMPOR-TANCE drives us to the conclusion that the major problem when parsing discontinuous constituents is data sparseness. More features cannot be the only solution: A more reliable recognition of discontinuous constituents requires a more robust learning from larger amounts of data.

Conclusion
We have presented a shift-reduce parser for discontinuous constituents which combines previous work in shift-reduce parsing for continuous constituents with recent work in easy-first parsing of discontinuous constituents. Our experiments confirm that an incremental shift-reduce architecture with a swap transition can indeed be used to parse discontinuous constituents. The swap transition is associated with a low computational cost. We have obtained a speed-up of up to 2,000% in comparison to the grammar-based rparse, and we have shown that we obtain better results than with the grammar-based parser, even though the grammarbased strategy does better at the reconstruction of discontinuous constituents.
In future work, we will concentrate on methods that could remedy the data sparseness concerning discontinuous constituents, such as self-training. Furthermore, we will experiment with larger feature sets that add lexical information. An formal investigation of the expressivity of our parsing model is currently under way.