Empty element recovery by spinal parser operations

This paper presents a spinal parsing al-gorithm that can jointly detect empty elements. This method achieves state-of-the-art performance on English and Japanese empty element recovery problems


Introduction
Empty categories, which are used in Penn Treebank style annotations to represent complex syntactic phenomena like constituent movement and discontinuous constituents, provide important information for understanding the semantic structure of sentences. Previous studies attempt empty element recovery by casting it as linear tagging (Dienes and Dubey, 2003), PCFG parsing (Schmid, 2006;Cai et al., 2011) or post-processing of syntactic parsing (Johnson, 2002;Gabbard et al., 2006). To the best of our knowledge, the results reported by (Cai et al., 2011) are the best yet reported, so we pursue a method that uses syntactic parsing to jointly solve the empty element recovery problem.
Our proposal uses the spinal Tree Adjoining Grammar (TAG) formalism of (Carreras et al., 2008). The spinal TAG has a set of elementary trees, called spines, each consisting of a lexical anchor with a series of unary projections. Figure 1 displays (a) a head-annotated constituent tree and (b) spines extracted from the tree. This paper presents a transition-based algorithm together with several operations to combine spines for constructing full parse trees with empty elements. Compared with the PCFG parsing approaches, one advantage of our method is its flexible feature representations, which allow the incorporation of constituency-, dependency-and spine-based features. Of particular interest, the motivation for our spinal TAG-based approach comes from the  intuition that features extracted from spines can be expected to be useful for empty element recovery in the same way as constituency-based vertical higher-order conjunctive features are used in recent post-processing methods (Xiang et al., 2013;Takeno et al., 2015). Experiments on English and Japanese datasets empirically show that our system outperforms existing alternatives.

Spinal Tree Adjoining Grammars
We define here the spinal TAG G = (N, PT, T, LS) where N is a set of nonterminal symbols, PT is a set of pre-terminal symbols (or part-of-speech tags), T is a set of terminal symbols (or words), and LS is a set of lexical spines. Each spine, s, has the form n 0 → n 1 → · · · → n k−1 → n k (k ∈ N) which satisfies the conditions: • n 0 ∈ T and n 1 ∈ PT , • ∀i ∈ [2, k], n i ∈ N.
The height of spine s is ht(s) = k + 1 and for some position i ∈ [0, k], the label at i is s(i) = n i . Tak  ing the leftmost spine s = We → PRP → NP in Figure 1 The spinal TAG uses two operations, sister and regular adjunctions, to combine spines. Both adjunctions also have left and right types. We use @# to illustrate node position on a spine, explicitly. After a regular adjunction, the resulting tree has an additional node level which has a copy of its original node at position @x, while a sister adjunction simply inserts a spine into some node of another spine. If adjunction left (or right) inserts spine s 1 into some node at @x on spine s 2 , we call s 2 the head spine of s 1 and s 1 the left (or right) child spine of s 2 1 . This paper denotes sister adjunction left and right as s 1 ▷ ⃝ x s 2 , s 2 ◁ ⃝ x s 1 , regular adjunction left and right as s 1 ▶ ⃝ x s 2 , s 2 ◀ ⃝ x s 1 , respectively.
A transition system for spinal TAG parsing is the tuple S = (C, T, I,C t ), where C is a set of configurations, T is a set of transitions, which are partial functions t : C ⇀ C, I is a total initialization function mapping each input string to a unique configuration, and C t ⊆ C is a set of terminal configurations. A configuration is the tuple (α, β , A) where α is a stack of stack elements, β is a buffer of elements from an input, and A is a set of parser operations. A stack element s is a pair (s, j) where s is a spine and j is a node index of s. We refer to s and j of s as s.s and s. j, respectively. Let x = ⟨w 1 /t 1 , . . . , w n /t n ⟩ (∀i ∈ [1, n], w i ∈ T and t i ∈ PT ) be a pos-tagged input sentence. The arc-standard transition system by Hayashi et al. (2016) can be defined as follows: its initialization function is ), and it has the following transitions:  To reduce search errors, Hayashi et al. (2016) employed beam search with Dynamic Programming of (Huang and Sagae, 2010). For experiments, we also use this technique and discriminative modeling of (Hayashi et al., 2016).

Spinal TAG with Empty Elements
In this paper, we redefine the spinal TAG as G = (N, PT, T, LS, *e*, ET, ES), where *e* is a special word, ET is a set of empty categories, and ES is a set of empty spines. An empty spine s = n 0 → n 1 → · · · → n k−1 → n k (k ∈ N) has the same form as lexical spines, but n 0 = *e* and n 1 ∈ ET . The height and label definitions are also the same as those of lexical spines. For example, the rightmost spine s = *e* → *T* → ADVP in Figure 1 (b) is an empty spine with ht(s) = 3 and s(1) = *T*.
This paper extends empty spines to allow the use of phrasal constituents that consist of only empty elements, as a single spine. A phrasal empty spine is a tuple (t, h), where t is a sequence of (phrasal) empty spines specifying some sister adjunctions between these spines and h is a head spine in t. The phrasal empty spine in Figure 3 consists of two empty spines *e* → 0 and *e* → *T* → S → SBAR, where a sister adjunction left is performed at the SBAR node of the latter spine, which is a head spine in the phrase. To apply parser operations to a phrasal empty spine, we use its head spine rather than itself. This paper defines the height and label of a phrasal empty spine as those of its head spine.
To recover empty elements, this paper introduces two additional operations, insert and combine, both of which have left and right types. Figures 2 (c) and (d) show insert left and combine right operations. These operations are similar to sister adjunctions in that the former simply inserts some phrasal empty spine into some node of another spine and the latter also inserts a spine into some node of a phrasal empty spine.

New Transitions
To handle empty spines in parsing process, we add the following five transitions to the arc-standard transition system of (Hayashi et al., 2016): 7-8. for each s ∈ ES and each j with s 1 . j ≤ j < ht(s 1 .s), an insert left transition of the form and an insert right transition of the form where s ′ 1 = (s 1 .s, j); 9-10. for each s ∈ ES and each j with 2 ≤ j < ht(s), a combine left transition of the form and a combine right transition of the form 11. an idle transition of the form (σ |s 1 , β , A) ⊢ (σ |s 1 , β , A); Like unary and idle rules in shift-reduce CFG parsing (Zhu et al., 2013), our current system prohibits > b consecutive actions consisting of only insert, combine and idle operations. Given an input sentence with length n, after performing n shift, n − 1 adjunction, b · (2n − 1) {insert, combine or idle} actions, the system triggers the finish action and terminates. For training, we make oracle derivations using the stack-shortest strategy.

Related Work
To realize empty element recovery, other lexicalized TAG formalisms (Chen and Shanker, 2004;Shen et al., 2008) attach some or all empty elements directly to surface word lexicons. Our framework, however, uses spinal TAG parser operations as they provide more efficient parsing and more compact sets of lexicons. It is remarkable that this paper is the first study to present a shiftreduce spinal TAG parsing algorithm to recover empty elements. Recent work has shown that empty element recovery can be effectively solved in conjunction   with parsing (Schmid, 2006;Cai et al., 2011). Schmid (2006) annotated a constituent tree with slash features to recover a direct path from a filler node to its trace. Cai et al. (2011) successfully integrated empty element recovery into lattice parsing for latent PCFGs. Compared with PCFG parsing, the spinal TAG parser provides a more flexible feature representation.

Experiments on the English Penn Treebank
We used the Wall Street Journal (WSJ) part of the English Penn Treebank: Sections 02-21 were used for training, Section 22 for development, and Section 23 for testing. We annotated trees with heads by treep (Chiang and Bikel, 2002)   perceptron algorithm (Huang et al., 2012). For training and testing, we set beam size to 16 and max count b, introduced in Section 4.2, to 2. For comparison with other systems in our environment, we also implemented two systems: • Lattice is a method by Cai et al. (2011). We also used blatt 5 , which is an extension of the Berkeley parser, to parse word lattices in which the special word *e* is encoded as described in (Cai et al., 2011).
• Tagger decides whether some empty category is inserted at the front of a word or not, with regularized logistic regression. To simplify point-wise linear tagging, we combined empty categories, those that appeared in the same position of a sentence, into a single category: thus the original 10 empty types increased to 63. Table 1 shows final results on Section 23. To evaluate the accuracy of empty element recovery, we calculated precision, recall and F1 scores for (1) Labeled Empty Bracket (X/t,i,i), (2) Labeled Empty Element (t,i,i), and (3) All Brackets, where X ∈ NT , t ∈ ET and i is a position of the empty element, using eevalb 6 . The results clearly show that our proposed method significantly outperforms the other systems. Table 2 shows the main reason for the improvement achieved by our method. The *ICH*, *RNR* and *EXP* empty types are used to show the relation between non-adjacent constituents, caused by syntactic phenomena like Extraposition and Conjunction. Our method captures such complex relations better with the help of the syntactic feature richness. Table 1 reports the scores for non-empty brackets to examine whether the joint method improves the standard PARSEVAL scores. While the Lattice Johnson (X/t,i,i) Table 1: Results on the English Penn Treebank (Section 23): to calculate the scores for Tagger, we obtained a parse tree by supplying the 1-best Tagger output with the Berkeley parser trained on Sections 02-21 including empty elements (using the option "-useGoldPOS").
method was less accurate than the vanilla Berkeley parser, the performance of our method could be maintained with little loss in parsing accuracy. Figure 4 shows the parse time in seconds for each test sentence and that our empty element recovery parser works in reasonable time.

Experiments on the Japanese Keyaki Treebank
Finally, to show that our method works well on other languages, we conduct experiments on the Japanese Keyaki Treebank (Butler et al., 2012). For this data, we modified blatt to keep function labels And, in order to consider segmentation errors, we also modified eevalb to calculate not word but character span in a sentence. We follow the experiments in (Takeno et al., 2015) and show the results in Table 3. Our method significantly outperforms the state-of-the-art post-processing method in Japanese.

Conclusion and Future Work
Using spinal parsing for the joint recovery of empty elements achieves state-of-the-art performance in standard English and Japanese datasets. We plan to extend our work to recover trace-filler and frame semantic structures using the PropBank data.