Left-Corner Parsing for Identifying PTB-Style Nonlocal Dependencies

Nonlocal dependencies represent syntactic phenomenon such as wh-movement, Amovement in passives, topicalization, raising, control, and right node raising. Nonlocal dependencies play an important role in semantic interpretation. This paper proposes a left-corner parser that identifies nonlocal dependencies. Our parser integrates nonlocal dependency identification into a transition-based system. We adopt a left-corner strategy in order to use the syntactic relation c-command, which plays an important role in nonlocal dependency identification. To utilize the global features captured by nonlocal dependencies, our parser uses a structured perceptron. In experimental evaluations, our parser achieved a good balance between constituent parsing and nonlocal dependency identification.


Introduction
Many constituent parsers based on the Penn Treebank (Marcus et al., 1993) are available, but most of them do not deal with nonlocal dependencies. Nonlocal dependencies represent syntactic phenomenon such as wh-movement, A-movement in passives, topicalization, raising, control, right node raising and so on. Nonlocal dependencies play an important role on semantic interpretation. In the Penn Treebank, a nonlocal dependency is represented as a pair of an empty element and a filler.
Several methods of identifying nonlocal dependencies have been proposed so far. These methods can be divided into three approaches: pre-processing approach (Dienes and Dubey, 2003b), in-processing approach (Dienes and Dubey, 2003a;Schmid, 2006;Cai et al., 2011;Kato and Matsubara, 2015) and post-processing approach (Johnson, 2002;Levy and Manning, 2004;Campbell, 2004;Xue and Yang, 2013;Xiang et al., 2013;Takeno et al., 2015). 1 In preprocessing approach, a tagger called "trace tagger" detects empty elements. The trace tagger uses only surface word information. In-processing approach integrates nonlocal dependency identification into a parser. The parser uses a probabilistic context-free grammar to rank candidate parse trees. Post-processing approach recovers nonlocal dependencies from a parser output which does not include nonlocal dependencies.
The parsing models of the previous methods cannot use global features captured by nonlocal dependencies. Pre-or in-processing approach uses a probabilistic context-free grammar, which makes it difficult to use global features. Postprocessing approach performs constituent parsing and nonlocal dependency identification separately. This means that the constituent parser cannot use any kind of information about nonlocal dependencies.
This paper proposes a parser which integrates nonlocal dependency identification into constituent parsing. Our method adopts an inprocessing approach, but does not use a probabilistic context-free grammar. Our parser is based on a transition system with structured perceptron (Collins, 2002), which can easily introduce global features to its parsing model. We adopt a left-corner strategy in order to use the syntactic relation c-command, which plays an important role on nonlocal dependency identification. Previous work on transition-based constituent parsing adopts a shift-reduce strategy with a tree binarization (Sagae and Lavie, 2005;Sagae and Lavie, 2006;Zhang and Clark, 2009;Zhu et al., 2013;Wang and Xue, 2014;Mi and Huang, 2015;Thang et al., 2015;Watanabe and Sumita, 2015), or convert constituent trees to "spinal trees", which are similar to dependency trees (Ballesteros and Carreras, 2015). These conversions make it difficult for their parsers to capture c-command relations in the parsing process. On the other hand, our parser does not require such kind of conversion.  Our contribution can be summarized as follows: 1. We introduce empty element detection into transition-based left-corner constituent parsing.
2. We extend c-command relation to deal with nodes in parse tree stack in the transition system, and develop heuristic rules which coindex empty elements with their fillers on the basis of the extended version of c-command.
3. We introduce new features about nonlocal dependency to our parsing model. This paper is organized as follows: Section 2 explains how to represent nonlocal dependencies in the Penn Treebank. Section 3 describes our transition-based left-corner parser. Section 4 introduces nonlocal dependency identification into our parser. Section 5 describes structured perceptron and features. Section 6 reports an experimental result, which demonstrated that our parser achieved a good balance between constituent parsing and nonlocal dependency identification. Section 7 concludes this paper.

Nonlocal Dependency
This section describes nonlocal dependencies in the Penn Treebank (Marcus et al., 1993). A nonlocal dependency is represented as a pair of an empty element and a filler. Figure 1 shows an example of (partial) parse tree in the Penn Treebank. The parse tree includes several nonlocal dependencies. The nodes labeled with -NONE-are empty elements. The terminal symbols such as * and * T * represent the type of nonlocal dependency: * represents an unexpressed subject of to-infinitive. * T * represents a trace of wh-movement. When a terminal symbol of empty element is indexed, its filler exists in the parse tree. The filler has the same number. For example, * T * -1 means that the node WHNP-1 is the corresponding filler. Table 1 gives a brief description of empty elements quoted from the annotation guideline (Bies et al., 1995). For more details, see the guideline.

Transition-Based Left-Corner Parsing
This section describes our transition-based leftcorner parser. As with previous work (Sagae and Lavie, 2005;Sagae and Lavie, 2006;Zhang and Clark, 2009;Zhu et al., 2013;Wang and Xue, 2014;Mi and Huang, 2015;Thang et al., 2015;Watanabe and Sumita, 2015), our transition-based parsing system consists of a set of parser states and a finite set of transition actions, each of which maps a state into a new one. A parser state consists of a stack of parse tree nodes and a buffer of input words. A state is represented as a tuple (σ, i), where σ is the stack and i is the next input word position in the buffer. The initial state is (⟨⟩, 0). The final states are in the form of (⟨[· · ·] TOP ⟩, n), where TOP is a special symbol for the root of the parse tree and n is the length of the input sentence. The transition actions for our parser are as follows: • SHIFT(X): pop up the first word from the buffer, assign a POS tag X to the word and push it onto the stack.
The SHIFT action assigns a POS tag to the shifted word to perform POS tagging and constituent parsing simultaneously. This is in the same way as Wang and Xue (2014).
• LEFTCORNER-{H/∅}(X): pop up the first node from the stack, attach a new node labeled with X to the node as the parent and push it back onto the stack. H and ∅ indicate whether or not the popped node is the head child of the new node.
• ATTACH-{H/∅}: pop up the top two nodes from the stack, attach the first one to the second one as the rightmost child and push it back onto the stack. H and ∅ indicate whether or not the first node is the head child of the second one.
We introduce new actions LEFTCORNER and AT-TACH. ATTACH action is similar to REDUCE action standardly used in the previous transitionbased parsers. However, there is an important type description n-posi * arbitrary PRO, controlled PRO and trace of A-movement L, R, − * EXP * expletive (extraposition) R * ICH * interpret constituent here (discontinuous dependency) L, R * RNR * right node raising R * T * trace of A ′ -movement A, L 0 null complementizer − * U * unit − * ? * placeholder for ellipsed material − * NOT * anti-placeholder in template gapping −  difference between ATTACH and REDUCE. The REDUCE action cannot deal with any node with more than two children. For this reason, the previous work converts parse trees into binarized ones.
The conversion makes it difficult to capture the hierarchical structure of the parse trees. On the other hand, ATTACH action can handle more than two children. Therefore, our parser does not require such kind of tree binarization. These transition actions are similar to the ones described in (Henderson, 2003), although his parser uses right-binarized trees and does not identify headchildren. Figure 2 summarizes the transition actions for our parser.
To guarantee that every non-terminal node has exactly one head child, our parser uses the following constraints: • LEFTCORNER and ATTACH are not allowed when s 0 has no head child.
• ATTACH-H is not allowed when s 1 has a head child. Table 2 shows the first several transition actions which derive the parse tree shown in Figure 1. Head children are indicated by the superscript * .
Previous transition-based constituent parsing does not handle nonlocal dependencies. One exception is the work of Maier (2015), who proposes shift-reduce constituent parsing with swap action. The parser can handle nonlocal dependencies represented as discontinuous constituents. In this framework, discontinuities are directly annotated by allowing crossing branches. Since the annotation style is quite different from the PTB annotation, the parser is not suitable for identifying the PTB style nonlocal dependencies. 2

Nonlocal Dependency Identification
Nonlocal dependency identification consists of two subtasks: • empty element detection.
• empty element resolution, which coindexes empty elements with their fillers.
Our parser can insert empty elements at an arbitrary position to realize empty element detection. This is in a similar manner as the in-processing approach. Our method coindexes empty elements with their fillers using simple heuristic rules, which are developed for our transition system.

Empty Element Detection
We introduce the following action to deal with empty elements: This action simply inserts an empty element at an arbitrary position and pops up no element from the buffer (see the transition from #11 to #12 shown in Table 2 as an example).

Annotations
For empty element resolution, we augment the Penn Treebank. For nonlocal dependency types   Table 2: An example of transition action sequence. * EXP * , * ICH * , * RNR * and * T * , we assign the following information to each filler and each empty element: • The nonlocal dependency type (only for filler).
• The nonlocal dependency category, which is defined as the category of the parent of the empty element.
• The relative position of the filler, which take a value from {A, L, R}. "A" means that the filler is an ancestor of the empty element. "L" ("R") means that the filler occurs to the left (right) of the empty element. Table 1 summarizes which value each empty element can take.
The information is utilized for coindexing empty elements with fillers. Below, we write n-type(x), n-cat(x) and n-posi(x) for the information of a node x, respectively. If an empty element of type * is indexed, we annotate the empty element in the same way. 3 Furthermore, we assign a tag OBJCTRL to every empty element if its coindexed constituent does not have the function tag SBJ. 4 This enables our parser to distinguish between subject control and object control. Figure 3 shows the augmented version of the parse tree of Figure 1.

Empty Element Resolution
Nonlocal dependency annotation in the Penn Treebank is based on Chomsky's GB-theory (Chomsky, 1981). This means that there exist ccommand relations between empty elements and 3 We omit its nonlocal dependency category, since it is always NP. 4 In the Penn Treebank, every subject has the tag SBJ.  fillers in many cases. For example, all the empty elements in Figure 1 are c-commanded by their fillers. Our method coindexes empty elements with their fillers by simple heuristic rules based on the c-command relation.

C-command Relation
Here, we define c-command relation in a parse tree as follows: • A node x c-commands a node y if and only if there exists some node z such that z is a sibling of x(x ̸ = z) and y is a descendant of z.
It is difficult for previous transition-based shift-reduce constituent parsers to recognize c-command relations between nodes, since parse trees are binarized. On the other hand, our left-corner parser needs not to binarize parse trees and can easily recognize c-command relations. Furthermore, we extend c-command relation to handle nodes in a stack of our transition system. For two nodes x and y in a stack, the following statement necessarily holds: • Let S = (⟨s m , . . . , s 0 ⟩, i) be a parser state. Let y be a descendant of s j and x be a child of some node s k (j < k ≤ m), respectively.
Then, x c-commands y in any final state derived from the state S.
Below, we say that x c-commands y, even when the nodes x and y satisfy the above statement.
As an example, let us consider the state shown in Figure 4. The subscripts of nodes indicate the order in which the nodes are instantiated. The nodes in dotted box c-command the shifted node -NONE-L 14 in terms of the above statement. In the parse tree shown in Figure 3, which is derived from this state, these nodes c-commands -NONE-L 14 by the original definition.

Resolution Rules
Our parser coindexes an empty element with its filler, when E-SHIFT or ATTACH is executed. E-SHIFT action coindexes the shifted empty element e such that n-posi(e) = L with its filler. ATTACH action coindexes the attached filler s 0 such that n-posi(s 0 ) = R with its corresponding empty element. Resolution rules consist of three parts: PRECONDITION, CONSTRAINT and SELECT. Empty element resolution rule is applied to a state when the state satisfies PRECONDITION of the rule. CONSTRAINT represents the conditions which coindexed element must satisfy. SE-LECT can take two values ALL and RIGHTMOST. When there exist several elements satisfying the CONSTRAINT, SELECT determines how to select coindexed elements. ALL means that all the elements satisfying the CONSTRAINT are coindexed. RIGHTMOST selects the rightmost element satisfying the CONSTRAINT.
The most frequent type of nonlocal dependency in the Penn Treebank is * . Figure 5 shows the resolution rules for type * . Here, ch(s) designates the set of the children of s. sbj(x) means that x has a function tag SBJ. par(x) designates the parent of x. cat(x) represents the constituent cat- egory of x. des(s) designates the set of the proper descendants of s. free(x, σ) means that x is not coindexed with a node included in σ.
The first rule * -L is applied to a state when E-SHIFT action inserts an empty element e = [ * ] -NONE-L . This rule seeks a subject which ccommands the shifted empty element. The first constraint means that the node x c-commands the empty element e, since the resulting state of E-SHIFT action is (⟨s m , . . . , s 0 , e⟩, i), and x and e satisfy the statement in section 4.3.1. For example, the node NP-SBJ 10 shown in Figure 4 satisfies these constraints (the dotted box represents the first constraint). Therefore, our parser coindexes NP-SBJ 10 with -NONE-L 14 .
The second rule * -L-OBJCTRL seeks an object instead of a subject. The second and third constraints identify whether or not x is an argument. If x is a prepositional phrase, our parser coindexes e with x's child noun phrase instead of x itself, in order to keep the PTB-style annotation.
The third rule * -R is for null subject of participial clause. Figure 6 shows an example of applying the rule * -R to a state. This rule is applied to a state when the transition action is ATTACH and s 0 is a subject. By definition, the first constraint means that s 0 c-commands x.
The second most frequent type is * T * . Figure 7 shows the rule for * T * . This rule is ap- plied to a state when E-SHIFT action inserts an empty element of type * T * . Here, match(x, y) checks the consistency between x and y, that is, match(x, y) holds if and only if n-type(x) = n-type(y), n-cat(x) = n-cat(y), n-posi(x) = n-posi(y), cat(x) ̸ = -NONE-and cat(y) = -NONE-. removeCRD(⟨s m , . . . , s 0 ⟩) is a stack which is obtained by removing s j (0 ≤ j ≤ m) which is annotated with a tag CRD. 5 The tag CRD means that the node is coordinate structure. In general, each filler of type * T * is coindexed with only one empty element. However, a filler of type * T * can be coindexed with several empty elements if the empty elements are included in coordinate structure. This is the reason why our parser uses removeCRD. Figure 8 and 9 give examples of resolution for type * T * .
The empty elements [ * T * ] -NONE-A are handled by an exceptional process. When ATTACH action is applied to a state (⟨s m , . . . , s 0 ⟩, i) such that cat(s 0 ) = PRN, the parser coindexes the empty element x = [ * T * ] -NONE-A included in s 0 with s 1 . More precisely, the coindexation is executed if the following conditions hold: • x ∈ des(s 0 ) • match(s 1 , x) • free(x, ⟨s m , . . . , s 0 ⟩) For the other types of nonlocal dependencies, that is, * EXP * , * ICH * and * RNR * , we use a simi- lar idea to design the resolution rules. Figure 10 shows the resolution rules. These heuristic resolution rules are similar to the previous work (Campbell, 2004;Kato and Matsubara, 2015), which also utilizes c-command relation. An important difference is that we design heuristic rules not for fully-connected parse tree but for stack of parse trees derived by left-corner parsing. That is, the extend version of c-command relation plays an important role on our heuristic rules.

Parsing Strategy
We use a beam-search decoding with the structured perceptron (Collins, 2002). A transition action a for a state S has a score defined as follows: where f (S, a) is the feature vector for the stateaction pair (S, a), and w is a weight vector. The input: sentence w1 · · · wn, beam size k H ← {S0} # S0 is the initial state for w1 · · · w0 repeat N times do C ← {} for each S ∈ H do for each possible action a do S ′ ← apply a to S push S ′ to C H ← k best states of C return best final state in C Figure 11: Beam-search parsing.
score of a state S ′ which is obtained by applying an action a to a state S is defined as follows: For the initial state S 0 , score(S 0 ) = 0.
We learn the weight vector w by max-violation method (Huang et al., 2012) and average the weight vector to avoid overfitting the training data (Collins, 2002).
In our method, action sequences for the same sentence have different number of actions because of E-SHIFT action. To absorb the difference, we use an IDLE action, which are proposed in (Zhu et al., 2013): Figure 11 shows details of our beam-search parsing. The algorithm is the same as the previous transition-based parsing with structured perceptron. One point to be noted here is how to determine the maximum length of action sequence (= N ) which the parser allows. Since it is impossible to know in advance how many empty elements a parse tree has, we need to set this parameter as a sufficiently larger value.
To extract the features, we need to identify head children in parse trees. We use the head rules described in (Surdeanu et al., 2008).
In addition to these features, we introduce a new feature which is related to empty element resolution. When a transition action invokes empty element resolution described in section 4.3.2, we use as a feature, whether or not the procedure coindexes empty elements with a filler. Such a feature is difficult for a PCFG to capture. This feature enables our parsing model to learn the resolution rule preferences implicitly, while the training process is performed only with oracle action sequences.
In addition, we use features about free empty elements and fillers. Table 4 summarizes such feature templates. Here, x.n i stands for the i-th rightmost free element included in x, and rest i stands for the stack ⟨s m , . . . , s i ⟩.

Experiment
We conducted an experiment to evaluate the performance of our parser using the Penn Treebank. We used a standard setting, that is, section 02-21 is for the training data, section 22 is for the development data and section 23 is for the test data.
In training, we set the beam size k to 16 to achieve a good efficiency. We determined the optimal iteration number of perceptron training, and the beam size (k was set to 16, 32 and 64) for decoding on the development data. The maximum  length of action sequences (= N ) was set to 7n, where n is the length of input sentence. This maximum length was determined to deal with the sentences in the training data. Table 5 presents the constituent parsing performances of our system and previous systems. We used the labeled bracketing metric PARSEVAL (Black et al., 1991). Here, "CF" is the parser which was learned from the training data where nonlocal dependencies are removed. This result demonstrates that our nonlocal dependency identification does not have a bad influence on constituent parsing. From the viewpoint of transitionbased constituent parsing, our left-corner parser is somewhat inferior to other perceptron-based shiftreduce parsers. On the other hand, our parser outperforms the parsers which identify nonlocal dependency based on in-processing approach.
We use the metric proposed by Johnson (2002) to evaluate the accuracy of nonlocal dependency identification. Johnson's metric represents a nonlocal dependency as a tuple which consists of the type of the empty element, the category of the empty element, the position of the empty element, the category of the filler and the position of the filler. For example, the nonlocal dependency of the type * T * in Figure 1 is represented as ( * T * ,NP,[4,4], WHNP, [3, 4]). The precision and the recall are measured using these tuples. For more details, see (Johnson, 2002). Table 6 shows the nonlocal dependency identification performances of our method and previous methods. Previous in-processing approach  achieved the state-of-the-art performance of nonlocal dependency identification, while it was inferior in terms of constituent parsing accuracy. Our nonlocal dependency identification is competitive with previous in-processing approach, and its accuracy of constituent parsing is higher than previous in-processing approach. As a whole, our parser achieves a good balance between constituent parsing and nonlocal dependency identification. Table 7 summarizes the accuracy of nonlocal dependency identification for each type of nonlocal dependency.

Conclusion
This paper proposed a transition-based parser which identifies nonlocal dependencies. Our parser achieves a good balance between constituent parsing and nonlocal dependency identification. In the experiment reported in this paper, we used simple features which are captured by nonlocal dependencies. In future work, we will develop lexical features which are captured by nonlocal dependencies.