PTB Graph Parsing with Tree Approximation

The Penn Treebank (PTB) represents syntactic structures as graphs due to nonlocal dependencies. This paper proposes a method that approximates PTB graph-structured representations by trees. By our approximation method, we can reduce nonlocal dependency identification and constituency parsing into single tree-based parsing. An experimental result demonstrates that our approximation method with an off-the-shelf tree-based constituency parser significantly outperforms the previous methods in nonlocal dependency identification.


Introduction
In the Penn Treebank (PTB) (Marcus et al., 1993), syntactic structures are represented as graphs due to nonlocal dependencies, which capture syntactic discontinuities. This paper proposes a method that approximates PTB graph-structured representations by trees. By our approximation method, we can reduce nonlocal dependency identification and constituency parsing into single tree-based parsing. The information loss of our approximation method is slight, and we can easily recover original PTB graphs from the output of a parser trained using the approximated ones. An experimental result demonstrates that our approximation method with an off-the-shelf tree-based parser significantly outperforms the previous nonlocal dependency identification methods.

Nonlocal Dependency Identification
This section explains nonlocal dependencies in the PTB, and summarizes previous work on nonlocal dependency identification.

Nonlocal dependency in PTB
In the PTB, a nonlocal dependency is represented as an edge. One node is called an empty element, which is a covert element in the syntactic representation. The other is called a filler. PTB's syntactic representations are graph-structured, while its constituency structures are represented by trees. Below, a syntactic representation in the PTB is called a PTB graph. The left graph in Figure 1 is an example of PTB graph. The empty elements are labelled with -NONE-. The terminal symbols such as 0 and * T * designate their types of the empty elements. 0 and * T * represent a zero relative pronoun and a trace of wh-movement, respectively. If a terminal symbol of an empty element is indexed with a number, its corresponding filler exists in the PTB graph, and is indexed with the same number. For example, the empty element of type * T * is indexed with 1 and it has the corresponding filler WHNP-1. For more details about PTB nonlocal dependencies, we refer readers to (Bies et al., 1995).

Previous Work
Most PTB-based parsers deal with the trees obtained by removing nonlocal dependencies and empty elements (we call such trees PTB trees). While such parsers are simple, efficient and accurate, they cannot handle nonlocal dependencies. To fill this gap, several methods have been proposed so far. They can be classified into the following two categories: the methods that introduce special operations handling nonlocal dependencies or empty elements into parsing algorithm (Dienes and Dubey, 2003;Schmid, 2006;Cai et al., 2011;Evang and Kallmeyer, 2011;Maier, 2015;Kato and Matsubara, 2016;Hayashi and Nagata, 2016;Kummerfeld and Klein, 2017), and the ones that recover PTB graphs from PTB trees generated by a parser (Johnson, 2002;Campbell, 2004;Levy and Manning, 2004). The former approach is required to design a parsing model that is suitable for the algorithm. In the latter post-processing approach, the pre-processing parser cannot reflect   the information about nonlocal dependencies.

Tree Approximation of PTB Graphs
This section proposes a new approach of nonlocal dependency identification. We reduce nonlocal dependency identification and constituency parsing into single tree-based parsing. In our approach, a PTB graph is converted to a tree which approximately represents the PTB graph. The conversion consists of the following two steps: Removing nonlocal dependency removes the edges between the empty elements and their fillers, and augments the labels of them. Augmented labels are used in order to recover the removed edges.
Removing empty element removes the empty elements and inserts new inner nodes that encode the empty elements.
We call the trees obtained by this conversion PTB augmented trees. Figure 1 shows an example of the conversion. Below, we explain each step in detail.

Removing nonlocal dependency
By removing the nonlocal dependency edges, a PTB graph becomes a tree. In order to approximately represent the edges in the resulting tree, we augment node labels in the annotation scheme identical to that proposed by Kato and Matsubara (2016). In this scheme, the labels of empty elements and their fillers are augmented with special tags. We first describe the annotation scheme, and then how to recover removed edges using augmented labels.

Annotation approximately representing nonlocal dependency
Algorithm 1 is the annotation algorithm of Kato and Matsubara (2016). Here, posi(x, y) is the relative position of x for y and defined as follows: A (x is an ancestor of y) L (x occurs to the left of y) R (x occurs to the right of y) The tag OBJCTRL enables us to distinguish between subject and object control. For example, the left PTB graph in Figure 1 is converted to the middle tree. The boxes designate the augmented empty element and filler.

nonlocal dependency recovery
This section proposes a method of recovering nonlocal dependencies using the annotation described in the previous section. This method is based on heuristic rules, which are similar to, but simpler than those of Kato and Matsubara (2016) posi ) match(f, e) means the type, the category and the position tag of a filler f are identical to those of an empty element e. A rule consists of a node pattern and a constraint. When there is a node that matches the pattern, we select the nearest node satisfying the constraint as its co-indexed node. Table 1 summarizes the rules. 2 Here, c-cmd(x, y) is the syntactic relation called c-command 3 and holds iff the following condition (1) is satisfied: For example, the nonlocal dependency in Figure  1 can be recovered by the fourth rule in Table 1.

Removing empty elements
While the first step in the conversion can remove nonlocal dependency edges, the empty elements still remain. The second step removes empty elements and encodes them as inner nodes. By this conversion, parsing algorithm require no special operations handling empty elements.

Encoding empty elements
Algorithm 2 removes and encodes empty elements. For example, the middle tree in Figure 1 is converted to the right one. The dotted boxes designate the inner nodes encoding the empty elements. Here, note that [(NP (-NONE-L * T * ))] is no more than a part of the label in the PTB augmented tree. Kummerfeld and Klein (2017) represent empty elements in a similar way, but important difference exists. Our method keeps empty element positions (L and R) and no nonlocal dependencies, while they do not keep empty element positions and reserves nonlocal dependencies. Furthermore, while they require a specially-designed head rule 2 In the third rule, if cat(x) = PP, e is co-indexed with not x but x's child NP. 3 Kato and Matsubara (2016) follow Chomsky's GBtheory (Chomsky, 1981) to use this relation, because it holds between co-indexed nodes in most cases. We also use this relation.

Algorithm 2 Encoding Empty element
null(x) means all the leaves of x are empty elements. node(l, C) creates a node with a label l and children C. encode(x) converts the subtree rooted at x to a string. label(x) is the label of x.
Input: a node x c1, . . . , cn ← children(x) i ← the leftmost position such that ¬null(ci) C ← ci for j from i + 1 to n do if ¬null(cj) then C ← C · cj else C ← node(cat(x) + "R" + encode(cj), C) end if end for for j from i − 1 down to 1 do C ← node(cat(x) + "L" + encode(cj), C) end for return node(label(x), C) to avoid constructing cyclic graphs in parsing, our method does not need head rules in the first place.

Recovering empty elements
Algorithm 2 is lossless and Algorithm 3 can recover the empty elements from the inner nodes inserted in Algorithm 2.

Experiment
To evaluate the performance of our proposed method, we conducted an experiment using the PTB. We used the Kitaev and Klein (henceforth K&K) parser (Kitaev and Klein, 2018a) 4 . The K&K parser is a state-of-the-art tree-based parser, which can use ELMo (Peters et al., 2018) or BERT 5 (Devlin et al., 2018) as external data. PTB graphs in the training (sections 02-21) and development (section 22) data were converted into PTB augmented trees by our tree approximation  Algorithm 3 Recovering empty element decode(x) creates a tree by decoding a string assigned by encode and returns its root.
Input: a node x C ← children(x) C ← while C = do pop the first element c from C if c is an inserted node and c has the tag L then C ← decode(c) · children(c) · C else if c is an inserted node and has the tag R then C ← children(c) · decode(c) · C else C ← C · c end if end while return node(label(x), C ) method 6 , and a parsing model was trained using the PTB augmented trees. The hyperparameters for training were identical to those of Kitaev and Klein (2018a). We selected the model that maximizes the F1 score on the development data, where we treated the node labels of PTB augmented trees as constituent labels. For the test data (section 23), PTB graphs were recovered from the PTB augmented trees generated by the parser. The accuracy of the nonlocal dependency identification was evaluated by the metric proposed by Johnson (2002).
First, we evaluated the performance of our approximation method. We recovered PTB graphs from not the parser output but the gold PTB augmented trees in the development data. We ob-tained 99.5 F1 score in nonlocal dependency identification where unindexed empty elements were excluded. This result means that the information loss is slight in our approximation method. Table 2 summarizes the performances of our system and previous ones. These results demonstrate that our system significantly outperforms the previous methods in nonlocal dependency identification. Although the main reason for this is because of the performance of the K&K parser, the important point is that our proposed approximation method enables us to use the K&K parser for the nonlocal dependency identification task. The previous methods that introduce additional operations cannot adopt such parser directly. On the other hand, although post-processing approach can use any parser in pre-processing, our approach outperforms the post-processing approach, even if the pre-processing parser is assumed to always generate gold PTB trees.
We converted PTB graphs into PTB trees to evaluate constituency parsing performance. Table  3 shows the F1 scores of our and the K&K parser. These results demonstrate that our tree approximation has little negative impact on the constituency parsing performance.

Conclusion
This paper proposes a conversion of PTB graphs into PTB augmented trees, which enables us to reduce nonlocal dependency identification and constituency parsing into single parsing. Our proposed conversion method can be easily combined  with other tree-based parsers. We can expect that the evolution of tree-based parsing technology makes our approach improve the accuracy of nonlocal dependency identification.