A Transition-Based Directed Acyclic Graph Parser for UCCA

We present the first parser for UCCA, a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. To our knowledge, the conjunction of these formal properties is not supported by any existing parser. Our transition-based parser, which uses a novel transition set and features based on bidirectional LSTMs, has value not just for UCCA parsing: its ability to handle more general graph structures can inform the development of parsers for other semantic DAG structures, and in languages that frequently use discontinuous structures.


Introduction
Universal Conceptual Cognitive Annotation (UCCA, Abend and Rappoport, 2013) is a crosslinguistically applicable semantic representation scheme, building on the established Basic Linguistic Theory typological framework (Dixon, 2010a(Dixon, ,b, 2012, and Cognitive Linguistics literature (Croft and Cruse, 2004). It has demonstrated applicability to multiple languages, including English, French, German and Czech, support for rapid annotation by non-experts (assisted by an accessible annotation interface ), and stability under translation (Sulem et al., 2015). It has also proven useful for machine translation evaluation (Birch et al., 2016). UCCA differs from syntactic schemes in terms of content and formal structure. It exhibits reentrancy, discontinuous nodes and non-terminals, which no single existing parser supports. Lacking a parser, UCCA's applicability has been so far limited, a gap this work addresses.
We evaluate TUPA on the English UCCA corpora, including in-domain and out-of-domain settings. To assess the ability of existing parsers to tackle the task, we develop a conversion procedure from UCCA to bilexical graphs and trees. Results show superior performance for TUPA, demonstrating the effectiveness of the presented approach. 1 The rest of the paper is structured as follows: Section 2 describes UCCA in more detail. Section 3 introduces TUPA. Section 4 discusses the data and experimental setup. Section 5 presents the experimental results. Section 6 summarizes related work, and Section 7 concludes the paper.
2 The UCCA Scheme UCCA graphs are labeled, directed acyclic graphs (DAGs), whose leaves correspond to the tokens of the text. A node (or unit) corresponds to a terminal or to several terminals (not necessarily contiguous) viewed as a single entity according to semantic or cognitive considerations. Edges bear a category, indicating the role of the sub-unit in the parent relation. Figure 1 presents a few examples. UCCA is a multi-layered representation, where each layer corresponds to a "module" of semantic distinctions. UCCA's foundational layer, targeted in this paper, covers the predicate-argument structure evoked by predicates of all grammatical categories (verbal, nominal, adjectival and others), the inter-relations between them, and other major linguistic phenomena such as coordination and multi-word expressions. The layer's basic notion is the scene, describing a state, action, movement or some other relation that evolves in time. Each scene contains one main relation (marked as either a Process or a State), as well as one or more Participants. For example, the sentence "After graduation, John moved to Paris" (Figure 1a) contains two scenes, whose main relations are "graduation" and "moved". "John" is a Participant in both scenes, while "Paris" only in the latter. Further categories account for inter-scene relations and the internal structure of complex arguments and relations (e.g. coordination, multi-word expressions and modification).
One incoming edge for each non-root node is marked as primary, and the rest (mostly used for implicit relations and arguments) as remote edges, a distinction made by the annotator. The primary edges thus form a tree structure, whereas the remote edges enable reentrancy, forming a DAG.
While parsing technology in general, and transition-based parsing in particular, is wellestablished for syntactic parsing, UCCA has several distinct properties that distinguish it from syntactic representations, mostly UCCA's tendency to abstract away from syntactic detail that do not affect argument structure. For instance, consider the following examples where the concept of a scene has a different rationale from the syntactic concept of a clause. First, non-verbal predicates in UCCA are represented like verbal ones, such as when they appear in copula clauses or noun phrases. Indeed, in Figure 1a, "graduation" and "moved" are considered separate events, despite appearing in the same clause. Second, in the same example, "John" is marked as a (remote) Participant in the graduation scene, despite not being overtly marked. Third, consider the possessive construction in Figure 1c. While in UCCA "trip" evokes a scene in which "John and Mary" is a Participant, a syntactic scheme would analyze this phrase similarly to "John and Mary's shoes".
These examples demonstrate that a UCCA parser, and more generally semantic parsers, face an additional level of ambiguity compared to their syntactic counterparts (e.g., "after graduation" is formally very similar to "after 2pm", which does not evoke a scene). Section 6 discusses UCCA in the context of other semantic schemes, such as AMR (Banarescu et al., 2013).
Alongside recent progress in dependency parsing into projective trees, there is increasing interest in parsing into representations with more general structural properties (see Section 6). One such property is reentrancy, namely the sharing of semantic units between predicates. For instance, in Figure 1a, "John" is an argument of both "gradu-ation" and "moved", yielding a DAG rather than a tree. A second property is discontinuity, as in Figure 1b, where "gave up" forms a discontinuous semantic unit. Discontinuities are pervasive, e.g., with multi-word expressions . Finally, unlike most dependency schemes, UCCA uses non-terminal nodes to represent units comprising more than one word. The use of non-terminal nodes is motivated by constructions with no clear head, including coordination structures (e.g., "John and Mary" in Figure 1c), some multi-word expressions (e.g., "The Haves and the Have Nots"), and prepositional phrases (either the preposition or the head noun can serve as the constituent's head). To our knowledge, no existing parser supports all structural properties required for UCCA parsing.

Transition-based UCCA Parsing
We now turn to presenting TUPA. Building on previous work on parsing reentrancies, discontinuities and non-terminal nodes, we define an extended set of transitions and features that supports the conjunction of these properties.
Transition-based parsers (Nivre, 2003) scan the text from start to end, and create the parse incrementally by applying a transition at each step to the parser's state, defined using three data structures: a buffer B of tokens and nodes to be processed, a stack S of nodes currently being processed, and a graph G = (V, E, ) of constructed nodes and edges, where V is the set of nodes, E is the set of edges, and : E → L is the label function, L being the set of possible labels. Some states are marked as terminal, meaning that G is the final output. A classifier is used at each step to select the next transition based on features encoding the parser's current state. During training, an oracle creates training instances for the classifier, based on gold-standard annotations.
Transition Set. Given a sequence of tokens w 1 , . . . , w n , we predict a UCCA graph G over the sequence. Parsing starts with a single node on the stack (an artificial root node), and the input tokens in the buffer. Figure 2 shows the transition set.
In addition to the standard SHIFT and RE-DUCE operations, we follow previous work in transition-based constituency parsing (Sagae and Lavie, 2005), adding the NODE transition for creating new non-terminal nodes. For every X ∈ L, NODE X creates a new node on the buffer as a par-ent of the first element on the stack, with an Xlabeled edge. LEFT-EDGE X and RIGHT-EDGE X create a new primary X-labeled edge between the first two elements on the stack, where the parent is the left or the right node, respectively. As a UCCA node may only have one incoming primary edge, EDGE transitions are disallowed if the child node already has an incoming primary edge. LEFT-REMOTE X and RIGHT-REMOTE X do not have this restriction, and the created edge is additionally marked as remote. We distinguish between these two pairs of transitions to allow the parser to create remote edges without the possibility of producing invalid graphs. To support the prediction of multiple parents, node and edge transitions leave the stack unchanged, as in other work on transition-based dependency graph parsing (Sagae and Tsujii, 2008;Ribeyre et al., 2014;Tokgöz and Eryigit, 2015). REDUCE pops the stack, to allow removing a node once all its edges have been created. To handle discontinuous nodes, SWAP pops the second node on the stack and adds it to the top of the buffer, as with the similarly named transition in previous work (Nivre, 2009;Maier, 2015). Finally, FINISH pops the root node and marks the state as terminal.
Classifier. The choice of classifier and feature representation has been shown to play an important role in transition-based parsing (Chen and Manning, 2014;Andor et al., 2016;Kiperwasser and Goldberg, 2016). To investigate the impact of the type of transition classifier in UCCA parsing, we experiment with three different models.
1. Starting with a simple and common choice (e.g., Maier and Lichte, 2016), TUPA Sparse uses a linear classifier with sparse features, trained with the averaged structured perceptron algorithm (Collins and Roark, 2004) and MIN-UPDATE (Goldberg and Elhadad, 2011): each feature requires a minimum number of updates in training to be included in the model. 2 2. Changing the model to a feedforward neural network with dense embedding features, TUPA MLP ("multi-layer perceptron"), uses an architecture similar to that of Chen and Manning (2014), but with two rectified linear layers The transition set of TUPA. We write the stack with its top to the right and the buffer with its head to the left. (·, ·)X denotes a primary X-labeled edge, and (·, ·) * X a remote X-labeled edge. i(x) is a running index for the created nodes. In addition to the specified conditions, the prospective child in an EDGE transition must not already have a primary parent.
instead of one layer with cube activation. The embeddings and classifier are trained jointly.
3. Finally, TUPA BiLSTM uses a bidirectional LSTM for feature representation, on top of the dense embedding features, an architecture similar to Kiperwasser and Goldberg (2016). The BiLSTM runs on the input tokens in forward and backward directions, yielding a vector representation that is then concatenated with dense features representing the parser state (e.g., existing edge labels and previous parser actions; see below). This representation is then fed into a feedforward network similar to TUPA MLP . The feedforward layers, BiLSTM and embeddings are all trained jointly.
For all classifiers, inference is performed greedily, i.e., without beam search. Hyperparameters are tuned on the development set (see Section 4).
Features. TUPA Sparse uses binary indicator features representing the words, POS tags, syntactic dependency labels and existing edge labels related to the top four stack elements and the next three buffer elements, in addition to their children and grandchildren in the graph. We also use bi-and trigram features based on these values (Zhang and Clark, 2009;Zhu et al., 2013), features related to discontinuous nodes (Maier, 2015, including separating punctuation and gap type), features representing existing edges and the number of parents and children, as well as the past actions taken by the parser. In addition, we use use a novel, UCCAspecific feature: number of remote children. 3 For TUPA MLP and TUPA BiLSTM , we replace all indicator features by a concatenation of the vector embeddings of all represented elements: words, 3 See Appendix A for a full list of used feature templates. POS tags, syntactic dependency labels, edge labels, punctuation, gap type and parser actions. These embeddings are initialized randomly. We additionally use external word embeddings initialized with pre-trained word2vec vectors (Mikolov et al., 2013), 4 updated during training. In addition to dropout between NN layers, we apply word dropout (Kiperwasser and Goldberg, 2016): with a certain probability, the embedding for a word is replaced with a zero vector. We do not apply word dropout to the external word embeddings.
Finally, for all classifiers we add a novel realvalued feature to the input vector, ratio, corresponding to the ratio between the number of terminals to number of nodes in the graph G. This feature serves as a regularizer for the creation of new nodes, and should be beneficial for other transition-based constituency parsers too.
Training. For training the transition classifiers, we use a dynamic oracle (Goldberg and Nivre, 2012), i.e., an oracle that outputs a set of optimal transitions: when applied to the current parser state, the gold standard graph is reachable from the resulting state. For example, the oracle would predict a NODE transition if the stack has on its top a parent in the gold graph that has not been created, but would predict a RIGHT-EDGE transition if the second stack element is a parent of the first element according to the gold graph and the edge between them has not been created. The transition predicted by the classifier is deemed correct and is applied to the parser state to reach the subsequent state, if the transition is included in the set of optimal transitions. Otherwise, a random optimal transition is applied, and for the perceptronbased parser, the classifier's weights are updated according to the perceptron update rule.
POS tags and syntactic dependency labels are extracted using spaCy (Honnibal and Johnson, 2015). 5 We use the categorical cross-entropy objective function and optimize the NN classifiers with the Adam optimizer (Kingma and Ba, 2014). UCCA edges can cross sentence boundaries, we adhere to the common practice in semantic parsing and train our parsers on individual sentences, discarding inter-relations between them (0.18% of the edges). We also discard linkage nodes and edges (as they often express inter-sentence relations and are thus mostly redundant when applied at the sentence level) as well as implicit nodes. 7 In the out-of-domain experiments, we apply the same parsers (trained on the Wiki training set) to the 20K Leagues corpus without parameter re-tuning.
Implementation. We use the DyNet package (Neubig et al., 2017) for implementing the NN classifiers. Unless otherwise noted, we use the default values provided by the package. See Appendix C for the hyperparameter values we found by tuning on the development set.
Evaluation. We define a simple measure for comparing UCCA structures G p = (V p , E p , p ) and G g = (V g , E g , g ), the predicted and goldstandard graphs, respectively, over the same sequence of terminals W = {w 1 , . . . , w n }. For an edge e = (u, v) in either graph, u being the parent and v the child, its yield y(e) ⊆ W is the set of terminals in W that are descendants of v. Define the set of mutual edges between G p and G g : Labeled precision and recall are defined by dividing |M (G p , G g )| by |E p | and |E g |, respectively, and F-score by taking their harmonic mean. We report two variants of this measure: one where we consider only primary edges, and another for remote edges (see Section 2). Performance on remote edges is of pivotal importance in this investigation, which focuses on extending the class of graphs supported by statistical parsers. We note that the measure collapses to the standard PARSEVAL constituency evaluation measure if G p and G g are trees. Punctuation is excluded from the evaluation, but not from the datasets.
Comparison to bilexical graph parsers. As no direct comparison with existing parsers is possible, we compare TUPA to bilexical dependency graph parsers, which support reentrancy and discontinuity but not non-terminal nodes.
To facilitate the comparison, we convert our training set into bilexical graphs (see examples in Figure 4), train each of the parsers, and evaluate them by applying them to the test set and then reconstructing UCCA graphs, which are compared with the gold standard. The conversion to bilexical graphs is done by heuristically selecting a head terminal for each non-terminal node, and attaching all terminal descendents to the head terminal. In the inverse conversion, we traverse the bilexical graph in topological order, creating non-terminal parents for all terminals, and attaching them to the previously-created non-terminals corresponding to the bilexical heads. 8 In Section 5 we report the upper bounds on the achievable scores due to the error resulting from the removal of non-terminal nodes.
Comparison to tree parsers. For completeness, and as parsing technology is considerably more 8 See Appendix D for a detailed description of the conversion procedures. mature for tree (rather than graph) parsing, we also perform a tree approximation experiment, converting UCCA to (bilexical) trees and evaluating constituency and dependency tree parsers on them (see examples in Figure 5). Our approach is similar to the tree approximation approach used for dependency graph parsing (Agić et al., 2015;Fernández-González and Martins, 2015), where dependency graphs were converted into dependency trees and then parsed by dependency tree parsers. In our setting, the conversion to trees consists simply of removing remote edges from the graph, and then to bilexical trees by applying the same procedure as for bilexical graphs.
Baseline parsers. We evaluate two bilexical graph semantic dependency parsers: DAGParser (Ribeyre et al., 2014), the leading transition-based parser in SemEval 2014 (Oepen et al., 2014) and TurboParser (Almeida and Martins, 2015), a graph-based parser from SemEval 2015 ; UPARSE (Maier and Lichte, 2016), a transition-based constituency parser supporting discontinuous constituents; and two bilexical tree parsers: MaltParser (Nivre et al., 2007), and the stack LSTM-based parser of Dyer et al. (2015, henceforce "LSTM Parser"). Default settings are used in all cases. 9 DAGParser and UPARSE use beam search by default, with a beam size of 5 and 4 respectively. The other parsers are greedy.  flecting the error resulting from the conversion. 10 DAGParser and UPARSE are most directly comparable to TUPA Sparse , as they also use a perceptron classifier with sparse features. TUPA Sparse considerably outperforms both, where DAGParser does not predict any remote edges in the out-ofdomain setting. TurboParser fares worse in this comparison, despite somewhat better results on remote edges. The LSTM parser of Dyer et al. (2015) obtains the highest primary F-score among the baseline parsers, with a considerable margin.

Results
Using a feedforward NN and embedding features, TUPA MLP obtains higher scores than TUPA Sparse , but is outperformed by the LSTM parser on primary edges. However, using better input encoding allowing virtual look-ahead and look-behind in the token representation, TUPA BiLSTM obtains substantially higher scores than TUPA MLP and all other parsers, on both primary and remote edges, both in the in-domain and out-of-domain settings. Its performance in absolute terms, of 73.5% F-score on primary edges, is encouraging in light of UCCA's inter-annotator agreement of 80-85% F-score on them (Abend and Rappoport, 2013).
The parsers resulting from tree approximation 10 The low upper bound for remote edges is partly due to the removal of implicit nodes (not supported in bilexical representations), where the whole sub-graph headed by such nodes, often containing remote edges, must be discarded. are unable to recover any remote edges, as these are removed in the conversion. 11 The bilexical DAG parsers are quite limited in this respect as well. While some of the DAG parsers' difficulty can be attributed to the conversion upper bound of 58.3%, this in itself cannot account for their poor performance on remote edges, which is an order of magnitude lower than that of TUPA BiLSTM .

Related Work
While earlier work on anchored 12 semantic parsing has mostly concentrated on shallow semantic analysis, focusing on semantic role labeling of verbal argument structures, the focus has recently shifted to parsing of more elaborate representations that account for a wider range of phenomena .
Grammar-Based Parsing. Linguistically expressive grammars such as HPSG (Pollard and Sag, 1994), CCG (Steedman, 2000) and TAG (Joshi and Schabes, 1997) provide a theory of the syntax-semantics interface, and have been used as a basis for semantic parsers by defining com- 11 We also experimented with a simpler version of TUPA lacking REMOTE transitions, obtaining an increase of up to 2 labeled F-score points on primary edges, at the cost of not being able to predict remote edges. 12 By anchored we mean that the semantic representation directly corresponds to the words and phrases of the text. positional semantics on top of them (Flickinger, 2000;Bos, 2005, among others). Depending on the grammar and the implementation, such semantic parsers can support some or all of the structural properties UCCA exhibits. Nevertheless, this line of work differs from our approach in two important ways. First, the representations are different. UCCA does not attempt to model the syntaxsemantics interface and is thus less coupled with syntax. Second, while grammar-based parsers explicitly model syntax, our approach directly models the relation between tokens and semantic structures, without explicit composition rules.
Broad-Coverage Semantic Parsing. Most closely related to this work is Broad-Coverage Semantic Dependency Parsing (SDP), addressed in two SemEval tasks (Oepen et al., 2014. Like UCCA parsing, SDP addresses a wide range of semantic phenomena, and supports discontinuous units and reentrancy. In SDP, however, bilexical dependencies are used, and a head must be selected for every relation-even in constructions that have no clear head, such as coordination (Ivanova et al., 2012). The use of non-terminal nodes is a simple way to avoid this liability. SDP also differs from UCCA in the type of distinctions it makes, which are more tightly coupled with syntactic considerations, where UCCA aims to capture purely semantic cross-linguistically applicable notions. For instance, the "poss" label in the DM target representation is used to annotate syntactic possessive constructions, regardless of whether they correspond to semantic ownership (e.g., "John's dog") or other semantic relations, such as marking an argument of a nominal predicate (e.g., "John's kick"). UCCA reflects the difference between these constructions.
Unlike in UCCA, the alignment between AMR concepts and the text is not explicitly marked. While sharing much of this work's motivation, not anchoring the representation in the text complicates the parsing task, as it requires the alignment to be automatically (and imprecisely) detected. Indeed, despite considerable technical effort Pourdamghani et al., 2014;Werling et al., 2015), concept identification is only about 80%-90% accurate. Furthermore, anchoring allows breaking down sentences into semantically meaningful sub-spans, which is useful for many applications (Fernández-González and Martins, 2015;Birch et al., 2016).
Several transition-based AMR parsers have been proposed: CAMR assumes syntactically parsed input, processing dependency trees into AMR (Wang et al., 2015a(Wang et al., ,b, 2016Goodman et al., 2016). In contrast, the parsers of Damonte et al. (2017) and Zhou et al. (2016) do not require syntactic pre-processing. Damonte et al. (2017) perform concept identification using a simple heuristic selecting the most frequent graph for each token, and Zhou et al. (2016) perform concept identification and parsing jointly. UCCA parsing does not require separately aligning the input tokens to the graph. TUPA creates non-terminal units as part of the parsing process.
Furthermore, existing transition-based AMR parsers are not general DAG parsers. They are only able to predict a subset of reentrancies and discontinuities, as they may remove nodes before their parents have been predicted (Damonte et al., 2017). They are thus limited to a sub-class of AMRs in particular, and specifically cannot produce arbitrary DAG parses. TUPA's transition set, on the other hand, allows general DAG parsing. 13

Conclusion
We present TUPA, the first parser for UCCA. Evaluated in in-domain and out-of-domain settings, we show that coupled with a NN classifier and BiLSTM feature extractor, it accurately predicts UCCA graphs from text, outperforming a variety of strong baselines by a margin.
Despite the recent diversity of semantic pars-ing work, the effectiveness of different approaches for structurally and semantically different schemes is not well-understood (Kuhlmann and Oepen, 2016). Our contribution to this literature is a general parser that supports multiple parents, discontinuous units and non-terminal nodes. Future work will evaluate TUPA in a multilingual setting, assessing UCCA's cross-linguistic applicability. We will also apply the TUPA transition scheme to different target representations, including AMR and SDP, exploring the limits of its generality. In addition, we will explore different conversion procedures (Kong et al., 2015) to compare different representations, suggesting ways for a data-driven design of semantic annotation.
A parser for UCCA will enable using the framework for new tasks, in addition to existing applications such as machine translation evaluation (Birch et al., 2016). We believe UCCA's merits in providing a cross-linguistically applicable, broadcoverage annotation will support ongoing efforts to incorporate deeper semantic structures into various applications, such as sentence simplification (Narayan and Gardent, 2014) and summarization (Liu et al., 2015).