Broad-Coverage Semantic Parsing as Transduction

We unify different broad-coverage semantic parsing tasks under a transduction paradigm, and propose an attention-based neural framework that incrementally builds a meaning representation via a sequence of semantic relations. By leveraging multiple attention mechanisms, the transducer can be effectively trained without relying on a pre-trained aligner. Experiments conducted on three separate broad-coverage semantic parsing tasks -- AMR, SDP and UCCA -- demonstrate that our attention-based neural transducer improves the state of the art on both AMR and UCCA, and is competitive with the state of the art on SDP.


Introduction
Broad-coverage semantic parsing aims at mapping any natural language text, regardless of its domain, genre, or even the language itself, into a general-purpose meaning representation. As a long-standing topic of interest in computational linguistics, broad-coverage semantic parsing has targeted a number of meaning representation frameworks, including CCG (Steedman, 1996(Steedman, , 2001, DRS (Kamp and Reyle, 1993;Bos, 2008), AMR (Banarescu et al., 2013), UCCA (Abend and Rappoport, 2013), SDP (Oepen et al., 2014(Oepen et al., , 2015, and UDS (White et al., 2016). 1 Each of these frameworks has their specific formal and linguistic assumptions. Such framework-specific "balkanization" results in a variety of frameworkspecific parsing approaches, and the state-of-theart semantic parser for one framework is not always applicable to another. For instance, the stateof-the-art approaches to SDP parsing (Dozat and Manning, 2018;Peng et al., 2017a) are not directly transferable to AMR and UCCA because of the lack of explicit alignments between tokens in the sentence and nodes in the semantic graph.
While transition-based approaches are adaptable to different broad-coverage semantic parsing tasks (Wang et al., 2018;Hershcovich et al., 2018;Damonte et al., 2017), when it comes to representations such as AMR whose nodes are unanchored to tokens in the sentence, a pre-trained aligner has to be used to produce the reference transition sequences Damonte et al., 2017;Peng et al., 2017b). In contrast, there are attempts to develop attention-based approaches in a graph-based parsing paradigm (Dozat and Manning, 2018;, but they lack parsing incrementality, which is advocated in terms of computational efficiency and cognitive modeling (Nivre, 2004;Huang and Sagae, 2010).
In this paper, we approach different broadcoverage semantic parsing tasks under a unified framework of transduction. We propose an attention-based neural transducer that extends the two-stage semantic parser of  to directly transduce input text into a meaning representation in one stage. This transducer has properties of both transition-based approaches and graph-based approaches: on the one hand, it builds a meaning representation incrementally via a sequence of semantic relations, similar to a transition-based parser; on the other hand, it leverages multiple attention mechanisms used in recent graph-based parsers, thereby removing the need for pre-trained aligners.
Requiring only minor task-specific adaptations, we apply this framework to three separate broadcoverage semantic parsing tasks: AMR, SDP, and UCCA. Experimental results show that our neural transducer outperforms the state-of-the-art parsers on AMR (77.0% F1 on LDC2017T10 and 71.3% F1 on LDC2014T12) and UCCA (76.6% F1 on the English-Wiki dataset v1.2), and is competitive with the state of the art on SDP (92.2% F1 on the English DELPH-IN MRS dataset).

Background and Related Work
We provide summary background on the meaning representations we target, and review related work on parsing for each. Abstract Meaning Representation (AMR; Banarescu et al., 2013) encodes sentence-level semantics, such as predicate-argument information, reentrancies, named entities, negation and modality, into a rooted, directed, and usually acyclic graph with node and edge labels. AMR graphs abstract away from syntactic realizations, i.e., there is no explicit correspondence between elements of the graph and the surface utterance. Fig. 1(a) shows an example AMR graph.
Since its first general release in 2014, AMR has been a popular target of data-driven semantic parsing, notably in two SemEval shared tasks (May, 2016;May and Priyadarshi, 2017). Graphbased parsers build AMRs by identifying concepts and scoring edges between them, either in a pipeline (Flanigan et al., 2014), or jointly (Zhou et al., 2016;Lyu and Titov, 2018;. This two-stage parsing process limits the parser incrementality. Transition-based parsers either transform dependency trees into AMRs (Wang et al., , 2016Goodman et al., 2016), or employ transition systems specifically tailored to AMR parsing (Damonte et al., 2017;Ballesteros and Al-Onaizan, 2017). Transitionbased parsers rely on pre-trained aligner produce the reference transitions. Grammar-based parsers leverage external semantic resources to derive AMRs compositionally based on CCG rules (Artzi et al., 2015), or SHRG rules (Peng et al., 2015). Another line of work uses neural model translation models to convert sentences into linearized AMRs (Barzdins and Gosko, 2016;Peng et al., 2017b), but has relied on data augmentation to produce effective parsers (van Noord and Bos, 2017;Konstas et al., 2017). Our parser differs from the previous ones in that it has incrementality without relying on pre-trained aligners, and can be effectively trained without data augmentation. Semantic Dependency Parsing (SDP) was introduced in 2014 and 2015 SemEval shared tasks (Oepen et al., 2014(Oepen et al., , 2015. It is cen-tered around three semantic formalisms -DM (DELPH-IN MRS;Flickinger et al., 2012;Oepen and Lønning, 2006), PAS (Predicate-Argument Structures; Miyao and Tsujii, 2004), and PSD (Prague Semantic Dependencies; Hajič et al., 2012) -representing predicate-argument relations between content words in a sentence. Their annotations have been converted into bi-lexical dependencies, forming directed graphs whose nodes injectively correspond to surface lexical units, and edges represent semantic relations between nodes. In this work, we focus on only the DM formalism. Fig. 1(b) shows an example DM graph.
Most recent parsers for SDP are graph-based: Peng et al. (2017aPeng et al. ( , 2018) use a max-margin classifier on top of a BiLSTM, with the factored score for each graph over predicates, unlabeled arcs, and arc labels. Multi-task learning approaches and disjoint data have been used to improve the parser performance. Dozat and Manning (2018) extend an LSTM-based syntactic dependency parser to produce graph-structured dependencies, and carefully tune it to state of the art performance. Wang et al. (2018) extend the transition system of Choi and McCallum (2013) to produce non-projective trees, and use improved versions of stack-LSTMs (Dyer et al., 2015) to learn representation for key components. All of these are specialized for bi-lexical dependency parsing, whereas our parser can effectively produce both bi-lexical semantics graphs, and graphs that are less anchored to the surface utterance.
Universal Conceptual Cognitive Annotation (UCCA; Abend and Rappoport, 2013) targets a level of semantic granularity that abstracts away from syntactic paraphrases in a typologicallymotivated, cross-linguistic fashion. Sentence representations in UCCA are directed acyclic graphs (DAG), where terminal nodes correspond to surface lexical tokens, and non-terminal nodes to semantic units that participate in super-ordinate relations. Edges are labeled, indicating the role of a child in the relation the parent represents. Fig. 1(c) shows an example UCCA DAG.
The first UCCA parser is proposed by Hershcovich et al. (2017), where they extend a transition system to produce DAGs. To leverage other semantic resources, Hershcovich et al. (2018) is one of the few attempts to present (lossy) conversion from AMR, SDP and Universal Dependencies (UD; Nivre et al., 2016)   based DAG format. They explore multi-task learning under the unified format. While multi-task learning improves UCCA parsing results, it shows poor performance on AMR, SDP and UD parsing. In contrast, different semantic parsing tasks are formalized in our unified transduction paradigm with no loss, and our approach achieves state-ofthe-art or competitive performance on each task, using only single-task data.

Unified Arborescence Format
We first introduce a unified target format for different broad-coverage semantic parsing tasks. Meaning representation in the unified format is an arborescence (aka, a directed rooted tree), which is converted from its corresponding task-specific semantic graph via the following reversible steps: AMR Reentrancy is what can make an AMR graph not an arborescence (it introduces cycles). Following , we convert an AMR graph into an arborescence by duplicating nodes that have reentrant relations; that is, whenever a node has a reentrant relation, we make a copy of that node and use the copy to participate in the relation, thereby resulting in an arborescence. Next, in order to preserve the reentrancy information, we assign a node index to each node. Duplicated nodes are assigned the same index as the original node. Fig. 1(d) shows an AMR arborescence converted from Fig. 1(a): two "person" nodes have the same node index 2. The original AMR graph can be recovered by merging identically indexed nodes. DM We first break the DM graph into a set of weakly connected subgraphs. For each subgraph, if it has the top node, we treat top as root; otherwise, we treat the node with the max outdegree as root. We then run depth-first traversal over each subgraph from its root to yield an arborescence, and repeat the following three steps until no more edges can be added to the arborescence: (1) we run breadth-first traversal over the arborescence from the root until we find a node that has an incoming edge not belonging to the arborescence; (2) we reverse the edge and add a -of suffix to the edge label; (3) we run depth-first search from that node to include more edges to the arborescence. During the whole process, we add node indices and duplicate reentrant nodes in the same way as AMR conversion. Finally, we connect arborescences by adding a null edge from top to other arborescence roots. Fig. 1(e) shows a DM arborescence converted from Fig. 1(b). The original DM graph can be recovered by removing null edges, merging identically indexed nodes, and reversing edges with -of suffix. UCCA To date, official UCCA evaluation only considers UCCA's foundational layer, which is al- Vinken (4) concern-01 (5) Vinken (4) concern-01 (5) person (2) <end> (6) Pierre Vinken expressed his concern

Encoder
Source Node Module Relation Type Module concern-01 (5) The encoder-decoder architecture of our attention-based neural transducer. An encoder encodes the input text into hidden states. A decoder is composed by three modules: a target node module, a relation type module, and a source node module. At each decoding time step, the decoder takes the previous semantic relation as input, and outputs a new semantic relation in a factorized way: firstly, the target node module produces a new target node; secondly, the source node module points to a preceding node as a new source node; finally, the relation type module predicts the relation type between source and target nodes.
ready an arborescence. We convert it to the unified arborescence format by first collapsing subgraphs of pre-terminal nodes: we replace each pre-terminal node with its first terminal node; if the pre-terminal node has other terminals, we add a special phrase edge from the first terminal node to other terminal nodes. The collapsing step largely reduces the number of terminal nodes in UCCA. We then add labels to the remaining nonterminal nodes. Each node label is simply the same as its incoming edge label. We find that adding node labels improves performance of our neural transducer (See Section 6.2 for the experimental results). Lastly, we add node indices in the same way as AMR conversion. Fig. 1(f) shows a DM arborescence converted from Fig. 1(c). The original UCCA DAG can be recovered by expanding pre-terminal subgraphs, and removing nonterminal node labels.

Problem Formalization
For any broad-coverage semantic parsing task, we denote the input text by X, and the output meaning representation in the unified arborescence format by Y , where X is a sequence of tokens x 1 , x 2 , ..., x n and Y can be decomposed as a sequence of semantic relations y 1 , y 2 , ..., y m . A relation y is a tuple u, d u , r, v, d v , consisting of a source node label u, a source node index d u , a relation type r, a target node label v, and a target node index d v .
Let Y be the output space. The unified transduction problem is to seek the most-likely sequence of semantic relationsŶ given X:

Transducer
To tackle the unified transduction problem, we introduce an attention-based neural transducer that extends 's attention-based parser. Their attention-based parser addresses semantic parsing in a two-stage process: it first employs an extended variant of pointer-generator network (See et al., 2017) to convert the input text into a list of nodes, and then uses a deep biaffine graph-based parser (Dozat and Manning, 2016) with a maximum spanning tree (MST) algorithm to create edges. In contrast, our attention-based neural transducer directly transduces the input text into a meaning representation in one stage via a sequence of semantic relations. A high-level model architecture of our transducer is depicted in Fig. 2: an encoder first encodes the input text into hidden states; and then conditioned on the hidden states, at each decoding time step, a decoder takes the previous semantic relation as input, and outputs a new semantic relation, which includes a target node, a relation type, and a source node.
Specifically, there a significant difference between  and our model:  first predicts nodes, and then edges. These two stages are done separately (except that a shared encoder is used). At the node prediction stage, their model has no knowledge of edges, and therefore node prediction is performed purely based previous nodes. At the edge prediction stage, their model predicts the head of each node in parallel. Head prediction of one node has no constrains or impact on another. As a result, MST algorithms have to be used to search for a valid prediction. In comparison, our model does not have two separate stages for node and edge prediction. At each decoding step, our model predicts not only a node, but also the incoming edge to the node, which includes a source and a relation type. See Fig. 2 for an example. The predicted node and incoming edge together with previous predictions form a partial semantic graph, which is used as input of the next decoding step for the next node and incoming edge prediction. Our model therefore makes predictions based on the partial semantic graph, which helps prune the output space for both nodes and edges. Since at each decoding step, we assume the incoming edge is always from a preceding node (see Section 4.3 for the details), the predicted semantic graph is guaranteed to be a valid arborescence, and a MST algorithm is no longer needed.

Encoder
At the encoding stage, we employ an encoder embedding module to convert the input text into vector representations, and a BiLSTM is used to encode vector representations into hidden states. Encoder Embedding Module concatenates word-level embeddings from GloVe (Pennington et al., 2014) and BERT 2 (Devlin et al., 2018), char-level embeddings from CharCNN (Kim et al., 2016), and randomly initialized embeddings for POS tags.
For AMR, it includes extra randomly initialized embeddings for anonymization indicators that tell the encoder whether a token is an anonymized token from preprocessing.
For UCCA, it includes extra randomly initialized embeddings for NER tags, syntactic dependency labels, punctuation indicators, and shapes that are provided in the UCCA official dataset. Multi-layer BiLSTM (Hochreiter and Schmidhuber, 1997) is defined as: where s l t is the l-th layer hidden state at time step t; s t i is the embedding module output for token x t .

Decoder
Decoder Embedding Module at decoding time step i converts elements in the input semantic rela- : 3 u i and v i are concatenations of word-level embeddings from GloVe, char-level embeddings from CharCNN, and randomly initialized embeddings for POS tags. POS tags for source and target nodes are inferred at runtime: if a node is copied from input text, the POS tag of the corresponding token is used; if it is copied from a preceding node, the POS tag of the preceding node is used; otherwise, an UNK tag is used. d u i , d v i and r i are randomly initialized embeddings for source node index, target node index, and relation type.
Next, the decoder outputs a new semantic relation in a factorized way depicted in Fig. 2: First, a target node module takes vector representations of the previous semantic relation, and predicts a target node label as well as its index. Then, a source node module predicts a source node via pointing to a preceding node. Lastly, a relation type module takes the predicted source and target nodes, and predicts the relation type between them. Target Node Module converts vector representations of the input semantic relation into a hidden state z i in the following way: where an l-layer LSTM generates contextual representation h l i for target node v i (for initialization, . A feed-forward neural network FFN (relation) generates the hidden state z i of the input semantic relation by combining contextual representation h l i for target node v i , encoder context vector c i , and vector representations r i , u i , d u i for relation type r i , source node label u i and source node index d u i . Encoder context vector c i is a weighted-sum of encoder hidden states s l 1:n . The weight is attention a (enc) i from the decoder at decoding step i to encoder hidden states: Given the hidden state z i for input semantic relation, we use an extended variant of pointergenerator network to compute the probability distribution of next target node label v i+1 : P(v i+1 ) is a hybrid of three parts: (1) emitting a new node label from a pre-defined vocabulary via probability distribution p (vocab) i ; (2) copying a token from the encoder input text as node label via encoder-side attention a (enc) i ; and (3) copying a node label from preceding target nodes via decoder-side attention a (dec) i . Scalars p gen , p enc and p dec act as a soft switch to control the production of target node label from different sources.
The next target node index d v i+1 is assigned based on the following rule: Source Node Module produces the next source node label u i+1 via pointing to a node label among preceding target node labels (the dotted arrows shown in Fig. 2). The probability distribution of next source node label u i+1 is defined as (11) where BIAFFINE is a biaffine function (Dozat and Manning, 2016). h (start) i+1 is the vector representation for the start of the pointer. h They are computed by two multi-layer perceptrons: Note that h l i+1 is the LSTM hidden state for target node v i+1 , generated by Equation (3) in the target node module. We reuse LSTM hidden states from the target node module such that we can train the decoder modules jointly.
Then, the next source node index d u i+1 is the same as the target node the module points to. Relation Type Module also reuses LSTM hidden states from the target node module to compute the probability distribution of next relation type r i+1 . Assuming that the source node module points to target node label v j as the next source node label, The next relation type probability distribution is computed by:

Training
To ensure that at each decoding step, the source node can be found in the preceding nodes, we create the reference sequence of semantic relations by running a pre-order traversal over the reference arborescence. The pre-order traversal only determines the order between a node and its children.
As for the order of its children, we sort them in alphanumerical order in the case of AMR, following . In the case of SDP, we sort the children based on their order in the input text. In the case of UCCA, we sort the children based on their UCCA node ID. Given a training pair X, Y , the optimization objective is to maximize the decomposed conditional log likelihood i log P(y i | y <i , X) , which is approximated by: We also employ label smoothing (Szegedy et al., 2016) to prevent overfitting, and include a coverage loss (See et al., 2017) to penalize repetitive nodes: covloss i = t min(a

Prediction
Our transducer at each decoding time step looks for the source node from the preceding nodes, which ensures that the output of a greedy search is already a valid arborescenceŶ : Therefore, a MST algorithm such as the Chu-Liu-Edmonds algorithm at O(EV ) used in  is no longer needed, 4 and the decoding speed of our transducer is O(V ). Moreover, since our transducer builds the meaning representation via a sequence of semantic relations, we implement a beam search over relation in Algo. 1. Compared to the beam search of  that only returns top-k nodes, our beam search finds the top-k relation scores, which includes source nodes, relation types and target nodes. 5 Data Pre-and Post-processing AMR Pre-and post-processing steps are similar to those of : in preprocessing, we anonymize subgraphs of entities, remove senses, and convert resultant AMR graphs into the unified format; in post-processing, we assign the most frequent sense for nodes, restore Wikipedia links using the DBpedia Spotlight API (Daiber et al., 2013), add polarity attributes based on rules observed from training data, and recover the original AMR format from the unified format. DM No pre-or post-processing is done to DM except converting them into the unified format, and recovering them from predictions. UCCA During training, multi-sentence input text and its corresponding DAG are split into singlesentence training pairs based on rules observed from training data. At test time, we split multisentence input text, and join the predicted graphs into one. We also convert the original format to the unified format in preprocessing, and recover the original DAG format in post-processing.        (2019), where they convert UCCA graphs to constituency trees, and train a framework for constituency parsing and remote edge recovery. Hershcovich et al. (2018) explore multi-task learning (MTL) to improve UCCA parsing, using AMR, DM and UD parsing as auxiliaries. While improvement is achieved UCCA parsing, their MTL model shows poor results on the auxiliary tasks: 64.7% unlabeled F1 on AMR, 27.2% unlabeled F1 on DM, and 4.9% UAS on UD. In comparison, our transducer improves the state of the art on AMR, and shows competitive results on DM. At the same time, it also shows reasonable results on UCCA. When converting UCCA DAGs to the unified format, we adopt a simple rule (Section 3.1) to add node labels to non-terminals. Table 5 shows that these node labels do improve the parsing performance from 75.7% to 76.6%.

Analysis
Validity Graph-based parsers like Dozat and Manning (2018);  make independent decisions on edge types. As a result, the same outgoing edge type can appear multiple times to a node. For instance, a node can have more than one ARG1 outgoing edge. Although F1 scores can be computed for graphs with such kind of nodes, these graphs are in fact invalid mean representations. Our neural transducer incrementally builds meaning representations: at each decoding step, it takes a semantic relation as input, and has memory of preceding edge type information, which implicitly places constraints on edge type prediction. We compute the number of invalid graphs predicted by the parser of  and our neural transducer on the AMR 2.0 test set, and find that our neural transducer reduces the number of invalid graphs by 8%. Speed Besides the improvement on parsing accuracy, we also significantly speed up parsing. Table 6 compares the parsing speed of our transducer and  on the AMR 2.0 test set, under the same environment setup. Without relying on MST algorithms to produce a valid arborescence, our transducer is able to parse at 1.7x speed.

Conclusion
We cast three broad-coverage semantic parsing tasks into a unified transduction framework, and propose a neural transducer to tackle the problem. Given the input text, the transducer incrementally builds a meaning representation via a sequence of semantic relations. Experiments conducted on three tasks show that our approach improves the state of the art in both AMR and UCCA, and is competitive to the best parser in SDP. This work can be viewed as a starting point for cross-framework semantic parsing. Also, compared with transition-based parsers (e.g. Damonte et al., 2017) and graph-based parsers (e.g. Dozat and Manning, 2018), our transductive framework does not require a pre-trained aligner, and it is capable of building a meaning representation that is less anchored to the input text. These advantages make it well suited to semantic parsing in cross-lingual settings (Zhang et al., 2018). In the future, we hope to explore its potential in crossframework and cross-lingual semantic parsing.