Semantic graph parsing with recurrent neural network DAG grammars

Semantic parses are directed acyclic graphs (DAGs), so semantic parsing should be modeled as graph prediction. But predicting graphs presents difficult technical challenges, so it is simpler and more common to predict the *linearized* graphs found in semantic parsing datasets using well-understood sequence models. The cost of this simplicity is that the predicted strings may not be well-formed graphs. We present recurrent neural network DAG grammars, a graph-aware sequence model that generates only well-formed graphs while sidestepping many difficulties in graph prediction. We test our model on the Parallel Meaning Bank—a multilingual semantic graphbank. Our approach yields competitive results in English and establishes the first results for German, Italian and Dutch.

Explicitly or implicitly, a representation in any of these formalisms can be expressed as a directed acyclic graph (DAG). Consider the sentence "Every ship in the dock needs a big anchor". Its meaning representation, expressed as a Discourse Representation Structure (DRS, Kamp 1981), is shown in Figure 1 ship(x1) PARTOF(x1, x2) ⇒ e1, s1, x3 b3 need(e1) PIVOT(e1, x1) THEME(e1,x3) anchor(x3) big(s1) TOPIC(s1, x3) Figure 1: The discourse representation structure for "Every ship in the dock needs a big anchor". For ease of reference in later figures, each box includes a variable corresponding to the box itself, at top right in gray. two parts: the top part lists variables for discourse referents (e.g. x 1 , e 1 ) and the bottom part can contain unary predicates expressing the type of a variable (e.g. ship, need), binary predicates specifying relationships between variables (e.g. PARTOF, TOPIC), logical operators expressing relationships between nested boxes (e.g. ⇒, ¬), or binary discourse relations (e.g., RESULT, CONTRAST). To express a DRS as a graph, we represent each box as a node labeled ; each variable as a node labeled by its associated unary predicate; and each binary predicate, logical operator, or discourse relation as an edge from the first argument to the second ( Figure 2). To fully realize the representation as a DAG, additional transformations are sometimes necessary: in DRS, when a box represents a presupposition, as box b 4 does, the label of the node corresponding to the presupposed variable is marked (e.g. x 2 /dock P ); and edges can be reversed (e.g. TOPIC(s 1 , x 3 ) becomes TOPICOF(s 1 , x 3 )).
Since meaning representations are graphs, semantic parsing should be modeled as graph prediction. But how do we predict graphs? A popular approach is to predict the linearized graphthat is, the string representation of the graph found in most semantic graphbanks. Figure 3 illus- trates one style of linearization using PENMAN notation, in which graphs are written as wellbracketed strings which can also be interpreted as trees-note the correspondence between the treelike structure of Figure 2 and the string in Figure 3. 2 Each subtree is a bracketed string starting with a node variable and its label (e.g. b 2 / ), followed by a list of relations corresponding to the outgoing edges of the node. A relation consists of the edge label prefixed with a colon (:), followed by either the subtree rooted at the target node (e.g. :DRS (x 1 /ship :PARTOF(x 2 /dock p ))), or a reference to the target node (e.g. :PIVOT x 1 ). By convention, if a node is the target of multiple edges, then the leftmost one is written as a subtree, and the remainder are written as references. Hence, every node is written as a subtree exactly once.
The advantage of predicting linearized graphs is twofold. The first advantage is that graphbank datasets usually already contain linearizations, which can be used without additional work. These linearizations are provided by annotators or algorithms and are thus likely to be very consistent in ways that are beneficial to a learning algorithm. The second advantage is that we can use simple, well-understood sequence models (Gu et al., 2016;Jia and Liang, 2016;van Noord et al., 2018) to model them. But this simplicity comes with a cost: sequence models can predict strings that don't correspond to graphs-for example, strings with illformed bracketings or unbound variable names. While it is often possible to fix these strings with pre-or post-processing, we would prefer to model the problem in a way that does not require this.
Models that predict graphs are complex and (b 1 / :IMP 1 (b 2 / :DRS(x 1 /ship :PARTOF(x 2 /dock p ))) :IMP 2 (b 3 / :DRS(e 1 / need :PIVOT x 1 :THEME(x 3 / anchor :TOPICOF(s 1 / big))))) far less well-understood than models that predict sequences. Fundamentally, this is because predicting graphs is difficult: every graph has many possible linearizations, so from a probabilistic perspective, the linearization is a latent variable that must be marginalized out (Li et al., 2018). Groschwitz et al. (2018) model graphs as trees, interpreted as the (latent) derivation trees of a graph grammar; Lyu and Titov (2018) model graphs with a conditional variant of the classic Erdös and Rényi (1959) model, first predicting an alignment for each node of the output graph, and then predicting, for each pair of nodes, whether there is an edge between them. Buys and Blunsom (2017), Chen et al. (2018), and Damonte et al. (2017) all model graph generation as a sequence of actions, each aligned to a word in the conditioning sentence. Each of these models has a latent variablea derivation tree or alignment-which must be accounted for via preprocessing or complex inference techniques. Can we combine the simplicity of sequence prediction with the fidelity of graph prediction? We show that this is possible by developing a new model that predicts sequences through a simple string rewriting process, in which each rewrite corresponds to a well-defined graph fragment. Importantly, any well-formed string produced by our model has exactly one derivation, and thus no latent variables. We evaluate our model on the Parallel Meaning Bank (PMB, ), a multilingual corpus of sentences paired with DRS representations. Our model performs competitively on English, and better than sequence models in German, Italian, and Dutch. et al., 2016), a type of context-free graph grammar designed to model linearized DAGs. Since linearized DAGs are strings, we present it as a stringrewriting system, which can be described more compactly than a graph grammar while making the connection to sequences more explicit. The correspondence between string rewriting and graph grammars is given in the Appendix.
A grammar in our model is defined by a set Σ of terminal symbols consisting of all symbols that can appear in the final string-brackets, variable types, node labels, and edge labels; a set N of n + 1 nonterminal symbols denoted by {L, T 0 , . . . , T n }, for some maximum value n; an unbounded set V of variable references {$1, $2, . . . }; and a set of productions, which are defined below.
We say that T 0 is the start symbol, and for each symbol T i ∈ N , we say that i is its rank, and we say that L has a rank of 0. A nonterminal of rank i can be written as a function of i variable references-for example, we can write T 2 ($1, $2). By convention, we write the rank-0 nonterminals L and T 0 without brackets. Productions in our grammar take the form α → β, where α is a function of rank i over $1, . . . , $i; and β is a linearized graph in PENMAN format, with each of its subtrees replaced by either a function or a variable reference. Optionally, the variable name in β may replace one of the variable references in α. All variable references in a production must appear at least twice. Hence every variable reference in α must appear at least once in β, and variables that do not appear in α must appear at least twice in β.
To illustrate, we will use the following grammar, which can generate the string in Figure 3, assuming L can also rewrite as any node label.
Our grammar derives strings by first rewriting the start symbol T 0 , and at each subsequent step rewriting the leftmost function in the partially derived string, with special handling for variable references described below. A derivation is complete when no functions remain.
We illustrate the rewriting process in Figure 4. The start symbol T 0 at step 1 is rewritten by production r 1 in step 2, and the new b variable introduced at this step is deterministically renamed to the unique name b 1 . In step 3, the leftmost T 1 ($1) is rewritten by production r 5 , and the new b variable is likewise renamed to the unique b 2 . All productions apply in this way, simply replacing a left-hand-side function with a right-hand side expression. These rewrites are coupled with a mechanism to correctly handle multiple references to shared variables, as illustrated in Step 4 when production r 7 is applied. In this production, the left-hand-side function applies to the x variable naming the right-hand-side node. When this production applies, x is renamed to the unique x 1 as in previous steps, but because it appears in the left-hand-side, the reference $1 is bound to this new variable name throughout the partiallyderived string. In this way, the reference to x 1 is passed through the subsequent rewrites, becoming the target of a PIVOT edge at step 10. Derivations of a DAG grammar are context-free ( Figure 5). 3 Our model requires an explicit grammar like the one in r 1 ...r 7 , which we obtain by converting each DAG in the training data into a sequence of productions. The conversion yields a single, unique sequence of productions via a simple linear-time algorithm that recursively decomposes a DAG into subgraphs (Björklund et al., 2016). Each subgraph consists of single node and its outgoing edges, as exemplified by the PENMAN-formatted righthand-sides of r 1 through r 7 . Each outgoing edge points to a nonterminal symbol representing a subgraph. If a subgraph does not share any nodes with its siblings, it is represented by T 0 . But if any subgraphs share a node, then a variable reference must refer for this node in the production associated with the lowest common ancestor of all its incoming edges. For example, in Figure 3, the common ancestor of the two edges targeting x 1 is the node b 1 , so production r 1 must contain two copies of variable reference $1 to account for this. A more mathematical account can be found along with a proof of correctness in Björklund et al. (2016) Our implementation follows their description unchanged.
Step Action Production Result :IMP 2 (b 3 / :DRS (e 1 /need :PIVOT x 1 :THEME T 0 ))) Figure 4: A partial derivation of the string in Figure 3. The stack operations follow closely each step in the derivation, where GEN-FRAG and GEN-LABEL are invoked when rewriting a non-terminal T and a terminal L respectively. In the result of each step, the leftmost function is underlined, and is rewritten in the fragment in blue in the next step. On the other hand, a REDUCE operation is invoked when a generated fragment does not contain non-terminals T to expand further (in this partial derivation, this is the case of the result of production r 2 ).

Neural Network Realizer
We model graph parsing with an encoder-decoder architecture that takes as input a sentence w and outputs a directed acyclic graph G derived using the rewriting system of Section 2. Specifically, we model its derivation tree in top-down, left-toright order as a sequence of actions a = a 1 . . . a |a| , inspired by Recurrent Neural Network Grammars (RNNG, Dyer et al., 2016). As in RNNG, we use a stack to store partial derivations. We model two types of actions: GEN-FRAG rewrites T i nonterminals, while GEN-LABEL rewrites L nonterminals, always resulting in a leaf of the derivation tree. A third REDUCE action is applied whenever a subtree of the derivation tree is complete, and since the number of subtrees is known in advance, it is applied deterministically. For example, when we predict r 1 , this determines that we must rewrite an L and then recursively rewrite two copies of T 1 ($1) and then apply RE-DUCE. Hence graph generation reduces to predicting rewrites only.
We define the probability of generating graph G conditioned of input sentence w as follows: Input Encoder We represent the ith word w i of input sentence w = w 1 . . . w |w| using both learned and pre-trained word embeddings (w i and w p i respectively), lemma embedding (l i ), part-of-speech embedding (p i ), universal semantic tag (Abzianidze and Bos, 2017) embedding (u i ), and dependency label embedding (d i ). 4 An input x i is computed as the weighted concatenation of these features followed by a non-linear projection (with vectors and matrices in bold): Input x i is then encoded with a bidirectional LSTM, yielding contextual representation h e i .
Graph decoder Since we know in advance whether the next action is GEN-FRAG or GEN-LABEL, we use different models for them. GEN-FRAG. If step t rewrites a T nonterminal, we predict the production y t that rewrites it using context vector c t and incoming edge embedding e t . To obtain c t we use soft attention (Luong et al., 2015) and weight each input hidden representation h e i to decoding hidden state h d t : The contribution of c and e is weighted by matrices W (3) and W (4) , respectively. We then update the stackLSTM representation using the embedding of the non-terminal fragment y t (denoted as y e t ), as follows: GEN-LABEL. Labels L can be rewritten to either semantic constants (e.g., 'speaker', 'now', 'hearer') or unary predicates that often corresponds to the lemmas of the input words (e.g., 'love') or. We predict the former using a model identical to the one for GEN-FRAG. For the latter, we use a selection mechanism to choose an input lemma to copy to output. We model selection following Liu et al. (2018), assigning each input lemma a score o ji that we then pass through a softmax layer to obtain a distibution: where h i is the encoder hidden state for word w i . We allow the model to learn whether to use softattention or the selection mechanism through a binary classifier, conditioned on the decoder hidden state at time t, h d t . Similar to Equation (4), we update the stackLSTM with the embedding of terminal predicted. 5 5 In the PMB, each terminal is annotated for sense (e.g. 'n.01', 's.01') and presupposition (e.g. for 'dock p ' in Figure 3) as well. We predict both the sense tag and whether a terminal is presupposed or not independently conditioned on the current stackLSTM state and the embedding of the main terminal labels but are not used to update the state of the stackLSTM.

REDUCE.
When a reduce action is applied, we use an LSTM to compose the fragments on top of the stack. Using the derivation tree in Figure 5 as reference, let [c 1 , ..., c n ] denote the embeddings of one or more sister nodes r i and p u the embedding of their parent node, which we refer to as children and parent fragments respectively. A reduce operation runs an LSTM over the children fragments and the parent fragment in order and then uses the final state u to update the stack LSTM as follows: The models are trained to minimize a crossentropy loss objective J over the sequence of gold actions a i in the derivation:

Experimental Setup
We evaluated our model on the Parallel Meaning Bank (PMB; ), a semantic bank where sentences in English, Italian, German, and Dutch have been annotated following Discourse Representation Theory (Kamp and Reyle, 2013). Lexical predicates in PMB are in English, even for non-English languages. Since this is not compatible with our copy mechanism, we revert predicates to their orignal language by substituting them with the lemmas of the tokens they are aligned to. In our experiments we used both PMB v.2.1.0 and v.2.2.0 6 ; we included the former release in order to compare against the state-of-theart seq2seq system of van Noord et al. (2018). Statistics on the data and the grammar extracted from v.2.2.0 are reported in Table 1.

Converting DRSs to Graphs
In this section we discuss how DRSs are converted to acyclic, single-rooted, and fully-instantiated graphs (i.e., how to translate Figure 1 to Figure 2 as nodes, while conditions, operators, and discourse relations become edge labels between these nodes. We consider main boxes (see b 2 , b 3 , and b 4 in Figure 1 separately from presuppositional boxes (see b 1 ), which represent instances that are presupposed in the wider discourse context (e.g., definite expressions). Using Figure 2 as an example, b 2 , b 3 and b 4 become nodes in the graph and material implication (IMP stands for ⇒) becomes an edge label. If an operator or a relation is binary, as in this case, we number the edge label so as to preserve the order of the operands.
For each node in a main box, we expand the graph by adding all relations and variables belonging to it. We identify the head of the first relation or the first referent mentioned as the head variable. These are 'ship (x 1 )' for b 2 , and 'need(e 1 )' for b 3 . We attach the head variable as a child of the box-node and follow the relations recursively to expand the subgraph. If while expanding a graph a variable in a condition is part of a presuppositional box, we introduce it as a new node and add to its label the superscript p . When expanding the DAG along the edge PartOf, since x 2 is also in the presuppositional box in Figure 1, we attach the node 'dock p '. Graphs extracted this way are mostly acyclic except for adjectival phrases and relative clauses where state variables can be themselves root (e.g., big "a.01" s 1 ). We get rid of these extra roots by reversing the direction of the edge involved and adding an '-of' to the edge label to flag this change (see TOPIC-OF).

System Comparison
We compared the performance of our graph parser (seq2graph below) with a sequence-to-sequence model (enchanced with a copy mechanism) which decodes to a string linearization of the graph similar to the one shown in Figure 3 (seq2seq + copy below). We also compare against the recently proposed model of van Noord et al. (2018); they introduce a seq2seq model that generates a DRS as a concatenation of clauses, essentially a flat version of the standard box notation. The decoded string is made aware of the overall graph structure during preprocessing where variables are replaced by indices indicating when they were first introduced and their recency. In contrast, we model the graph structure explicitly. van Noord et al. (2018) experimented with both word and character-based models, as well as with an ensemble of both, using word embedding features. Since all our models are word-based, we compare our results with their best word model, using word embedding features only (trained using 10-fold cross validation).

Model Configurations
In addition to word embeddings 7 , we also report on experiments which make use of additional features. Specifically, for each word we add information about its universal PoS tag, lemma, universal semantic tag, and dependency label. 8 In Section 2, we mentioned that given a production α → β, variable references in α should appear at least once in β (i.e., they should have the same rank). In all experiments so far, we did not model this constraint explicitly to investigate whether the model is able by default to predict rank correctly. However, in exploring model configurations we also report on whether adding this constrant leads to better performance .

Cross-lingual Experiments
We conducted two sets of experiments: one monolingual (mono below) where we train and test on the same language and one cross-lingual (cross below), where we train a model on English and test it on one of the other three languages. The goal of our cross-lingual experiments is to examine whether we need data in a target language at all since the semantic representation itself is language agnostic and lexical predicates are dealt with via the copy mechanism. Most of the features mentioned above are cross-linguistic and therefore fit both mono and cross-lingual settings, with the exception of lemma and word embeddings, where we exclude the former and replaced the latter with multilingual word embeddings. 9

System settings
For training, we used the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001 and a decay rate of 0.1 every 10 epochs. Randomly initialized and pre-trained word embeddings have a dimensionality of 128 and 100 respectively, and all other features a dimensionality of 50. In all cross-lingual experiments, the pre-trained word embeddings have a dimensionality of 300. The LSTMs in the encoder and the decoder have a dimensionality of 150 and non-terminal and terminal embeddings during decoding have a dimensionality of 50. The system is trained for 30 epochs, with the best system chosen based on dev set performance.

Evaluation Metric
We evaluated our system by scoring the similarity between predicted and gold graphs. We used Counter (van Noord et al., 2018), an adaptation of Smatch  to Discourse Representation Structures where graphs are first transformed into a set of 'source node -edge label -target node' triples and the best mapping between the variables is found through an iterative hill-climbing strategy. Furthermore, Counter checks whether DRSs are well-formed in that all boxes should be connected, acyclic, with fully instantiated variables, and correctly assigned sense tags.
It is worth mentioning that there can be cases where our parser generates ill-formed graphs according to Counter; this is however not due to the model itself but to the way the graph is converted back in a format accepted by Counter.
All results shown are an averages over 5 runs. 9 We experimented with embeddings obtained with iterative procrustes (available at https://github.com/ facebookresearch/MUSE) and with Guo et al. (2016)'s 'robust projection' method where the embedding of non-English words is computed as the weighted average of English ones. We found the first method to perform better on cross-lingual word similarity tasks and used it in our experiments.

Results
System comparison Table 2 summarizes our results on the PMB gold data (v.2.1.0, test set). We compare our graph decoder against the system of van Noord et al. (2018) and our implementation of a seq2seq model, enhanced with a copy mechanism. Overall, we see that our graph decoder outperforms both models. Moreover, it reduces the number of illformed representations without any specific constraints or post-processing in order to ensure the well-formedness of the semantics of the output. The PMB (v.2.1.0) contains a large number of silver standard annotations which have been only partially manually corrected (see Table 1). Following van Noord et al. (2018), we also trained our parser on both silver and gold standard data combined. As shown in Table 3, increasing the training data improves performance but the difference is not as dramatic as in van Noord et al. (2018). We found that this is because our parser requires graphs that are fully instantiated -all unary predicates (e.g. ship(x)) need to be present for the graph to be fully connected, which is often not the case for silver graphs. Our model is at a disadvantage since it could exploit less training data; during grammar extraction we could not process around 20K sentences and in some cases could not reconstruct the whole graph, as shown by the conversion score. 10   Table 4 reports on various ablation experiments investigating which features and combinations thereof perform best. The experiments were conducted on the development set (PMB v2.2.0). We show a basic version of our seq2graph model with word embeddings, to which we add information about rank (+restrict). We also experimented with the full gamut of additional features (+feats) as well as with ablations of individual feature classes. For comparsion, we also show the performance of a graph-to-string model (seq2seq+copy). As can be seen, all linguistic features seem to improve performance. Restricting fragment selection by rank does not seem to improve the overall result showing that our baseline model is already able to predict fragments with the correct rank throughout the derivation. Subsequent experiments report results with this model using all linguistic features, unless otherwise specified.

Model configurations
Cross-lingual experiments Results on DRT parsing for languages other than English are reported in Table 5. There is no gold standard training data for non-English languages in the PMB (v.2.2.0). We therefore trained our parser on silver standard data but did use the provided gold standard data for development and testing (see Table 1). We present two versions of our parser, one where we train and test on the same language (s2g mono-silver) and another one where a model is trained on English but tested on the other languages (s2g cross-silver). We also show the results of a sequence-to-sequence model enhanced with a copy mechanism.
In the monolingual setting, our graph parser outperforms the seq2seq baseline by a large margin; we hypothesize this is due to the large percentage of ill-formed semantics, mostly due to training on silver data. The difference in perfor-  mance between our cross-lingual parser and the monolingual parser for all languages is small, and in Dutch the two parsers perform on par, suggesting that English data and language independent features can be leveraged to build parsers in other languages when data is scarse or event absent. We also conducted various ablation studies to examine the contribution of individual features to crosslinguistic semantic parsing. Our experiments revealed that universal semantic tags are most useful, while the multilingual word embeddings that we have tested with are not. We refer the interested reader to the supplementary material for more detail on these experiments.

Error Analysis
We further analyzed the output of our parser to gain insight as to what parts of meaning representation are still challenging. Table 6 shows a more detailed break-down of system output as computed by Counter, where operators (e.g., negation, implication), roles (i.e., binary relations, such as 'Theme'), concepts (i.e., unary predicates like 'ship'), and synsets (i.e., sense tags like 'n.01') are scored separately. Synsets are further broken down into into 'Nouns', 'Verbs', 'Adverbs', and 'Adjectives'. We compare our best seq2graph models (+feats) trained on the PMB v.2.2.0, gold and gold+silver data respectively. Adding silver data helps with semantic elements (operators, roles and concepts), but does not in the case of sense prediction where the only category that benefits from additional data are is nouns. We also found that ignoring the prediction of sense tags altogether helps with the performance of both models.  Table 6: F 1 -scores of fine-grained evaluation on the PMB (v.2.2.0) development set; the seq2graph models trained on gold (left) and with gold and silver data combined (right) are compared.

Conclusions
In this paper we have introduced a novel graph parser that can leverage the power and flexibility of sequential neural models while still operating on graph structures. Heavy preprocessing tailored to a specific formalism is replaced by a flexible grammar extraction method that relies solely on the graph while yielding performance that is on par or better than string-based approaches. Future work should focus on extending our parser to other formalisms (AMR, MRS, etc.). We also plan to explore modelling alternatives, such as taking different graph generation oders into account (bottom-up vs. top-down) as well as predicting the components of a fragment (type, number of edges, edge labels) separately.