Discourse Representation Structure Parsing

We introduce an open-domain neural semantic parser which generates formal meaning representations in the style of Discourse Representation Theory (DRT; Kamp and Reyle 1993). We propose a method which transforms Discourse Representation Structures (DRSs) to trees and develop a structure-aware model which decomposes the decoding process into three stages: basic DRS structure prediction, condition prediction (i.e., predicates and relations), and referent prediction (i.e., variables). Experimental results on the Groningen Meaning Bank (GMB) show that our model outperforms competitive baselines by a wide margin.


Introduction
Semantic parsing is the task of mapping natural language to machine interpretable meaning representations. A variety of meaning representations have been adopted over the years ranging from functional query language (FunQL; Kate et al. 2005) to dependency-based compositional semantics (λ-DCS; Liang et al. 2011), lambda calculus (Zettlemoyer and Collins, 2005), abstract meaning representations (Banarescu et al., 2013), and minimal recursion semantics (Copestake et al., 2005).
Existing semantic parsers are for the most part data-driven using annotated examples consisting of utterances and their meaning representations (Zelle and Mooney, 1996;Wong and Mooney, 2006;Zettlemoyer and Collins, 2005). The successful application of encoder-decoder models (Sutskever et al., 2014;Bahdanau et al., 2015) to a variety of NLP tasks has provided strong impetus to treat semantic parsing as a sequence transduction problem where an utterance is mapped to a target meaning representation in string format (Dong and Lapata, 2016;Jia and Liang, 2016;Kočiskỳ et al., 2016). The fact that meaning representations do not naturally conform to a lin-ear ordering has also prompted efforts to develop recurrent neural network architectures tailored to tree or graph-structured decoding (Dong and Lapata, 2016;Cheng et al., 2017;Yin and Neubig, 2017;Alvarez-Melis and Jaakkola, 2017;Rabinovich et al., 2017;Buys and Blunsom, 2017) Most previous work focuses on building semantic parsers for question answering tasks, such as querying a database to retrieve an answer (Zelle and Mooney, 1996;Cheng et al., 2017), or conversing with a flight booking system (Dahl et al., 1994). As a result, parsers trained on query-based datasets work on restricted domains (e.g., restaurants, meetings; Wang et al. 2015), with limited vocabularies, exhibiting limited compositionality, and a small range of syntactic and semantic constructions. In this work, we focus on open-domain semantic parsing and develop a general-purpose system which generates formal meaning representations in the style of Discourse Representation Theory (DRT; Kamp and Reyle 1993).
DRT is a popular theory of meaning representation designed to account for a variety of linguistic phenomena, including the interpretation of pronouns and temporal expressions within and across sentences. Advantageously, it supports meaning representations for entire texts rather than isolated sentences which in turn can be translated into firstorder logic. The Groningen Meaning Bank (GMB; ) provides a large collection of English texts annotated with Discourse Representation Structures (see Figure 1 for an example). GMB integrates various levels of semantic annotation (e.g., anaphora, named entities, thematic roles, rhetorical relations) into a unified formalism providing expressive meaning representations for open-domain texts.
structured logical forms are generated and show that a structure-aware decoder is paramount to open-domain semantic parsing. Our proposed model decomposes the decoding process into three stages. The first stage predicts the structure of the meaning representation omitting details such as predicates or variable names. The second stage fills in missing predicates and relations (e.g., thing, Agent) conditioning on the natural language input and the previously predicted structure. Finally, the third stage predicts variable names based on the input and the information generated so far. Decomposing decoding into these three steps reduces the complexity of generating logical forms since the model does not have to predict deeply nested structures, their variables, and predicates all at once. Moreover, the model is able to take advantage of the GMB annotations more efficiently, e.g., examples with similar structures can be effectively used in the first stage despite being very different in their lexical make-up. Finally, a piecemeal mode of generation yields more accurate predictions; since the output of every decoding step serves as input to the next one, the model is able to refine its predictions taking progressively more global context into account. Experimental results on the GMB show that our three-stage decoder outperforms a vanilla encoder-decoder model and a related variant which takes shallow structure into account, by a wide margin.
Our contributions in this work are three-fold: an open-domain semantic parser which yields discourse representation structures; a novel end-toend neural model equipped with a structured decoder which decomposes the parsing process into three stages; a DRS-to-tree conversion method which transforms DRSs to tree-based representations allowing for the application of structured de-coders as well as sequential modeling. We release our code 1 and tree formatted version of the GMB in the hope of driving further research in opendomain semantic parsing.

Discourse Representation Theory
In this section we provide a brief overview of the representational semantic formalism used in the GMB. We refer the reader to  and Kamp and Reyle (1993) for more details.
Discourse Representation Theory (DRT; Kamp and Reyle 1993) is a general framework for representing the meaning of sentences and discourse which can handle multiple linguistic phenomena including anaphora, presuppositions, and temporal expressions. The basic meaning-carrying units in DRT are Discourse Representation Structures (DRSs), which are recursive formal meaning structures that have a model-theoretic interpretation and can be translated into first-order logic (Kamp and Reyle, 1993). Basic DRSs consist of discourse referents (e.g., x, y) representing entities in the discourse and discourse conditions (e.g., man(x), magazine(y)) representing information about discourse referents. Following conventions in the DRT literature, we visualize DRSs in a box-like format (see Figure 1).
GMB adopts a variant of DRT that uses a neo-Davidsonian analysis of events (Kipper et al., 2008), i.e., events are first-order entities characterized by one-place predicate symbols (e.g., say(e 1 ) in Figure 1). In addition, it follows Projective Discourse Representation Theory (PDRT; Venhuizen et al. 2013) an extension of DRT specifically developed to account for the interpretation of presuppositions and related projection phenomena (e.g., conventional implicatures). In PDRT, each basic DRS introduces a label, which can be bound by a pointer indicating the interpretation site of semantic content. To account for the rhetorical structure of texts, GMB adopts Segmented Discourse Representation Theory (SDRT; Asher and Lascarides 2003). In SDRT, discourse segments are linked with rhetorical relations reflecting different characteristics of textual coherence, such as temporal order and communicative intentions (see continuation(k 1 , k 2 ) in Figure 1).
More formally, DRSs are expressions of type exp e (denoting individuals or discourse referents) and exp t (i.e., truth values): exp e ::= re f , exp t ::= drs | sdrs , discourse referents re f are in turn classified into six categories, namely common referents (x n ), event referents (e n ), state referents (s n ), segment referents (k n ), proposition referents (π n ), and time referents (t n ). drs and sdrs denote basic and segmented DRSs, respectively: sdrs ::= k 1 : exp t , k 2 : exp t coo(k 1 , k 2 ) | Basic DRSs consist of a set of referents ( re f ) and conditions ( condition ), whereas segmented DRSs are recursive structures that combine two exp t by means of coordinating (coo) or subordinating (sub) relations. DRS conditions can be basic or complex: Basic conditions express properties of discourse referents or relations between them: where sym n denotes n-place predicates, num denotes cardinal numbers, timex expresses temporal information (e.g., timex(x 7 , 2005) denotes the year 2005), and class refers to named entity classes (e.g., location). Complex conditions are unary or binary. Unary conditions have one DRS as argument and represent negation (¬) and modal operators expressing necessity (2) and possibility (3) re f : exp t represents verbs with propositional content (e.g., factive verbs). Binary conditions are conditional statements (→) and questions.

The Groningen Meaning Bank Corpus
Corpus Creation DRSs in GMB were obtained from Boxer (Bos, 2008(Bos, , 2015, and then refined using expert linguists and crowdsourcing methods. Boxer constructs DRSs based on a pipeline of tools involving POS-tagging, named entity recognition, and parsing. Specifically, it relies on the syntactic analysis of the C&C parser (Clark and Curran, 2007), a general-purpose parser using the framework of Combinatory Categorial Grammar (CCG; Steedman 2001). DRSs are obtained from CCG parses, with semantic composition being guided by the CCG syntactic derivation. Documents in the GMB were collected from a variety of sources including Voice of America (a newspaper published by the US Federal Government), the Open American National Corpus, Aesop's fables, humorous stories and jokes, and country descriptions from the CIA World Factbook. The dataset consists of 10,000 documents each annotated with a DRS. Various statistics on the GMB are shown in Table 1.  recommend sections 20-99 for training, 10-19 for tuning, and 00-09 for testing.
DRS-to-Tree Conversion As mentioned earlier, DRSs in the GMB are displayed in a box-like format which is intuitive and easy to read but not particularly amenable to structure modeling. In this section we discuss how DRSs were post-processed and simplified into a tree-based format, which served as input to our models.
The GMB provides DRS annotations perdocument. Our initial efforts have focused on sentence-level DRS parsing which is undoubtedly a necessary first step for more global semantic representations. It is relatively, straightforward to obtain sentence-level DRSs from document-level annotations since referents and conditions are indexed to tokens. We match each sentence in a document with the DRS whose content bears the same indices as the tokens occurring in the sentence. This matching process yields 52,268 sentences for training (sections 20-99), 5,172 sentences for development (sections 10-19), (development), and 5,440 sentences for testing (sections 00-09).
In order to simplify the representation, we omit referents in the top part of the DRS (e.g., x 1 , e 1 and π 1 in Figure 1) but preserve them in conditions without any information loss. Also we ignore pointers to DRSs since this information is implicitly captured through the typing and co-indexing of referents. Definition (1) is simplified to: where DRS() denotes a basic DRS. We also modify discourse referents to SDRSs (e.g., k 1 , k 2 in Figure 1) which we regard as elements bearing scope over expressions exp t and add a 2-place predicate sym 2 to describe the discourse relation between them. So, definition (3) becomes: where SDRS() denotes a segmented DRS, and re f are segment referents. We treat cardinal numbers num and sym 0 in relation timex as constants. We introduce the binary predicate "card" to represent cardinality (e.g., |x 8 | = 2 is card(x 8 , NUM)). We also simplify exp e = exp e to eq( exp e , exp e ) using the binary relation "eq" (e.g., x 1 = x 2 becomes eq(x 1 , x 2 )). Moreover, we ignore class in named and transform named( exp e , sym 0 , class) into sym 1 ( exp e ) (e.g., named(x 2 , mongolia, geo) becomes mongolia(x 2 )). Consequently, basic conditions (see definition (5)) are simplified to: Analogously, we treat unary and binary conditions as scoped functions, and definition (6) becomes: Following the transformations described above, the DRS in Figure 1 is converted into the tree in DRS statement(x 1 ) say(e 1 ) Cause(e 1 ,x 1 ) Topic(e 1 ,π 1 ) π 1

Semantic Parsing Models
We present below three encoder-decoder models which are increasingly aware of the structure of the DRT meaning representations. The models take as input a natural language sentence X represented as w 1 , w 2 ,. . . , w n , and generate a sequence Y = (y 1 , y 2 , ..., y m ), which is a linearized tree (see Figure 2 bottom), where n is the length of the sentence, and m the length of the generated DRS sequence. We aim to estimate p(Y |X), the conditional probability of the semantic parse tree Y given natural language input X:

Encoder
An encoder is used to represent the natural language input X into vector representations. Each token in a sentence is represented by a vector x k which is the concatenation of randomly initialized embeddings e w i , pre-trained word embeddingsē w i , and lemma embeddings e l i : D is a shorthand for (d w + d p + d l ) × d input (subscripts w, p, and l denote the dimensions of word embeddings, pre-trained embeddings, and lemma embeddings, respectively); b 1 ∈ R d input and the symbol ; denotes concatenation. Embeddings e w i and e l i are randomly initialized and tuned during training, whileē w i are fixed. We use a bidirectional recurrent neural network with long short-term memory units (bi-LSTM; Hochreiter and Schmidhuber 1997) to encode natural language sentences: [h e 1 : h e n ] = bi-LSTM(x 1 : where h e i denotes the hidden representation of the encoder, and x i refers to the input representation of the ith token in the sentence. Table 2 summarizes the notation used throughout this paper.

Sequence Decoder
We employ a sequential decoder (Bahdanau et al., 2015) as our baseline model with the architecture shown in Figure 3(a). Our decoder is a (forward) LSTM, which is conditionally initialized with the hidden state of the encoder, i.e., we set h d 0 = h e n and c d 0 = c e n , where c is a memory cell: where h d j denotes the hidden representation of y j , e y j are randomly initialized embeddings tuned during training, and y 0 denotes the start of sequence.
The decoder uses the contextual representation of the encoder together with the embedding of the previously predicted token to output the next token from the vocabulary V : where W 2 ∈ R (d enc +d y )×|V | , b 2 ∈ R |V | , d enc and d y are the dimensions of the encoder hidden unit and output representation, respectively, and h ct j is obtained using an attention mechanism: where the weight β ji is computed by: and f is the dot-product function. We obtain the probability distribution over the output tokens as: sequence of words; outputs w i ; y i the ith word; output X j i ; Y j i word; output sequence from position i to j e w i ; e y i random embedding of word w i ; of output y ī e w i fixed pretrained embedding of word w i e l i random embedding for lemma l i d w dimension of random word embedding d p dimension of pretrained word embedding d l the dimension of random lemma embedding d input input dimension of encoder d enc ; d dec hidden dimension of encoder; decoder W i matrix of model parameters b i vector of model parameters x i representation of ith token h e i hidden representation of ith token c e i memory cell of ith token in encoder h d i hidden representation of ith token in decoder c d i memory cell of ith token in decoder s j score vector of jth output in decoder h ct j context representation of jth output β i j alignment from jth output to ith token o i j copy score of jth output from ith token indicates tree structure (e.g.Ŷ ,ŷ i ,ŝ j ) indicates DRS conditions (e.g.Ȳ ,ȳ i ,s j ) indicates referents (e.g.Ẏ ,ẏ i ,ṡ j ) Table 2: Notation used throughout this paper.

Shallow Structure Decoder
The baseline decoder treats all conditions in a DRS uniformly and has no means of distinguishing between conditions corresponding to tokens in a sentence (e.g., the predicate say(e 1 ) refers to the verb said) and semantic relations (e.g., Cause(e 1 , x 1 )). Our second decoder attempts to take this into account by distinguishing conditions which are local and correspond to words in a sentence from items which are more global and express semantic content (see Figure 3(b)). Specifically, we model sentence specific conditions using a copying mechanism, and all other conditions G which do not correspond to sentential tokens (e.g., thematic roles, rhetorical relations) with an insertion mechanism.
Each token in a sentence is assigned a copying score o ji : where subscript ji denotes the ith token at jth time step, and W 3 ∈ R d dec ×d enc . All other conditions G are assigned an insertion score: where W 4 ∈ R (d enc +d y )×|G| , b 4 ∈ R |G| , and h ct j are the same with the baseline decoder. We obtain the probability distribution over output tokens as:
2 Each condition has one and only one right bracket.Ŷ j 1 denotes the structure predicted before conditionsȳ j ;Ŷ j 1 andȲ j 1 are the structures and conditions predicted before referentsẏ j . We next discuss how each decoder is modeled.

Structure Prediction
To model basic DRS structure we apply the shallow decoder discussed in Section 4.3 and also shown in Figure 3(c.1). Tokens in such structures correspond to parent nodes in a tree; in other words, they are all inserted from G, and subsequently predicted tokens are only scored with the insert score, i.e.,ŝ i = s i . The hidden units of the decoder are: And the probabilistic distribution over structure denoting tokens is: Condition Prediction DRS conditions are generated by taking previously predicted structures into account, e.g., when "DRS(" or "SDRS(" are predicted, their conditions will be generated next. By mapping j to (k, m k ), the sequence of conditions can be rewritten asȳ 1 , . . . ,ȳ j , . . . ,ȳ r = y (1,1) ,ȳ (1,2) , . . . ,ȳ (k,m k ) , . . . , whereȳ (k,m k ) is m k th condition of structure tokenŷ k . The corresponding hidden unitsĥ d k act as conditional input to the decoder. Structure denoting tokens (e.g., "DRS(" or "SDRS(") are fed into the decoder one by one to generate the corresponding conditions as: where W 5 ∈ R d dec ×d y and b 5 ∈ R d y . The hidden unit of the conditions decoder is computed as: Given hidden unith d j , we obtain the copy scoreō j and insert scores j . The probabilistic distribution over conditions is: Referent Prediction Referents are generated based on the structure and conditions of the DRS. Each condition has at least one referent. Similar to condition prediction, the sequence of referents can be rewritten asẏ 1 , . . . ,ẏ j , . . . ,ẏ v = y (1,1) ,ẏ (1,2) , . . . ,ẏ (k,m k ) , . . . The hidden units of the conditions decoder are fed into the referent decoder eẏ (k,0) =h d k * W 6 + b 6 , where W 6 ∈ R d dec ×d y , b 6 ∈ R d y . The hidden unit of the referent decoder is computed as: All referents are inserted from G, given hidden unitḣ d j (we only obtain the insert scoreṡ j ). The probabilistic distribution over predicates is: Note that a single LSTM is adopted for structure, condition and referent prediction. The mathematic symbols are summarized in Table 2.

Training
The models are trained to minimize a crossentropy loss objective with 2 regularization: where θ is the set of parameters, and λ is a regularization hyper-parameter (λ = 10 −6 ). We used stochastic gradient descent with Adam (Kingma and Ba, 2014) to adjust the learning rate.

Experimental Setup
Settings Our experiments were carried out on the GMB following the tree conversion process discussed in Section 3. We adopted the training, development, and testing partitions recommended in . We compared the three models introduced in Section 4, namely the baseline sequence decoder, the shallow structured decoder and the deep structure decoder. We used the same empirical hyper-parameters for all three models. The dimensions of word and lemma embeddings were 64 and 32, respectively. The dimensions of hidden vectors were 256 for the encoder and 128 for the decoder. The encoder used two hidden layers, whereas the decoder only one. The dropout rate was 0.1. Pre-trained word embeddings (100 dimensions) were generated with Word2Vec trained on the AFP portion of the English Gigaword corpus. 3 Evaluation Due to the complex nature of our structured prediction task, we cannot expect model output to exactly match the gold standard. For instance, the numbering of the referents may be different, but nevertheless valid, or the order of the children of a tree node (e.g., "DRS(india(x 1 ) say(e 1 ))" and "DRS(say(e 1 ) india(x 1 ))" are the same). We thus use F 1 instead of exact match accuracy. Specifically, we report D-match 4 a metric designed to evaluate scoped meaning representations and released as part of the distribution of the Parallel Meaning Bank corpus (Abzianidze et al., 2017). D-match is based on Smatch 5 , a metric used to evaluate AMR graphs ; it calculates F 1 on discourse representation graphs (DRGs), i.e., triples of nodes, arcs, and their referents, applying multiple restarts to obtain a good referent (node) mapping between graphs. We converted DRSs (predicted and goldstandard) into DRGs following the top-down procedure described in Algorithm 1. 6 ISCONDI-TION returns true if the child is a condition (e.g., india(x 1 )), where three arcs are created, one is connected to a parent node and the other two are connected to arg1 and arg2, respectively (lines 7-12). ISQUANTIFIER returns true if the child is a quantifier (e.g., π 1 , ¬ and 2) and three arcs are created; one is connected to the parent node, one to the referent that is created if and only
When comparing two DRGs, we calculate the F 1 over their arcs. For example consider the two DRGs (a) and (b) shown in Figure 4. Let {b 0 : b 0 , x 1 : x 2 , x 2 : x 3 , c 0 : c 0 , c 1 : c 2 , c 2 : c 3 } denote the node alignment between them. The number of matching arcs is eight, the number of arcs in the gold DRG is nine, and the number of arcs in the predicted DRG is 12. So recall is 8/9, precision is 8/12, and F 1 is 76.19.  Table 3 compares our three models on the development set. As can be seen, the shallow structured decoder performs better than the baseline decoder, and the proposed deep structure decoder outperforms both of them. Ablation experiments show that without pre-trained word embeddings or word lemma embeddings, the model generally performs worse. Compared to lemma embeddings, pretrained word embeddings contribute more. Table 4 shows our results on the test set. To assess the degree to which the various decoders contribute to DRS parsing, we report results when predicting the full DRS structure (second block), when ignoring referents (third block), and when ignoring both referents and conditions (fourth block). Overall, we observe that the shallow structure model improves precision over the baseline with a slight loss in recall, while the deep structure model performs best by a large margin. When referents are not taken into account (compare the second and third blocks in Table 4), performance improves across the board. When conditions are additionally omitted, we observe further performance gains. This is hardly surprising, since errors propagate from one stage to the next when predicting full DRS structures. Further analysis revealed that the parser performs slightly better on (copy) conditions which correspond to natural language tokens compared to (insert) conditions (e.g., Topic, Agent) which are generated from global semantic content (83.22 vs 80.63 F 1 ). The parser is also better on sentences which do not represent SDRSs (79.12 vs 68.36 F 1 ) which is expected given that they usually correspond to more elaborate structures. We also found that rhetorical relations (linking segments) are predicted fairly accurately, especially if they are frequently attested (e.g., Continuation, Parallel), while the parser has difficulty with relations denoting contrast.    Figure 5: F 1 score as a function of sentence length. Figure 5 shows F 1 performance for the three parsers on sentences of different length. We observe a similar trend for all models: as sentence length increases, model performance decreases. The baseline and shallow models do not perform well on short sentences which despite containing fewer words, can still represent complex meaning which is challenging to capture sequentially. On the other hand, the performance of the deep model is relatively stable. LSTMs in this case function relatively well, as they are faced with the easier task of predicting meaning in different stages (starting with a tree skeleton which is progressively refined). We provide examples of model output in the supplementary material.

Related Work
Tree-structured Decoding A few recent approaches develop structured decoders which make use of the syntax of meaning representations. Dong and Lapata (2016) and Alvarez-Melis and Jaakkola (2017) generate trees in a top-down fashion, while in other work (Xiao et al., 2016;Krishnamurthy et al., 2017) the decoder generates from a grammar that guarantees that predicted logical forms are well-typed. In a similar vein, Yin and Neubig (2017) generate abstract syntax trees (ASTs) based on the application of production rules defined by the grammar. Rabinovich et al. (2017) introduce a modular decoder whose various components are dynamically composed according to the generated tree structure. In comparison, our model does not use grammar information explic-itly. We first decode the structure of the DRS, and then fill in details pertaining to its semantic content. Our model is not strictly speaking top-down, we generate partial trees sequentially, and then expand non-terminal nodes, ensuring that when we generate the children of a node, we have already obtained the structure of the entire tree.
Wide-coverage Semantic Parsing Our model is trained on the GMB , a richly annotated resource in the style of DRT which provides a unique opportunity for bootstrapping wide-coverage semantic parsers. Boxer (Bos, 2008) was a precursor to the GMB, the first semantic parser of this kind, which deterministically maps CCG derivations onto formal meaning representations. Le and Zuidema (2012) were the first to train a semantic parser on an early release of the GMB (2,000 documents; Basile et al. 2012), however, they abandon lambda calculus in favor of a graph based representation. The latter is closely related to AMR, a general-purpose meaning representation language for broad-coverage text. In AMR the meaning of a sentence is represented as a rooted, directed, edge-labeled and leaf-labeled graph. AMRs do not resemble classical meaning representations and do not have a model-theoretic interpretation. However, see Bos (2016) and Artzi et al. (2015) for translations to first-order logic.

Conclusions
We introduced a new end-to-end model for opendomain semantic parsing. Experimental results on the GMB show that our decoder is able to recover discourse representation structures to a good degree (77.54 F 1 ), albeit with some simplifications. In the future, we plan to model document-level representations which are more in line with DRT and the GMB annotations.