The First Shared Task on Discourse Representation Structure Parsing

The paper presents the IWCS 2019 shared task on semantic parsing where the goal is to produce Discourse Representation Structures (DRSs) for English sentences. DRSs originate from Discourse Representation Theory and represent scoped meaning representations that capture the semantics of negation, modals, quantification, and presupposition triggers. Additionally, concepts and event-participants in DRSs are described with WordNet synsets and the thematic roles from VerbNet. To measure similarity between two DRSs, they are represented in a clausal form, i.e. as a set of tuples. Participant systems were expected to produce DRSs in this clausal form. Taking into account the rich lexical information, explicit scope marking, a high number of shared variables among clauses, and highly-constrained format of valid DRSs, all these makes the DRS parsing a challenging NLP task. The results of the shared task displayed improvements over the existing state-of-the-art parser.


Introduction
Semantic parsing has been gaining in popularity in the last few years. There have been a series of shared tasks in semantic parsing organized, where each task requires to generate meaning representations of specific types: Broad-Coverage Broad-coverage Semantic Dependencies (Oepen et al., 2014(Oepen et al., , 2015, Abstract Meaning Representation (May, 2016;May and Priyadarshi, 2017), or Universal Conceptual Cognitive Annotation (Hershcovich et al., 2019).
The Discourse Representation Structure (DRS) parsing task extends this development by aiming at producing meaning representations that (i) come with more expressive power than existing ones and (ii) are easily translatable into formal logic, thereby opening the door to applications that require automated forms of inference (Blackburn and Bos, 2005;Dagan et al., 2013). DRSs are meaning representations employed by Discourse Representation Theory (DRT, Kamp and Reyle, 1993). They have been successfully applied for wide-coverage semantic representations (Bos et al., 2004;Bos, 2008), Natural Language Inference (Bos and Markert, 2005;Bjerva et al., 2014), and Natural Language Generation (Basile and Bos, 2013). To the best of our knowledge, there has never been a shared task on scoped meaning representations.
The aim of the task is to compare semantic parsing methods and the performance of systems that take as input an English text and provide as output the scoped meaning representation of that text Since a DRS combines logical (negation, quantification and modals), pragmatic (presuppositions) and lexical (word senses and thematic roles) components of semantics in a single meaning representation, the DRS parsing task shares parts of the following NLP tasks: semantic role labeling, reference resolution, scope detection, named entity tagging, word sense disambiguation, predicate-argument structure prediction, and presupposition projection.
There are only a few previous approaches to DRS parsing. Traditionally, due to the complexity of the task, it has been the domain of symbolic and statistical approaches (Bos, 2008;Le and Zuidema, 2012;Bos, 2015). Recently, however, neural sequence-to-sequence systems achieved impressive performance on the task (Liu et al., 2018;, without relying on any external linguistic resources. Tom isn't afraid of anything. SYSTEM OUTPUT: b1 REF x1 b1 male "n.02" x1 b1 Name x1 "tom" b2 REF t1 b2 EQU t1 "now" b2 time "n.08" t1 b2 NOT b3 b3 REF s1 b3 Time s1 t1 b3 Experiencer s1 x1 b3 afraid "a.01" s1 b3 Stimulus s1 x2 b3 REF x2 b3 entity "n.01" x2 BOX FORMAT: Name(x 1 , tom) Figure 1: The DRS parsing task: the system input is a short text (the PMB document 99/2308), and the expected output is a DRS in clausal form. Its standard visualisation in box-notation, following DRT, is presented below. 00/3008: He played the piano and she sang. b4 time "n.08" t2 b1 time "n.08" t1 b6 CONTINUATION b1 b4 b1 TPR t1 "now" b1 Time e1 t1 b6 Figure 2: The segmented box b6 consists of a set of labelled boxes, i.e. the discourse segments b1 and b2, and a single discourse condition. In the condition, discourse relation holds between two discourse segments and is formatted in uppercase. The definite noun phrase and the pronouns are presupposition (b2, b3, and b5) triggers.
In the first shared task on DRS parsing, taking into account the information-rich and complex structure of the target meaning representation, we tested participant systems mainly on short, open-domain English texts. In this way, we lowered the threshold for participation to encourage higher results in the shared task and mitigate challenges associated to semantic parsing long texts. In total five systems participated in the shared task. The top-ranked systems outperformed the existing state-of-the-art system in DRS parsing. The shared task was hosted on CodaLab. 1 .

Task Description
The DRS parsing in a nutshell is presented in Figure 1. Here, the input, a short English sentence, needs to be mapped to the output, a scoped meaning representation in clausal form. Concepts, states and events are represented by the word senses (male.n.02, entity.n.01, afraid.a.01) from WordNet 3.0 (Fellbaum, 1998) and relations are modeled with thematic roles (Name, Experiencer, Stimulus) drawn from an extended version of VerbNet (Bonial et al., 2011).
Each entity needs to introduce a discourse referent, i.e. a variable, in the right scope, form an instance of the right concepts, and be connected to other entities via thematic roles or comparison operators. For example, in Figure 1, anything introduces a discourse referent x 2 in the scope b3 with the help of the clause b3 REF x2 . The clause b3 entity "n.01" x2 makes x 2 an instance of entity.n.01. Finally, the clause b3 Stimulus s1 x2 connects x 2 to the event entity s 1 of afraid via the Stimulus thematic role.
The scopes of negation, implication, modal operators or propositional arguments need to be correctly identified. Proper names, pronouns, definite descriptions and possessives are treated as presuppositions and get their own box if they cannot be resolved by the local context. Tense is locally accommodated. For example, Figure 1 shows how the negation operator introduces the scope (b3) and how the named entity Tom gives rise to the presupposition (b1). Figure 2 demonstrates how discourse segments get their own scope (b1 and b4) and how definite noun phrases and pronouns trigger presuppositions (b2, b3, and b5). Finally, Figure 3 depicts an implication with two scopes (b3 and b5), modeling semantics of a universal quantifier, and nested presuppositions (b1 and b4) due to a possessive pronoun.
Given the aforementioned nuances of the fine-grained scoped meaning representations, the DRS parsing task represents a challenge for machine learning methods.

Discourse Representation Structure
The meaning representations used in this shared task are based on the DRSs put forward in DRT (Kamp and Reyle, 1993) and derived from the Parallel Meaning Bank . There are some important extensions to the theory, though. First, the DRSs are language-neutral, and all nonlogical symbols are disambiguated to WordNet synsets or VerbNet roles. Furthermore, presuppositions are explicitly represented following (Van der Sandt, 1992) and Porjective DRT (Venhuizen et al., 2018). Discourse structure is analysed following by Segmented DRT (Asher and Lascarides, 2003). As in the original DRT, DRSs are displayed in box format for reading convenience (presuppositional DRSs are displayed with outgoing arrows of the boxes that triggered them). DRSs are recursive structures, and for the purpose of evaluation, they are translated into clauses, flattening down the recursion by reification.
A DRS always contains a main labelled box along with an optional set of presupposition DRSs (see Definition 1). For example, the main labelled box in Figure 1 is b0 while b0 is a presupposition. A box can be simple (e.g., the box labelled with b0 in Figure 1) or segmented (e.g., the box labelled with b6 in Figure 2). A simple box consists of a set of discourse referents and a set of conditions. Conditions can be basic or complex. Basic conditions are concept predicates or relations over discourse referents and constants. Indexicals are treated as constants, not as discourse referents Bos (2017), for example, now is one of such indexicals (see Figure 1). Complex conditions are those involving labelled boxes. The examples of complex conditions are ¬b3 in Figure 1 and b3 ⇒ b5 in Figure 3. Finally, a segmented box contains a set of labelled boxes (b1 and 4 in Figure 2) and discourse conditions. A discourse condition is a discourse relations over box labels, e.g., CONTINUATION(b 1 , b 4 ) in Figure 2.
Definition 1: A BNF of DRSs: (possibly empty) sets are denoted with curly brackets as { element }. The string elements for operators and punctuation are in red.

DRS
::= { DRS } labelled BOX labelled BOX ::= label BOX BOX b6 box.n.01(x4) Figure 3: The DRS contains the example of nested presuppositions triggered by the possessive pronoun his. The main box b2 of the DRS presupposes a set of two DRSs. At the same time, one of the presupposed DRSs, namely {b1}, b4 , itself carries the presupposition b1. Note that the presuppositions about a male discourse referent, triggered by he and his separately, are merged into a single presupposition box b1. The clauses are accompanied with aligned tokens.
The clausal form and the box-notation are two different forms of displaying scoped meaning representations van Noord, Abzianidze, Haagsma, and Bos (2018). We consider the clausal form a machinereadable format that is suitable for the evaluation with a continuous score between 0 and 1 (see Section 5). On the other hand, the box-notation is a human-readable format and originates from Discourse Representation Theory. Conversion from the box-notation to the clausal form and vice versa is transparent: each box gets a label, and discourse referents and conditions in the clausal form are preceded by the label of the box they occur in.

Released Data
For the shared task we released the training, development, and test data, taken from the Parallel Meaning Bank (PMB, . The PMB is a parallel corpus annotated with formal meaning representations. 2 These representations capture the most probable interpretation of a sentence; no ambiguities or under-specification techniques are employed. The formal meaning representations are automatically constructed and manually corrected. Completely correct representations are flagged as gold. Representations that are partly manually corrected are marked as silver, while the rest is marked bronze.
The PMB release number used for the shared task is 2.2.0 3 , of which some statistics are shown in Table 1. Note that MWE tokens and types are underrepresented in the silver and bronze data compared to the gold data. This is because the gold data contains more manual corrections on the token level than the silver and bronze data. For the example of multi-word expressions see Figure 4. In the shared task, participants were allowed to use the silver and bronze data, this would especially make sense in the case of data-hungry neural models, though there is no guarantee that those representations resemble the gold The data provided to the shared task participants consists of pairs of a raw natural language text and its corresponding scoped meaning representation in clausal form. 4 Whether the meaning representation is of gold, silver or bronze standard is explicitly indicated. To facilitate automatic learning of scoped meaning representations, we also provided automatically induced alignments between clauses and tokens, where token positions are provided with character offsets. The examples of clause-token alignments are give in Figure 2 and Figure 4. The latter represents an exact formatting of the text and clausal form pair provided in the shared task.

Evaluation set
The official evaluation set contains 600 instances that were not released previously. They will not be released publicly, but are still available for (blind) scoring via the shared task website. 5 However, during the evaluation phase, we asked the participants to provide DRSs for a set of 12,606 short texts. In addition to the raw texts (600) from the evaluation split, this set contained the train (4,597), development (682), and test (650) data from the PMB-2.2.0 release and the sentences (6,077) from the SICK dataset (Marelli et al., 2014). The reason for providing the inflated set of raw texts was three-fold: (i) Disguise the raw texts of the evaluation set to make it hard to tune models on them; (ii) Obtain the complete information about the performance of the systems on the provided training, development and test sets; (iii) Carry out extrinsic evaluation of the participant systems on the natural language inference task.

Evaluation Metrics and Baselines
Before comparing a system produced clausal form to the gold one, the produced form is checked on validity-whether it represents a DRS. If the clausal form is invalid, it is replaced by a single nonmatching clause. In the shared task, we include three baseline systems. The evaluation and validation scripts and the baselines are publicly available. 6

Validation
Not all sets of clauses correspond to a well-formed DRS, e.g., discourse referents found in the conditions should be explicitly introduced in the boxes, or there should exist labelled boxes for the labels used in the discourse conditions. We employ the validator REFEREE  to automatically check a set of clauses on well-formedness. REFEREE does several checks for validity checking. For example, first it scans each clause separately in a clausal form and identifies the types of variables based on the operators. For each discourse referent variable, it checks the existence of the binding discourse referent. During this procedure, REFEREE also detects positions of the boxes in the DRS (i.e., so-called the subordinate relation). Based on this information, it is checked that nested boxes do not create loops and there is a unique main box in the DRS. All the released clausal forms of the DRSs are valid. We provided the participants with REFEREE in order to help them identify the ill-formed clausal forms produced by their systems.

Evaluation
The evaluation defines to what degree a system output clausal form is similar to the corresponding gold one. To compare the system output and gold representations, we compute the F1-score over the clauses, following Allen et al. (2008). We use the tool COUNTER (van Noord, Abzianidze, Haagsma, and Bos, 2018), which is specifically designed to evaluate DRSs. It is based on the SMATCH Cai and Knight (2013) tool that is used to evaluate AMR parsers. It is essentially a hill-climbing algorithm that finds the best variable mapping between the produced DRS and the gold standard. To avoid local optima, we restart the procedure 10 times. In order to prevent an inflated F-score, before searching the maximal matching, COUNTER discards those REF-clauses which are deemed redundant. A REF-clause b REF x is redundant if and only if its discourse referent x occurs with a concept predicate in a basic condition of the same box b -in other words, there exists a clause of the form b concept "pos.nn" x . 7 An example of comparing the clausal forms of two scoped meaning representations is shown in Figure 5. With respect to the optimal mapping, both, the sample system output and gold clauses, include three clauses that could not be matched with each other while four clauses are matched. The optimal mapping gives us a precision and recall of 3 /7, resulting in an F-score of 42.9. Similarly to AMR, we use micro-averaged F-score when evaluating a set of DRSs.
An aspect that is different from the AMR evaluation system is that we generalize over synonyms. In a preprocessing step of the evaluation, all word senses are converted to its WordNet 3.0 synset ID. For example, fox.n.02 and dodger.n.01 both get normalized to dodger.n.01 and are thus able to match.  Figure 5: An optimal mapping of variables which maximizes overlap between the system output and gold clausal forms for the sentence (PMB document 00/2302) Everything is new. The maximal overlap yields an F-score of 42.9. Matching, non-matching and redundant clauses are in green, red, and stricken through, respectively. The box-notation of scoped meaning representations is not available during the comparison of clausal forms.
To calculate whether two systems differ significantly, we perform approximate randomization Noreen (1989), with α = 0.05, R = 1000 and F (model 1 ) > F (model 2 ) as test statistic for each individual DRS pair.

Baselines
We provide three baseline parsers: SPAR, SIM-SPAR and AMR2DRS. SPAR simply outputs a default DRS, which is a DRS that is the most similar to the DRSs in our training set. 8 SIM-SPAR outputs the DRS of the most similar sentence in the training set, based on the cosine distance of the average word-embedding vector, calculated using GloVe (Pennington et al., 2014). AMR2DRS is a script that converts the output of an AMR parser to a valid DRS by applying a set of rules, described in Bos (2016) and van Noord, Abzianidze, Haagsma, and Bos (2018). We will provide scores on the development, test and evaluation sets by using the AMR parser of van Noord and Bos (2017).

Participating Systems
We received a total of five submissions in the shared task out of 32 registered participants. Three out of five submitted a system paper. The general characteristics of the participating systems are give in Table 2. Following Nissim et al. (2017), we explicitly encouraged the participants to include ablation experiments and negative results (if any). Note that the authors of the systems NOORD ET AL.18 and NOORD ET AL.19 are from the organizers. Below, we provide a short description of each system.

Van Noord et al. (2018)
NOORD ET AL.18, the parser described in , uses a character-level neural sequenceto-sequence model to produce DRSs. They apply a number of methods to improve performance, such as rewriting the variables to a more general format, introducing a feature for uppercase letters and not using character-level representation for DRS roles and operators. Moreover, they show that performance can be substantially improved by first pre-training on gold and silver data, after which the parser is fine-tuned on only the gold standard data.

Van Noord (2019)
The system of NOORD ET AL.19 is the parser described in Van Noord et al. (2019) and Van Noord (2019), which follows up on their work previously described in . They improve on this work in two ways: (i) by switching their sequence-to-sequence framework from OpenNMT (Klein et al., 2017) to Marian (Junczys-Dowmunt et al., 2018) and (ii) by providing the encoder with linguistic information (lemmas, semantic tags , POS-tags, dependency parses and CCG supertags) that are encoded in a separate encoder.

Liu et al. (2019)
LIU ET AL. also follow the approach of  in terms of pre-and postprocessing the data, but they improve on it by using the Transformer model (Vaswani et al., 2017), instead of a sequence-to-sequence RNN. Also, they show that employing the bronze standard in addition to the gold and silver standard leads to improved performance.

Evang (2019)
EVANG aim to find a middle-ground between traditional symbolic approaches and the recent neural (sequence-to-sequence) models. They employ a transition-based parser that relies on explicit wordmeaning pairs that are found in the training set. Parsing decisions are made based on vector representations of parser states, which are encoded using stack-LSTMs.
6.5 FANCELLU ET AL. 9 FANCELLU ET AL. propose a graph decoder that given an input sentence encoded via a bidirectional LSTM generates a DAG (Directed Acyclic Graph) as a sequence of fragments from a graph grammar. These fragments are delexicalized; predicate names, synset and information on whether the predicate is presupposed or not are predicted in a second step, conditioned on the fragment and the decoding history. Two are the main features of the graph parser: 1) it is agnostic to the underlying semantic formalism and does not need any preprocessing step to deal with variable binding; 2) fragments are aware of the overall graph structure and the graph is built incrementally via a process of non-terminal rewriting.
(1) sets this method apart from the graph parser of Groschwitz et al. (2018) where a grammar is extracting via an elaborate pre-processing step, tailored to a specific formalism, whereas (2) allows to leverage neural sequential decoding (stackLSTM, Dyer et al., 2016). The only preprocessing step required is to convert DRSs in clause format into single-rooted, fully instantiated DAGs; we do so by treating both variables and boxes as nodes and semantic roles, operators and discourse relations as edges between those (where each binary operator or relation gives rise to two edges). Similarly, the only postprocessing step lies in converting the graph back to clause format. This last step can inject errors in the parse and it is the reason why some of the output graphs can be ill-formed. Table 3 shows the official results of the shared task. The system of LIU ET AL. achieved the best performance, though there is no significant difference with the work of NOORD ET AL.19 (p = 0.23). The systems of EVANG and FANCELLU ET AL., though clearly outperforming the baselines, are a bit behind the best three systems. However, they can likely improve performance by incorporating silver and bronze standard data. No systems seem to have overfit on the provided dev and test sets. The work of FANCELLU ET AL. is perhaps overfit on the training set, given their high score on train compared to the test sets. Table 4 shows a more detailed overview of the results. All teams produced a substantial amount of perfect DRSs, but only 31 DRSs of them were perfectly produced by each system. EVANG is the only system with a substantial number of ill-formed DRSs. This hurts their performance, since they get an F-score of 0.0 in evaluation. If we ignore referee and score their ill-formed DRSs as if they were valid, their score increases to 72.1. On the other hand, calculating an F-score for only the ill-formed DRSs (without referee) gives us an F-score of 50.6, suggesting that the model would not have scored very well in either way.

Results
Similar as was observed in Van Noord et al. (2018), word sense disambiguation is problematic for the DRS parsers. When assuming oracle sense numbers, all systems obtain a substantially higher F-score (increases of 1.8 to 3.6). NOORD ET AL.19 propose a simple method to improve on this sub-problem by taking the most frequent sense in the training set for a concept, though this only increased their F-scores by 0.2 to 0.4 on the dev and test sets. Nouns are the easiest for all models to correctly produce (possibly due to the frequent time.n.08), while adverbs are the hardest, though there are only 12 such clauses in the evaluation set.

Analysis
Longer sentences are probably harder to parse, but which systems behave well on longer sentences? Figure 6 shows the performance of the systems plotted over sentence length. As expected, all systems show a clear drop in performance for longer sentences. The work of LIU ET AL. is based on the Transformer model (Vaswani et al., 2017), which claims that performance should not degrade for longer Table 4: F-scores of fine-grained evaluation of the participating systems on the evaluation set. "Winner out of 5" counts only those instances for which the parser obtained a higher score than all the rest, while "Highest out of 5" allows ties.  sentences. However, for DRS parsing this does not seem to be the case, as LIU ET AL. shows a similar decrease in performance as the neural models of NOORD ET AL.18 and NOORD ET AL.19. How similar were the outputs of the participating systems to each other? Table 5 shows pairwise comparison of the outputs of the systems on the evaluation set. The only system that has a substantially higher similarity to one of the systems than their official F-score is NOORD ET AL.18 compared to NOORD ET AL.19, which tells us the models make similar mistakes. This makes sense given that they are both character-level sequence-to-sequence models trained on the same data. Additionally, the output of NOORD ET AL.19 comes closest to the output of FANCELLU ET AL. when compared to other systems' outputs. Similarly, EVANG is most similar to NOORD ET AL.18 than to any other systems. How complementary were the participating systems to each other? If we had an ensemble system, with an oracle component that selected the best DRS for each sentence out of the participants submissions, it would obtain an F-score of 90.5. When only combining the submissions of LIU ET AL. and NOORD ET AL.19, it would already result in an F-score of 89.1. This suggests that the neural models in fact do learn different things, though there is still a significant portion that both methods could not learn.
Finally, are there phenomena that are especially hard for all participating systems? This is not an easy question to answer, and here we show just a first step to such an analysis. Table 6 shows sentences for which systems, on average, performed badly. Some of them show non-standard use of English, others are phenomena that are relatively rare, such as generics, multi-word expressions, and coordination.

Conclusion
The first shared task on DRS parsing was successful. It improved the state-of-the-art in DRS parsing, and the variety in methods used (models based on recursive neural networks, transformer models, models based on transition-based parsing, graph decoders) gives inspiration for future research. In the future DRS parsing will be made more challenging by moving to longer sentences and texts (where we expect simple seq2seq models to have a harder time), more complex phenomena (ellipsis, comparatives, multiword expressions), and to languages other than English.