MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at:

Background and Motivation
The 2020 Conference on Computational Language Learning (CoNLL) hosts a shared task (or 'system bake-off') on Cross-Framework Meaning Representation Parsing (MRP 2020), which is a revised and extended re-run of a similar CoNLL shared task in the preceding year. The goal of these tasks is to advance data-driven parsing into graph-structured representations of sentence meaning. For the first time, the MRP task series combines formally and linguistically different approaches to meaning rep-resentation in graph form in a uniform training and evaluation setup.
Key differences in the 2020 edition of the task include the addition of a graph-based encoding of Discourse Representation Structures (dubbed DRG); a generalization of Prague Tectogrammatical Graphs (to include more information from the original annotations); and a separate cross-lingual track, introducing one extra language (beyond English) for four of the frameworks involved. 1 Participants were invited to develop parsing systems that support five distinct semantic graph frameworks in four languages (see §3 below)all encoding core predicate-argument structure, among other things-in the same implementation. Ideally, these parsers predict sentence-level meaning representations in all frameworks in parallel. Architectures utilizing complementary knowledge sources (e.g. via parameter sharing) were encouraged, though not required. Learning from multiple flavors of meaning representation in tandem has hardly been explored (with notable exceptions, e.g. the parsers of Peng et al., 2017;Hershcovich et al., 2018;Stanovsky and Dagan, 2018;. The task design aims to reduce frameworkspecific 'balkanization' in the field of meaning representation parsing. Its contributions include (a) a unifying formal model over different semantic graph banks ( §2), (b) uniform representations and scoring ( §4 and §6), (c) contrastive evaluation across frameworks ( §5), and (d) increased crossfertilization of parsing approaches ( §7).

Definitions: Graphs and Flavors
Reflecting different traditions and communities, there is wide variation in how individual meaning representation frameworks think (and talk) about semantic graphs, down to the level of visual conventions used in rendering graph structures. Increased terminological uniformity and guidance in how to navigate this rich and diverse landscape are among the desirable side-effects of the MRP task series. The following paragraphs provide semi-formal definitions of core graph-theoretic concepts that can be meaningfully applied across the range of frameworks represented in the shared task.
Basic Terminology Semantic graphs (across different frameworks) can be viewed as directed graphs or digraphs. A semantic digraph is a triple (T, N, E) where N is a set of nodes and E ⊆ N × N is a set of edges. The inand outdegree of a node count the number of edges arriving at or leaving from the node, respectively. In contrast to the unique root node in trees, graphs can have multiple (structural) roots, which we define as nodes with in-degree zero. The majority of semantic graphs are structurally multi-rooted. Thus, we distinguish one or several nodes in each graph as top nodes, T ⊆ N ; the top(s) correspond(s) to the most central semantic entities in the graph, usually the main predication(s).
In a tree, every node except the root has indegree one. In semantic graphs, nodes can have in-degree two or higher (indicating shared arguments), which constitutes a reentrancy in the graph. In contrast to trees, general digraphs may contain cycles, i.e. a directed path leading from a node to itself. Another central property of trees is that they are connected, meaning that there exists an undirected path between any pair of nodes. In contrast, semantic graphs need not generally be connected.
Finally, in some semantic graph frameworks there is a (total) linear order on the nodes, typically (though not necessarily) induced by the surface order of corresponding tokens. Such graphs are conventionally called bi-lexical dependencies (probably deriving from a notion of lexicalization articulated by Eisner, 1997) and formally consti-tute ordered graphs. A natural way to visualize a bi-lexical dependency graph is to draw its edges as semicircles in the halfplane above the sentence. An ordered graph is called noncrossing if in such a drawing, the semicircles intersect only at their endpoints (this property is a natural generalization of projectivity as it is known from dependency trees).
A natural generalization of the noncrossing property, where one is allowed to also use the halfplane below the sentence for drawing edges is a property called pagenumber two. Kuhlmann and Oepen (2016) provide additional definitions and a quantitative summary of various formal graph properties across frameworks.

Hierarchy of Formal Flavors
In the context of the MRP shared task series, we have previously defined different flavors of semantic graphs based on the nature of the relationship they assume between the linguistic surface signal (typically a written sentence, i.e. a string) and the nodes of the graph . We refer to this relation as anchoring (of nodes onto sub-strings); other commonly used terms include alignment, correspondence, or lexicalization.
Flavor (0) is characterized by the strongest form of anchoring, obtained in bi-lexical dependency graphs, where graph nodes injectively correspond to surface lexical units (i.e. tokens or 'words'). In such graphs, each node is directly linked to one specific token (conversely, there may be semantically empty tokens), and the nodes inherit the linear order of their corresponding tokens.
Flavor (1) includes a more general form of anchored semantic graphs, characterized by relaxing the correspondence between nodes and tokens, allowing arbitrary parts of the sentence (e.g. subtoken or multi-token sequences) as node anchors, as well as unanchored nodes, or multiple nodes anchored to overlapping sub-strings. These graphs afford greater flexibility in the representation of meaning contributed by, for example, (derivational) affixes or phrasal constructions and facilitate lexical decomposition (e.g. of causatives or comparatives).
Finally, Flavor (2) semantic graphs do not consider the correspondence between nodes and the surface string as part of the representation of meaning (thus backgrounding notions of derivation and compositionality). Such semantic graphs are simply unanchored.
While different flavors refer to formally defined sub-classes of semantic graphs, we reserve the term framework for specific linguistic approaches to graph-based meaning representation (typically encoded in a particular graph flavor, of course). However, the coarse classification into three distinct flavors does not fully account for the variability of anchoring relations observed across frameworks. For example, graphs can be partially anchored, meaning that only a subset of nodes are explicitly linked to the surface string; the anchoring relations that are present, can in turn stand in one-to-one correspondence to surface tokens, or allow overlapping and sub-token or phrasal relationships. At the same time, a framework may impose a total ordering of nodes independent (or possibly only partly dependent) on anchoring. We will interpret Flavors (0) and (2) strictly, as fully lexically anchored and wholly unanchored, respectively, leading to the categorization of mixed forms of anchoring as Flavor (1), and allow for the presence of ordered graphs, in principle at least, at all levels of the hierarchy. 2

Meaning Representation Frameworks
The shared task combines five distinct frameworks for graph-based meaning representation, each with its specific formal and linguistic assumptions. This section reviews the frameworks and presents English example graphs for sentence #20209013 from the venerable Wall Street Journal (WSJ) Corpus from the Penn Treebank (PTB; Marcus et al., 1993): (1) A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice.
The example exhibits some interesting linguistic complexity, including what is called a tough adjective (impossible), a scopal adverb (almost), a tripartite coordinate structure, and apposition. The example graphs in Figures 1 through 4 are prewhere unanchored nodes for unexpressed material beyond the surface string can be postulated (Schuster and Manning, 2016). Whether or not these nodes occupy a well-defined position in the otherwise total order of basic UD nodes remains an open question, but either way the presence of unanchored nodes will take enhanced UD graphs beyond the bi-lexical Flavor (0) graphs in our terminology. sented in order of (arguably) increasing 'abstraction' from the surface string, i.e. ranging from fully anchored Flavor (1) to unanchored Flavor (2).

Elementary Dependency Structures
The EDS graphs (Oepen and Lønning, 2006) originally derive from the underspecified logical forms computed by the English Resource Grammar (Flickinger et al., 2017;Copestake et al., 2005). These logical forms are not in and of themselves semantic graphs (in the sense of §2 above) and are often refered to as English Resource Semantics (ERS; Bender et al., 2015). 3 Elementary Dependency Structures (EDS; Oepen and Lønning, 2006) encode English Resource Semantics in a variablefree semantic dependency graph-not limited to bi-lexical dependencies-where graph nodes correspond to logical predications and edges to labeled argument positions. The EDS conversion from underspecified logical forms to directed graphs discards partial information on semantic scope from the full ERS, which makes these graphs abstractlyif not linguistically-similar to Abstract Meaning Representation (see below).
Nodes in EDS are in principle independent of surface lexical units, but for each node there is an explicit, many-to-many anchoring onto sub-strings of the underlying sentence. Thus, EDS instantiates Flavor (1) in our hierarchy of different formal types of semantic graphs and, more specfically, are fully anchored but unordered. Avoiding a oneto-one correspondence between graph nodes and surface lexical units enables EDS to adequately represent, among other things, lexical decomposition (e.g. of comparatives), sub-lexical or construction semantics (e.g. corresponding to morphological derivation or syntactic compounding, respectively), and covert (e.g. elided) meaning contributions. All nodes in the example EDS in Figure 1 make explicit their anchoring onto sub-strings of the underlying input, for example span 2 : 9 for similar.
In the EDS analysis for the running example, nodes representing covert quantifiers (e.g. on bare nominals, labeled udef q 4 ), the 3 The underlying grammar is rooted in the general linguistic theory of Head-Driven Phrase Structure Grammar (HPSG; Pollard and Sag, 1994). 4 In the EDS example in Figure 1, all nodes corresponding to instances of bare 'nominal' meanings are bound by a covert quantificational predicate, including the group-forming implicit conj and and c nodes that represent the nested, binarybranching coordinate structure. This practice of uniform quantifier introduction in ERS is acknowledged as "particularly exuberant" by Steedman (2011, p. 21). two-place such+as p relation, as well as the implicit conj(unction) relation (which reflects recursive decomposition of the coordinate structure into binary predications) do not correspond to individual surface tokens (but are anchored on larger spans, overlapping with anchors from other nodes). Conversely, the two nodes associated with similar indicate lexical decomposition as a comparative predicate, where the second argument of the comp relation (the 'point of reference') remains unexpressed in Example (1).
Prague Tectogrammatical Graphs These graphs present a conversion from the multi-layered (and somewhat richer) annotations in the tradition of Prague Functional Generative Description (FGD; Sgall et al., 1986), as adopted (among others) in the Prague Czech-English Dependency Treebank (PCEDT; Hajič et al., 2012) and Prague Dependency Treebank (PDT; Böhmová et al., 2003). For more details on how the graphs are obtained from the original annotations, see .
The PTG structures essentially recast core predicate-argument structure in the form of mostly anchored dependency graphs, albeit introducing 'empty' (or generated, in FGD terminology) nodes, for which there is no corresponding surface token. Thus, these partially anchored representations instantiate Flavor (1) in our hierarchy of different formal types of semantic graphs, where anchoring relations can be discontinuous: For example, the technique node in Figure 2 is anchored to both the noun and its indefinite determiner a. PTG structures assume a total order of nodes, which provides the foundation for an underlying theory of topicfocus articulation, as proposed by Hajičová et al. (1998).
The PTG structure for our running example has many of the same dependency edges as the EDS graph (albeit using a different labeling scheme and inverse directionality in a few cases), but it analyzes the predicative copula as semantically contentful and does not treat almost as 'scoping' over the entire graph. In the example graph, there are two generated nodes to represent the unexpressed BEN(efactive) of the impossible relation as well as the unexpressed ACT(or) argument of the threeplace apply relation, respectively; these nodes are related by an edge indicating grammatical coreference. In this graph, the indefinite determiner, infinitival to, and the vacuous preposition marking  Figure 2: Semantic dependency graphs for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice: Prague Tectogrammatical Graphs (PTG). In addition to node properties, visualized similarly to the EDS in Figure 1, boolean edge attributes are abbreviated below edge labels, for true values. the deep object of apply can be argued to not have a semantic contribution of their own.
The ADDR argument relation to the apply predicate has been recursively propagated to both elements of the apposition and to all members of the coordinate structure. Accordingly, edge labels in PTG are not always functional, in the sense of allowing multiple outgoing edges from one node with the same label.
In FGD, role labels (called functors) ACT(or), PAT(ient), ADDR(essee), ORIG(in), and EFF(ect) indicate 'participant' positions in an underlying valency frame and, thus, correspond more closely to the numbered argument positions in other frameworks than their names might suggest. 5 The PTG annotations are grounded in a machine-readable valency lexicon (Urešová et al., 2016), and the frame values on verbal nodes in Figure 2 indicate specific verbal senses in the lexicon. 5 Accordingly, multiple instances of the same core participant role-as ADDR:member in Figure 2-will only occur with propagation of dependencies into paratactic constructions.

Universal Conceptual Cognitive Annotation
Universal Cognitive Conceptual Annotation (UCCA; Abend and Rappoport, 2013) is based on cognitive linguistic and typological theories, primarily Basic Linguistic Theory (Dixon, 2010/2012). The shared task targets the UCCA foundational layer, which focuses on argument structure phenomena (where predicates may be verbal, nominal, adjectival, or otherwise). This coarse-grained level of semantics has been shown to be preserved well across translations (Sulem et al., 2015). It has also been successfully used for improving text simplification (Sulem et al., 2018c), as well as to the evaluation of a number of text-to-text generation tasks (Birch et al., 2016;Sulem et al., 2018a;Choshen and Abend, 2018).
The basic unit of annotation is the scene, denoting a situation mentioned in the sentence, typically involving a predicate, participants, and potentially modifiers. Linguistically, UCCA adopts a notion of semantic constituency that transcends pure dependency graphs, in the sense of introducing separate, unlabeled nodes, called units. One or more labels are assigned to each edge. Formally, UCCA has a  Figure 3: Universal Conceptual Cognitive Annotation (UCCA), foundational layer, for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice. The dashed edge whose target is the node anchored to technique abbreviates a boolean remote edge attribute.
Type (1) flavor, where leaf (or terminal) nodes of the graph are anchored to possibly discontinuous sequences of surface sub-strings, while interior (or 'phrasal') graph nodes are formally unanchored. The UCCA graph for the running example (see Figure 3) includes a single scene, whose main relation is the Process (P) evoked by apply. It also contains a secondary relation labeled Adverbial (D), almost impossible, which is broken down into its Center (C) and Elaborator (E); as well as two complex arguments, labeled as Participants (A). Unlike the other frameworks in the task, the UCCA foundational layer integrates all surface tokens into the graph, possibly as the targets of semantically bleached Function (F) and Punctuation (U) edges. UCCA graphs need not be rooted trees: Argument sharing across units will give rise to reentrant nodes much like in the other frameworks. For example, technique in Figure 3 is both a Participant in the scene evoked by similar and a Center in the parent unit. UCCA in principle also supports implicit (unexpressed) units which do not correspond to any tokens, but these are currently excluded from parsing evaluation and, thus, suppressed in the UCCA graphs distributed in the context of the shared task.
Abstract Meaning Representation The shared task includes Abstract Meaning Representation (AMR; Banarescu et al., 2013), which in the MRP hierarchy of different formal types of semantic graphs (see §2 above) is simply unanchored, i.e. represents Flavor (2). The AMR framework is independent of particular approaches to derivation and compositionality and, accordingly, does not make explicit how elements of the graph correspond to the surface utterance. Although most AMR parsing research presupposes a pre-processing step that 'aligns' graph nodes with (possibly discontinuous) sets of tokens in the underlying input, this anchoring is not part of the meaning representation proper.
At the same time, AMR frequently invokes lexical decomposition and normalization towards verbal senses, such that AMR graphs often appear to 'abstract' furthest from the surface signal. Since the first general release of an AMR graph bank in 2014, the framework has provided a popular target for data-driven meaning representation parsing and has been the subject of two consecutive tasks at SemEval 2016 and 2017 (May, 2016;May and Priyadarshi, 2017).
The AMR example graph in Figure 4 has a topo-   Figure 5: Discourse Representation Graph (DRG) for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice. Different node shapes are not formally part of the graph but serve as a visual aid to distinguish different types of the underlying DRS elements. logy broadly comparable to EDS, with some notable differences. Similar to the UCCA example graph (and unlike EDS), the AMR representation of the coordinate structure is flat. Although most lemmas are linked to derivationally related forms in the sense lexicon, this is not universal, as seen by the nodes corresponding to similar and such as, which are labeled as resemble-01 and exemplify-01, respectively. These sense distinctions (primarily for verbal predicates) are grounded in the inventory of predicates from the PropBank lexicon (Kingsbury and Palmer, 2002;Hovy et al., 2006).
Role labels in AMR encode semantic argument positions, with the particular roles defined according to each PropBank sense, though the counting in AMR is zero-based such that the ARG1 and ARG2 roles in Figure 4 often correspond to ARG2 and ARG3, respectively, in the EDS of Figure 1. Prop-Bank distinguishes such numbered arguments from non-core roles labeled from a general semantic inventory, such as frequency, duration, or domain. Figure 4 also shows the use of inverted edges in AMR, for example ARG1-of and mod. These serve to allow annotators (and in principle also parsing systems) to view the graph as a tree-like structure (with occasional reentrancies) but are formally merely considered notational variants. Therefore, the MRP rendering of the AMR example graph also provides an unambiguous indication of the underlying, normalized graph: Edges with a label component shown in parentheses are to be reversed in normalization, e.g. representing an actual ARG0 edge from resemble-01 to technique or a domain edge from other to crop.
Given the non-compositionality of AMR annotation, AMR allows the introduction of semantic concepts which have no explicit lexicalization in the text, for example the et-cetera element in the coordinate structure in Figure 4. Conversely, like in the other frameworks (except UCCA), some surface tokens are analyzed as semantically vacuous. For example, parallel to the PTG graph in Figure 2, there is no meaning contribution annotated for the determiner a (let alone for covert determiners in bare nominals, as made explicit in EDS).
Discourse Representation Graphs Finally, Discourse Representation Graphs (DRG) provide a graph encoding of Discourse Representation Structure (DRS), the meaning representations at the core of Discourse Representation Theory (DRT; Kamp and Reyle, 1993;Van der Sandt, 1992;Asher, 1993). DRSs can model many challenging semantic phenomena including quantifiers, negation, scope, pronoun resolution, presupposition accommodation, and discourse structure. Moreover, they are directly translatable into first-order logic formulas to account for logical inference.
DRG used in the shared task represents a type of graph encoding of DRS that makes the graphs structurally as close as possible to the structures found in other frameworks; Abzianidze et al. (2020) provide more details on the design choices in the DRG encoding. The source DRS annotations are taken from data release 3.0.0 of the Parallel Meaning Bank (PMB; Abzianidze et al., 2017;. 6 Although the annotations in the PMB are compositionally derived from lexical semantics, anchoring information is not explicit in its DRSs; thus, (like AMR) the DRG framework formally instantiates Flavor (2) of meaning representations.
The DRG of the running example is given in   of VerbNet roles. Nodes with quoted labels represent entities which semantically behave as constants. Such a node is used for the indexical "now", modelling the time of speech, which is part of the semantics of the present-tense copula is.
Explicit encoding of the scope is one of the main differences between DRG and the other frameworks. Scopes can be triggered by discourse segments, negation, universal quantification, clause embedding (e.g. to apply . . . ), and presuppositions (e.g. other crops). The scopes are represented as unlabeled (square-shaped) nodes in DRG (UCCA also has unlabeled nodes, albeit for a different reason). The node for the first discourse segment is treated as a root, which is connected to the scope of the embedded clause by the ATTRIBUTION discourse relation. The latter scope presupposes the scope containing a crop which is different (with NEQ inequality) from the group of crops consisting of (with the Sub semantic role) rice, soybeans, and cotton. Each concept, represented by a Word-Net synset, has explicitly assigned its scope via in edges. 7 Compared to the other frameworks, DRG structures are larger in size due to the number of semantic relations, explicit nodes for scope, scope membership edges, role reification, and information about the time (which usually introduces at least four additional nodes). 7 Since in principle the scope of a semantic role cannot be uniquely determined by the scopes of its arguments, semantic roles are reified as nodes and can have ingoing in edges. But whenever the scopes of a role and its arguments coincide, the scope membership edge for the role is omitted and hence recoverable. This decision decreases the number of edges in DRG.

Task Setup
The following paragraphs summarize the 'logistics' of the MRP 2020 shared task. Except for the addition of the new cross-lingual track, the overall task setup mirrored that of the 2019 predecessor; please see  for additional background.
Cross-Framework Track The English training, validation, and evaluation data are summarized in Table 1. For EDS, PTG, UCCA, and AMR the provenance of these gold-standard annotations is the same as in the MRP 2019 setup . 8 The DRG target structures have been converted using the procedure sketched in §3 above. Unlike in the 2019 edition of the task, designated validation segments have been provided for all five frameworks in the cross-framework track; this data could be used during system development, e.g. for parameter tuning, but not for training the final system submission. For EDS, UCCA, and AMR, the 2020 validation data corresponds to the 2019 evaluation segments, thus allowing some comparability across the two editions of the MRP shared task.
As a common point of reference, the training data includes a sample of 89 WSJ sentences annotated in all five frameworks (twenty for DRG); for all frameworks but DRG, the evaluation data further includes parallel annotations over the same random selection of 100 sentences from the novel The Little Prince (by Antoine de Saint-Exupéry) as used in MRP 2019, dubbed LPPS. These parallel subsets of the gold-standard data are available for public download from the task site (see §9 below).  Table 2: Contrastive graph statistics for the MRP 2020 English training data using a subset of the properties defined by Kuhlmann and Oepen (2016). Here, % g and % n indicate percentages of all graphs and nodes, respectively, in each framework; AMR −1 refers to the normalized form of the graphs, with inverted edges reversed, as discussed in § 3. The second block of statistics indicates the proportional distribution of different formal types of information in the graphs, according to the categorization used in the MRP cross-framework evaluation metric (see § 5). Table 2 provides a quantitative side-by-side comparison of the training data, using some of the graph-theoretic properties discussed by Kuhlmann and Oepen (2016); see §2 for semi-formal definitions. The table indicates clear differences among the frameworks. The underlying input strings for AMR (where text selection is more varied), for example, are shorter, and much shorter in turn for DRG. EDS, UCCA, and DRG have many more nodes per token, on average, than the other frameworks-reflecting lexical decomposition, 'phrasal' grouping, and role reification, respectively, as evident in Figures 1, 3, and 5. In some respects, the PTG and UCCA graphs are more tree-like than graphs in the other frameworks, for example in their proportions of actual rooted trees, the frequencies of reentrant nodes, and the lack of multi-rooted structures. At the same time, PTG exhibits comparatively high average and maximal treewidth and is the only framework with a non-trivial percentage of cyclic graphs.
Cross-Lingual Track For four of the frameworks (excluding EDS), gold-standard training and evaluation data has been compiled in other languages than English: Mandarin Chinese for AMR, Czech for PTG, and German for UCCA and DRG. For UCCA and in particular DRG, however, available data is comparatively limited, as summarized in Table 3. These target representations constitute a separate cross-lingual track, which transcends the MRP 2019 task setup.
Additional Resources For reasons of comparability and fairness, the shared task constrained which additional data or pre-trained models (e.g. corpora, word embeddings, language models, lexica, or other annotations) can be legitimately used besides the resources distributed by the task organizers-such that all participants should in principle have access to the same range of data. However, to keep such constraints to the minimum required, a 'white-list' of legitimate resources was compiled from nominations by participants (with a cut-off date eight weeks before the end of the eval-  uation period). 9 Thus, the task design reflects what is at times called a closed track, where participants are constrained in which additional data and pretrained models can be used in system development.
Companion Syntactic Parses At a technical level, training (and evaluation) data were distributed in two formats, (a) as sequences of 'raw' sentence strings and (b) in pre-tokenized, partof-speech-tagged, lemmatized, and syntactically parsed form. For the latter, premium-quality morpho-syntactic dependency analyses were provided to participants, called the MRP 2020 companion parses. These parses were obtained using a prerelease of the 'future' UDPipe architecture (Straka, 2018;Straka and Straková, 2020), trained on available gold-standard UD 2.x treebanks, for English augmented with conversions from PTB-style annotations in the WSJ and OntoNotes corpora (Hovy et al., 2006), using the UD-style CoreNLP 4.0 tokenizer (Manning et al., 2014) and jack-knifing where appropriate (to avoid overlap with the texts underlying the MRP semantic graphs).

Rules of Participation
While the various meaning representation frameworks and graph banks represented in the shared task inevitably present considerable linguistic variation, all MRP 2020 data was repackaged in a uniform and normalized abstract representation with a common serialization, the same JSON Lines format as used in the previous year . Because some of the semantic graph banks involved in the shared task had originally been released by the Linguistic Data Consortium (LDC), the training data was made available to task participants by the LDC under no-cost evaluation licenses. All task data (including system submissions and evaluation results) is being prepared for general release through the LDC, while subsets that are copyright-free will also become available for direct, open-source download. The shared task was first announced in March 2020, the initial release of the cross-framework training data became available in late April, and the evaluation period ran between July 27 and August 10, 2020; during this period, teams obtained the unannotated input strings for the evaluation data and had available a little more than two weeks to prepare and submit parser outputs. Submission of semantic graphs for evaluation was through the 9 See http://svn.nlpl.eu/mrp/2020/public/ resources.txt for the list of legitimate extra resources.  on-line CodaLab infrastructure. Teams were allowed to make repeated submissions, but only the most recent successful upload to CodaLab within the evaluation period was considered for the official, primary ranking of submissions. Task participants were encouraged to process all inputs using the same general parsing system, but-owing to inevitable fuzziness about what constitutes 'one' parser-this constraint was not formally enforced.

Evaluation
Following the previous edition of the shared task, the official MRP metric for the task is the microaverage F 1 score across frameworks over all tuple types that encode 'atoms' of information in MRP graphs. The cross-framework metric uniformly evaluates graphs of different flavors, regardless of a specific framework exhibiting (a) labeled or unlabeled nodes or edges, (b) nodes with or without anchors, and (c) nodes and edges with optional properties and attributes, respectively (see Table 4).
The MRP metric generalizes earlier frameworkspecific metrics (Dridan and Oepen, 2011;Hershcovich et al., 2019a) in terms of decomposing each graph into sets of typed tuples, as indicated in Figure 6. To quantify graph similarity in terms of tuple overlap, a correspondence relation between the nodes of the goldstandard and system graphs must be determined. Adapting a search procedure for the NP-hard maximum common edge subgraph (MCES) isomorphism problem, the MRP scorer will search for the node-to-node correspondence that maximizes the intersection of tuples between two graphs, where node identifiers (m and n in Figure 6) act like variables that can be equated across the gold-standard and system graphs. 10 This means that during evaluation all information in the MRP graphs is con-  sidered with equal weight, i.e. tops, node and edge labels, properties and attributes, and anchors. MRP scoring is carried out using the opensource mtool software-the Swiss Army Knife of Meaning Representation 11 -which implements a refinement of the MCES algorithm by McGregor (1982). Based on pre-computed per-node rewards and upper bounds on adjacent edge correspondences, candidate node-to-node mappings are initialized and scheduled in decreasing order of expected similarity. For increased efficiency (in principle tractability, in fact), mtool will return the best available solution when it exhausts its preset search space limits. This anytime behavior of the scores provides a distinction between exact vs. approximate solutions (which contrasts with 11 https://github.com/cfmrp/mtool

Submissions and Results
Six teams submitted parser outputs to the shared task within the official evaluation period. In addition, we received two submissions after the submission deadline, which we mark as 'unofficial'. We further include results from an additional 'reference' system by one of the task co-organizers, namely EDS outputs from the grammar-based ERG parser (Oepen and Flickinger, 2019). Table 5 presents an overview of the participating systems and the tracks and frameworks they submitted results for. All official systems submitted results for the cross-framework track (across all frameworks), and additionally five of them submitted results to the cross-lingual track as well (where TJU-BLCU did not submit UCCA parser outputs in the cross-lingual track). We note that the shared task explicitly allowed partial submissions, in order to lower the bar for participation (which is no doubt substantial). Two of the teams-ISCAS and TJU-BLCU-declined the invitation to submit a system description paper to the shared task proceedings.

Team
Cross-Framework Cross-Lingual All EDS PTG UCCA AMR DRG All PTG UCCA AMR DRG 3  3  3  3  3  ------3  3  3  2  3  3  3  3  2  3 Table 7: Per-framework results for the cross-framework track, using the same groupings as in Table 6. Table 6 presents the official rankings for the official submissions (top), including an overall score for each track and per-framework rankings. Rankings are given over the LPPS dataset, a sample from the Little Prince annotated by all frameworks save for DRG, and over the entire test set. Results are consequently more readily comparable for the LPPS sub-corpus, but should be more robust on the entire test corpus, due to its larger size (see §4). That said, LPPS and overall test results are very similar, both in terms of ranking and in terms of bottom line scores.
The main task results are summarized in Table 6 for both the cross-framwork (middle) and cross-lingual (bottom) tracks. Results are broken down into component pieces. Edge attributes are only present in PTG and UCCA. While they are still predicted with fairly low results, this constitutes a notable improvement over the findings of MRP 2019 (the best score on the official track on UCCA edge attributes was 0.12 F 1 then, as opposed to 0.36 now). Anchors are predicted with substantially lower scores compared to MRP 2019, probably since we did not include in MRP 2020 the bi-lexical Flavor (0) frameworks. Edges and tops are slightly more accurate, while labels and properties slightly less, but these are not directly comparable since the frameworks and data are different. See §8 for an overall discussion of the state of the art, considering MRP 2019 and MRP 2020.
Results show that the Hitachi andÚFAL sub-missions share the first place for both tracks, and together rank first or second for almost all the individual frameworks (save for UCCA parsing in the cross-lingual track, where Hitachi ranks third). HIT-SCIR further ranks second for UCCA parsing in both tracks. Interestingly, rankings in the perframework track are similar across frameworks, which may indicate some similarity in the parsing problem exhibited by different linguistic schemes, despite differences in structure and content.
Per-framework scores using the official MRP metric are given in Table 7 for the cross-framework track and Table 8 for the cross-lingual track. Examining these results, we note that cross-framework and cross-lingual scores are quite similar, an encouraging sign of cross-linguistic applicability. Another trend to note is that precision and recall are surprisingly close to each other for many systems, often identical.

Overview of Approaches
Compared with systems from MRP 2019, there has been a fairly clear shift in approaches for participating systems this year, resulting in significant improvements in performance. The improvements for some of the frameworks are fairly substantial. For example, the Hitachi system, one of the two winning systems, achieves a score of 0.82 F 1 in AMR parsing, in comparison to 0.73 F 1 achieved by the top AMR parser in MRP 2019. This reflects an improvement of over eight points, reflecting a  number of innovations from the participants this year, as well as contemporaneous developments outside the shared task (see §8). Broadly speaking, top performers at MRP 2020 have all adopted a system architecture that is based on an encoder-decoder framework in which the input sentence is encoded into contextualized token embeddings that are used as input to the decoder. The system vary in the decoding strategies.
The Hitachi system adopts a transformer-based encoder-decoder architecture. The system uses the standard transformer encoder in which selfattention and position embeddings are used to compute the contextualized token embeddings. In its decoder, this system has a number of innovations, however. First of all, the system rewrites the meaning representation graphs into a reversible Plain Graph Notation (PGN), and enhances PGN with a number of pseudo-nodes that indicate the end of node prediction, the end of label prediction, etc. These correspond well with parsing actions commonly found in transition-based systems. In this sense, the systems combines the strengths of graphbased parsing on the encoder side resulting from self attention with efficiency of transition-based parsing on the decoder side. Another innovation is the use of a 'hierarchical' decoding process in which the model first predicts a mode, and then predicts the next action conditioned on the mode. For example, if the mode is G(raph), the decoder predicts a meta node, and if the mode is S(urface), the decoder predicts the node label of a specific concept. This allows a fair competition among actions that are similar in nature.
The PERIN system computes contextualized token embeddings with XLM-R (Conneau et al., 2019) on the encoder side, and then on the decoder side, uses separate attention heads to predict the node labels, identify anchors for nodes, and predict edges between nodes, as well as edge labels. Because the label set for nodes is typically very large, rather than predicting the node labels directly, the PERIN system reduces the search space by predicting 'relative rules' that can be used to map surface token strings to node labels in meaning representation graphs, an idea that is similar to the use of Factored Concept Labels in Wang and Xue (2017). Another innovation of the PERIN system is that it is trained with a permutation-invariant loss function that returns the same value independently of how the nodes in the graph are ordered. This captures the unordered nature of nodes in (most of the MRP 2020) meaning representation graphs and prevents situations in which the model is penalized for generating the correct nodes in an order that is different from that in the training data.
The HIT-SCIR and JBNU systems adopt the iterative inference framework first proposed by Cai and Lam (2020) for Flavor (2) meaning representation graphs that do not enforce strict correspondences between tokens in the input sentence and the concepts in meaning representation graphs. The iterative inference framework is also based on an encoder-decoder architecture. The encoder takes the sentence as input and computes contextualized token embeddings that are used as text memory by a decoder that iteratively predicts the next node given the text memory and a predicted parent node in the partially constructed graph memory at the previous time step, and then identifies the parent node for the newly predicted node from the partially constructed graph. While the HIT-SCIR system essentially uses the Cai and Lam (2020) architecture with little modification, the JBNU system attempts to extend the work of Cai and Lam (2020) by using a shared state to make both predictions but did not observe substantial improvements.
Transition-based systems, which had achieved strong performance in the 2019 shared task, are also represented in the competition this year. The HIT-SCIR team uses a transition-based system to parse Flavor (1) meaning representations where there is a stricter correspondence between tokens in the input sentence and concepts in the meaning representation graph. The HIT-SCIR transition-based system is essentially the overall top performing system they developed for MRP 2019. It uses Stack LSTM to compute transition states in the parsing process, and the parsing actions are tailored to specific meaning representation frameworks. In the training process, the system fine-tunes BERT contextualized encodings.
The HUJI-KU system also extends an entry in the 2019 MRP shared task (originally called TUPA) to parse additional frameworks and handle meaning representation parsing in a multilingual setting. TUPA is a transition-based system that supports general DAG parsing. TUPA applies separate constraints tailored to each meaning representation framework. When parsing cross-framework meaning representations for English, the system is trained with a BERT-large-cased pretrained encoder, and when parsing cross-lingual meaning representations, it is trained with multilingual BERT.
8 On the State of the Art MRP 2019  yielded parsers for five frameworks in a uniform format, of which EDS, UCCA, and AMR are represented in MRP 2020 again. Submissions included transition-, factorization-, and composition-based systems, and gold-standard target structures in 2019 were solely for English. Comparability is limited by the fact that two of the 2020 frameworks (PTG and DRG) are new, training and (in particular) evaluation sets for the others have been updated since MRP 2019, and additional validation sets was introduced. However, the LPPS evaluation sub-corpus (Le Petite F 2019 .92 .93 .93 .84 .82 .83 .74 .72 .73 2020 .97 .97 .97 .86 .80 .83 .78 .79 .79 Table 9: Per-framework cross-task comparison of top MRP metric scores on LPPS between the 2019 and 2020 editions of the MRP task, on the three frameworks represented in both year, for English. The top systems in MRP 2019 for EDS, UCCA, and AMR were Peking (Chen et al., 2019), HIT-SCIR (Che et al., 2019), and Saarland (Donatelli et al., 2019), respectively; in MRP 2020 the Hitachi system (Ozaki et al., 2020) was at the top for all three frameworks, sharing the UCCA first rank withÚFAL (Samuel and Straka, 2020).
Prince) is identical between the two years for EDS, UCCA, and AMR. This allows a comparison on nearly equal grounds: as Table 9 shows, in terms of LPPS F 1 , the state-of-the-art has substantially improved for EDS and AMR parsing, but stayed the same for UCCA. However, as mentioned in §6, remote edge detection for UCCA improved substantially, though it carries only a small weight in terms of overall scores due to the scarcity of remote edges.
For EDS, the strongest results were obtained in the MRP 2019 official competition by SUDA-Alibaba (Zhang et al., 2019c). However, in the post-evaluation stage, they were outperformed by the Peking system (Chen et al., 2019). Both used factorization-based parsing with pre-trained contextualized language model embeddings (which has consistently proved to be very effective for other frameworks too). These parsers even approached the performance of the carefully designed grammarbased ERG parser (Oepen and Flickinger, 2019).
English PTG has not been comprehensively addressed by parsers prior to MRP 2020, but a bilexical framework called PSD is a subset of PTG. It was included in the SDP shared tasks (Oepen et al., 2014 as well as in MRP 2019, and has been addressed by numerous parsers since (Kurita and Søgaard, 2019;Kurtz et al., 2019;Jia et al., 2020, among others).  established the state of the art in supervised PSD using a second-order factorization-based parser, and Fernández-González and Gómez-Rodríguez (2020) matched it using a stack-pointer parser.
Czech PTG, in its original form as published in the Prague Dependency Treebank (Hajič et al., 2018), has been used in several version of the TectoMT machine translation system (Rosa et al., 2016); however, parsing results have not been published separately. A (lossy) conversion has been included in the CoNLL 2009 Shared Task on Semantic Role Labeling (Hajič et al., 2009), but the differences in task design are and conversion make empirical comparison impossible.
UCCA parsing has been dominated by transitionbased methods (Hershcovich et al., 2017(Hershcovich et al., , 2018Che et al., 2019). However, both English and German UCCA parsing featured in a SemEval shared task (Hershcovich et al., 2019b), where the best system, a composition-based parser (Jiang et al., 2019), treated the task as constituency tree parsing with the recovery of remote edges as a postprocess-ing task.
Prior to MRP 2019, Lyu and Titov (2018) parsed AMR using a joint probabilistic model with latent alignments, avoiding cascading errors due to alignment inaccuracies and outperforming previous approaches. Lyu et al. (2020) recently improved the latent alignment parser using stochastic softmax.  trained a composition-based parser on five frameworks including AMR and EDS, using the Apply-Modify algebra, on which the third-ranked Saarland submission to MRP 2019 was based (Donatelli et al., 2019). They employed multi-task training with all tackled semantic frameworks and UD, establishing the state of the art on all graph banks but AMR 2017. Since then, a new state-of-the-art has been established for English AMR, using sequenceto-sequence transduction (Zhang et al., 2019a,b) and iterative inference with graph encoding Lam, 2019, 2020). Xu et al. (2020a) improved sequence-to-sequence parsing for AMR by using pre-trained encoders, reaching similar performance to Cai and Lam (2020).  introduced a stack-transformer to enhance transitionbased AMR parsing (Ballesteros and Al-Onaizan, 2017), and Lee et al. (2020) improved it further, using a trained parser for mining oracle actions and combining it with AMR-to-text generation to outperform the state of the-art. Wang et al. (2018) parsed Chinese AMR with a transition-based system. For cross-lingual AMR parsing, Blloshmi et al. (2020) trained an AMR parser similar to the approach of Zhang et al. (2019b), using cross-lingual transfer learning, outperforming the transition-based cross-lingual AMR parser of Damonte and Cohen (2018) on German, Spanish, Italian, and Chinese.
DRG is a novel graph representation format for DRS that was specially designed for MRP 2020 to make it structurally as close as possible to other frameworks (Abzianidze et al., 2020). However, several semantic parsers exist for DRS, which employ different encodings. Liu et al. (2018) used a DRG format that dominantly labels edges compared to nodes. van Noord et al. (2018) process DRSs in a clausal form, sets of triples and quadruples. The latter format is more common among DRS parsers, as it was officially used by the shared task on DRS parsing (Abzianidze et al., 2019). The shared task gave rise to several DRS parsers: Evang (2019); ; van Noord (2019); Fancellu et al. (2019), among which the best results (F 1 = 0.85) were achieved by the word-level sequence-to-sequence model with Tranformer . Note that the DRS shared task used F 1 calculated based on the DRS clausal forms, which is not comparable to MRP F 1 over DRGs.
Similarly to English DRG, German DRG has not been used for semantic parsing prior to the shared task due to the new DRG format. Moreover, semantic parsing with German DRG is novel in the sense that its DRS counterpart is also new. In German DRG, concepts are grounded in English WordNet 3.0 (Fellbaum, 2012) senses assuming that synsets are language-neutral. The mismatch between German tokens and English lemmas of senses must be expected to add additional complexity to German DRG parsing.
Direct comparison to non-MRP results is impossible: we are using a new version of AMRbank. Gold-standard tokenization is not provided for any of the frameworks. We use the MRP scorer. However, general trends appear consistent with recent developments. Pretrained embeddings and crosslingual transfer help; but multi-task learning less so.
There is yet progress to be made in sharing information between parsers for different frameworks and making better use of their overlap.

Reflections and Outlook
The MRP series of shared tasks has contributed to general availability of accurate data-driven parsers for a broad range of different frameworks, with performance levels ranging between 0.76 MRP F 1 (English UCCA) and 0.94 F 1 (English EDS). Parsing accuracies in the cross-lingual track present comparable levels of performance, despite limited training data in the case of UCCA and DRG. Furthermore, the evaluation sets for most of the frameworks comprise different text types and subject matters-offering some hope of robustness to domain variation. We expect that these parsers will enable follow-up experimentation on the utility of explicit meaning representation in downstream tasks like, for example, relation extraction, argumentation mining, summarization, or text generation.
Maybe equally importantly, the MRP task design capitalizes on uniformity of representations and evaluation, enabling resource creators and parser developers to more closely (inter)relate representations and parsing approaches across a diverse range of semantic graph frameworks. This facilitates both quantitative contrastive studies (e.g. the 'postmortem' analysis by Buljan et al. (2020), which observes that top-performing MRP 2019 parsers have complementary strengths and weaknesses) but also more linguistic, qualitative comparison. General availability of parallel gold-standard annotations over the same text samples-drawing from the WSJ and LPPS corpora-enables side-by-side comparison of linguistic design choices in the different frameworks. This is an area of investigation that we hope will see increased interest in the aftermath of the MRP task series, to go well beyond the impressionistic observations from §3 and ideally lead to contrastive refinement across linguistic schools and traditions.
Despite uniformity in packaging and evaluation, cumulative overall complexity and inherent diversity of the frameworks deemed participation in the shared task a formidable challenge. Of the sixteen teams who participated in MRP 2019, only four teams (predominantly strong performers from before) decided to submit parser outputs in 2020. The two 'newcomer' teams, by comparison, only made partial submissions in the cross-lingual track and ended up not competing for top ranks overall. Similar trends of 'competitive self-selection' and declining participant groups for consecutive instances have been observed with earlier CoNLL shared task and similar benchmarking series. On the upside, with the possible exception of English AMR (where there has been much contemporaneous progress recently), the MRP 2020 empirical results present a strong state-of-the-art benchmark for meaning representation parsing.
On the more foundational question of the relevance of explicit, discrete representations of sentence meaning, the past several years of breakthrough neural advances have been comparatively insensitive to syntactico-semantic structure. In our view, these developments have at least in part been reflective of the stark lack of general techniques for the encoding of hierarchical structure in end-to-end neural architectures. Increased adoption of Graph Convolutional Networks (Kipf and Welling, 2017) and other hierarchical modeling techniques suggest new opportunities for the exploration of both structurally informed end-to-end archictures or e.g. multi-task learning setups. Beyond such ultimately performance-driven research, explicit encoding of syntactico-semantic structure in our view further bears promise in terms of model interpretability and safe-guarding against 'neural meltdown' (e.g. discarding something as foundational as negation or inadvertently altering a date expression in summarization or translation). In a similar vain, meaning representations are being successfully applied in evaluation, e.g. to quantify system output vs. gold standard similarity beyond surface n-grams (Sulem et al., 2018b;Xu et al., 2020b, inter alios).
All technical information regarding the MRP 2019 shared task, including system submissions, detailed official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu Acknowledgments Several colleagues have assisted in designing the task and preparing its data and software resources. We thank Dotan Dvir (Hebrew University of Jerusalem) for leading the annotation efforts on UCCA. Dan Flickinger (Stanford University) created fresh gold-standard annotations of some 1,000 WSJ strings, which form part of the EDS evaluation graphs in 2020. Sebastian Schuster (Stanford University) advised on how to convert the goldstandard syntactic annotations from the venerable PTB and OntoNotes treebanks to Universal Dependencies, version 2.x, using 'modern' tokenization. Anna Nedoluzhko and Jiří Mírovský (Charles University in Prague) enhanced the PTG annotation of LPPS data with previously missing items, most notably coreference. Milan Straka (Charles University in Prague) made available an enhanced version of his UDPipe parser and assisted in training Czech, English, and German morpho-syntacic parsing models (for the MRP companion trees). Jayeol Chun (Brandeis University) provided invaluable assistance in conversion of the Chinese AMR annotations, preparation of the Chinese morpho-syntactic companion trees, and provisioning of companion alignments for the English AMR graphs.
We are grateful to the Nordic Language Processing Laboratory (NLPL) and Uninett Sigma2, which provided technical infrastructure for the MRP 2020 task. Also, we warmly acknowledge the assistance of the Linguistic Data Consortium (LDC) in distributing the training data for the task to participants at no cost to anyone.
The work on UCCA and the HUJI-KU submission was partially supported by the Israel Science Foundation (grant No. 929/17). The work on PTG has been partially supported by the Ministry of Education, Youth and Sports of the Czech Republic (project LINDAT/CLARIAH-CZ, grant No. LM2018101) and partially by the Grant Agency of the Czech Republic (project LUSyD, grant No. GX20-16819X). The work on DRG was supported by the NWO-VICI grant (288-89-003) and the European Union Horizon 2020 research and innovation programme (under grant agreement No. 742204). The work on Chinese AMR data is partially supported by project 18BYY127 under the National Social Science Foundation of China and project 61772278 under the National Science Foundation of China.