Penman: An Open-Source Library and Tool for AMR Graphs

Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a framework for semantic dependencies that encodes its rooted and directed acyclic graphs in a format called PENMAN notation. The format is simple enough that users of AMR data often write small scripts or libraries for parsing it into an internal graph representation, but there is enough complexity that these users could benefit from a more sophisticated and well-tested solution. The open-source Python library Penman provides a robust parser, functions for graph inspection and manipulation, and functions for formatting graphs into PENMAN notation. Many functions are also available in a command-line tool, thus extending its utility to non-Python setups.


Introduction
Abstract Meaning Representation (AMR; Banarescu et al., 2013) is a framework for encoding English language 1 meaning as structural-semantic graphs using a fork of Propbank (Kingsbury and Palmer, 2002;O'Gorman et al., 2018) for its semantic frames with additional AMR-specific roles. The graphs are connected, directed, with node and edge labels, and may have multiple roots but always have exactly one distinguished top node. AMR corpora, such as the recent AMR Annotation Release 3.0 (LDC2020T02), 2 encode the graphs in a format called PENMAN notation (Matthiessen and Bateman, 1991). PENMAN notation is a text stream and is thus linear, but it first uses bracketing to capture a spanning tree over the graph, then inverted edge labels and references to node IDs to capture re-entrancies. Proper interpretation 1 Variations exist for other languages (e.g., Li et al., 2016;, but AMR is primarily English and is not an interlingua (Xue et al., 2014).
2 https://catalog.ldc.upenn.edu/ LDC2020T02 of the "pure" graph therefore requires the deinversion of inverted edges and the resolution of node IDs. Some tools that work with AMR use the interpreted pure graph Song and Gildea, 2019;Chiang et al., 2013), but many others work at the tree level for surface alignment (Flanigan et al., 2014), for transformations from syntax trees (Wang et al., 2015), or to make use of tree-based algorithms (Pust et al., 2015;Takase et al., 2016). Others, particularly sequential neural systems (Konstas et al., 2017;van Noord and Bos, 2017), use the linear form directly.
Furthermore, while AMRs ostensibly describe semantic graphs abstracted away from any particular sentence's surface form, human annotators tend to "leak information" (Konstas et al., 2017) about the source sentence. This means that an annotator might be expected to produce the AMR in Fig. 1 for sentence (1), but then swap the relative order of the adjunct relations :location and :time for (2). 3 Van Noord and Bos (2017) embraced these biases and intentionally reordered relations, even frame arguments such as :ARG0 and :ARG1, by their surface alignments, leading to a boost in their evaluation scores.
(1) I swam in the pool today.
(2) Today I swam in the pool.  As illustrated above, work involving AMR may use the PENMAN string, the tree structure, or the pure graph, or possibly multiple representations. This paper therefore describes and demonstrates Penman, a Python library and command-line utility for working with AMR data at both the tree and graph levels and for encoding and decoding these structures using PENMAN notation. Converting a tree into a graph loses information that the tree implicitly encodes, so Penman introduces the epigraph: 4 optional information that exists on top of the graph and controls how the pure graph is expressed as a tree. Penman is freely available under a permissive open-source license at https://github.com/goodmami/penman/.

Decoding and Encoding Graphs
Penman uses three-stage processes to decode PEN-MAN notation to a graph and to encode a graph to PENMAN, as illustrated in Fig. 2. Parsing is the process of getting a tree from a PENMAN string, and interpretation is getting a graph from a tree, while decoding is the whole string-to-graph process. Going the other way, configuration is the process of getting a tree from a graph and formatting is getting a string from a tree, while encoding is the whole graph-to-string process. Splitting the decoding and encoding processes into two steps each allows one to work with AMR data at any stage. The variant of PENMAN notation used by Penman is described in §2.1. The tree, graph, and epigraph data structures are described in §2.2. Getting a tree from a string (and vice-versa) depends only on understanding PENMAN notation, but getting a graph from a tree (and vice-versa) requires an understanding of the semantic model. Semantic models are described in §2.4.

PENMAN Notation
The Penman project uses a less-strict variant of PENMAN notation than is used by AMR in order to robustly handle some kinds of erroneous output by AMR parsers. The syntactic and lexical rules for PENMAN notation in PEG syntax 5 are shown in Fig. 3. Optional whitespace (not shown) is allowed around expressions in the syntactic rules.
In AMR, the Concept expression on Node, the Atom expression on Concept, and the (Node / Atom) expression on Reln are obligatory, but they are optional for Penman and will get a null value when missing. Also in AMR, the initial Symbol on Node may be further constrained with a specific Variable pattern for node identifiers and the Symbol in Atom would become a choice: Variable / Symbol. How Penman handles variables is discussed in §2.2.
AMR corpora conventionally use blank lines to delineate multiple graphs, but Penman relies on bracketing instead and whitespace is not significant. Penman also parses comments (not described in Fig. 3), which are lines prefixed with # characters, and extracts metadata where keys are tokens prefixed with two colons (e.g., ::id) and values are anything after the key until the next key or a newline.

Decoding: Trees, Graphs, and Epigraphs
In Penman, a tree data structure is a n, B tuple where n is the node's identifier (variable) and B is a list of branches. Each branch is a l, b tuple where l is a branch label (a possibly inverted role) and b is a (sub)tree or an atom. The first branch on B is the node's concept, thus a tree is a near-direct conversion of the Node rule in Fig. 3 where B is the concatenation of Concept and Reln. The tree corresponding to the AMR in Fig. 2 is shown in Fig. 4.
A graph is a tuple v, T where v is the top variable and T is a flat list of triples. For each triple s, r, t , the source s is always the head variable of a dependency, r is the normalized role, and t is the dependent. When interpreting a triple from a tree branch, n becomes s and t comes from b unless the branch label l is deinverted according to the semantic model (described in §2.4) to produce r, in which case s and t are swapped. In the graph, t is designated a variable if it appears as the source of any other triple; otherwise it is an atom. Triples where t is a variable are called edge relations. If t is an atom and r is the special role :instance, then t is the node's concept and the triple is an instance relation. All other triples are attribute relations. Fig. 5 shows the graph corresponding to the AMR in Fig. 2. Conversion from a PENMAN string to a tree is straightforward: the only information lost in parsing is formatting details like the amount of whitespace. The interpretation of a graph from a tree, however, loses information about the specific tree configuration for the graph, as there are often many possible configurations for the same graph.  Therefore, upon interpretation, Penman stores in two places the information that would be lost: in the order of triples (meaning the graph's triples are a sequence, not an unordered bag or set), and in the epigraph, which is a mapping of triples to lists of epigraphical markers. The choice of the term epigraph is by analogy to the epigenome: just as epigenetic markers control how genes are expressed in an organism, epigraphical markers control how triples are expressed in a tree. In interpreting a graph from a tree, when a branch's target is another subtree (e.g., when ( is encountered in the string), a Push marker is assigned to the triple resulting from the branch, indicating that that triple pushed a new node context onto a stack representing the tree structure. The final triple resulting from branches in the subtree, even considering further nested subtrees (e.g., at the point where ) is encountered in the string), gets a Pop marker indicating the end of the node context. In addition to tree-layout markers, the epigraph is also where surface alignment information is stored, as these alignments are not part of the pure graph. Fig. 6 shows the epigraph for the AMR in Fig. 2. [Pop] } Figure 6: Epigraph structure for the AMR in Fig. 2

Encoding: No Surprises
When configuring a tree from a graph, the epigraph is used to control where triples occur in the tree. If at each step the layout markers in the epigraph allow the configuration process to navigate a tree with no surprises (that is, when the source or target of each triple is the current node on a node-context stack), then it will produce the same tree that was decoded to get the graph. 6 Otherwise, such as when a graph is modified or constructed without an epigraph, the algorithm will switch to another procedure that repeatedly passes over the list of remaining triples and configures those whose source or target are already in the tree under construction. If no triples are inserted in a pass, the remaining triples are discarded and a warning is logged that the graph is disconnected. The semantic model is used to properly configure inverted branches as necessary.
Once a tree is configured, formatting it to a string is simple, and users may customize the formatter to adjust the amount of whitespace used. The default indentation width is an adaptive mode that indents based on the initial column of the current node context; otherwise an explicit width is multiplied by the nesting level, or a user may select to print the whole AMR on one line. Another customization option is a "compact" mode which joins any attribute relations, but not edges, that immediately follow the concept onto the same line as the concept.

Semantic Models
In order to interpret a tree into a graph, a semantic model is used to get normalized, or deinverted, triples. Penman provides a default model which only checks if the role ends in -of (the conventional indicator of role inversion in PENMAN notation). Ideally this would be all that is needed, but AMR defines several primary (non-inverted) roles ending in -of, such as :consist-of and :prep-on-behalf-of, where the inverted forms are :consist-of-of and :prep-on-behalf-of-of, respectively. The model therefore first checks if a role is listed as a primary role; if not and if it ends in -of, it is inverted, otherwise it is not. When the role of a triple 6 There is currently one known situation where this is not the case: when a graph has duplicate triples with the same source, role, and target, as the epigraph cannot uniquely map the triple to its epigraphical markers. These, however, are likely bad graphs in AMR. is deinverted, Penman also swaps its source and target so the dependency relation remains intact.
The model has other uses, such as inverting triples (useful when encoding), defining transformations as described in §3, and checking graphs for compliance with the model. In addition to the default model, Penman includes an AMR model with the roles and transformations defined in the AMR documentation. 7

Graph and Tree Transformations
Goodman (2019a) described four transformations of AMR graphs and trees-namely, role canonicalization, edge reification (including dereification), attribute reification, and tree structure indication 8 -and how they could be used to improve the comparability of parser-produced AMR corpora by normalizing differences that are meaningequivalent in AMR and by allowing for partial credit when, for example, a relation has a correct role but an incorrect target value. Penman incorporates all of those transformations but it (a) depends on the semantic model to define canonical roles and reifications, whereas Goodman 2019a used hardcoded transformations; and (b) inserts layout markers for a "no-surprises" configuration that results in the expected tree. A separately-defined model allows Penman to use the same transformation methods with different versions of AMR, for different tasks, or even with non-AMR representations, by creating different models. For the implementation details of these transformations, refer to Goodman 2019a.
In addition to those four transformations, Penman adds a few more methods. The rearrange method operates on a tree and sorts the order of branches by their branch labels. Besides changing the order of branches, their structure is unchanged by this method. Van Noord and Bos (2017) similarly rearranged tree branches based on surface alignments. The reconfigure method configures a tree from a graph after discarding the layout markers in the epigraph and sorting the triples based on their roles. Unlike the rearrange method, reconfigure affects the entire structure of the graph except for which node is the graph's top. For both of these, the sorting methods are defined by the model, and Penman includes three such methods: original order, random order, and canonical order. For rearrange there are additional sorting methods applicable to trees: alphanumeric order, attributesfirst order, and inverted-last order. Since node variables in AMR are conventionally assigned in order of their appearance and the above methods can change this order, the reset-variables method reassigns the variables based on the new tree.

Use Cases
Here I describe a handful of use cases that motivate the use of Penman.

Graph Construction
Users of the Penman library can programmatically construct graphs and then encode them to PENMAN notation. Penman allows users to directly append to the list of triples and assign epigraphical markers, or to assemble small graphs and use set-union operations to combine them together. Another option is to assemble the tree directly, which may make more sense for some systems. Once the tree is configured or constructed, users can use transformations such as rearrange and reset-variables to make the PENMAN string more canonical in form. Fig. 7 illustrates using the Python API to construct and encode a graph. Another possibility is for graph augmentation, where users rely on Penman to parse a string to a graph which they then modify, e.g., to add surface alignments or wiki links, then serialize to a string again. This allows them to focus on their primary task without worrying about the details of parsing and formatting.

Graph Validation
Whether one is generating AMR graphs with hand annotation or by automatic means, the end result is not guaranteed to be valid with respect to the model, so Penman offers a function to check for compliance. Currently, this check evaluates three criteria: 1. Is each role defined by the model? 2. Is the top set to a node in the graph?
3. Is the graph fully connected?
To facilitate both library and tool usage, the library function returns a dictionary mapping triples (for context) to error messages, as shown in Fig. 8, while the tool encodes the errors as metadata comments and has a nonzero exit-code on errors.

Formatting for a Consistent Style
The official AMR corpora, such as the AMR Annotation Release 3.0, are distributed with the graphs serialized in a human-readable style that uses increasing levels of indentation to show the nesting of subgraphs. Furthermore, relations on a node appear in a canonical order depending on their roles (e.g., ARG1 appears before ARG2) or their surface alignments, where the appearance of a node roughly follows the order of corresponding words in a sentence. The rearrange and reconfigure transformations can change the order of relations in the graph to be more canonical, the reset-variables method can ensure variable forms are as expected, and the whitespace options of tree formatting can emulate the same indentation style as the official corpora. These features may be useful for users distributing new AMR corpora.

Normalization for Fairer Evaluation
The normalization options in §3 can be useful when evaluating the results of AMR parsing, as described in Goodman 2019a. Penman is thus well-situated as a preprocessor to an evaluation step using, e.g., smatch , SemBLEU (Song and Gildea, 2019), or SEMA (Anchiêta et al., 2019).

Preprocessing for Machine Learning
Sequential neural models which use linearized AMR graphs have been popular for both parsing and generation (Barzdins and Gosko, 2016;Peng et al., 2017;Konstas et al., 2017;van Noord and Bos, 2017;Song et al., 2018;Damonte and Cohen, 2019;Zhang et al., 2019), but data sparsity is a significant issue (Peng et al., 2017). One way to address data sparsity is to remove senses on concepts (Lyu and Titov, 2018). Fig. 10 shows how the Python API can remove these senses in the tree.

Applicability beyond AMR
This paper has described PENMAN as a notation for encoding AMR graphs, but it is also applicable to other dependency graphs that share the same constraints (e.g., connected, directed). PENMAN notation can encode Dependency Minimal Recursion Semantics (DMRS; Copestake, 2009;Copestake et al., 2016), such as for learning graph-to-graph machine translation rules (Goodman, 2018) and neural generation (Hajdik et al., 2019), and it can encode Elementary Dependency Structures (EDS; Oepen et al., 2004;Oepen and Lønning, 2006), as shown in Fig. 11 using PyDelphin (Goodman, 2019b) for conversion. It is also useful for extensions of AMR, such as Uniform Meaning Representation (UMR; Pustejovsky et al., 2019). :ARG1 (x / pron :BV-of (_1 / pronoun_q))) Figure 11: Example of EDS in Penman notation

Conclusion
In this paper I have presented Penman, a Python library and command-line tool for working with AMR and other graphs serialized in the PENMAN format. Existing work on AMR has targeted the PENMAN string, the parsed tree, or the interpreted graph, and Penman accommodates all of these use cases by allowing users to work with the tree or graph data structures or to encode them back to strings. Transformations defined at both the graph and tree level make it applicable for pre-and postprocessing steps for corpus creation, evaluation, machine learning projects, and more. Penman is available under the MIT open-source license at https://github.com/goodmami/penman. Interactive notebook demonstrations and informational videos are available at https://github. com/goodmami/penman#demo.