Hitachi at MRP 2020: Text-to-Graph-Notation Transducer

This paper presents our proposed parser for the shared task on Meaning Representation Parsing (MRP 2020) at CoNLL, where participant systems were required to parse five types of graphs in different languages. We propose to unify these tasks as a text-to-graph-notation transduction in which we convert an input text into a graph notation. To this end, we designed a novel Plain Graph Notation (PGN) that handles various graphs universally. Then, our parser predicts a PGN-based sequence by leveraging Transformers and biaffine attentions. Notably, our parser can handle any PGN-formatted graphs with fewer framework-specific modifications. As a result, ensemble versions of the parser tied for 1st place in both cross-framework and cross-lingual tracks.


Introduction
This paper introduces the proposed parser of the Hitachi team for the CoNLL 2020 Cross-Framework Meaning Representation Parsing (MRP 2020) shared task (Oepen et al., 2020). Different from the previous MRP 2019 shared task (Oepen et al., 2019), there are two tracks. The first is a cross-framework track that aims at parsing English sentences to five different meaning representation graphs, i.e., EDS (Oepen and Lønning, 2006), PTG (Hajič et al., 2012), UCCA (Abend and Rappoport, 2013;Hershcovich et al., 2017), AMR (Banarescu et al., 2013), and DRG (Van Der Sandt, 1992;Bos et al., 2017). The other is a cross-lingual track that targets four different frameworks and three languages, i.e., German UCCA and DRG, Chinese AMR (Li et al., 2016), and Czech PTG. In both tracks, the goal was to design a unified parser for all graphs. * Contributed equally. Ozaki mainly developed PGN parser. Morio mainly developed neural models.  Figure 1: Given an input text, we tokenize and encode the text by pre-trained Transformer encoder (e.g., BERT). Then, Transformer decoder is applied to produce a Plain Graph Notation (PGN) that is convertible into a general graph.
In this paper, we propose a novel parser to unify graph predictions across all frameworks and languages. To this end, we introduce a text-to-graphnotation transduction. The overview of our parser is shown in Figure 1. Our parser utilizes sequenceto-sequence Transformer architectures (Vaswani et al., 2017) to generate a plain graph notation (PGN), which we newly designed as a contextfree language. PGN is a simplified notation based on the PENMAN notation (Matthiessen and Bateman, 1991) that is generally used for AMR graphs. However, PGN is tailored for direct generation by sequence-to-sequence architecture.
Our parser is expected to combine both strengths of the neural graph-based and transition-based parsers (McDonald et al., 2005;Yamada and Matsumoto, 2003;Kulmizev et al., 2019;Ma et al., 2018). This is because the Transformer decoder directly draws attentions like a graph-based parser does, as well as handles higher-level effects of graph structures by sequence prediction like a transition-based parser does. Moreover, our parser is practically able to parse most of the graph vari-ants in a unified manner. For example, our parser is able to predict directed acyclic graphs, disconnected graphs, directed multigraphs, reentrancy edges (Vilares and Gómez-Rodríguez, 2018), and source-side anchors without complicated languagedependent architectures.
Consequently, ensemble versions of our parser officially tied for 1st place in both the crossframework track and cross-lingual track, achieving the top performances for English EDS, PTG, and AMR graphs. We also summarize other contributions as follows: Alignment Free: PGN generation allows us to achieve completely alignment-free parsing. Action Design Free: Compared to a transitionbased parser, there is no need to design a complex transition strategy. Fast Training: Since we leverage attentions, train speed is faster than a transition-based parser.  (Devlin et al., 2019). SUDA-Alibaba (Zhang et al., 2019c) proposed a graph-based approach with BERT. They used biaffine attention (Dozat and Manning, 2018) for the edge prediction. Donatelli et al. (2019) employed a compositional approach that represents each graph with its compositional tree structure .

Comparison with Other Systems
Like Zhang et al. (2019a), we model a context-free language instead of a sequence of transition actions, and parser states can be regarded as being implicitly materialized inside BERT's memory. However, our parser jointly generates nodes and edges based on PGN, making the system consider higher-level effects of graph structures.
The work most closely related to our study is (Zhang et al., 2019b), where the authors provided an encoder-decoder architecture to predict a sequence of semantic relations, employing a target Figure 2: PGN grammar described in EBNF. Essentially, a graph can be represented by a set of edges. However, to support floating nodes, we defined a graph as a set of edges and floating nodes.

Name
Function attr2name Append -{attr name} suffix to edge label name instead of having attributes. prop2node Make node properties independent nodes linked with edges named with properties' names. embed label Replace node id in PGN with {node id}/{node label}. node-, relation type-, and source node-module in the decoder. Similar to our study, Zhang et al. (2019b) encoded node and edge representation with the modules. While they jointly predicted node and edge labels, our parser outputs node and edge labels separately. In addition, they provided LSTMs whereas we provide Transformers that can draw attentions from both past node and edge representations in the decoder. Also, while Zhang et al. (2019b) solved reentrancies by producing the same node ID, we solve them with a biaffine classifier, making our parser solve reentrancies with attentions. The biggest difference between the transitionbased architectures for MRP (Che et al., 2019) and our work is that we have designed PGN such that it unifies all graph generation processes and eliminates the need to design framework-specific actions. In addition, Che et al. (2019) relied on explicit alignment between input tokens and nodes, whereas our model utilizes biaffine attention for anchoring only when it is necessary, allowing our model to be alignment-free.

Format Design
To represent a graph as a text sequence, we newly designed a notation, called Plain Graph Notation (PGN), with the key principles shown below. Simpler Format: Similar to PENMAN notation (Matthiessen and Bateman, 1991;Goodman, 2020), which is used to represent AMR graphs with text sequence, PGN is based on a context-free grammar.  However, PGN only represents a graph structure (namely, all edges in the graph) for simplicity. All node properties are omitted from PGN while we preserve the properties separately. In addition, we reduce redundant meta-tokens appearing in the notation as much as possible. Figure 2 shows the Extended Backus-Naur Form (EBNF) 1 of PGN grammar.
Tree-Like Structure: We employed an essentially tree-like structure 2 because all spanning graphs with a root can be converted into tree-like structures by flipping the directions of appropriate edges. This structure is useful when we convert graphs to PGN.
Left-to-Right Decodable: To make our parser robust, we allow it to convert a notation into a graph in a left-to-right manner. This operation makes us decode an ill-formatted sequence with a best-effort strategy. We briefly explain this algorithm in a later sub-section. 1 https://www.iso.org/obp/ui/#iso:std: iso-iec:14977:ed-1:v1:en 2 Here we define "tree-like" graph as a graph whose root node is always an ancestor of all nodes in the graph.

Graph to PGN Conversion
MRP graph to PGN conversion starts with the top nodes. We recursively apply the grammar shown in Figure 2 from parents to children by depth first search. In addition to the depth first search, at finding a next path, we select a child node in increasing order of the numbers of outgoing edges in all descendant nodes (i.e., we select shallow branch first) to convert. Since we assume the input graph consists of a tree-like structure, finding children is just extracting out-going edges. However, several frameworks such as EDS, DRG, and AMR may not form a tree-like structure. Thus, we provide an option using all edges instead of out-going edges with flipping edge directions of in-coming edges (we append "-of" suffix to labels of flipped edges). To deal with the reentrancy problem, our recursive search is applied when the node first appears. Here we describe various framework-specific modifications. Floating Nodes: We found that some EDS and AMR graphs have floating nodes in which no incoming or outgoing edges are annotated. Thus, the PGN grammar supports floating nodes. Floating Sub-Graphs: We found that some EDS graphs have floating sub-graphs that have no con-nection to the top. Therefore, we add temporal top nodes for floating sub-graphs to convert all subgraphs on the basis of the following criteria.
1. First predicate node that has frame property, with the first priority. 2. First node that has smallest ID in a sub-graph, with the second priority.

Left-to-Right Decoding
This left-to-right decoding system consists of a stack and an input stream. Every time a token is fed from the input stream, we take an action of ADD (put the token to the stack), ARC (create an edge between the top two tokens on the stack), POP (pop out the top token from the stack), or CLEAR (pop out all tokens in the stack, and add them to the node list). These actions correspond to node id, edge label, [EOD], and [EOG] in Figure 2, respectively.
Since neural networks may produce ill-formatted PGN, the left-to-right decoding finds as many edges as possible. 3 If there is an ill-formatted action such that a PGN sequence terminates with a non-empty stack, we generate additional edges according to the stack state.

PGN Processors
We define PGN processors, which are a set of invertible functions to apply small modification PGN formatted sequences. Table 1 shows all PGN processors and their description. To better understand these processors, Figure 3 (a2 and b2) shows example PGN formatted graphs of EDS and AMR. Actual PGN expressions are a list of serialized tokens, but here we add indentations for ease of reading. According to the PGN grammar, a node id should be digits representing a node ID, but we insert node labels by the embed label processor. In Figure 3 (a2), there are two flipped edges in the graph to form a tree-like structure, i.e., ( next a 1, sell v 1, ARG1) and (udef q, jacket n 1, BV). Also, Figure 3 (a2 and b2) depicts a larger number of nodes than that in original MRP graphs because node properties are converted into additional nodes by prop2node processor, e.g., 6/pres and 7/pl in the EDS graph.

PGN to Prediction Sequence
Though existing text generation techniques are applicable to generate PGN as is, we provide further modification for PGN expressions to obtain more suitable prediction sequences for a neural decoder. Figure 3 (a3 and b3) shows example prediction sequences derived from PGN. As can be seen, we split a node label into multiple tokens (e.g., subword tokens) and add some special tokens. We add an end-of-node token ([EON]) just after the end of subword elements because we should know where the node label token generation terminates. Since [EON] is inserted as the end of node token, we can consider [EON] as the node's representative token, which will be used for reentrancy classifiers and property classifiers (described later). To handle anchors, we add a place-holder token for anchor starting and ending ([AS] and [AE]) before node label tokens. When our parser predicts [AS] and [AE] tokens, we resolve anchors by a biaffine classifier described later. We also add a place-holder token ([RNT]) to generate a reentrancy edge after all decoding steps have been completed.

Problem Formalization
We describe the conceptual formalization similarly to the work of Zhang et al. (2019b). Given an input sequence X (i.e., tokens in the text), we optimize an output sequenceŶ = y 1 , y 2 , . . . y n , where y can be represented by a tuple y mode , y G , y E , y L , y AS , y AE , 4 consisting of a model label (y mode ), mode-wise labels (y G , y E and y L ), and an index of a source-side token for anchoring (y AS and y AE ), defined as follows:

Overview
To generate the prediction sequence, we provide a sequence-to-sequence model. Figure 4 illustrates an example of AMR parsing (i.e., the graph of Figure 3 (b3)). Our parser is based on a typical encoder-decoder architecture but has several proposed architectures on the decoder side. Given an input text, our parser encodes the tokens by a pre-trained language model (PLM) such as BERT (Devlin et al., 2019). At decoding, a Transformer decoder produces the prediction sequence. To effectively control the decoder, we propose a mode switching mechanism. At the i-th decode step, our Depth embed Then, a classification layer of the selected mode is applied to predict the i-th output. For example, if mode E is selected, an edge classifier on the decoder is used to produce an edge label. If mode AS is selected, an anchoring classifier on the decoder is used to produce an anchor starting index in the encoder's subword tokens. If mode R is selected, we do nothing but generate a place-holder token. Instead, after all decoding steps, we apply biaffine attention, which solves the reentrancy edges for the place-holder tokens.

Encoder
Given an input text, a PLM-specific tokenizer tokenizes the text into the token sequence X. Note that we insert special tokens such as [CLS] and [SEP] according to the PLM type. To obtain PLM representations, a layer-wise attention is applied (Kondratyuk and Straka, 2019;Peters et al., 2018): where s and c are parameters. Note that h PLM,i ∈ R d(PLM) , where d(PLM) represents the number of dimensions of the PLM layers. PLM ij is an embedding of the i-th token in the j-th PLM layer. Note that 1 ≤ i ≤ N , where N is the number of tokens.

Decoder
We employ a Transformer decoder to fully utilize a self-attention mechanism. The decoder includes a switching architecture of modes that makes the decoder explicitly learn structural representations. Decoder Input Representation: For each decoding step i, we compute an input representation for the Transformer decoder: where ; shows a concatenate operation, and each representation is obtained as follows: where EMB is a layer that transfers the label into a fixed sized vector, and W and b are parameters.
i , e AE i : These are input embedding of source-side anchors.
W (AS) , W (AE) ∈ R d(PLM)×d(PLM) and b (AS) , b (AE) ∈ R d(PLM) are trainable parameters. k is an index of the anchor starting token for AS, or an index of the anchor ending token for AE. Therefore, the Transformer decoder draws attentions from the encoder representation of source-side anchored tokens. Note that AS and AE are also exclusive. • e depth i : This is a feature embedding to make the network consider the current depth from the top of the graph. The depth y depth i is obtained by starting from zero, adding one when [EON] appears, and subtracting one when [EOD] appears (see Figure 4). Transformer Decoder: To leverage self-attentions throughout parsing, a multi-layered Transformer decoder (Vaswani et al., 2017) is applied to obtain an output sequence. Let d i be a decoder representation at i-th step that is obtained by a multi-layered Transformer decoder where previous decoder inputs (e 1 . . . e i ) and encoder repre- Mode Output Layers: Given the decoder representation d i , we produce a probability distribution of the next mode label with a softmax classifier and a feed forward network as follows: In the equation, y mode i+1 denotes a mode label, i.e., G, E, L, AS, AE, or R. We chose the mode y mode i+1 based on the maximum probability. Similarly, we obtain a probability distribution for each mode-wise label as follows: After applying these layers above, we obtain output labels y mode i+1 , y G i+1 , y E i+1 , y L i+1 , and their corresponding embeddings e mode i+1 , e G i+1 , e E i+1 , e L i+1 , which are used for the next decoder input.
Anchoring Classifier: To compute source-side anchors (i.e., AS and AE), we employ biaffine attention Manning, 2017, 2018). The biaffine operation computes a relation for vector pairs as: where U (t) , W (t) , and b (t) are trainable parameters. We apply the biaffine operation between the decoder representation d i and encoder representation h PLM,j to point a range of anchoring. If y mode i+1 = AS, the anchor starting probability can be represented as where σ is a sigmoid function. P y AS i→j represents a probability that the j-th token in the encoder is an anchor starting token. After the output layer above, we draw encoder representations by selecting arg max j∈1,...,N P y AS i→j and its corresponding h PLM,j , which is used as the next decoder input e AS i+1 . Also, e AE i+1 can be calculated in the same manner. Figure 5 shows an example of EDS parsing (the graph of Figure 3 (a3)). This example illustrates that AS is raised at the first decoding step, applying biaffine scoring and selecting the encoder's representation of may token. Similarly, the AE is raised at the second step. Reentrancy Classifier: To solve reentrancy edges, we provide another biaffine layer. Given that this is a target-side (i.e., decoder-side) operation, we apply this classifier after all decoding steps have been finished to keep the training speed fast. If y mode i+1 = R, the probability that a reentrancy edge exists between the i and j-th decoding steps can be represented as To restrict the search space, we only consider the end of node token (i.e., [EON]) for j (see Figure 4), since we assume [EON] is a representative token of the node.
where P y P i+1 represents a probability distribution of the label of property type P at the [EON] (i.e., a node representative token). Note that P contains a no label class where the node is considered to not have the property.

Loss and Decoding
Loss: We compute a mode output cross-entropy loss L mode based on P y mode . For each mode, we compute mode-specific cross-entropy loss L G , L L , and L E based on P y G , P y L , and P y E . Note that loss is not computed if a different mode is selected for decode step i. For example, if y mode i = G, only L G is computed, and others are ignored except the mode loss. Anchoring loss L AS and L AE are computed based on binary cross-entropy of the P y AS and P y AE . Similarly, reentrancy loss is represented as L R . If the graph has property tasks (e.g., PTG graphs), we compute the crossentropy of property label L P for property type P. The following equation describes the combined loss to be optimized: where λ are hyperparameters to adjust loss scales. Decoding: For simplicity, we only consider greedy decoding. We also apply explicit restrictions. For example, the mode AE always comes after AS.

Ensemble
To further boost the performance of our parser, we provide an average ensemble. We apply mode-wise averaging over output probabilities. Therefore, we average probabilities for the mode layer, modespecific layers, anchoring classifiers, reentrancy classifier, and property classifiers, respectively.

Post-Processing
We incorporate framework-specific post-processing after reconstruction except for DRG.
For EDS, to support unknown words appearing as a named entity, we replace the CARG property with a node label expression extracted by anchors when the edit distance between node label expression and CARG is larger than 70% of the node label characters. Since the EDS frame dictionary is available 5 , we correct frames by checking their arguments. When several candidates are available, we select the most frequent frame name.
For PTG, frame dictionaries for both English and Czech are also available 6 , so we correct frames in the same manner.
For UCCA, we apply post-processing to follow UCCA restrictions. We remove non-anchored nodes appearing as terminal nodes. We also remove self-loop edges. We add remote attribute to all edges except primary edges.
Input texts are tokenized by the PLM-specific tokenizer. We utilized the tokenization scheme of the tokenizer for our decoder: we use the same vocabulary for the node labels y L i when decoding. The only exception is Czech PTG because node label tokens in Czech PTG graphs include accents that are removed from the vocabulary of multilingual BERT. Thus, we employ character-level decoding for Czech PTG, where the vocabulary was constructed to contain Czech characters. Hyperparameters: Hyperparameters of the submitted models are shown in Table 2 (bottom). Adam (Kingma and Ba, 2015) was used as an optimizer, applying linear warmup scheduling. We preliminarily tuned hyperparameters for learning rates, the number of decoder layers, the number of decoder heads, and λ values.
Search ranges of hyperparameters were [1e-6, 1e-3] with log-uniform sampling for decoder learning rate, [1e-6, 1e-4] with log-uniform sampling for encoder learning rate, [4,8] for Transformer layers and heads, and [0, 2] with uniform sampling for λ. We fixed λ mode = 1. In terms of hidden dimensions, we did not aggressively tune them because we preliminarily found their impact on the final performance to be minuscule. In this work, we fixed the biaffine dimensions to 400 and the depth embedding dimensions to 100. Validation: We used CV to validate our models to ensure robustness. We picked four folds from the training data in Table 2 for each framework/language. 7 For example, although we split the EDS training data into 24 folds, we only used four of these folds to validate the EDS model. We mixed official validation data with each fold to evaluate the model's performance. Then, validation performance was evaluated every 20 epochs, selecting the best model.
Through this validation, we obtained four (i.e., the number of CV folds) trained models for each framework/language with the same hyperparameters. The obtained models were then utilized for the average ensemble. Setup for Cross-Lingual Pre-Training: Given the lower resource nature of the cross-lingual track, especially for the German UCCA and DRG graphs, we provided two-staged cross-lingual training. First, we concatenated the cross-framework (CF) (e.g., English DRG) and cross-lingual (CL)  (e.g., German DRG) training data. Then, we applied pre-training on the concatenated data for each framework with multi-lingual BERT (Devlin et al., 2019). After that, we applied fine-tuning on only monolingual training data.

Results and Discussion
Overall Result: Table 3 shows the official crossframework evaluation results in MRP metrics. As can be seen in the table, in terms of average MRP scores, our parser tied for 1st place: results were very close to theÚFAL system (Samuel and Straka, 2020). 8 We achieved the top performances on EDS, PTG, and AMR, demonstrating the efficiency for these frameworks. Table 4, the official crosslingual evaluation results, shows a similar tendency. In the cross-lingual track, we achieved a tie for 1st place, obtaining the best performance for Chinese AMR and German DRG. Notably, our parser performed well on flavor 2 graphs (Oepen et al., 2019) such as AMR and DRG, where no anchors exist in the graphs. This is because we generate node labels directly by the Transformer decoder, thus avoiding alignment errors. However, anchorbased graphs such as UCCA seem unsuitable for our parser when compared to theÚFAL system. We presume that improving the biaffine scoring in anchoring classifiers would remedy this problem. Comparing Pre-Trained Models: To better understand how we benefit from PLMs, we compare the bert-base-cased and roberta-large models. Table 5 shows MRP all-F scores of the cross-framework results. Note that the hyperparameters were slightly different for each model. RoBERTa large models were better than BERT small models, showing improvements ranging from one to four points. Effectiveness of Depth Embeddings: We conduct an ablation study to examine the role of depth embedding. Table 6 shows a CV-averaged result on English DRG graphs. Note that the hyperparameters are different from Table 5. The result shows 8 Given randomness nature of the official evaluation tool and statistical significance concerns, system ranking was considered with rounded scores.   that our depth embedding is effective to boost performance. We presume this is because the decoder considers a kind of stack state in PGN, which helps the parser easily produce valid graphs. Effectiveness of Cross-Lingual Pre-Training: Table 7 shows a comparison of F scores between CL and monolingual training. We used bert-base-german instead of bert-base-ml-cased for both monolingual trainings. CL training outperformed monolingual training. This indicates that both UCCA and DRG annotations are cross-lingually consistent, and our model can capture the consistency through the CL training. We estimate that our parser has a better transfer ability on cross-lingual graphs.

Conclusion
This paper described a novel parser for the shared task on Meaning Representation Parsing 2020. We proposed a text-to-graph-notation transduction that provides a novel graph notation. Our model effectively parsed the graph-notation. Experimental results showed that our parser achieved the top performances in many frameworks. Since our parser is not limited to the five frameworks, in future work we will extend our technique for other tasks.