HIT-SCIR at MRP 2019: A Unified Pipeline for Meaning Representation Parsing via Efficient Training and Effective Encoding

This paper describes our system (HIT-SCIR) for CoNLL 2019 shared task: Cross-Framework Meaning Representation Parsing. We extended the basic transition-based parser with two improvements: a) Efficient Training by realizing Stack LSTM parallel training; b) Effective Encoding via adopting deep contextualized word embeddings BERT. Generally, we proposed a unified pipeline to meaning representation parsing, including framework-specific transition-based parsers, BERT-enhanced word representation, and post-processing. In the final evaluation, our system was ranked first according to ALL-F1 (86.2%) and especially ranked first in UCCA framework (81.67%).


Introduction
The goal of the CoNLL 2019 shared task  is to develop a unified parsing system to process all five semantic graph banks. 1 For the first time, this task combines formally and linguistically different approaches to meaning representation in graph form in a uniform training and evaluation setup.
Recently, a lot of semantic graphbanks arise, which differ in the design of graphs (Kuhlmann and Oepen, 2016), or semantic scheme . More specifically, SDP (Oepen et al., 2015), including DM, PSD and PAS, treats the tokens as nodes and connect them with semantic relations; EDS (Flickinger et al., 2017) encodes MRS representations (Copestake et al., 1999) as graphs with the many-to-many relations between tokens and nodes; UCCA (Abend and Rappoport, 2013) represents semantic structures with the multi-layer framework; AMR (Banarescu et al., 2013) represents the meaning of each word using a concept graph.  classifies these frameworks into three flavors of semantic graphs, based on the degree of alignment between the tokens and the graph nodes. In DM and PSD, nodes are sub-set of surface tokens; in EDS and UCCA, graph nodes are explicitly aligned with the tokens; in AMR, the alignments are implicit.
Most semantic parsers are only designed for one or few specific graphbanks, due to the differences in annotation schemes. For example, the currently best parser for SDP is graph-based (Dozat and Manning, 2018), which assumes dependency graphs but cannot be directly applied to UCCA, EDS, and AMR, due the existence of concept node. Hershcovich et al. (2018) parses across different semantic graphbanks (UCCA, DM, AMR), but only works well on UCCA. The system of Buys and Blunsom (2017) is a good data-driven EDS parser, but does poorly on AMR. Lindemann et al. (2019) sets a new SOTA in DM, PAS, PSD, AMR and nearly SOTA in EDS, via representing each graph with the compositional tree structure (Groschwitz et al., 2017), but they do not expand this method to UCCA. Learning from multiple flavors of meaning representation in parallel has hardly been explored, and notable exceptions include the parsers of Peng et al. (2017Peng et al. ( , 2018; Hershcovich et al. (2018).
Therefore, the main challenge in crossframework semantic parsing task is that diverse framework differs in the mapping way between surface string and graph nodes, which incurs the incompatibility among framework-specific parsers. To address that, we propose to use transition-based parser as our basic parser, since it's more flexible to realize the mapping (node generation and alignment) compared with graphbased parser, and we improve it from the two as- pects: 1) Efficient Training Aligning the homogeneous operation in stack LSTM within a batch and then computing them simultaneously; 2) Effective Encoding Fine-tuning the parser with pretrained BERT (Devlin et al., 2019) embedding, which enrich the context information to make accurate local decisions, and global learning for exact search. Together with the post-processing, we developed a unified pipeline for meaning representation parsing.
Our contribution can be summarised as follows: • We proposed a unified parsing framework for cross-framework semantic parsing.
• We designed a simple but efficient method to realize stack LSTM parallel training.
• We showed that semantic parsing task benefits a lot from adopting BERT.
• Our system was ranked first in CoNLL 2019 shared task among 16 teams upon ALL-F1.

System Architecture
Our system architecture is shown in Figure 1. In this section, we will first introduce the transitionbased parser in Section 2.1, which is the central part of our system. Then, to speed up the training of stack LSTM at transition-based parser, we propose a simple method to do batch-training in Section 2.2. And we adopt BERT to extract the contextualized word representation in Section 2.3. At last, to label the nodes with pos, frame and lemma, we use additional tagger models to predict these in Section 2.4. The framework-specific transition system is presented in Section 3 and post-processing for each framework is discussed in Section 4.

Transition-based Parser
In order to design the unified transition-based parser, we refer to the following frameworkspecific parsers: Wang et al. (2018b) for DM and PSD, Hershcovich et al. (2017) for UCCA, Buys and Blunsom (2017) for EDS,  for AMR. Those parsers differ in the design of transition system to generate oracle action sequence, but similar in modeling the parsing state. A tuple (S, L, B, E, V ) is used to represent parsing state, where S is a stack holding processed words, L is a list holding words popped out of S that will be pushed back in the future, and B is a buffer holding unprocessed words. E is a set of labeled dependency arcs. V is a set of graph nodes include concept nodes and surface tokens. The initial state is ([0], [ ], [1, · · · , n], [ ], V ) , where V only contains surface tokens since the concept nodes would be generated during parsing. And the terminal state is ([0], [ ], [ ], E, V ). We model the S, L, B and action history with stack LSTM, which supports PUSH and POP operation. 2 Transition classifier takes the parsing state from multiple stack LSTM models as input at once, and outputs a action that maximizes the score. The score of a transition action a on state s is calculated as where STACK LSTM(s) encodes the state s into a vector, g a and b a are embedding vector, bias vector of action a respectively. The oracle transition action sequence is obtained through transition system, proposed in in Section 3.  Figure 2: When some new INSERT operations come, the data to be inserted are pushed into corresponding buffers. They will be merged into a batch once batch-processing is triggered. After that, new LSTM states will be pushed to corresponding stacks.

Batch Training
Kiperwasser and Goldberg (2016) shows that batch training increases the gradient stability and speeds up the training. Delaying the backward to simulate mini-batch update is a simple way to realize batch training, but it fails to compute over data in parallel. To solve this, we propose a method of maintaining stack LSTM structure and using operation buffer.
stack LSTM The stack LSTM augments the conventional LSTM with a 'stack pointer'. And it supports the operation including: a) INSERT adds elements to the end of the sequence; b) POP moves the stack pointer to the previous element; c) QUERY returns the output vector where the stack pointer points. Among these three operation, POP and QUERY only manipulates the stack without complex computing, but INSERT performs lots of computing.
Batch Data in Operation-Level Like conventional LSTM can't form a batch inside a sequence due to the characteristics of sequential processing, stack LSTM can't either. Thus, we collect undercomputed operations between different pieces of data to form a batch. In other words, we construct batch data on operation-level other than data-level in tradition. After collecting a batch of operation, we compute them simultaneously.
Operation Buffer To be more efficient, we adopt a buffer to collect operations and let it trigger the computing of those operations automatically (batch-processing), as shown in Figure 2.
To ensure correctness, batch-processing will only be triggered when satisfy some conditions. More specifically, when a) operation INSERT comes and there is already an INSERT in the buffer; b) operation POP or QUERY comes. To clarify, the depth of buffer per data is 1.

Deep Contextualized Word Representations
Neural parsers often use pretrained word embeddings as their primary input, i.e. word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), which assign a single static representation to each word so that they cannot capture context-dependent meaning. By contrast, deep contextualized word representations, i.e. ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), encode words with respect to the context, which have been proven to be useful for many NLP tasks, achieving state-of-the-art performance in standard Natural Language Understanding (NLU) benchmarks, such as GLUE (Wang et al., 2018a).  adopted ELMo in CoNLL 2018 shared task (Zeman et al., 2018) and achieved first prize in terms of LAS metric. (Kondratyuk and Straka, 2019) exceeds the state-ofthe-art in UD with fine-tuning model with BERT.

BERT
We adopt BERT in our model, which uses the language-modeling objective and trained on unannotated text for getting deep contextualized embeddings. BERT differs from ELMo in that it employs a bidirectional Transformer (Vaswani et al., 2017), which benefit from learning potential dependencies between words directly. For a token w k in sentence S, BERT splits it to several pieces and use a sequence of WordPiece embedding (Wu et al., 2016) s k,1 , s k,2 , ..., s k,piece num k instead of a single token embedding. Each s k,i is passed to an L-layered BiTransformer, which is trained with a masked language modeling objective (i.e. randomly masking a percentage of input tokens and only predicting these masked tokens).
To encode the whole sentence, we extract the first piece s k,1 of each token w k , with applying a scalar mix on all L layers of transformer, to represent the corresponding token w k .

Tagger
Semantic graphs in all frameworks can be broken down into 'atomic' component pieces, i.e. tuples capturing (a) top nodes, (b) node labels, (c) node properties, (d) node anchoring, (e) unlabeled edges, (f) edge labels, and (g) edge attributes. Not all tuple types apply to all frameworks, however. 3 The released dataset and evaluation is annotated by MRP, which consists of the tuple including the graph component mentioned above.
Our transition-based parser can provide the edge information, while the other node information, such as pos, frame and lemma, require us to use additional tagger models to label the sentence sequence. The tagger we adopted is directly imported from AllenNLP library, which only models the dependency between node and label (emission score), not models the dependency between labels (transition score). The details about integrating and converting system output into MRP format will be introduced in Section 4.

Transition Systems
Building on previous work on parsing reentrancies, discontinuities, and non-terminal nodes, we define an extended set of transitions and features that supports the conjunction of these properties. To solve cross-arc problem, we use list-based arceager algorithm for DM, PSD, and EDS framework as Choi and McCallum (2013); Nivre (2003Nivre ( , 2008; for UCCA framework, we employ SWAP operation to generate cross-arc as Hershcovich et al. (2017). 4

DM and PSD
We follow the work of (Wang et al., 2018b) to design transition system for DM and PSD.
• LEFT-EDGE X and RIGHT-EDGE X add an arc with label X between w j and w i , where w i is the top elements of stack and w j is the top elements of buffer. They are performed only when one of w i and w j is the head of the other.
• SHIFT is performed when no dependency exists between w j and any word in S other than w i , which pushes all words in list and w j into stack S.
• REDUCE is performed only when w i has head and is not the head or child of any word in buffer, which pops w i out of stack.
• PASS is performed when neither SHIFT nor REDUCE can be performed, which moves w i to the front of list.
• FINISH pops the root node and marks the state as terminal.

UCCA
We follow the work of (Hershcovich et al., 2017) to design transition system for UCCA.
• SHIFT and REDUCE operations are the same as DM and PSD. REDUCE pops the stack, to allow removing a node once all its edges have been created.
• NODE transition creates new non-terminal nodes. For every X ∈ L, NODE X creates a new node on the buffer as a parent of the first element on the stack, with an X-labeled edge.
• LEFT-EDGE X and RIGHT-EDGE X create a new primary X-labeled edge between the first two elements on the stack, where the parent is the left or the right node, respectively.
• LEFT-REMOTE X and RIGHT-REMOTE X do not have this restriction, and the created edge is additionally marked as remote.
• SWAP pops the second node on the stack and adds it to the top of the buffer, as with the similarly named transition in previous work (Maier, 2015;Nivre, 2009).
• FINISH pops the root node and marks the state as terminal.
As a UCCA node may only have one incoming primary edge, EDGE transitions are disallowed if the child node already has an incoming primary edge. To support the prediction of multiple parents, node and edge transitions leave the stack unchanged, as in other work on transition-based dependency graph parsing (Sagae and Tsujii, 2008).

EDS
Based on the work of (Buys and Blunsom, 2017), we extended NODE-START L and NODE-END actions for generating concept node and realizing node alignment.
To clarify, w i is the top element in stack and w j is the top element in buffer. Moreover, w i could only be concept node (stack and list only contain concept node), and w j could be concept node or surface token.
• SHIFT and REDUCE operations are the same as DM and PSD.
• LEFT-EDGE X and RIGHT-EDGE X add an arc with label X between w j and w i . (w j is the concept node) • DROP pops w j . Then push all elements in list into stack. (w j is the surface token).
• REDUCE is performed only when w i has head and is not the head or child of any node in buffer B, which pops w i out of stack S.
• NODE-START X generates a new concept node with label X and set it's alignment starting from w j . (w j is the surface token) • NODE-END set the alignment of w i ending in w j . (w j is the surface token) • PASS is performed when neither SHIFT nor REDUCE l can be performed, which moves w i to the front of list .
• FINISH pops the root node and marks the state as terminal.

AMR
We extend the basic transition set to obtain the ability to generate graph nodes from the surface string, following previous work . There are 3 steps to parse graph nodes from the surface string in general. (a) Many concepts appear as phrases rather than single words, so we connect token spans on top of buffer to form special single tokens if needed using operation MERGE. (b) Then we use operation CONFIRM to convert a single token on buffer to a graph node(concept). In order to process entity concepts like date-entity better, operation ENTITY is a special form of CONFIRM which also generates property nodes of the entity concept. (c) The other concepts are not derived from surface string but previous concepts. If there is a concept node on top of buffer, operation NEW can be performed to parse this kind of concept nodes.
After solving the problem of parsing concept nodes from surface string, the basic transition set used in DM and PSD is able to predict edges between concept nodes.
• REDUCE and PASS operations are the same as DM and PSD.
• SHIFT, LEFT-EDGE X and RIGHT-EDGE X are similar to operations in DM and PSD, but they can be performed only when the top of buffer is a concept node.
• DROP operation pops the top of buffer when it is a token.
• MERGE operation connect the top two tokens in the buffer to a single token which is waiting for being converted to a concept node.
• CONFIRM X operation convert top of buffer to a concept node X if it is a token.
• ENTITY X operation does same things with CONFIRM X and then adds internal attributes of entity X, such as year, month and day of a date-entity.
• NEW X operation create a concept node labeled with X and push it to the buffer.
• FINISH pops the root node and marks the state as terminal.
At first, all those framework construct triple input is basically the same, which using directed edges, edge labels and node id. About node anchor, we directly derive the anchoring based on segmentation from companion data alignment with each sentence. 5 While the other elements, such as top nodes, are a bit different among the frameworks. We will introduce these framework-specific work in the following.

DM and PSD
Node Properties Nodes in DM and PSD are labeled with lemmas and carry two additional properties that jointly determine the predicate sense, viz. pos and frame. We use two taggers to handle this problem.
Top Nodes At first, we construct an artifact node called ROOT. Then we add an edge (node, ROOT, ROOT) where the node is enumerated from top nodes.
Node Label We copy the lemmas from additional companion data and set it as node labels.

UCCA
Top Nodes There is only one top node in UCCA, which used to initialize the stack. Meanwhile, top node is the protect symbol of stack (never be popped out).
Edge Properties UCCA is the only framework with edge properties, used as a sign for remote edges. We treat remote edges the same as primary edge, except the edge label added with a special symbol, i.e. star(*).
Node Anchoring Refer to the original UCCA framework design, we link the the node in layer 0 to the surface token with edge label 'Terminal'. In post-processing, we combine surface token and layer 0 nodes via collapsing 'Terminal' edge to extract the alignment or anchor information.

EDS
Top Nodes The TOP operation will set the first concept node in buffer as top nodes.
Node Labels We train a tagger to handle this. Although there are many node labels exists, the result shows our system performs well on this.
Node Properties The only framework-specific property used on EDS nodes is carg (for constant argument), a string-valued parameter that is used with predicates(node label) like named or dofw, for proper names and the days of the week, respectively.
We write some rules to convert the surface token into properties value, such as converting million(token) to 1000000(value) when card(node label).
Node Anchoring We obtain alignment information through NODE START and NODE END operation,

AMR
Alignment There is no anchor between tokens from surface string and nodes from AMR graph. So we have to know which token aligns to which node, or we cannot train our model. Actually, finding alignment is a quite hard problem so that we could only get approximate solutions through heuristic searching. Although basic alignments have been contained in the companion data, we decide to use an enhanced rule-based aligner TAMR .
TAMR recalls more alignments by matching words and concepts from the view of semantic and morphological. (a) semantic match: Glove embedding represents words in some vector space. Considering a word and a concept striping off trailing number, we think them matching if their cosine similarity is small enough. (b) morphological match: Morphosemantic database in the Word-Net project provides links connecting noun and verb senses, which helps match words and concepts.
Top Nodes There is exact one top node in AMR. For the convenience of processing, we add a guard element to the stack and use operation LEFT-EDGE ROOT between guard element and concept nodes to predict top nodes.
Node Labels Node label appears as the name of each concept which is parameter of operation EN-TITY, CONFIRM and NEW.
Node Properties This is the main part of postprocessing. Since our model predicts everything   (Oepen et al., 2014) in DM/PSD, UCCA Labeled Dependency F1  in UCCA. And EDM (Dridan and Oepen, 2011) stands for Elementary Dependency Match in EDS. SMATCH (Cai and Knight, 2013) is an evaluation metric for semantic feature structures in AMR.
as nodes and edges, we need an extra procedure to recognize which nodes should be properties in the final result. Once recognized, node along with the corresponding edge will be converted to the property of its parent node, edge label for the key, and node label for the value. We write some rules to perform the recognizing procedure. Rules come from 2 basic facts. (a) attribute node: Numbers, URLs, and other special tokens like '-'(value of 'polarity') should be values of properties. (b) constant relation: When an edge has a label like 'value', 'quant', 'op x ' and so on, it is usually a key to property. We treat it as property if there is an edge of constant relation connecting to an attribute node.

Experiments
In this section, we will show the basic model setup including BERT fine-tuning, and results including overall evaluation, training speed. More details about training, including model selection, hyperparameters and so on, are contained in supplementary material.

Model Setup
Our work uses the AllenNLP library built for the PyTorch framework. We split parameters into two groups, i.e., BERT parameters and the other parameters (base parameters). The two parameter groups differ in learning rate. For training we use Adam (Kingma and Ba, 2015). Code for our parser and model weights are available at https://github.com/DreamerDeo/ HIT-SCIR-CoNLL2019.
Fine-Tuning BERT with Parser Based on Devlin et al. (2019), fine-tuning BERT with supervised downstream task will receive the most benefit. So we choose to fine-tune BERT model together with the original parser. In our preliminary study, gradual unfreezing and slanted triangular learning rate scheduler is essential for BERT fine-tuning model. More details are discussed in supplementary material.

Results
Overall Evaluation We list the evaluation results on Table 1, which is ranked by the crossframework metric, named ALL-F1, attached with the result of specific framework. 6 In final submission, we only use the single model for prediction. In the follow-up experiments, we get further improvement via the ensemble model. The related results is listed in supplementary material.
Training Speed To explore the effect of batchtraining methods which proposed in Section 2.  in training process, we conduct several experiments through adjusting the batch-size. Since we have adopted two different ways to address the cross-arc problem: list-based (DM, PSD, EDS, AMR) and SWAP operation (UCCA), we try batch-training experiments on DM and UCCA respectively. The result is shown in Figure 3. 5.3x on DM and 2.7x on UCCA speedup could be reached approximately while increasing batch size. We use GloVe pretrained embedding instead of BERT to reduce memory cost and support a larger batch size in the speed test. Improvement through BERT Our parser benefits a lot from BERT compared with GloVe as shown in Table 2. The improvement is more obvious in the out-of-domain evaluations, illustrating BERT's ability to transfer across domains.

Discussion
In recent years, graph-based parser holds the stateof-the-art in dependency parsing area due to its ability in the global decision, compared with transition-based parser. However, when we concatenated those models with BERT, we receive the similar performance, which shows that powerful representation could eliminate the gap between structure or parsing strategy. Kulmizev et al. (2019) proposes that deep contextualized word representations are more effective at reducing errors in transition-based parsing than in graph-based parsing. Their experiments were all about dependency parsing (tree structure), and we found similar results in meaning representation parsing (graph structure), as shown in Table  3. It remains the future work to study this phenomenon with the theoretical analysis.

Conclusion and Future Work
Our system extends the basic transition-based parser with the following improvements: 1) adopting BERT for better word representation; 2) realizing batch-training for stack LSTM to speed up the training process. And we proposed a unified pipeline for meaning representation parsing, suitable for main stream graphbanks. In the final evaluation, we were ranked first place in CoNLL 2019 shared task according to ALL-F1 (86.2%) and especially ranked first in UCCA framework (81.67%).