SJTU-NICT at MRP 2019: Multi-Task Learning for End-to-End Uniform Semantic Graph Parsing

This paper describes our SJTU-NICT’s system for participating in the shared task on Cross-Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). Our system uses a graph-based approach to model a variety of semantic graph parsing tasks. Our main contributions in the submitted system are summarized as follows: 1. Our model is fully end-to-end and is capable of being trained only on the given training set which does not rely on any other extra training source including the companion data provided by the organizer; 2. We extend our graph pruning algorithm to a variety of semantic graphs, solving the problem of excessive semantic graph search space; 3. We introduce multi-task learning for multiple objectives within the same framework. The evaluation results show that our system achieved second place in the overall F_1 score and achieved the best F_1 score on the DM framework.


Introduction
In recent years, the semantic graph parsing has received a lot of attention from researchers. * Corresponding authors. † This work was conductd when Zuchao Li and Zhuosheng Zhang visited NICT as internship students.
Email: charlee@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn, zhangzs@sjtu.edu.cn, {wangrui, mutiyama, eiichiro.sumita}@nict.go.jp. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Science Foundation of China (U1836222 and 61733011). This work was partially conducted under the program "Research and Development of Enhanced Multilingual and Multipurpose Speech Translation Systems" of the Ministry of Internal Affairs and Communications (MIC), Japan. Masao Utiyama is partly supported by JSPS KAKENHI Grant Number 19H05660. Rui Wang was partially supported by JSPS grant-in-aid for early-career scientists (19K20354): "Unsupervised Neural Machine Translation in Universal Scenarios" and NICT tenuretrack researcher startup fund "Toward Intelligent Machine Translation". However, due to the variety of semantic graph flavors, the framework-specific "balkanization" of semantic parsing is worth noting.
The 2019 Conference on Computational Language Learning (CoNLL) hosts a shared task on Cross-Framework Meaning Representation Parsing (MRP 2019) (Oepen et al., 2019). From the perspective of the formal representation of semantic graphs, MRP 2019 uses the directed graphs to unify the five different semantic representation frameworks: DELPH-IN MRS Bi-Lexical Dependencies (DM), Prague Semantic Dependencies (PSD), Elementary Dependency Structures (EDS), Universal Conceptual Cognitive Annotation (UCCA), and Abstract Meaning Representation (AMR). Wherein, the directed graph is represented by a T , N , E triplet, N represents a set of nodes that constitutes the semantic graph, E ⊆ N × N represents a set of edges that express a specific semantic relationship (N , E contains a specific attribute corresponding to the semantic framework), and T represents nodes with a degree of zero in N , usually corresponding to the most central semantic entity.
Though the semantic graph parsing task is uniformly modeled into a directed graph generation task, according to the relationship between nodes in the directed graph and the surface lexical units in the sentence, the five semantic graph frameworks can be divided into three different categories according to the alignment degree between graph nodes and lexical semantics: (1) graph nodes and surface lexical units anchor correspondence strictly (i.e., DM, PSD, EDS), (2) partial graph nodes and surface lexical units anchor correspondence strictly (i.e., UCCA), and (3) graph nodes and surface lexical units have no anchor correspondence (i.e., AMR). As there is a case of anchoring multiple nodes in the corresponding graph of the directed graph of EDS, our system further treats EDS as one type, and DM and PSD as another type.
Based on the experiences of Jiang et al. (2019) and Zhang et al. (2019a) and our previous works on the Dependency Parsing (Li et al., 2018a,b,d;, Semantic Role Labeling (He et al., 2018b;Cai et al., 2018;Li et al., 2018cHe et al., 2019), Universal Conceptual Cognitive Annotation (Jiang et al., 2019), Abstract Mean Representation (Zhang et al., 2019a), Machine Translation (Xiao et al., 2019;Sun et al., 2019;, Language Modeling (Li et al., 2019a;Zhang et al., 2019c,b) tasks, we create three graph parsing models based on the semantic graph flavors: (1) Strictly anchored (DM, PSD, EDS): scores the surface lexical units as nodes of the graph, and performs edge training based on the expression of the candidate graph nodes, (2) Non-strictly anchored (UCCA): treats it as a special constituent tree parsing task and uses an additional component to recover the remote edges, and (3) Completely unanchored (i.e., AMR): uses the Seq2seq model to generate the nodes and then performs edge scoring on the generated graph nodes. In order to maintain the end-to-end style of our system, we use the multi-task learning method to jointly train and predict the attributes of nodes and edges together with themselves. We use the pre-trained language model BERT as the encoder. In the training phase, in order to prevent the nodes from falling into local optimum and the edges unable to get enough training, we use the random sampling method on the golden graph nodes to push as many correct nodes as possible to join the edge training. According to the official results of the evaluation, our system ranked second place in the overall F 1 metric among the 16 participating systems. On the DM framework, our system achieved the best results. Our system on other 4 frameworks (PSD, EDS, UCCA, and AMR) are all ranked the third place.

Tasks and Modeling
In this section, we will introduce this shared task and our modeling approach. Our key idea is to use a graph-based approach rather than a transition-based one; therefore, all the modeling and optimization methods we have on these frameworks are graph-based. The CoNLL shared task combines the following five frameworks for graph-based meaning representation: DM, PSD, EDS, UCCA, and AMR.

DM and PSD
The DM (Ivanova et al., 2012) and PSD (Hajic et al., 2012;Miyao et al., 2014) are two independently developed syntactic-semantic annotations which project semantic forms onto bilexical dependencies in a lossy manner.
In the representation of the DM and PSD frameworks, the graph nodes and surface lexical units are strictly anchored. There is an explicit, one-to-many anchoring onto sub-strings of the underlying sentence. These graphs are neither fully connected nor rooted. The graphs of DM and PSD have the following features: • There is only a one-to-one correspondence 1 between the graph node and the span in the sentence.
• Graph nodes can have multiple in-edges or out-edges.
• Graph nodes can be completely isolated, with no in-edges or out-edges.
• There is at most one edge between any two graph nodes.
According to the above properties, the task is modeled as follows: Given a sentence S = {w 1 , w 2 , ..., w n }, enumerate all the span in the sentence span i,j = {w i , w i+1 , ..., w j }, (i <= j), which is used as a candidate graph node and is fed into the node classifier classif ier n 2 to filter the truly graph nodes: node k = classif ier n (span i,j ), and then uses the edge classifier classif ier e to obtain the semantic relationship between the two graph nodes edge k 1 ,k 2 = classif ier e (node k 1 , node k 2 ).

EDS
EDS is a variable-free semantic dependency graph representation proposed by Oepen and Lønning (2006) which encode the English Resource Semantics (ERS) (Flickinger et al., 2014). The EDS conversion from under-specified logical forms of the full ERS to variable-free graphs discards partial semantic information which makes the graph abstractly.
In the representation of the EDS framework, the graph nodes are independent of surface lexical units. For each graph node, there is an explicit, many-to-many anchoring onto sub-strings of the underlying sentence. The EDS graph has the following features: • There is a many-to-one correspondence between the graph node and the span in the sentence.
• Graph nodes do not correspond to individual surface tokens in the sentence.
• Graph nodes can not be completely isolated and have at least one in-or out-edge.
According to the above features, since there is a many-to-one correspondence between the graph nodes and the spans in the sentences, it is impossible to use the modeling method of DM and PSD simply. Therefore, we adopt a pseudo node method to solve the problem. The transformation is carried out: the pseudo node has a one-to-one relationship with the span in the sentence. The edge between nodes in the graph is transformed into the edge of the pseudo node, and two attributes are added for the edge: the source node label and the target node label which are used to indicate the node label in the original EDS graph. In this way, the many-to-one relationship is converted into a one-to-one relationship. After conversion, we can model the problem using in the same way as DM and PSD as described in Subsection 2.1.

UCCA
UCCA is a multi-layer linguistic framework for semantic annotation proposed by Abend and Rappoport (2013). UCCA aims to recognize the level of semantic granularity which abstracts away from syntactic paraphrases in a typologicallymotivated, cross-linguistic fashion and does not need to rely on language-specific resources.
In the representation of the UCCA framework, some nodes have a one-to-one correspondence with the span in the sentence, which is called terminal nodes 3 . Other nodes do not have any corresponding relationship with the span, which is introduced as a notion of a semantic constituency that transcends the pure dependency graphs to represent the semantic granularity. The UCCA graph has the following features: • There is a one-to-one correspondence between the terminal nodes and the spans in the sentence.
• Graph nodes may have multiple parents, among which one is annotated as the primary parent and others as remote parents.
• The primary edges between nodes and their primary parents form a tree structure, whereas the remote edges between nodes and their remote parents enable the reentrancy, forming directed acyclic graphs (DAGs).
• The non-terminal nodes may exist discontinuous leaves; in which some terminal nodes are not its descendants.
Based on the above features and inspired by Nivre and Nilsson (2005), we transform the tree composed of primary edges (and nodes) into a constituent syntax tree structure, which is modeled using the constituent syntax tree parsing schema. Use an additional classifier for the remote edges prediction and recovery.

AMR
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) parsing is the task of transducing natural language text into AMR, which is a graph-based formalism used for capturing sentence-level semantics. The AMR framework backgrounds notions of compositionality and derivation, therefore, without explicit correspondence between graph nodes and lexical units.
In the representation of AMR framework, the graph nodes are obtained by composition, derivation, lexical decomposition, normalization towards verb senses and so on. The features of the AMR graphs built on these graph nodes is similar to the dependency syntax tree except for the reentrancy. Therefore, if the node is determined, modeling can be performed using the method of dependency syntax parsing. However, we can't get the nodes of the graph directly from the sentence due to the nature of the AMR framework. Therefore, inspired by Zhang et al. (2019a), we model the nodes determination as sequence generation tasks using the Seq2seq model and then parse the tree structure on the generated nodes.

Data
The CoNLL shared task provides a training dataset of 5 subtasks, of which DM, PSD, and EDS are from Wall Street Journal (WSJ) text of Penn Treebank (Marcus et al., 1993) and contain 35,656 sentences. The UCCA training data comes from the English Web Treebank's reviews text (Bies et al., 2012) and the English Wikipedia celebrity articles, with a total data volume of 5,672 sentences. AMR annotation data are drawn from a variety of texts, including online discussion forums, newswires, folktales, novels, and Wikipedia articles, which contain a total of 56,240 sentences.

Tokenization, Lemmatization, and Anchor conversion
Since the sentence in the training dataset is the original text and no tokenization is performed, and the subsequent processing requires the word root form, we use the Stanford NLP toolkit 4 (Manning et al., 2014) to tokenize and lemmatize the original text. As the graph node anchor in the original data is defined at the character level, we need to convert the anchor to the word level. In this process, due to the difference in tokenization criteria and the existence of tokenizing errors, some graph nodes will be converted into the same one in the process of conversion to word-level anchor. Therefore, we performed some post-processing modifications on the tokenization results of the Stanford NLP toolkit to ensure that the graph nodes after the conversion to the word level anchor correspond to the previous character level, without increasing or decreasing the nodes.  Figure 1: Examples of the most frequent frame-toframeset mapping extracted from "rng pb links.txt".

Frame Label Projection in PSD Framework
The node label in the PSD framework is a special item id for Engvallex-to-PropBank mapping dictionary. The node label contains the item id of the item in the dictionary and the format id of the item. Such as: [access: ev-w22f1 ACT PAT] where 22 is the item id (word id) and 1 is the format id. Therefore, it is not convenient to use the classifier directly for prediction on the raw node label. Due to the word has a one-tomany relationship with the item id, we cannot obtain this item id by word directly. By observing "rng pb links.txt" 5 as shown in Figure 1, the item id has a one-to-one correspondence with its usage pattern string (like "ACT PAT") in the case of word determination, and the usage pattern has duplicates among different words, the number is much smaller than all item ids size; thus it is more suitable as a learning goal. In the subsequent recovery process, we can use lemma and the usage pattern to restore to the item id.

Graph to Constituent Tree Conversion in UCCA Framework
As described in subsection 2.3, the features, and modeling approach, we need to preprocess the UCCA graph in the training set, transforming the graph into a constituent tree by removing the remote edges and the edges that cause the discontinuous leaves. We have adopted the same transformation method as Jiang et al. (2019). The simple steps are as follows: 1) Remote edges removal. In the UCCA MRP representation graph, we remove all edges with the remote = T rue attribute. The label of the primary edge corresponding to the remote edge is added with a "remote" suffix to distinguish the node with the remote relationship from the ordinary node and subsequent recovery of the remote edge.
2) Constituency continuity formation. Since the current mainstream constituent parsing method requires continuity of constituency, we need to process the discontinuous nodes of the tree species obtained in the previous step. For detailed conversion steps, see Algorithm 1.

Algorithm 1 Constituency continuity formation Input:
A tree with discontinuous leaves, T d ; Output: A constituent tree, T c ; 1: set T (t) = T d 2: repeat 3: set n(t) is a non-descendant node with discontinuous leaves; 4: find the discontinuous spans S d in the range of n(t);

5:
for each span s ∈ S d do 6: for each word w ∈ s do 7: find a maximum range parent node n p of word w whose range size is less than s; 8: move node n p to be the child of n(t), and concatenate the original edge label with "ancestor-d" where d represents the original number of edges between the ancestor of n p and n(t); 9: remove all the children words of n p from s; 10: end for 11: end for 12: until T (t) is a constituent tree 13: set T c = T (t) 3) Edge labels move down. Constituent syntax parsing generally uses parenthetical notation to represent the constituent syntax tree structure, so in order to keep the model consistent, we also move the edge label down to the child node. Since the UCCA graphs need not be rooted trees, we add a "ROOT" dummy node to ensure that the transformed tree is a rooted tree.

Graph to Tree Conversion in AMR Framework
AMR graph is rooted, directed, and most acyclic. However, AMR is a graph instead of a tree due to it allows re-entrant semantic relations. In order to adopt the tree model for AMR parsing, we need to convert the AMR graph to a tree in the preprocessing step. Following the practice of Zhang et al. (2019a), we duplicate the nodes that have a re-entrant relation. In order to recover the original graph, we assign an index to each node, named reentrancy index. Duplicate nodes will be assigned the same index.

Anonymization in AMR Framework
Anonymization is an important AMR preprocessing method to reduce the data sparsity issue (Werling et al., 2015;Peng et al., 2017;Guo and Lu, 2018). Following the practice of Zhang et al. (2019a), we first remove senses, wiki links, and polarity attributes in the training dataset. Secondly, we anonymize sub-graphs of named entities which is labeled by one of AMR's finegrained entity types that contain a name role, and other entities which end with -entity 6 .

Models
To handle different flavors of representation, our system has three types of models: Anchoring-based Pruning Parsing Model, Constituent Parsing Model, Seq2seq-based Parsing Model.

Anchoring-based Pruning Parsing Model
The anchoring-based pruning parsing model is suitable for frameworks where the graph nodes are strictly one-to-one with the sentence span, such as DM, PSD, and the transformed EDS framework. The key idea of the anchoring-based pruning parsing model is to obtain the candidate graph nodes by enumerating the sentence span, and then use a scorer to pruning the candidate graph nodes and perform parsing on these graph nodes.
Formally, for major structural parsing goals, given a sentence S = {w 1 , w 2 , ..., w n }, where n is the sequence length, we aim to predict a set of labeled graph node-pair (sentence spanpair) relations Y ⊆ N × N × L, where N = {(w i , ..., w j )|1 ≤ i ≤ j ≤ n} contains all the spans (graph nodes), and L is the space of the edge labels (semantic roles), including a null label indicating no edge between the node-pairs. As our model deals with O(n 2 ) possible sentence spans (graph nodes), it needs to consider O(n 4 |L|) possible relations, which is computationally impractical. To overcome this issue, motivated by our previous work  and the work of (He et al., 2018a), we limit the maximum width of the candidate spans to fixed number W , which reduces the overall number of relational factors need to be considered by the model to O(n 2 |L|). In order to make the training goal denser, we also introduce a unary scorer φ node (·) and the candidate nodes are ranked and pruned by their unary score in descending order. The size of candidates reserved after the pruning operation is limited to λn. Candidates that are pruned do not participate in computing the edge relation prediction, which can also further reduce the computational complexity and memory requirements. These parameters W and λ are determined based on the statistics on the training dataset of each framework. Neural Architecture Our model builds the candidate graph nodes representation based on the BERT (Devlin et al., 2019) encoder outputs, i.e., for each token w i , the contextualized vector from BERT encoder is denoted as x i . The candidate span (i, j) representation h consists of two endpoint contextualized vectors (x i , x j ) where i and j are the start and end position of the span in the sentence: (1) The node unary scorer φ node (·) is implemented with feed-forward networks based on the candidate graph nodes representation h: φ node (·) = sigmoid(MLP node (h)).
The edge relation classifier φ rel (·) is implemented with biaffine attention mechanism. Following Dozat and Manning (2017), we apply two seperate MLPs to the source and target nodes respectively, producing identity-specified representation: r src = MLP src (h) and r tgt = MLP tgt (h).
We perform a biaffine operation to compute the relation labeling score.
φ rel (·) = r T src W rel r tgt +U T rel r src +V T rel r tgt +b rel , (3) where W rel , U rel , V rel , and b rel are the weight matrix of the bi-linear term, the two weight vectors of the linear terms, and the bias vector, respectively.
Training Loss For the node scoring goal, we use the binary cross-entropy loss between the target and the output. For the edge classification, we implement it with the standard cross-entropy loss.
Multi-Task Learning Each framework still has some other goals to learn. The DM framework includes top node, node pos tag, and node frame label. The PSD includes top node, node pos tag, and node frame label. The EDS includes edge source label and edge target label.
Overall, we use multi-tasking learning strategy, shared hidden representation, The top node uses the same mechanism as node scoring, using binary crossentropy as loss implementation. The node pos tag and node frame label use independent feed-forward classifier, using cross-entropy as loss implementation. The edge source label and edge target label use a biaffine scorer consistent with the edge label, using cross-entropy loss as well. We accumulate the loss of all goals together.

Constituent Parsing Model
For the UCCA framework, we directly adopt the minimal span-based parser of Stern et al. (2017) on the converted constituent trees. A constituency tree can be regarded as a collection of labeled spans over a sentence. There are two components in the constituent parsing model: one is to assign the scores directly to span existence which determines the tree structure, and the other one assigns scores to span labels which provides the labeled outputs.
Neural Architecture In this model, we also build the candidate span representation h based on the BERT encoder outputs due to a span's correct label and its quality as a constituent depend heavily on the context in which it appears. Different from the previous span representation, in this model, the representation h of span (i, j) is the concatenation of the two endpoint contextualized vectors differences: The span splitting unary scorer φ split (·) and the span label scorer φ label (·) are both implemented as feed-forward networks which take as input the span representation and output either a single span score or a vector of labeling scores. For the tree inference, we adopt the greedy topdown searching strategy introduced in Stern et al. (2017).
In order to recover the full UCCA graph, the model needs to learn the remote edge target. The remote edge target is similar to the previous edge target, which is to predict the relationship between the node pairs, so we also use the relational classifier φ rel (·). As the same in the previous model, there is also a null label indicating no edge between the node-pairs.

Seq2seq-based Parsing Model
The AMR framework backgrounds notions of compositionality and derivation and, accordingly, declines to make explicit how elements of the graph correspond to the surface utterance. Although most AMR parsing research presupposes a preprocessing step that aligns graph nodes with (possibly discontinuous) sets of tokens in the underlying input, these correspondences need extra annotation and training. This does not match our requirements for the model to be end-toend. Therefore, we consider the AMR tree with indexed nodes as the prediction target (proposed by Zhang et al. (2019a)). The approach of AMR parsing is formulized as a two-stage process: node prediction (concept identification) and edge prediction (relation identification).
Formally, given a sentence S = {w 1 , w 2 , ..., w n }, the model sequentially decodes a list of nodes N = {u 1 , u 2 , ..., u m } and their reentrancy indices D = {d 1 , d 2 , .., d m }. Then, the model is required to search for the highest scoring parsing tree similar to dependency parsing.
Neural Architecture For node prediction, we adopt the widely-used Seq2seq model Seq2seq(·) with pointer-generator network (Vinyals et al., 2015). The pointer-network has the advantage of copying words from the source text while still retaining the ability to produce novel words For the edge prediction, we also adopt the biaffine attention mechanism to score all possible headdependent pairs like dependency parsing. The relation classifier φ rel (·) is the same as the previous: where W arc , U arc , V arc , and b arc are the weight matrix of the bi-linear term, the two weight vectors of the linear terms, and the bias vector, respectively.

Setup
We first describe the final setup used for our final submission. We use the  pytorch-transformers 7 as our codebase to develop the downstream parsing models. The weights of pre-trained language model BERT with whole word masking 8 are used to initialize the encoder of our models. Due to the absence of development dataset, we split the training dataset to 10 sections and 0-8 for training and 9 for development. Our model is trained using Adam (Kingma and Ba, 2014) up to 30 epochs for DM, PSD, and EDS, and 20 epochs for UCCA and 120 epochs for AMR, with early stopping strategy based on the MRP F 1 score 9 on the development dataset with mtool 10 toolkit. Table 1 lists the hyperparameters used in our full model. We apply the hidden dropout (dropout rate = 0.1) to the outputs of each module in our model.

Main Results
We list our official evaluation scores 11 on the all test dataset in Tables 2 and 3. Table 2 summarizes the MRP F 1 scores of the 6 graph components. The results listed in Table 2 shows that we obtained the state-of-the-art MRP F 1 score on the top nodes component. In Table 3, we assess the quality on each frameworks. Our model also achieved the best results on the DM framework. We observed a notable phenomenon that as the anchoring relationship between the graph node and the surface lexical units is getting farther, the difficulty of parsing is getting higher. From the results of parsing on different 7 https://github.com/huggingface/ pytorch-transformers. 8 In our experiments, we use the BERT-Large, uncased (Whole Word Masking) with 24-layer, 1024-hidden, 16heads, and 340M parameters released by Google, https: //github.com/google-research/bert. 9 http://mrp.nlpl.eu/index.php. 10 https://github.com/cfmrp/mtool. 11 The official evaluation results are at http://bit. ly/cfmrp19.  frameworks, our results on the EDS framework have the biggest gap with other priority teams, probably because of the existence of multiple edges between the same pair of pseudo nodes in the EDS framework after our modeling transformation. Therefore, our subsequent experiments modeled the edges of EDS as multiclassification problems, and our results on the development dataset have been improved.

Conclusion and Future Work
In this paper, we present our end-to-end graphbased system participated in the CoNLL 2019 shared task on Cross-Framework Meaning Representation Parsing (MRP 2019). We extend existing models and make our model be end-toend and does not depend on any other information (including the companion data provided by the organizer). We introduce our previous graph pruning algorithm to a variety of semantic graphs, solving the problem of excessive semantic graph search space and adopt multi-task learning for multiple objectives within the same framework. Specifically, we model the semantic graph task as a multi-objective learning task of nodes, edges, node attributes, and edge attributes. The nodes candidates are scored and then pruned within the model, thus controlling the overall graph search space, and finally forming an end-to-end style parsing system. We achieve state-of-the-art results on the top nodes component and DM framework.
For future work, we are going to integrate all the different frameworks into one single model, not just the same modeling approach. Based on the MRP representation method, a single model is used to generate various semantic graphs. Furthermore, we would like to extend our model to other more semantic parsing tasks.