ShanghaiTech at MRP 2019: Sequence-to-Graph Transduction with Second-Order Edge Inference for Cross-Framework Meaning Representation Parsing

This paper presents the system used in our submission to the CoNLL 2019 shared task: Cross-Framework Meaning Representation Parsing. Our system is a graph-based parser which combines an extended pointer-generator network that generates nodes and a second-order mean field variational inference module that predicts edges. Our system achieved 1st and 2nd place for the DM and PSD frameworks respectively on the in-framework ranks and achieved 3rd place for the DM framework on the cross-framework ranks.


Introduction
The goal of the Cross-Framework Meaning Representation Parsing (MRP 2019, Oepen et al. (2019)) is learning to parse text to multiple formats of meaning representation with a uniform parsing system. The task combines five different frameworks of graph-based meaning representation. DELPH-IN MRS Bi-Lexical Dependencies (DM) (Ivanova et al., 2012) and Prague Semantic Dependencies (PSD) (Hajič et al., 2012; first appeared in SemEval 2014 and 2015 shared task Semantic Dependency Parsing (SDP) (Oepen et al., , 2015. Elementary Dependency Structures (EDS) (Oepen and Lønning, 2006) is the origin of DM Bi-Lexical Dependencies, which encodes English Resource Semantics (Flickinger et al., 2016) in a variable-free semantic dependency graph. Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013) targets a level of semantic granularity that abstracts away from syntactic paraphrases. Abstract Meaning Representation (AMR) (Banarescu et al., 2013) targets to abstract away from syntactic representations, which means that sentences have similar meaning should be assigned the same AMR graph. One of the main differences be-tween these frameworks is their level of abstraction from the sentence. SDP is a bi-lexical dependency graph, where graph nodes correspond to tokens in the sentence. EDS and UCCA are general forms of anchored semantic graphs, in which the nodes are anchored to arbitrary spans of the sentence and the spans can have overlaps. AMR is an unanchored graph, which does not consider the correspondence between nodes and the sentence tokens. The shared task also provides a crossframework metric which evaluates the similarity of graph components in all frameworks.
Previous work mostly focused on developing parsers that support only one or two frameworks while few work has explored cross-framework semantic parsing. Peng et al. (2017), Stanovsky and Dagan (2018) and Kurita and Søgaard (2019) proposed methods learning jointly on the three frameworks of SDP and Peng et al. (2018) further proposed to learn from different corpora. Hershcovich et al. (2018) converted UCCA, AMR, DM and UD (Universal Dependencies) into a unified DAG format and proposed a transition-based method for UCCA parsing.
In this paper, we present our system for MRP 2019. Our system is a graph-based method which combines an extended pointer-generator network introduced by Zhang et al. (2019) to generate nodes for EDS, UCCA and AMR graphs and a second-order mean field variational inference module introduced by Wang et al. (2019) to predict edges for all the frameworks. According to the official results, our system gets 94.88 F1 score in the cross-framework metric for DM, which is the 3 rd place in the ranking. For in-framework metrics, our system gets 92.98 and 81.61 labeled F1 score for DM and PSD respectively, which are ranked 1 st and 2 nd in the ranking. Figure 1: An example of converting AMR graphs into tree structures. This is a sub-graph of sentence #20003002.

Data Processing
In this section, we introduce our data preprocessing and post-processing in our system for all the frameworks. We use sentence tokenizations, POS tags and lemmas from the official companion data and named entity tags extracted by Illinois Named Entity Tagger (Ratinov and Roth, 2009) in the official 'white-list'. We follow Zhang et al. (2019) to convert each EDS, UCCA, and AMR graph to a tree through duplicating the nodes that have multiple edge entrances, An example is shown in Fig. 1. The node sequences for EDS, UCCA and AMR are decided by depth-first search that starts from the root node and sorts neighbouring nodes in alphanumerical order.

AMR Data Processing
Our data processing follows Zhang et al. (2019).
In pre-processing, we remove the senses, wiki links and polarity attributes in AMR nodes, and replace the sub-graphs of special named entities, such as names, places, time, with anonymized words. The corresponding phrases in the sentences are also anonymized. A mapping from  NER tags to these entities is built to process the test data.
In post-processing, we generate the AMR subgraphs from the anonymized words. Then we assign the senses, wiki links and polarity attributes with the method in Zhang et al. (2019).

EDS and UCCA Data Processing
In pre-processing we first clean the companion data to make sure the tokens in the companion data is consistent with those in the MRP input. We suppose anchors are continuous for each node, so we replace the anchors with the corresponding start and end token indices.
In EDS graphs, there are a lot of nodes without a direct mapping to individual surface tokens, which we call type 1 nodes. We call nodes with corresponding surface tokens type 2 nodes. We reduce type 1 nodes in two ways: • If a node a of type 1 is connected to only one node b which is of type 2 and has the same anchor as a, we reduce node a into node b as a special attribute for the node.
• If a node a of type 1 is connected to exactly two nodes b and c which are of type 2 and have a combined anchor range that matches the anchor of a. We reduce node a as an edge connecting b and c with the same label. The edge direction is decided by the labels of the edges connecting a to b and c. For example, if node a has two child nodes b and c, edge (a, c) has label ARG2 and edge (a, b) has label ARG1, then node a will be reduced to directed edge (b, c) with the label of node a.
An example of the reduction is shown in Fig. 2. This method reduces 4 nodes on average for each graph. We also look at nodes whose node label corresponds to a multi-word in the sentence For example, '_such+as' in an EDS graph corresponds to 'such as' in the sentence. In such case, if the phrase has a probability over 0.5 that maps to a single node, then all words in this phrase will be combined to a single token in the sentence.
In the post-processing, we recover reduced nodes by reversing the reduction precedure according to the node attributes and edge labels.
For UCCA, we label implicit nodes with special labels n i , where i is the index that the implicit node appears in the node sequence.

System Description
In this section, we describe our model for the task. We first predict the nodes of the parse graph. For DM and PSD, there is a one-to-one mapping between sentence tokens and graph nodes. For EDS, UCCA and AMR, we apply an extended pointergenerator network (Zhang et al., 2019) for node prediction. Given predicted nodes, we then adopt the method of second-order mean field variational inference (Wang et al., 2019) for edge prediction. Figure 3 illustrates our system architecture.

Word Representation
Previous work found that various word representation could help improve parser performance. Many state-of-the-art parsers use POS tags and pre-trained GloVe (Pennington et al., 2014) embeddings as a part of the word representation. Dozat and Manning (2018) find that characterbased LSTM and lemma embeddings can further improve the performance of semantic dependency parser. Zhang et al. (2019) use BERT (Devlin et al., 2019) embeddings for each token to improve the performance of AMR parsing. In our system, we find that predicted named entity tags are helpful as well. The word representation o i in our system is: i are XPOS, lemmas, character and NER embedding respectively. XPOS and lemmas are extracted from the official companion data.

Node Prediction
We use extended pointer-generator network (Zhang et al., 2019) for nodes prediction. Given a sentence with n words w = [w 1 , w 2 , ..., w n ], we predict a list of nodes u = [u 1 , u 2 , ..., u m ] sequentially and assign their corresponding indices idx = [idx 1 , idx 2 , ..., idx m ]. The indices idx are used to track whether a copy of a previous generated nodes or a newly generated node.
To encode the input sentence, we use a multi-layer BiLSTM fed with embeddings of the words: , o i is the concatenation different types of embeddings for w i , and R = [r 1 , . . . , r n ] represents the output from the BiLSTM. For the decoder, at each time step t, we use an l-layer LSTM for generating hidden states z l t sequentially: t is the concatenation of the label embedding of node u t−1 and attentional vector z t−1 . z t is defined by: Where a t src is the source attention distribution, and c t is contextual vector of encoder hidden layers, The vocabulary distribution is given by: where W vocab and b vocab are learnable parameters. The target attention distribution is defined similarly as Eq. 2 and 3: where W tatt , W tgt , U tgt , b tgt are learnable parameters. Finally, at each time step, we need to decide which action should be taken. Possible actions include copying an existing node from previous nodes and generating a new node whose label is either from the vocabulary or a word from the source sentence. The corresponding probability of these three actions are p tgt , p gen and p src : where p tgt + p gen + p src = 1. At time step t, if u t is a copy of an existing nodes, then the probability P (node) (u t ) and the index idx t is defined by: where idx j is the copied node index. If u t is a new node:

Edge Prediction
We adopt the method presented in Wang et al. (2019) for edge prediction, which is based on second-order scoring and inference. Suppose that we have a sequence of vector representations of the predicted nodes [r 1 , . . . , r m ], which can be the BiLSTM output r i in Eq. 1 in the cases of DM and PSD, or the extended pointer-generator network output z i in Eq. 4 in the cases of EDS, UCCA and AMR. The edge prediction module is shown in Fig. 4. To score first-order and second-order parts (i.e., edges and edge-pairs) in both edge-prediction and label-prediction, we apply the Biaffine function Manning, 2017, 2018) and Trilinear function (Wang et al., 2019) fed with node representations. where U i is a (d × d)-dimensional tensor, where d is hidden size and • represents element-wise product. We consider three types of second-order parts: siblings (sib), co-parents (cop) and grandparents (gp) (Martins and Almeida, 2014). For a specific first-order and second-order part, we use singlelayer FNNs to compute a head representation and a dependent representation for each word, as well as a head_dep representation which is used for grandparent parts: part ∈ {edge, label, sib, cop, gp} We then compute the part scores as follows: In Eq. 7,8, the tensor U in the biaffine function is where c is the number of labels. We require j < k in Eq. 9 and i < k in Eq. 10.
In the label-prediction module, s (label) i,j is fed into a softmax layer that outputs the probability of each label for edge (i, j). In the edge-prediction module, we can view computing the edge probabilities as doing posterior inference on a Conditional Random Field (CRF). Each Boolean variable X ij in the CRF indicates whether the directed edge (i, j) exists. We use Eq. 7 to define our unary potential ψ u representing scores of an edge and Eqs. (9-11) to define our binary potential ψ p . We define a unary potential φ u (X ij ) for each variable X ij .
For each pair of edges (i, j) and (k, l) that form a second-order part of a specific type, we define a binary potential φ p (X ij , X kl ).
Exact inference on this CRF is intractable. We use mean field variational inference to approximate a true posterior distribution with a factorized variational distribution and tries to iteratively minimize their KL divergence. We can derive the following iterative update equations of distribution Q ij (X ij ) for each edge (i, j).
ij (X ij ) is set by normalizing the unary potential φ u (X ij ). We iteratively update the distributions for T steps and then output Q (T ) ij (X ij ), where T is a hyperparameter. We can then predict the parse graph by including every edge y (edge) ij such that Q (T ) ij (1) > 0.5. The edge labels y (label) ij are predicted by maximizing the label probabilities computed by the label-prediction module.
Note that the iterative updates in mean-field variational inference can be seen as a recurrent neural network that is parameterized by the potential functions. Therefore, the whole edge prediction module can be seen as an end-to-end neural network.

Other Predictions
The shared task also requires prediction of component pieces such as top nodes, node properties, node anchoring and edge attributes. In this section, we present our approaches to predicting these components.

Top Nodes
We add an extra ROOT node for each sentence to determine the top node through edge prediction for DM and PSD. For the other frameworks, we use the first predicted node as the top node.

Node Properties
Node properties vary among different frameworks. For DM and PSD, we need to predict the POS and frame for each node. As DM and PSD are bilexical semantic graphs, we directly use the prediction of XPOS from the official companion data. We use a single layer MLP fed with word features obtained in Eq. 1 for frame prediction. For EDS, the properties only contain 'carg' and the corresponding values are related to the surface string.
For example, the EDS sub-graph in Fig. 2 contains a node with label 'named' which has property 'carg' with a corresponding value 'Pierre'. The anchor of this node matches the token 'Pierre' in the sentence. We found that nodes with properties have limited types of node labels. Therefore, we exchange node labels and values for EDS nodes containing properties during training. We combine the node attributes and value predictions described in Section 2.2 together as a multi-label prediction task. We use a single layer MLP to predict node labels specially for nodes with properties. For each property value, we regard it as a node label and use the extended pointer-generator network described in Section 3.2 to predict it. Therefore, the probability of node property prediction is: Node Anchoring As DM and PSD contain only token level dependencies, we can decide a node anchor by the corresponding token. For the other frameworks, we use two biaffine functions to predict the 'start token' and 'end token' for each node and the final anchor range is decided by the start position of 'start token' and the end position of 'end token'. The biaffine function is fed by word features from the encoder RNN and node features from decoder RNN.
where i ranges from 1 to n and j ranges from 1 to m.

Edge Attributes
Only UCCA requires prediction of edge attributes, which are the 'remote' attributes of edges. We create new edge labels by combining the original edge labels and edge attributes. In this way, edge attribute prediction is done by edge label prediction.

Learning
Given a gold graph y , we use the cross entropy loss as learning objective: where θ is all the parameters of the model, 1(X ) is an indicator function of whether X exists in the graph, i, j range over all the nodes and k ranges over all possible attributes in the graph. The total loss is defined by: where λ 1,...,4 are hyperparameters. For DM and PSD, we tuned on λ 1 , λ 2 and λ 3 . For other frameworks, we set all of them to be 1.

Training
For DM, PSD and EDS, we used the same dataset split as previous approaches (Martins and Almeida, 2014;Du et al., 2015) with 33,964 sentence in the training set and 1,692 sentences in the development set. For each of the other frameworks, we randomly chose 5% to 10% of the training set as the development set. We additionally removed graphs with more than 60 nodes (or with input sentences longer than 60 words for DM and PSD). We trained our model for each framework separately and used Adam (Kingma and Ba, 2015) to optimize our system, annealing the learning rate by 0.5 for 10,000 steps. We trained the model for 100,000 iterations with a batch size of 6,000 tokens and terminated with 10,000 iterations without improvement on the development set.

Main Results
Due to an unexpected bug in UCCA anchor prediction, we failed to submit our UCCA prediction.
Our results are still competitive to those of the other teams and we get the 3 rd place for the DM framework in the official metrics. The main result is shown in Table 1. Our system performs well on the DM framework with an F1 score only 0.4 percent F1 below the best score on DM. Note that our system does not learn to predict node labels for DM and PSD and simply uses lemmas from the companion data as node labels. We find that compared to gold lemmas from the original SDP dataset, lemmas from the companion data have only 71.4% accuracy. We believe that it is the main reason for the F1 score gap between our system and the best one on DM and PSD. A detailed comparison between each component will be discussed in Section 4.3. For PSD, EDS and AMR graph, our system ranks 6 th , 5 th and 7 th among 13 teams. Table 2 and 3 show detailed comparison for each evaluation component for DM and PSD. For DM, our system outperforms systems of the other teams on tops, properties and edges prediction and is competitive on anchors. For PSD, our system is also competitive on all the components except labels. There is a large gap in the performance of node label prediction between our system and the best one on both DM and PSD, we believe adding an MLP layer for label prediction would diminish this gap. Table 4 shows the performance comparison on in-framework metrics for DM and PSD. For DM, our system outperforms the best of the other systems by 0.5 and 0.8 F1 scores on all and lpps test sets. For PSD, our system outperforms the best of the other systems by 0.4 F1 score for lpps and only 0.05 F1 score below the best score for all.

AMR
For AMR graph prediction, our node prediction module is based on Zhang et al. (2019), but our edge prediction module is based on the secondorder method of Wang et al. (2019). To verify the effectiveness of second-order edge prediction, we compare the performances on the development set of our model and Zhang et al. (2019). The result is shown in Table 5. The result shows that our second-order edge prediction is useful not only on the SDP frameworks but also on the AMR framework.    Table 4: Comparison of in-framework labeled F1 scores by our system and best scores over the other teams. Note that the Best scores are not only from a single system.

Model
Smatch Zhang et al. (2019) 69.1 Ours 69.3  From the official results on the test sets, we find it surprising that there is a huge gap between the test and development results on both the MRP and the Smatch  scores, as shown in Table 6. In future work, we will figure out the reason behind this problem.

EDS
For EDS, our parser ranks 5 th . There are multiple details of our parser that can be improved. For example, our anchor prediction module described in Eq. 14 (ranking 4 th in the task) may occasionally predict an end anchor positioned before a start anchor, which would be rejected by the evaluation system. This can be fixed by adding constraints.

UCCA
For UCCA, we failed to submit the result because of the same reversed start-end anchor predictions, which prevents us from obtaining an MRP score.

BERT with Other Embeddings
We use BERT (Devlin et al., 2019) embedding in our model. We compared the performance of DM in the original SDP dataset with different subtoken pooling methods, and we also explored whether combining other embeddings such as pre-trained word embedding Glove (Pennington et al., 2014) and contextual embedding ELMo (Peters et al., 2018) will further improve the performance. The detailed results are shown in table 7. We found that Glove, lemma and character embeddings are helpful for DM and fine-tuning on the training set slightly improves the performance. ELMo embedding is also helpful but cannot outperform BERT embedding. However, the performance dropped when ELMo embedding and BERT embedding are combined. We speculate that the drop is caused by the conflict between the two types of contextual information. For subtoken pooling, we compared the performance of using first subtoken pooling and average pooling as token embedding. We found that average pooling is slightly better than   Table 8: F1 score averaged over the labeled F1 score and the frame F1 score on the development sets of DM and PSD. basic represents our model with embeddings described in 3.1 except lemma and named entity embeddings.
first pooling. For syntactic information, we encode each head word and dependency label as embeddings and concatenate them together with other embeddings. The result shows that syntactic information as embeddings is not very helpful for the task. We will try other methods utilizing syntactic information in future work. Dozat and Manning (2018) found that gold lemma embedding is helpful for semantic dependency parsing. However, in section 4.2, we note that the lemmas from the official companion data have only 71.4% accuracy compared to lemmas in gold SDP data, which makes lemma embeddings less helpful for parsing. We found that one of the difference is about the lemma annotations of entities, for example, lemmas of "Pierre Vinken" are "Pierre" and "Vinken" in the companion data while the lemmas are named-entitylike tags "Pierre" and "_generic_proper_ne" in the original SDP dataset. Based on this discovery, we experimented on the influence of named entity tags on parsing performance. We used Illinois Named Entity Tagger (Ratinov and Roth, 2009) in white list to predict named entity tags and compared the performance on the development sets of DM and PSD. The result is shown in table 8. We tuned the hyperparameters for all the embedding conditions in the table, and we found that adding lemma or named entity embeddings results in a slight improvement on DM but does not help on PSD. With both lemma and named entity embeddings, there is a further improvement on both DM and PSD, which shows the named entity tags are helpful for semantic dependency parsing. As a result, we apply named entity information in parsing other frameworks.

Conclusion
In this paper, we present our graph-based parsing system for MRP 2019, which combines two state-of-the-art methods for sequence to graph node generation and second-order edge inference. The result shows that our system performs well on the DM and PSD frameworks and achieves the best scores on the in-framework metrics. For future work, we will improve our system to achieve better performance on all these frameworks and explore cross-framework multi-task learning.