SJTU at MRP 2019: A Transition-Based Multi-Task Parser for Cross-Framework Meaning Representation Parsing

This paper describes the system of our team SJTU for our participation in the CoNLL 2019 Shared Task: Cross-Framework Meaning Representation Parsing. The goal of the task is to advance data-driven parsing into graph-structured representations of sentence meaning. This task includes five meaning representation frameworks: DM, PSD, EDS, UCCA, and AMR. These frameworks have different properties and structures. To tackle all the frameworks in one model, it is needed to find out the commonality of them. In our work, we define a set of the transition actions to once-for-all tackle all the frameworks and train a transition-based model to parse the meaning representation. The adopted multi-task model also can allow learning for one framework to benefit the others. In the final official evaluation of the shared task, our system achieves 42% F1 unified MRP metric score.


Introduction
Semantic understanding of texts is very important in Natural Language Processing (NLP), in which, Meaning Representation Parsing (MRP) attracts attentions of many researchers. This task is to encode a sentence into a semantic graph, which usually is directed. Compared with dependency parsing (Ma and Zhao, 2012;Li et al., 2018a;Zhou and Zhao, 2019) or semantic role labeling (Zhao et al., 2009a,b;Guan et al., 2019), this task is much harder since its representation is a graph which may incorporate both syntactical and semantic information. These general graphs are more expressive and arguably more adequate target structures for sentence-level analysis beyond shallow syntax and in particular for representations of the semantic structure. Many works have shown that these meaning representations are beneficial to other tasks such as machine translation and abstractive summarization. However, there are several types of meaning representations with different definitions, structures, and abstractions, which hinder the applications.
The CoNLL 2019 Shared Task (Oepen et al., 2019) combines formally and linguistically different meaning representation in graph form on a uniform training and evaluation setup for the first time. This task includes five MRP frameworks: DM, PSD, EDS, UCCA, and AMR. These frameworks have different anchoring types, i.e., the tightness of correspondence between graph nodes and sentence tokens with different abstractions. The nodes in DM and PSD are all the surface tokens in the sentences. In EDS and UCCA, the anchoring is flexible so that arbitrary parts of the sentence (e.g. sub-token or multi-token sequences) may be node anchors, as well as multiple nodes anchored to overlapping sub-strings. Further, AMR has even no anchoring but with the strongest expressive ability.
For each of these frameworks, the common methods for their parsing are transition-based method and graph-based method. The former parses sentences by making a sequence of transition actions according to the present state which usually consists of a stack, a buffer, and a processed edge set, while the latter gets nodes first and predicts the edges between these nodes.
In our system, we use the transition-based model to do the cross-framework meaning representation parsing, since we can define a set of transition actions and incorporate all the frameworks into our system, and the shared part of the model can learn from all the data from different frameworks. Our model is modified from TUPA (Transition-based UCCA Parser) (Hershcovich et al., 2017(Hershcovich et al., , 2018 in terms of neural networks, which is powerful in a lot of NLP tasks (Cai and Zhao, 2016;Vaswani et al., 2017;Cai et al., 2017;Wang et al., 2017;Qin et al., 2017;Cai et al., 2018;Zhang et al., 2018a,b;Zhu et al., 2018;Li et al., 2018c;Wu et al., 2018;Zhang et al., 2019;Xiao et al., 2019). Neural networks can encode the texts into a dense representation. We put the parsing job of all the frameworks to one model and use a multi-task setting to jointly train the system. In the final official evaluation of the shared task, our system achieves 42%F 1 unified MRP metric score.
The rest of this paper is organized as follows. Section 2 introduces these frameworks. Section 3 shows our model. Section 4 gives the settings of our model and test results.

Framework Schemes
This shared task considers five meaning representation frameworks. In this section, we briefly introduce these frameworks and figure out the traits of these frameworks.

DELPH-IN MRS
Bi-Lexical Dependencies (DM) (Ivanova et al., 2012) and Prague Semantic Dependencies (PSD) (Hajič et al., 2012;Miyao et al., 2014) use bi-lexical semantic dependencies to represent the meaning with different annotations. Graph nodes in DM and PSD correspond to surface tokens, and graphs are neither fully connected nor rooted trees, that is, some tokens from the underlying sentence remain structurally isolated, and for some nodes, there are multiple incoming edges.

EDS
Elementary Dependency Structures (EDS) (Oepen and Lønning, 2006) is a variable-free semantic dependency graph, where graph nodes correspond to logical predictions and edges to labeled argument positions. The variable-free feature makes these graphs quite similar to Abstract Meaning Representation (AMR). Nodes in EDS are in principle independent of surface lexical units, but for each node, there is an explicit and many-to-many anchoring onto sub-strings of the underlying sen-tence.

UCCA
Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013) targets to a more semantic way rather than only syntactically and can be extended to crosslinguistic settings. UCCA representations are directed acyclic graphs (DAGs), where terminal nodes correspond to the text tokens and non-terminal nodes to semantic units with more abstract meanings. Edges are labeled, indicating the role of a child in the relation. UCCA enable reentrancy to allow a node to participate in several semantic relations.

AMR
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) tries to abstract out all the semantic information from the sentences. The AMR graphs are rooted directed graphs, in which both nodes and edges are labeled, and reentrancy is also allowed. AMR declines to make explicit how elements of the graph correspond to the surface utterance and the nodes are abstract. So similar to EDS, it is also needed to generate nodes from semantic information, but AMR is harder since even no anchor is available. AMR graphs quite generally appear to be more abstractive compared to the other frameworks. These frameworks have different structures and different complexity. The graphs of these frameworks all have a top node or root node, and edges are all directed and labeled. Other properties are summarized in Table 1. By analyzing these properties, we can design a transition set to accommodate all these frameworks.

Framework Summary
swap the top two nodes in stack and then put the top one in the buffer 7 Shift shift the first node of the buffer to the stack if the top node of the stack is processed, pop it from stack

Model Description
For the joint learning task, we select a multi-task transition-based model. Following we will describe the transition set, the model, and the training/inference.

Transition Set
For a transition-based system, a transition action set is needed, and an oracle is also needed to generate gold-standard actions during training. We define the transition set to cover all meaning representation frameworks then these tasks can be learned consistently. Our transition system has a stack, a buffer, and a set of processed edges. Given a sentence consisting of a sequence of tokens t 0 , t 1 , · · · , t n , we put all these tokens to the buffer as initialization. During training, an oracle will generate a gold-standard action sequence, and during inference, the model will predict the action sequence and recover it to a graph. Table 2 summarizes all the actions. In these actions, actions 4, 5, 6, 7, 8 are used by all the frameworks, actions 1, 2 are used by EDS and AMR, action 3 is used by UCCA. If one action is not used by the framework, then the oracle will not generate this action for it, and during inference, the action is only selected from the legal actions for task-specified classifiers.

Model
Figure 1 depicts our model. x 1 , x 2 , · · · , x i denotes the input tokens. Our model architecture  follows TUPA. The model uses a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) to encode the sentence and a multi-layer perceptron (MLP) with a softmax layer for classification. Following Hershcovich et al. (2018), in the model, we have shared embedding components and a shared LSTM module, and for each framework, we have a task-specified LSTM module and a corresponding classifier. For each framework, the outputs of shared LSTM and task-specified LSTM are concatenated and fed into the task-specified classifier for action prediction. For the word embeddings, we use the pre-trained GloVe embeddings (Pennington et al., 2014) and the pre-trained BERT (Devlin et al., 2019). For each token, there are also embeddings for lemma, POS tag, and syntactic dependency label. These embeddings together with token embeddings and BERT outputs are concatenated and sent to the BiLSTMs as input. These embeddings and pre-trained models are tuned during training. Besides the neural model, we also add handmade features to the classifier. We use features representing the existing node labels related to the top four stack elements and the first three buffer elements. We also use the last three actions taken by the parser, and if there are less than three actions before, use zero embeddings instead. For all these features, we use vector embeddings to represent them, that is, node labels and transition actions are embedded to vectors. All these embeddings are initialized randomly. These features embeddings are concatenated as a feature vector for the state.
The final hidden state vectors of shared and specific BiLSTMs and the feature vector of the state are concatenated and fed as input to the action classifiers. The training is done with an oracle that yields the set of all optimal transitions at a given state. The actual transition performed in training is the one with the highest score given by the classifier, which is trained to maximize the sum of loglikelihoods of all optimal transitions at each step.
In addition to the main model, we also apply two classifiers for property prediction of DM and PSD. The classifier is an MLP and the input is the concatenated output vectors of each token from shared and specific BiLSTM since the nodes are one-to-one corresponding to the tokens in the sentence.

Training and Inference Procedures
The training and test data have companion data processed by UDPipe (Straka, 2018). For all the input sentences, we use the tokenization, lemma, POS tagging, dependency parsing, and anchor information results from UDPipe data. Then the anchors of the output graphs are directly obtained from the UDPipe data. For EDS and AMR, the anchors are derived from the first token in the buffer.
Since AMR has no anchoring between nodes and texts, so we use the alignments generated from JAMR (Flanigan et al., 2014) and the tokens and nodes which have no alignments are discarded. Then the oracle can generate an action sequence for AMR during training. We also do pre-process on AMR and EDS graphs by expanding the node properties of graphs in the two frameworks, that is, the property key is seen as edge label and property value is seen as node label. We collect these edge labels and convert these nodes and edges to properties. For DM and PSD, the pos node property is from XPOS in UDPipe data, and the frame property are predicted by additional classifiers. UCCA has edge attribute remote to reflect the reentrancy and we neglect the edge attribute in our transition system for convenience. So we add the attribute remote to the later predicted edges that link the used nodes. For node labels, we use the lemma corresponding to the token in the sentences as node label for DM and PSD, and we generate node label for EDS and AMR in the New action.
During training, an oracle is used to generate action sequences. We use a dynamic oracle which outputs a set of optimal transitions from a given state, and from the resulting state, the gold standard graph is still reachable. For example, for EDS and AMR, if the first element of the buffer is a token and it has aligned unprocessed nodes, then a node with small id is generated by the New and put to the buffer. For UCCA, if the top node in the stack connects to a non-terminal node which is not generated yet, then the Add this non-terminal node. If this token has no aligned nodes remaining, then the Drop is applied. If the top element of the stack and the first element of the buffer are nodes and the node in the buffer is the child of an unprocessed edge, then the Right action is applied. Similarly, we have the Left action. If the top element of the stack has no unprocessed edges, then the Reduce is applied. If the stack is empty and the buffer has elements, then the Shift is applied. If no other actions can be found, then we do the Swap action.
For inference, after the action sequence is predicted, we can generate a graph from this sequence. However, this graph may not conform to the graph rules of the respective framework. So we prune the generated graph. The pruning method includes: deleting the repeated nodes and edges, deleting the nodes containing empty labels of EDS and AMR, deleting the edges attached to the deleted nodes.

Data settings
Our system is trained and evaluated on the data provided by the shared task. The data size is shown in Table 5. We randomly sample out 3% of the training data in each framework as the development set. After the hyperparameters are de-    termined, we train our system on all the training data. The shared task also evaluates the system on the 100 annotated sentences from The Little Prince which denote as "lpps" in the Results section.

Model Settings
We implement our model with PyTorch 1 and tuned on the development set. During inference, we use greedy decoding to get the action sequence. Table 6 shows the hyperparameter settings. The optimizer is Adam (Kingma and Ba, 2015). The dropout is applied to the embeddings, the outputs of BiLSTMs, and the outputs of the first MLP lay-1 https://pytorch.org/ ers. If the length of one sentence is larger than the max length, then the exceeding tokens are discarded. Other features denote the node labels in the stack and buffer, and the previous actions introduced in Section 3.2.

Results
The evaluation is blindly conducted. The MRP score results are shown in Table 3. For framework specified metric, the SDP results for DM and PSD are reported in Table 4, the EDM results for EDS are reported in Table 7, and the SMATCH results for AMR are reported in Table 9. Table 3 also contains the comparison results with the TUPA baseline (Hershcovich and Arviv, 2019). For some of the frameworks, our model is better than the TUPA baseline.

Analysis
Though following the same model architecture and dynamic oracle of TUPA, we adopt a different transition set with a different feature set and setting. For example, UCCA only generates a node when an unprocessed edge is met and the node is on it, and UCCA has separate actions to predict   node label, edge label, node property, and edge attribute. Whereas we have actions to generate nodes (New and Add) and the node or edge label is predicted when the node or the edge is generated. However, our set does not have actions for node properties and edge attributes, which has been introduced in Section 3.3. The motivation for designing our transition set is to use fewer actions to parse a sentence.
For the results, we find the MRP metric may be imperfect for every framework. For example, the MRP results for UCCA of ours and the baseline are comparative, whereas, for the UCCA taskspecific metric, ours (Table 8) are much lower than TUPA. That is, a better MRP result may not reflect a better task-specific result. This is due to some items calculated by MRP that are not in UCCA graphs such as labels and properties, and the edge overlapping search methods are different. The gap comes from our transition set, which is not well suitable for UCCA, and this is mainly due to that all lpps labeled primary 4.7 5.6 labeled remote 0.6 1.6 labeled all 4.5 (22.4) 5.5 (28.4) labeled rank 9 9 unlabeled primary 6.5 7.7 unlabeled remote 1.1 3.3 unlabeled all 6.3 (27.1) 7.5 (33.1) unlabeled rank 9 9    we generate non-terminal node separately whereas TUPA directly generates edges attached to the non-terminal nodes, and our method may even illegally connect two terminal nodes. Other frameworks have the same issue such as DM. These task-specific metrics pay more attention to edges and have a different overlapping search method compared with MRP metric, which is more similar to the AMR specific metric SMATCH.
Only for AMR, our MRP results and SMATCH results are both better, which may be due to the separate New action and expanding the properties as nodes.
In Table 10, we compare the precision, recall, and F 1 results for MRP metric, and we can find that though the F 1 scores are comparative, our precision scores are much higher than TUPA, whereas recall scores are much lower. That is, we can predict the elements in the graph more accurately, but our model misses too much nodes and edges. This is due to that the new node action and the separate property classifiers can bring better element prediction. Fewer actions also make the prediction more accurate. However, the design of the oracle and the training may have flaws, so some tokens are dropped and some edges are not predicted out, which makes the low recall. Our parser tends to predict a smaller graph, so for some frameworks which tend to have bigger graphs, such as PSD and EDS, the MRP results of our parser are worse.

Conclusion
In this paper, we describe our transition-based multi-task parsing system for the CoNLL 2019 Shared Task: Cross-Framework Meaning Representation Parsing. In our system, we integrate all the frameworks into one transition-based neural model using shared features, and we focus more on unified overall MRP metric results. The results of the blind test show that our system achieves 42% F 1 unified MRP metric score. Compared with baseline TUPA, our parser has higher precision but lower recall, for future work, we will optimize our transition set and oracle for better performance.