Amazon at MRP 2019: Parsing Meaning Representations with Lexical and Phrasal Anchoring

This paper describes the system submission of our team Amazon to the shared task on Cross Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). Via extensive analysis of implicit alignments in AMR, we recategorize five meaning representations (MRs) into two classes: Lexical- Anchoring and Phrasal-Anchoring. Then we propose a unified graph-based parsing framework for the lexical-anchoring MRs, and a phrase-structure parsing for one of the phrasal- anchoring MRs, UCCA. Our system submission ranked 1st in the AMR subtask, and later improvements show promising results on other frameworks as well.


Introduction
The design and implementation of broad-coverage and linguistically motivated meaning representation frameworks for natural language is attracting growing attention in recent years. With the advent of deep neural network-based machine learning techniques, we have made significant progress to automatically parse sentences intro structured meaning representation (Oepen et al., , 2015May, 2016;. Moreover, the differences between various representation frameworks has a significant impact on the design and performance of the parsing systems. Due to the abstract nature of semantics, there is a diverse set of meaning representation frameworks in the literature (Abend and Rappoport, 2017). In some application scenario, tasks-specific formal representations such as database queries and arithmetic formula have also been proposed. However, primarily the study in computational semantics focuses on frameworks that are theoretically grounded on formal semantic theories, and * * Work done when Jie Cao was an intern at AWS AI sometimes also with assumptions on underlying syntactic structures.
Anchoring is crucial in graph-based meaning representation parsing. Training a statistical parser typically starts with a conjectured alignment between tokens/spans and the semantic graph nodes to help to factorize the supervision of graph structure into nodes and edges. In our paper, with evidence from previous research on AMR alignments (Pourdamghani et al., 2014;Flanigan et al., 2014;Wang and Xue, 2017;Chen and Palmer, 2017;Szubert et al., 2018;Lyu and Titov, 2018), we propose a uniform handling of three meaning representations from Flavor-0 (DM, PSD) and Flavor-2 (AMR) into a new group referred to as the lexical-anchoring MRs. It supports both explicit and implicit anchoring of semantic concepts to tokens. The other two meaning representations from Flavor-1 (EDS, UCCA) is referred to the group of phrasal-anchoring MRs where the semantic concepts are anchored to phrases as well.
To support the simplified taxonomy, we named our parser as LAPA (Lexical-Anchoring and Phrasal-Anchoring) 1 . We proposed a graph-based parsing framework with a latent-alignment mechanism to support both explicit and implicit lexicon anchoring. According to official evaluation results, our submission for this group ranked 1st in the AMR subtask, 6th on PSD, and 7th on DM respectively, among 16 participating teams. For phrasal-anchoring, we proposed a CKY-based constituent tree parsing algorithm to resolve the anchor in UCCA, and our post-evaluation submission ranked 5th on UCCA subtask.

Anchoring in Meaning Representation
The 2019 Conference on Computational Language Learning (CoNLL) hosted a shared task on Cross-Framework Meaning Representation Parsing (MRP 2019, which encourage participants in building a parser for five different meaning representations in three distinct flavors. Flavor-0 includes the DELPH-IN MRS Bi-lexical Dependencies (DM, Ivanova et al., 2012) and Prague Semantic Dependencies (PSD, Hajic et al., 2012;. Both frameworks under this representation have a syntactic backbone that is (either natively or byproxy) based on bi-lexical dependency structures. As a result, the semantic concepts in these meaning representations can be anchored to the individual lexical units of the sentence. Flavor-1 includes Elementary Dependency Structures (EDS, Oepen and Lønning, 2006) and Universal Conceptual Cognitive Annotation framework (UCCA, Abend and Rappoport, 2013), which shows an explicit, many-to-many anchoring of semantic concepts onto sub-strings of the underlying sentence. Finally, Flavor-2 includes Abstract Meaning Representation (AMR, Banarescu et al., 2013), which is designed to abstract the meaning representation away from its surface token. But it leaves open the question of how these are derived. Previous studies have shown that the nodes in AMR graphs are predominantly aligned with the surface lexical units, although explicit anchoring is absent from the AMR representation. In this section, we review the related work supporting the claim of the implicit anchoring in AMR is actually lexical-anchoring, which can be merged into Flavor-0 when we consider the parsing methods on it.

Implicit Anchoring in AMR
AMR tries to abstract the meaning representation away from the surface token. The absense of explicit anchoring can present difficulties for parsing. In this section, by extensive analysis on previous work AMR alignments, we show that AMR nodes can be implicitly aligned to the leixical tokens in a sentence.
AMR-to-String Alignments A straightforward solution to find the missing anchoring in an AMR Graph is to align it with a sentence; We denote it as AMR-to-String alignment.
ISI alignments (Pourdamghani et al., 2014) first linearizes the AMR graph into a sequence, and then use IBM word alignment model (Brown et al., 1993) to align the lin-earized sequence of concepts and relations with tokens in the sentence. According to the AMR annotation guidelines and error analysis of ISI aligner, some of the nodes or relations are evoked by subwords, e.g., the whole graph fragment (p/possible-01 :polarity -) is evoked by word "impossible", where the subword "im-" actually evoked the relation polarity and concept "-"; On the other side, sometimes concepts are evoked by multiple words, e.g., named entities, (c/city :name (n/name :op1 "New":op2 "York")), which also happens in explict anchoring of DM and PSD. Hence, aligning and parsing with recategorized graph fragments are a natural solution in aligners and parsers. JAMR aligner (Flanigan et al., 2014) uses a set of rules to greedily align single tokens, special entities and a set of multiple word expression to AMR graph fragments, which is widely used in previous AMR parsers (e.g. Flanigan et al., 2014;Wang et al., 2015;Artzi et al., 2015;Pust et al., 2015;Peng et al., 2015;Konstas et al., 2017;Wang and Xue, 2017).
Other AMR-to-String Alignments exists, such as the extended HMM-based aligner. To consider more structure info in the linearized AMR concepts, Wang and Xue (2017) proposed a Hidden Markov Model (HMM)-based alignment method with a novel graph distance. All of them report over 90% F-score on their own hand-aligned datasets, which shows that AMR-to-String alignments are almost token-level anchoring.
AMR-to-Dependency Alignments Chen and Palmer (2017) first tries to align an AMR graph with a syntactic dependency tree. Szubert et al. (2018) conducted further analysis on dependency tree and AMR interface. It showed 97% of AMR edges can be evoked by words or the syntactic dependency edges between words. Those nodes in the dependency graph are anchored to each lexical token in the original sentence. Hence, this observation indirectly shows that AMR nodes can be aligned to the lexical tokens in the sentence.
Both AMR-to-String and AMR-to-dependency alignments shows that AMR nodes, including recategorized AMR graph fragements, do have implicit lexical anchoring. Based on this, Lyu and Titov (2018) propose to treat token-node alignments as discrete and exclusive alignment matrix and learn the latent alignment jointly with parsing. Recently, attention-based seq2graph model also achieved the state-of-the-art accuracy on AMR parsing . However, whether the attention weights can be explained as AMR alignments needs more investigation in future.

Taxonomy of Anchroing
Given the above analysis on implicit alignments in AMR, in this section, we further discuss the taxonomy of anchoring of the five meaning representations in this shared task.
Lexical-Anchoring According to the bi-lexical dependency structures of DM and PSD, and implicit lexical token anchoring on AMR, the nodes/categorized graph fragments of DM, PSD, and AMR are anchored to surface lexical units in an explicit or implict way. Especially, those lexical units do not overlap with each other, and most of them are just single tokens, multiple word expression, or named entities. In other words, when parsing a sentence into DM, PSD, AMR graphs, tokens in the original sentence can be merged by looking up a lexicon dict when preprocessing and then may be considered as a single token for aligning or parsing.
Phrasal-Anchoring However, different from the lexical anchoring without overlapping, nodes in EDS and UCCA may align to larger overlapped word spans which involves syntactic or semantic pharsal structure. Nodes in UCCA do not have node labels or node properties, but all the nodes are anchored to the spans of the underlying sentence. Furthermore, the nodes in UCCA are linked into a hierarchical structure, with edges going between parent and child nodes. With certain exceptions (e.g. remote edges), the majority of the UCCA graphs are tree-like structures. According to the position as well as the anchoring style, nodes in UCCA can be classified into the following two types: 1. Terminal nodes are the leaf semantic concepts anchored to individual lexical units in the sentence 2. Non-terminal nodes are usually anchored to a span with more than one lexical units, thus usually overlapped with the anchoring of terminal nodes.
The similar classification of anchoring nodes also applies to the nodes in EDS, although they do not regularly form a recursive tree like UCCA. As the running example in Figure 1, most of the nodes belongs to terminal nodes, which can be explicitly anchored to a single token in the original sentence. However, those bold non-teriminal nodes are an-chored to a large span of words. For example, the node "undef q" with span <53:100> is aligned to the whole substring starting from "other crops" to the end; The abstract node with label imp conj are corresponding to the whole coordinate structure between soybeans and rice In summary, by treating AMR as an implicitly lexically anchored MR, we propose a simplified taxonomy for parsing the five meaning representation in this shared task.

Model
For the two groups of meaning representations defined in Section 2, in this section, we propose two parsing framework: a graph-based parsing framework with latent alignment for lexically anchored MRs, and a minimal span-based CKY parser for one of the phrasally anchored MRs, UCCA. 2

Graph-based Parsing Framework with Latent Alignment
Before formulating the graph-based model into a probabilistic model as Equation 1, we denote some notations: C, R are sets of concepts (nodes) and relations (edges) in the graph, and w is a sequence of tokens. a ∈ Z m as the alignment matrix, each a i is the index of aligned token where ith node aligned to. When modeling the negative log likelihood loss (NLL), with independence assumption between each node and edge, we decompose it into node-and edge-identification pipelines.
2 After the CKY parser gets the related phrasal spans, graph-based parser can also be used to predict the relations between nodes.
In DM, PSD, and AMR, every token will only be aligned once. Hence, we train a joint model to maximize the above probability for both node identification P (c i | h a i ) and edge identification P (r ij | h a i ,c i ,ha j ,c j ), and we need to marginalize out the discrete alignment variable a.

Alignment Model
The above model can support both explicit alignments for DM, PSD, and implicit alignments for AMR.
Explicit Alignments For DM, PSD, with explicit alignments a * , we can use P (a * ) = 1.0 and other alignments P (a|a = a * ) = 0.0 Implicit Alignments For AMR, without gold alignments, one requires to compute all the valid alignments and then condition the node-and edgeidentification methods on the alignments.
However, it is computationally intractable to enumerate all alignments. We estimate posterior alignments model Q as Equation 3, please refer to Lyu and Titov (2018) for more details.
• Applying variational inference to reduce it into Evidence Lower Bound (ELBO, Kingma and Welling, 2013) • The denominator Z Ψ in Q can be estimated by Perturb-and-Max(MAP) (Papandreou and Yuille, 2011) score each alignment link between node i and the corresponding words, g i is node encoding, and h a i is encoding for the aligned token.

Node Identification
Node Identification predicts a concept c given a word. A concept can be either NULL (when there is no semantic node anchoring to that word, e.g., the word is dropped), or a node label (e.g., lemma, sense, POS, name value in AMR, frame value in PSD), or other node properties. One challenge in node identification is the data sparsity issue. Many of the labels are from open sets derived from the input token, e.g., its lemma. Moreover, some labels are constrained by a deterministic label set given the word. Hence, we designed a copy mechanism (Luong et al., 2014) in our neural network architecture to decide whether to copying deterministic label given a word or estimate a classification probability from a fixed label set.

Edge Identification
By assuming the independence of each edge, we model the edges probabilites independently. Given two nodes and their underlying tokens, we predict the edge label as the semantic relation between the two concepts with a bi-affine classifier (Dozat and Manning, 2016).

Inference
In our two-stage graph-based parsing, after nodes are identified, edge identification only output a probility distribution over all the relations between identified nodes. However, we need to an inference algorithm to search for the maximum spanning connected graph from all the relations. We use Flanigan et al. (MSCG, 2014) to greedily select the most valuable edges from the identified nodes and their relations connecting them. As shown in Figure 2, an input sentence goes through preprocessing, node identification, edge identification, root identification, and MCSG to generate a final connected graph as structured output.

Minimal Span-based CKY Parsing Framework
Let us now see our phrasal-anchoring parser for UCCA. We introduce the transformation we used to reduce UCCA parsing into a consituent parsing task, and finally introduce the detailed CKY model for the constituent parsing.

Graph-to-CT Transformation
We propose to transform a graph into a constituent tree structure for parsing, which is also used in recent work (Jiang et al., 2019). Figure 3 shows an example of transforming a UCCA graph into a constituent tree. The primary transformation assigns the original label of an edge to its child node. Then to make it compatible with parsers for standard PennTree Bank format, we add some auxiliary nodes such as special non-terminal nodes, TOP, HEAD, and special terminal nodes TOKEN and MWE. We remove all the "remote" annotation in UCCA since the constituent tree structure does not support reentrance. A fully compatible transformation should support both graph-to-tree and tree-to-graph transformation. In our case, due to time constraints, we remove those remote edges and reentrance edges during training. Besides that, we also noticed that for multi-word expressions, the children of a parent node might not be in a continuous span (i.e., discontinuous constituent), which is also not supported by our constituent tree parser. Hence, when training the tree parser, by reattaching the dis-continuous tokens to its nearest continuous parent nodes, we force every sub span are continuous in the transformed trees. We leave the postprocessing to recover those discontinuous as future work.
For inference, given an input sentence, we first use the trained constituent tree parsing model to parse it into a tree, and then we transform a tree back into a directed graph by assigning the edge label as its child's node label, and deleting those auxiliary labels, adding anchors to every remaining node.

CKY Parsing and Span Encoding
After transforming the UCCA graph into a constituent tree, we reduce the UCCA parsing into a constituent tree parsing problem. Similar to the previous work on UCCA constituent tree parsing (Jiang et al., 2019), we use a minimal spanbased CKY parser for constituent tree parsing. The intuition is to use dynamic programming to recursively split the span of a sentence recursively, as shown in Figure 3. The entire sentence can be splitted from top to bottom until each span is a single unsplittable tokens. For each node, we also need to assign a label. Two simplified assumptions are made when predicting the hole tree given a sentence. However, different with previous work, we use 8-layers with 8 heads transformer encoder, which shows better performance than LSTM in Kitaev and Klein (2018).
Tree Factorization In the graph-to-tree transformation, we move the edge label to its child node. By assuming the labels for each node are independent, we factorize the tree structure prediction as independent span-label prediction as Equation 4. However, this assumption does not hold for UCCA. Please see more error analysis in §4.4 CKY Parsing By assuming the label prediction is independent of the splitting point, we can further factorize the whole tree as the following dynamic programming in Equation 5.
Span Encoding For each span (i, j), we represent the span encoding vector v (i,j) = [ y j − y i ] ⊕ [ y j+1 − y i+1 ]. ⊕ denotes vector concatenation. Assuming a bidirectional sentence encoder, we use the forward and backward encodings y i and y i of i th word. Following the previous work, and we also use the loss augmented inference training. More details about the network architecture are in the Section 4.2

Summary of Implementation
We summarize our implementation for five meaning representations as Table 1. As we mentioned in the previous sections, we use latentalignment graph-based parsing for lexical anchoring MRs (DM, PSD, AMR), and use CKYbased constituent parsing phrasal anchoring in MRs (UCCA, EDS). This section gives information about various decision for our models.
Top The first row "Top" shows the numbers of root nodes in the graph. We can see that for PSD, 11.56% of graphs with more than 1 top nodes. In our system, we only predict one top node with a N (N is size of identified nodes) way classifier, and  Table 1: Detailed classifiers in our model, round bracket means the number of ouput classes of our classify, * means copy mechanism is used in our classifier. At the end of shared task, EDS are not fully supported to get an official results, we leave it as our future work.
then fix this with a post-processing strategy. When our model predicts one node as the top node, and if we find additional coordination nodes with it, we add the coordination node also as the top node.
Node Except for UCCA, all other four MRs have labeled nodes, the row "Node Label" shows the templates of a node label. For DM and PSD, the node label is usually the lemma of its underlying token. But the lemma is neither the same as one in the given companion data nor the predicted by Stanford Lemma Annotators. One common challenge for predicting the node labels is the open label set problem. Usually, the lemma is one of the morphology derivations of the original word. But the derivation rule is not easy to create manually. In our experiment, we found that handcrafted rules for lemma prediction only works worse than classification with copy mechanism, except for DM. For AMR and EDS, there are other components in the node labels beyond the lemma. Especially, the node label for AMR also contains more than 143 fine-grained named entity types; for EDS, it uses the full SEM-I entry as its node label, which requires extra classifiers for predicting the corresponding sense. In addition to the node label, the properties of the label also need to be predicted. Among them, node properties of DM are from the SEMI sense and arguments handler, while for PSD, senses are constrained the senses in the predefined the vallex lexicon.
Edge Edge predication is another challenge in our task because of its large label set (from 45 to 94) as shown in row "Edge Label", the round bracket means the number of output classes of our classifiers. For Lexical anchoring MRs, edges are usually connected between two tokens, while phrasal anchoring needs extra effort to figure out the corresponding span with that node. For example, in UCCA parsing, To predict edge labels, we first predicted the node spans, and then node labels based that span, and finally we transform back the node label into edge label.
Connectivity Beside the local label classification for nodes and edges, there are other global structure constraints for all five MRs: All the nodes and edges should eventually form a connected graph. For lexical anchoring, we use MSCG algorithm to find the maximum connected graph greedily; For phrasal anchoring, we use dynamic programming to decoding the constituent tree then deterministically transforming back to a connected UCCA Graph 3

Dataset and Evaluation
For DM, PSD, EDS, we split the training set by taking WSJ section (00-19) as training, and section 20 as dev set. For other datasets, when developing and parameter tuning, we use splits with a ratio of 25:1:1. In our submitted model, we did not use multitask learning for training. Following the unified MRP metrics in the shared tasks, we train our model based on the development set and finally evaluate on the private test set. For more details of the metrics, please refer to the summarization of the MRP 2019 task ,

Model Setup
For lexical-anchoring model setup, our network mainly consists of node and edge prediction model. For AMR, DM, and PSD, they all use one layer Bi-directional LSTM for input sentence encoder, and two layers Bi-directional LSTM for head or dependent node encoder in the bi-affine classifier. For every sentence encoder, it takes a sequence of word embedding as input (We use 300 dimension Glove here), and then their output will pass a softmax layer to predicting output distribution. For the latent AMR model, to model the posterior alignment, we use another Bi-LSTM for node sequence encoding. For phrasal-anchoring model setup, we follow the original model set up in Kitaev and Klein (2018), and we use 8-layers 8headers transformer with position encoding to encode the input sentence.
For all sentence encoders, we also use the character-level CNN model as character-level embedding without any pre-trained deep contextualized embedding model. Equipping our model with Bert or multi-task learning is promising to get further improvement. We leave this as our future work.
Our models are trained with Adam (Kingma and Ba, 2014), using a batch size 64 for a graph-based model, and 250 for CKY-based model. Hyperparameters were tuned on the development set, based on labeled F1 between two graphs. We exploit early-stopping to avoid over-fitting.

Results
At the time of official evaluation, we submitted three lexical anchoring parser, and then we submitted another phrasal-anchoring model for UCCA parsing during post-evaluation stage, and we leave EDS parsing as future work. The following sections are the official results and error breakdowns for lexical-anchoring and phrasalanchoring respectively. Table 2 shows the official results for our lexical-anchoring models on AMR, DM, PSD. By using our latent alignment based AMR parser, our system ranked top 1 in the AMR subtask, and outperformed the top 5 models in large margin. Our parser on PSD ranked 6, but only 0.02% worse then the top 5 model. However, official results on DM and PSD shows that there is still around 2.5 points performance gap between our model and the top 1 model. Table 3 shows that our span-based CKY model for UCCA

Error Analysis on Lexical-Anchoring
As shown in Table 4, our AMR parser is good at predicting node properties and consistently perform better than other models in all subcomponent, except for top prediction. Node properties in AMR are usually named entities, negation, and some other quantity entities. In our system, we recategorize the graph fragements into a single node, which helps for both alignments and structured inference for those special graph fragments. We see that all our 3 models perform almost as good as the top 1 model of each subtask on node label prediction, but they perform worse on top and edge prediction. It indicates that our bi-affine relation classifier are main bottleneck to improve. Moreover, we found the performance gap between node labels and node anchors are almost consistent, it indicates that improving our model on predicting NULL nodes may further improve node label prediction as well. Moreover, we believe that multitask learning and pre-trained deep models such as BERT (Devlin et al., 2018) may also boost the performance of our paser in future.

Error Analysis on Phrasal-Anchoring
According to Table 7, our model with ELMo works slightly better than the top 1 model on anchors prediction. It means our model is good at predicting the nodes in UCCA and we belive that it is also helpful for prediction phrasal anchoring nodes in EDS. However, when predicting the edge and edge  attributes, our model performs 7-8 points worse than the top 1 model. In UCCA, an edge label means the relation between a parent nodes and its children. In our UCCA transformation, we assign edge label as the node label of its child and then predict with only child span encoding. Thus it actually misses important information from the parent node. Hence, in future, more improvement can be done to use both child and parent span encoding for label prediction, or even using another span-based bi-affine classifier for edge prediction, or remote edge recovering.

Conclusion
In summary, by analyzing the AMR alignments, we show that implicit AMR anchoring is actually lexical-anchoring based. Thus we propose to regroup five meaning representations as two groups: lexical-anchoring and phrasal-anchoring. For lexical anchoring, we suggest to parse DM, PSD, and AMR in a unified latent-alignment based parsing framework. Our submission ranked top 1 in AMR sub-task, ranked 6th and 7th in PSD and DM tasks. For phrasal anchoring, by reducing UCCA graph into a constituent tree-like structure, and then use the span-based CKY parsing to parse their tree structure, our method would rank 5th in the original official evaluation results.