AMR Parsing with Latent Structural Information

Abstract Meaning Representations (AMRs) capture sentence-level semantics structural representations to broad-coverage natural sentences. We investigate parsing AMR with explicit dependency structures and interpretable latent structures. We generate the latent soft structure without additional annotations, and fuse both dependency and latent structure via an extended graph neural networks. The fused structural information helps our experiments results to achieve the best reported results on both AMR 2.0 (77.5% Smatch F1 on LDC2017T10) and AMR 1.0 ((71.8% Smatch F1 on LDC2014T12).


Introduction
Abstract Meaning Representations (AMRs) (Banarescu et al., 2013) model sentence level semantics as rooted, directed, acyclic graphs. Nodes in the graph are concepts which represent the events, objects and features of the input sentence, and edges between nodes represent semantic relations. AMR introduces re-entrance relation to depict the node reuse in the graphs. It has been adopted in downstream NLP tasks, including text summarization (Liu et al., 2015;Dohare and Karnick, 2017), question answering (Mitra and Baral, 2016) and machine translation (Jones et al., 2012;Song et al., 2019).
AMR parsing aims to transform natural language sentences into AMR semantic graphs. Similar to constituent parsing and dependency parsing (Nivre, 2008;Dozat and Manning, 2017), AMR parsers mainly employ two parsing techniques: transitionbased parsing (Wang et al., 2016;Damonte et al., 2017;Wang and Xue, 2017;Liu et al., 2018;Guo and Lu, 2018) use a sequence of transition actions † Part of work was done when the author was visiting Westlake University * Corresponding author. to incrementally construct the graph, while graphbased parsing (Flanigan et al., 2014;Lyu and Titov, 2018;Zhang et al., 2019a;Cai and Lam, 2019) divides the task into concept identification and relation extraction stages and then generate a full AMR graph with decoding algorithms such as greedy and maximum spanning tree (MST). Additionally, reinforcement learning (Naseem et al., 2019) and sequence-to-sequence (Konstas et al., 2017) have been exploited in AMR parsing as well. Previous works (Wang et al., 2016;Artzi et al., 2015) shows that structural information can bring benefit to AMR parsing. Illustrated by Figure 1, for example syntactic dependencies can convey the main predicate-argument structure. However, dependency structural information may be noisy due to the error propagation of external parsers. Moreover, AMR concentrates on semantic relations, which can be different from syntactic dependencies. For instance, in Figure 1, AMR prefers to select the coordination (i.e. "and") as the root, which is different from syntactic dependencies (i.e. "came").
Given the above observations, we investigate the effectiveness of latent syntactic dependencies for AMR parsing. Different from existing work (Wang et al., 2016), which uses a dependency parser to provide explicit syntactic structures, we make use of a two-parameter distribution (Bastings et al., 2019) to induce latent graphs, which is differentiable under reparameterization (Kingma and Welling, 2014). We thus build a end-to-end model for AMR parsing with induced latent dependency structures as a middle layer, which is tuned in AMR training and thus can be more aligned to the need of AMR structure.
For better investigating the correlation between induced and gold syntax, and better combine the strengths, we additionally consider fusing gold and induced structural dependencies into an align-free AMR parser (Zhang et al., 2019a). Specifically, we first obtain the input sentence's syntactic dependencies 1 and treat the input sentence as prior of the probabilistic graph generator for inferring the latent graph. Second, we propose an extended graph neural network (GNN) for encoding above structural information. Subsequently we feed the encoded structural information into a two stage align-free AMR parser (Zhang et al., 2019a) for promoting AMR parsing.
To our knowledge, we are the first to incorporate syntactic latent structure in AMR parsing. Experimented results show that our model achieves 77.5% and 71.8% SMATCH F1 on standard AMR benchmarks LDC2017T10 and LDC2014T12, respectively, outperforming all previous best reported results. Beyond that, to some extent, our model can interpret the probabilistic relations between the input words in AMR parsing by generating the latent graph 2 .
2 Baseline: Align-Free AMR Parsing We adopt the parser of Zhang et al. (2019a) as our baseline, which treats AMR parsing as sequenceto-graph transduction.

Task Formalization
Our baseline splits AMR parsing into a two-stage procedure: concept identification and edge prediction. The first task aims to identify the concepts (nodes) in AMR graph from input tokens, and the second task is designed to predict semantic relations between identified concepts.
Formally, for a given input sequence of words w = w 1 , ..., w n , the goal of concept identification in our baseline is sequentially predicting the concept nodes u = u 1 , ..., u m in the output AMR graph, and deterministically assigning corresponding indices d = d 1 , ..., d m .
After identifying the concept nodes c and their corresponding indices d, we predict the semantic relations in the searching space R(u).
m} is a set of directed relations between concept nodes.

Align-Free Concept Identification
Our baseline extends the pointer-generator network with self-copy mechanism for concept identification (See et al., 2017;Zhang et al., 2018a). The extended model can copy the nodes not only from the source text, but also from the previously generated list of nodes on the target side.
The concept identifier firstly encodes the input sentence into concatenated vector embeddings with GloVe (Pennington et al., 2014), BERT (Devlin et al., 2019), POS (part-of-speech) and characterlevel (Kim et al., 2016) embeddings. Subsequently, we encode the embedded sentence by a two-layer bidirectional LSTM (Schuster and Paliwal, 1997;Hochreiter and Schmidhuber, 1997): where h l i is the l-th layer encoded hidden state at the time step i and h 0 i is the embedded token w i . Different from the encoding stage, the decoder does not use pre-trained BERT embeddings, but employs a two-layer LSTM to generate the decoding hidden state s l t at each time step: where s l−1 t and s l t−1 are hidden states from last layer and previous time step respectively, and s l 0 is the concatenation of the last bi-directional encoding hidden states. In addition, s 0 t is generated from the concatenation of the previous node u t−1 embedding and the attention vector s t−1 , which combine both source and target information: where W c and b c are trainable parameters, c t is the context vector calculated by the attention weighted encoding hidden states and the source attention distribution a t src following Bahdanau et al. (2015) The produced attention vector s is used to generate the vocabulary distribution: as well as the target attention distribution: , The source-side copy probability p src , target-side copy probability p tgt and generation probability p gen are calculated by s, which can be treated as generation switches: The final distribution is defined below, if v t is copied from existing nodes: otherwise: where a t [i] is the i-th element of a t , and then deterministically assigned the existing indices to the identified nodes based on whether the node is generated from the target-side distribution.

Edge Prediction
Our baseline employs a deep biaffine attention classifier for semantic edge prediction (Dozat and Manning, 2017), which have been widely used in graphbased structure parsing (Peng et al., 2017;Lyu and Titov, 2018;Zhang et al., 2019a). For a node u t , the probability of u k being the head node of u t and the probability of edge (u k , u t ) are defined below: , where score (score) and label (edge) are calculated via bi-affine attentions.

Model
The overall structure of our model is shown in Figure 2. First, we use an external dependency parser  to obtain the explicit structural information, and obtain the latent structural information via a probabilistic latent graph generator. We then combine both explicit and latent structural information by encoding the input sentence through an extended graph neural network. Finally, we incorporate our model with an alignfree AMR parser for parsing AMR graphs with the benefit of structural information.

Latent Graph Generator
We generate the latent graph of input sentence via the HardKuma distribution (Bastings et al., 2019), which has both continuous and discrete behaviours.
HardKuma can generate samples from the closed interval [0, 1] probabilisitcally . This feature allows us to predict soft connections probabilities between input words, which can be seen as a latent graph. Specifically, we treat embedded input words as a prior of a two-parameters distribution, and then sample a soft adjacency matrix between input words for representing a dependency.

HardKuma Distribution
The HardKuma distribution is derived from the Kumaraswamy distribution (Kuma) (Kumaraswamy, 1980), which is a two-parameters distribution over an open interval (0, 1), i.e., K ∼ Kuma(a, b), where a ∈ R >0 and b ∈ R >0 . The Kuma distribution is similar to Beta distribution, but its CDF function has a simpler analytical solution and inverse of the CDF is: We can generate the samples by: where U ∼ U(0, 1) is the uniform distribution, and we can reconstruct this inverse CDF function by the reparameterizing fashion (Kingma and Welling, 2014; Nalisnick and Smyth, 2017). In order to include the two discrete points 0 and 1, HardKuma employs a stretch-and-rectify method with support (Louizos et al., 2017), which leads the variable T ∼ Kuma(a, b, l, r) to be sampled from Kuma distribution with an open interval (l, r) where l < 0 and r > 0. The new CDF is: We pass the stretched variable T ∼ Kuma(a, b, l, r) through a hard-sigmoid function (i.e., h = min(1, max(0, t))) to obtain the rectified variable H ∼ HardKuma(a, b, l, r). Therefore, the rectified variable covers the closed interval [0, 1]. Note that all negative values of t are deterministically mapped to 0. In contrast, all samples t > 1 are mapped to 1 3 . Because the rectified variable is sampled based on Kuma distribution, HardKuma first sample a uniform variable over open interval (0, 1) from uniform distribution U ∼ U(0, 1), and then generate a Kuma variable through inverse CDF: Second, we transform the Kuma variable for covering the stretched support: 3 Details of derivations can be found at (Bastings et al., 2019).

Latent Graph
We generate the latent graph of input words w by sampling from HardKuma distribution with trained parameters a and b. We first calculate the prior c of (a, b) by employing multihead self-attention (Vaswani et al., 2017): where v = v 1 , ..., v n is the embedded input words. Subsequently, we compute a and b as: where a i = a i1 , ..., a in and b i = b i1 , ..., b in , c a , c b ∈ R n×n and Norm(x) is the normalization function. Hence, the latent graph L is sampled via learned parameters a and b:

Graph Encoder
For a syntactic graph with n nodes, the cell A ij = 1 in the corresponding adjacent matrix represents that an edge connects word w i to word w j . An L-layer syntactic GCN of l-th layer can be used to represent A, where the hidden vector for each word w i at the l − th layer is: ij is the degree of word w i in the graph for normalizing the activation to avoid the word representation with significantly different magnitudes Kipf and Welling, 2017), and σ is a nonlinear activation function.
In order to take benefits from both explicit and latent structural information in AMR parsing, we extend the Syntactic-GCN Zhang et al., 2018b) with a graph fusion layer and omit labels in the graph (i.e. we only consider the connected relation in GCN). Specifically, we propose to merge the parsed syntactic dependencies and sampled latent graph through a graph fusion layer: where π is trainable gate variables are calculated via the sigmoid function, D and L are the parsed syntactic dependencies and generated latent graph respectively, and F represent the fused soft graph. Furthermore, F is a n × n adjacent matrix for the input words w, different from the sparse adjacent matrix A, F ij denote a soft connection degree from word w i to word w j . We adapt syntactic-GCN with a fused adjacent matrix F , and employ a gate mechanism: We use GELU (Hendrycks and Gimpel, 2016) as the activation function, and apply layer normalization L norm (Ba et al., 2016) before passing the results into GELU. The scalar gate G j is calculated by each edge-node pair : where µ is the logistic sigmoid function,v andb are trainable parameters.

Training
Similar to our baseline (Zhang et al., 2019a), we linearize the AMR concepts nodes by a pre-order traversal over the training dataset. We obtain gradient estimates of E(φ, θ) through Monte Carlo sampling from: where u t is the reference node at time step t with reference head u k and l is the reference edge label between u k and u j . The form g φ (u, w) is short for the latent graph samples from uniform distribution to HardKuma distribution ( § §3.1).
Different from Bastings et al. (2019), we do not limit the sparsity of sampled latent graphs, i.e. we do not control the proportion of zeros in the latent graph, because we prefer to retain the probabilistic connection information of each word in w. Finally, we introduce coverage loss into our estimation due to reduce duplication of node generation (See et al., 2017).

Parsing
We directly generate the latent graph by the PDF function of HardKuma distribution with trained parameters a and b. In the concept identification stage, we decode the node from the final probability distribution P (node) (u t ) at each time step, and apply beam search for sequentially generating the concept nodes u and deterministically assigning corresponding indices d. For edge prediction, we use a bi-affine classifier to calculate the edge scores under the generated nodes u and indices d: Similar to Zhang et al. (2019a), we apply a maximum spanning tree (MST) algorithm (Chu, 1965;Edmonds, 1967) to generate complete AMR graph and restore the re-entrance relations by merging the receptive nodes via their indices.

Setup
We use two standard AMR corpora: AMR1.0 (LDC2014T12) and AMR 2.0 (LDC2017T10 2.0 is larger which is split into 36521, 1368 and 1371 sentences in training, development and testing sets respectively. We treat in AMR 2.0 as the main dataset in our experiments since it is larger. We tune hyperparameters on the development set, and store the checkpoints under best development results for evaluation. We employ the pre-processing and post-processing methods from Zhang et al. (2019a), and get the syntactic dependencies via Stanford Corenlp . We train our model jointly with the Adam optimizer (Kingma and Ba, 2015). The learning rate is decayed based on the results of development set in training. Training takes approximately 22 hours on two Nivida GeForce GTX 2080 Ti.

Results
Main Results We compare the SMATCH F1 scores  against previous best reported models and other recent AMR parsers. Table 1 summarizes the results on both AMR 1.0 and AMR 2.0 data sets. For AMR 2.0, with the benefit from the fused structural information, we improve our baseline (Zhang et al., 2019a)   is gained without pre-trained BERT embeddings 4 .
In addition, our model outperform the best reported model (Zhang et al., 2019b) by 0.5% F1. On AMR 1.0, there are only about 10k sentences for training. We outperform the best results by 0.5% Smatch F1. We observe that for the smaller data set, our model has a greater improvement of 1.6% F1 than for the larger data set (1.2% F1 comparing with our baseline.) Table 2 shows fined-grained parsing results of each sub-tasks in AMR 2.0, which are evaluated by the enhance AMR evaluation tools (Damonte et al., 2017). We notice that our model brings more than 1% average improvement to our baseline (Zhang et al., 2019a) for most sub-tasks, in particular, the unlabeled is gained 1.4% F1 score increasing with the structural information, and the sub-task of no WSD, reentrancies, negation and SRL are all improved more than 1.0% score under our graph encoder. In addition, our model achieves comparable results to the best reported method (Zhang et al., 2019b) for each subtask.

Ablation Study
We investigate the impacts of different structural information in our model on AMR 2.0 with main sub-tasks 5 . Table 3 shows the fused structure perform better in most sub-task than explicit and latent structure. In particular, the model with explicit structures (i.e. both explicit and fused)   outperform the model with only latent structure by 0.5% F1 in Reentrancies sub-task, which demonstrates that the explicit dependencies information can improve the this sub-task. Latent structure perform better in concepts sub-task, and fused structure brings more information to the negation subtask which obtain 0.5% and 1.0% improvement than explicit and latent structure respectively. Additionally, we can notice that both latent and explicit models outperform the previous best reported Smatch F1 score, and fused model reach the best results. It shows that different types of structural information can help the AMR parsing, we discuss the connection tendencies of each structure in ( § §4.3).

Discussion
Experiment results show that both the explicit structure and latent structure can improve the performance of AMR parsing, and latent structural information reduces the errors in sub-tasks such as concept and SRL. Different from the discrete relation of explicit structures, the internal latent structure holds soft connection probabilities between words in the input sentence, so that, each fully-connected word receive information from all the other words. Figure 3 depicts the latent and fused soft adjacent matrix of the input sentence "The boy came and left" respectively. It can be seen that the la- tent matrix (Figure 3a) tries to retain information from most word pairs, and the AMR root "and" holds high connection probabilities to each word in the sentence. In addition, the mainpredicates and arguments in the sentence tend to be connected with high probabilities. The fused matrix (Figure 3b) holds similar connection probabilities to predicates and arguments in the sentence as well, and it reduces the connection degrees to the determiner "The" which does not appear in corresponding AMR graph. Moreover, the syntactic root "came" and semantic root "and" reserve most connection probabilistic to other words.
We compare the connections in different structures in Figure 4. The latent graph (Figure 4a) prefers to connect most words, and the main predicates and arguments in the graph have higher connection probabilities. The fused graph (Figure 4c) shows that our model provides core structural information between interpretable relations. Specifically, it holds similar potential relations to annotated AMR graph, and tries to alleviate the connection information to the words which are not aligned in AMR concept nodes.
Beyond that, we calculate the Unlabeled Attachment Score (UAS) for fused and latent graph in Table 4, the unsupervised latent graph captures less explicit edges than fused graph, and both fused and latent graph ignore some arcs on explicit graph. It shows that a lower UAS does not mean lower AMR parsing score and some arcs are more useful to AMR parsing but not in explicit gold trees. Consequently, we preserve the explicit and latent structure information simultaneously. The latent structure can not only improve AMR parsing, but also have ability to interpret the latent connections between input words. The AMR graph. (We construct the latent and fused graph by selecting the top 2 possible soft connections between each word, in addition, we ignore the edges whose connection probabilities are less than 0.5.).

Related Work
Transition-based AMR parsers (Wang et al., 2016;Damonte et al., 2017;Wang and Xue, 2017;Liu et al., 2018;Guo and Lu, 2018;Naseem et al., 2019) suffer from the lack of annotated alignments between words and concept notes is crucial in these models. Lyu and Titov (2018) treat the alignments as an latent variable for their probabilistic model, which jointly obtains the concepts, relations and alignments variables. Sequence-to-sequence AMR parsers transform AMR graphs into serialized sequences by external traversal rules, and then restore the generated the AMR sequence to avoid aligning issue (Konstas et al., 2017;van Noord and Bos, 2017). Moreover, Zhang et al. (2019a) extend a pointer generator (See et al., 2017), which can generate a node multiple times without alignment through the copy mechanism. With regards to latent structure, Naradowsky et al. The latent constituent trees are shallower than human annotated, and it can boost the performance of downstream NLP tasks (e.g., text classification). Guo et al. (2019) and Ji et al. (2019) employ self-attention and bi-affine attention mechanism respectively to generate soft connected graphs, and then adopt GNNs to encode the soft structure to take advantage from the structural information to their works.
GCN and its variants are increasingly applied in embedding syntactic and semantic structures in NLP tasks (Kipf and Welling, 2017;Damonte and Cohen, 2019). Syntactic-GCN tries to alleviate the error propagation in external parsers with gates mechanism, it encodes both relations and labels with the gates, and filters the output of each GCN layer over the dependencies. Bastings et al., 2017). Damonte and Cohen (2019) encodes AMR graphs via GCN to promote the AMR-to-text generation task.

Conclusion
We investigate latent structure for AMR parsing, and we denote that the inferred latent graph can interpret the connection probabilities between input words. Experiment results show that the latent structural information improve the best reported parsing performance on both AMR 2.0 (LDC2017T10) and AMR 1.0 (LDC2014T12). We also propose to incorporate the latent graph into other multi-task learning problems (Chen et al.,   We select the best hyper-parameters under the results of the development set, and we fix the hyperparameters at the test stage. We use two-layer highway LSTM as the encoder and two-layer LSTM as the decoder for the align-free node generator. Table 5 shows the details.

A.2 More Examples
To discuss the generated latent graph in different situations, We provide two examples from the test set on the next page. Figure 5 gives the analysis of an interrogative sentence: "What advice could you give me?". It shows that the latent graph of the sentence is going to hold the most information between predicates and arguments. Both the AMR root "advice" and the dependency root "give" are paid more attention from other words, and the fused graph retains more information of the predicates and arguments in the original sentence as well.
For a longer sentence with multiple predicateargument structures, Figure 6 depicts the latent and fused graph of the sentence "You could go to the library on saturdays and do a good 8 hours of studying there.". In this case, the corresponding latent graph becomes shallower, and the AMR root "and" holds most information from other words. Besides, the fused graph indicates that predicates will receive more information from other words, and to some extent, phrases tend to be connected by the fused graph generator.