AMR Parsing as Graph Prediction with Latent Alignment

Abstract meaning representations (AMRs) are broad-coverage sentence-level semantic representations. AMRs represent sentences as rooted labeled directed acyclic graphs. AMR parsing is challenging partly due to the lack of annotated alignments between nodes in the graphs and words in the corresponding sentences. We introduce a neural parser which treats alignments as latent variables within a joint probabilistic model of concepts, relations and alignments. As exact inference requires marginalizing over alignments and is infeasible, we use the variational autoencoding framework and a continuous relaxation of the discrete alignments. We show that joint modeling is preferable to using a pipeline of align and parse. The parser achieves the best reported results on the standard benchmark (74.4% on LDC2016E25).


Introduction
Abstract meaning representations (AMRs) (Banarescu et al., 2013) are broad-coverage sentencelevel semantic representations. AMR encodes, among others, information about semantic relations, named entities, co-reference, negation and modality. AMR can be represented as a rooted labeled directed acyclic graph (see Figure 1). As AMR abstracts away from details of surface realization, it is potentially beneficial in many semantic related NLP tasks, including text summarization (Liu et al., 2015;Dohare and Karnick, 2017), machine translation (Jones et al., 2012) and question answering (Mitra and Baral, 2016). AMR parsing has recently received a lot of attention (e.g., (Flanigan et al., 2014;Artzi et al., The boys must not go 2015; Konstas et al., 2017)). One distinctive aspect of AMR annotation is the lack of explicit alignments between nodes in the graph (concepts) and words in the sentences. Though this arguably simplified the annotation process (Banarescu et al., 2013), it is not straightforward to produce an effective parser without relying on an alignment. Most AMR parsers (Damonte et al., 2017;Flanigan et al., 2016;Werling et al., 2015;Foland and Martin, 2017) use a pipeline where the aligner training stage precedes training a parser. The aligners are not directly informed by the AMR parsing objective and may produce alignments suboptimal for this task.
In this work, we demonstrate that the alignments can be treated as latent variables in a joint probabilistic model and induced in such a way as to be beneficial for AMR parsing. Intuitively, in our probabilistic model, every node in a graph is assumed to be aligned to a word in a sentence: the corresponding concept is predicted based on the corresponding RNN state. Similarly, graph edges (i.e. relations) are predicted based on representations of concepts and aligned words (see Figure 2). As alignments are latent, exact inference requires marginalizing over latent alignments, which is infeasible. Instead we use variational inference, specifically the variational autoencoding frame-work of Kingma and Welling (2014). Using discrete latent variables in deep learning has proven to be challenging (Mnih and Gregor, 2014;Bornschein and Bengio, 2015). We use a continuous relaxation of the alignment problem, relying on the recently introduced Gumbel-Sinkhorn construction (Mena et al., 2018). This yields a computationally-efficient approximate method for estimating our joint probabilistic model of concepts, relations and alignments.
We assume injective alignments from concepts to words: every node in the graph is aligned to a single word in the sentence and every word is aligned to at most one node in the graph. This is necessary for two reasons. First, it lets us treat concept identification as sequence tagging at test time. For every word we would simply predict the corresponding concept or predict NULL to signify that no concept should be generated at this position. Secondly, Gumbel-Sinkhorn can only work under this assumption. This constraint, though often appropriate, is problematic for certain AMR constructions (e.g., named entities). In order to deal with these cases, we re-categorized AMR concepts. Similar strategies have been adopted in previous work (Foland and Martin, 2017;Peng et al., 2017).
The resulting parser achieves 73.6% ± 0.3% Smatch score on the standard test set when using LDC2016E25 training set, an improvement of 2.6% over the previous best result (van Noord and Bos, 2017). We also demonstrate that inducing alignments within the joint model is indeed beneficial. When, instead of inducing alignments, we rely on predictions of JAMR aligner (Flanigan et al., 2016), the performance drops by 1.6% Smatch. Our main contributions can be summarized as follows: • we introduce a joint probabilistic model for alignment, concept and relation identification; • we demonstrate how a continuous relaxation can be used to effectively estimate the model; • the model achieves the best reported results. 1

Probabilistic Model
In this section we describe our probabilistic model and the estimation technique. In section 3, we de-scribe preprocessing and post-processing (including concept re-categorization, sense disambiguation, wikification and root selection).

Notation and setting
We will use the following notation throughout the paper. We refer to words in the sentences as w = (w 1 , . . . , w n ), where n is sentence length, w k ∈ V for k ∈ {1 . . . , n}. The concepts (i.e. labeled nodes) are c = (c 1 , . . . , c m ), where m is the number of concepts and c i ∈ C for i ∈ {1 . . . , m}. For example, in Figure 1, c = (obligate, go, boy, -). 2 Note that senses are predicted at post-processing, as discussed in Section 3.2 (i.e. go is labeled as go-02).
A relation between 'predicate concept' i and 'argument concept' j is denoted by r ij ∈ R; it is set to NULL if j is not an argument of i. In our example, r 2,3 = ARG0 and r 1,3 = NULL. We will use R to denote all relations in the graph.
To represent alignments, we will use a = {a 1 , . . . , a m }, where a i ∈ {1, . . . , n} returns the index of a word aligned to concept i. In our example, a 1 = 3.
All three model components rely on bidirectional LSTM encoders (Schuster and Paliwal, 1997). We denote states of BiLSTM (i.e. concatenation of forward and backward LSTM states) as h k ∈ R d (k ∈ {1, . . . , n}). The sentence encoder takes pre-trained fixed word embeddings, randomly initialized lemmas, suffix (last two characters) and named-entity tag embeddings.

Method overview
We believe that using discrete alignments, rather than attention-based models (Bahdanau et al., 2014) is crucial for AMR parsing. AMR banks are a lot smaller than parallel corpora used in machine translation (MT) and hence it is important to inject a useful inductive bias. We constrain our alignments from concepts to words to be injective. First, it encodes the observation that concepts are mostly triggered by single words (especially, after re-categorization, Section 3.1). Second, it implies that each word corresponds to at most one concept (if any). This encourages competition: alignments are mutually-repulsive. In our example, obligate is not lexically similar to the word must and may be hard to align. However, given that other concepts are easy to predict, alignment candidates other than must and the will be immediately ruled out. We believe that these are the key reasons for why attention-based neural models do not achieve competitive results on AMR (Konstas et al., 2017) and why state-of-the-art models rely on aligners. Our goal is to combine best of two worlds: to use alignments (as in state-of-the-art AMR methods) and to induce them while optimizing for the end goal (similarly to the attention component of encoder-decoder models).
Our model consists of three parts: (1) the concept identification model P θ (c|a, w); (2) the relation identification model P φ (R|a, w, c) and (3) the alignment model Q ψ (a|c, R, w). 3 Formally, (1) and (2) together with the uniform prior over alignments P (a) form the generative model of AMR graphs. In contrast, the alignment model Q ψ (a|c, R, w), as will be explained below, is approximating the intractable posterior P θ,φ (a|c, R, w) within that probabilistic model.
In other words, we assume the following model for generating the AMR graph: AMR concepts are assumed to be generated conditional independently relying on the BiLSTM states and surface forms of the aligned words. Similarly, 3 θ, φ and ψ denote all parameters of the models.
relations are predicted based only on AMR concept embeddings and LSTM states corresponding to words aligned to the involved concepts. Their combined representations are fed into a bi-affine classifier (Dozat and Manning, 2017) The expression involves intractable marginalization over all valid alignments.
As standard in variational autoencoders, VAEs (Kingma and Welling, 2014), we lower-bound the loglikelihood as where Q ψ (a|c, R, w) is the variational posterior (aka the inference network), E Q [. . .] refers to the expectation under Q ψ (a|c, R, w) and D KL is the Kullback-Liebler divergence. In VAEs, the lower bound is maximized both with respect to model parameters (θ and φ in our case) and the parameters of the inference network (ψ). Unfortunately, gradient-based optimization with discrete latent variables is challenging. We use a continuous relaxation of our optimization problem, where realvalued vectorsâ i ∈ R n (for every concept i) approximate discrete alignment variables a i . This relaxation results in low-variance estimates of the gradient using the parameterization trick (Kingma and Welling, 2014), and ensures fast and stable training. We will describe the model components and the relaxed inference procedure in detail in sections 2.6 and 2.7. Though the estimation procedure requires the use of the relaxation, the learned parser is straightforward to use. Given our assumptions about the alignments, we can independently choose for each word w k (k = 1, . . . , m) the most probably concept according to P θ (c|h k ). If the highest scoring option is NULL, no concept is introduced. The relations could then be predicted relying on P φ (R|a, w, c). This would have led to generating inconsistent AMR graphs, so instead we search for the highest scoring valid graph (see Section 3.2). Note that the alignment model Q ψ is not used at test time and only necessary to train accurate concept and relation identification models.

Concept identification model
The concept identification model chooses a concept c (i.e. a labeled node) conditioned on the aligned word k or decides that no concept should be introduced (i.e. returns NULL). Though it can be modeled with a softmax classifier, it would not be effective in handling rare or unseen words. First, we split the decision into estimating the probability of concept category τ (c) ∈ T (e.g. 'number', 'frame') and estimating the probability of the specific concept within the chosen category. Secondly, based on a lemmatizer and training data 4 we prepare one candidate concept e k for each word k in vocabulary (e.g., it would propose want if the word is wants). Similar to Luong et al. (2015), our model can then either copy the candidate e k or rely on the softmax over potential concepts of category τ . Formally, the concept prediction model is defined as where the first multiplicative term is a softmax classifier over categories ] denotes the indicator function and equals 1 if its argument is true and 0, otherwise; Z(h, θ) is the partition function ensuring that the scores sum to 1.

Relation identification model
Most predicates have at most one argument for each relation type (e.g., there is typically at most one agent / ARG0), hence we would like to encourage competition for each role among arguments. For non-NULL relations, we have where the first term corresponds to picking a candidate argument (i.e. predicting that concept c j is a candidate for being argument of type r for the predicate c i ), whereas the second to deciding that it is indeed an argument. The remaining probabil- Each term is modeled in exactly the same way: (1) for both endpoints, embedding of the concept c is concatenated with the RNN state h; (2) they are linearly projected to a lower dimension 4 See supplementary materials.

Alignment model
Recall that the alignment models is only used at training, and hence it can rely both on input (states h 1 , . . . , h n ) and on the list of concepts c 1 , . . . , c m .
Formally, we add (m−n) NULL concepts to the list. 5 Aligning a word to any NULL, would correspond to saying that the word is not aligned to any 'real' concept. Note that each one-to-one alignment (i.e. permutation) between n such concepts and n words implies a valid injective alignment of n words to m 'real' concepts. This reduction to permutations will come handy when we turn to the Gumbel-Sinkhorn relaxation in the next section. From now on, we will assume that m = n.
As with sentences, we use a BiLSTM model to encode concepts c, where g i ∈ R dg , i ∈ {1, . . . , n}. We use a globally-normalized alignment model: where Z ψ (c, w) is the intractable partition function and the terms ϕ(g i , h a i ) score each alignment link according to a bilinear form where B ∈ R dg×d is a parameter matrix.

Estimating model with Gumbel-Sinkhorn
Recall that our learning objective (1) involves expectation under the alignment model. The partition function of the alignment model is intractable, and it is tricky even to draw samples. Luckily, the recently proposed relaxation (Mena et al., 2018) lets us circumvent this issue. First, note that exact samples from a categorical distribution can be obtain using the perturb-and-max technique (Papandreou and Yuille, 2011). For our alignment model, it would correspond to adding independent noise to the score for every possible alignment and choosing the highest scoring one: where P is the set of all permutations of n elements, a is a noise drawn independently for each a from the fixed Gumbel distribution (G(0, 1)). Unfortunately, this is also intractable, as there are n! permutations. Instead, in perturband-max an approximate schema is used where noise is assumed factorizable. In other words, first noisy scores are computed asφ(g 1) and an approximate sample is obtained by a = argmax a n i=1φ (g i , h a i ), Such sampling procedure is still intractable in our case and also non-differentiable. The main contribution of Mena et al. (2018) is approximating this argmax with a simple differentiable com-putationÂ = S t (Φ, Σ) which yields an approximate (i.e. relaxed) permutation. We use Φ and Σ to denote the n × n matrices of alignment scores ϕ(g i , h k ) and noise variables ik , respectively. Instead of returning index a i for every concept i, it would return a (peaky) distribution over wordŝ a i . The peakiness is controlled by the temperature parameter t of Gumbel-Sinkhorn which balances smoothness ('differentiability') vs. bias of the estimator. For further details and the derivation, we refer the reader to the original paper (Mena et al., 2018).
Note that Φ is a function of the alignment model Q ψ , so we will write Φ ψ in what follows. The variational bound (1) can now be approximated as Following Mena et al. (2018), the original KL term from equation (1) is approximated by the KL term between two n × n matrices of i.i.d. Gumbel distributions with different temperature and mean. The parameter t 0 is the 'prior temperature'. At test time we will use the model in a pipeline fashion, first predicting concepts and the relations between these concepts. Consequently, being accurate at predicting concepts is more important. In contrast, the concept prediction term in expression (4) would have minor influence on the loss, as there are n concepts to predict, whereas there are n × n relations. Consequently, we down-weight the relation identification term by the factor of 1/n.
Our new objective is fully differentiable with re-spect to all parameters (i.e. θ, φ and ψ) and has low variance as sampling is performed from the fixed non-parameterized distribution, as in standard VAEs.

Relaxing concept and relation identification
One remaining question is how to use the soft inputÂ = S t (Φ ψ , Σ) in the concept and relation identification models in equation (4). In other words, we need to define how we compute P θ (c|S t (Φ ψ , Σ), w) and P φ (R|S t (Φ ψ , Σ), w, c).
The standard technique would be to pass to the models expectations under the relaxed variables n k=1â ik h k , instead of the vectors h a i (Maddison et al., 2017;Jang et al., 2017). This is exactly what we do for the relation identification model.
However, the concept prediction model log P θ (c|S t (Φ ψ , Σ), w) relies on the pointing mechanism, i.e. directly exploits the words w rather than relies only on biLSTM states h k . So instead we treatâ i as a prior in a hierarchical model: As we will show in our experiments, a softer version of the loss is even more effective: where we set the parameter α = 0.5. We believe that using this loss encourages the model to more actively explore the alignment space.

Re-Categorization
AMR parsers often rely on a pre-processing stage, where specific subgraphs of AMR are grouped together and assigned to a single node with a new compound category (e.g., Werling et al. (2015); Foland and Martin (2017); Peng et al. (2017)); this transformation is reversed at the post-processing stage. Our approach is very similar to the Factored Concept Label system of , with one important difference that we unpack our concepts before the relation identification stage, so the relations are predicted between original concepts (all nodes in each group share the same alignment distributions to the RNN states). Intuitively, the goal is to ensure that concepts rarely lexically triggered (e.g., thing in Figure 3) get grouped together with lexically triggered nodes. Such 'primary' concepts get encoded in the category of the concept (the set of categories is τ , see also section 2.3). In Figure 3, the re-categorized concept thing(opinion) is produced from thing and opine-01. We use category as the dummy category type. There are 8 templates in our system which extract re-categorization for fixed phrases (e.g. thing(opinion)), and deterministic system for grouping lexically flexible, but structurally stable sub-graph (e.g., named entities, have-rel-role-91 and have-org-role-91 concepts).
Details of the re-categorization procedure and other pre-processing are provided in appendix.

Post-processing
For post-processing, we handle sensedisambiguation, wikification and ensure legitimacy of the produced AMR graph. For sense disambiguation we pick the most frequent sense for that particular concept ('-01', if unseen). For wikification we again look-up in the training set and default to "-".
Our probability model predicts edges conditional independently and thus cannot guarantee the connectivity of AMR graph, also there are additional constraints which are useful to impose. We enforce three constraints: (1) specific concepts can have only one neighbor (e.g., 'number' and 'string'; see appendix for details); (2) each predicate concept can have at most one argument for each relation r ∈ R; (3) the graph should be connected. Constraint (1) is addressed simply by keeping only the highest scoring neighbor. In order to satisfy the last two constraints we use a simple greedy procedure. First, for each edge, we pick-up the highest scoring relation and edge (possibly NULL). If the constraint (2) is violated, we simply keep the highest scoring edge among the duplicates and drop the rest. If the graph is not connected (i.e. constraint (3) is violated), we greedily choose edges linking the connected components until the graph gets connected (MSCG in Flanigan et al. (2014)).
Finally, we need to select a root node. Similarly to relation identification, for each candidate concept c i , we concatenate its embedding with the corresponding LSTM state (h a i ) and use these scores in a softmax classifier over all the concepts.

Data and setting
We primarily focus on the most recent LDC2016E25 (R2) dataset, which consists of 36521, 1368 and 1371 sentences in training, development and testing sets, respectively. The earlier LDC2015E86 (R1) dataset has been used by much of the previous work. It contains 16833 training sentences, and same sentences for development and testing as R2. 6 We used the development set to perform model selection and hyperparameter tuning. The hyperparameters, as well as information about embeddings and pre-processing, are presented in the supplementary materials.
We used Adam (Kingma and Ba, 2014) to optimize the loss (4) and to train the root classifier.
In order to make sure the model pays more attention to concept prediction, we split our training into two stages. We start by jointly training the alignment model and the concept identification model. Then we continue optimizing the entire objective but fix the concept identification model. We do early stopping on the development set scores, for both stages.
In order to disentangle individual phenomena, we use the AMR-evaluation tools (Damonte et al., 2017) and compare to systems which reported these scores (Table 2). We obtain the highest scores on most subtasks. The exceptions are NER (named entity recognition) and negations. For NER, the best parser is AMREager by Damonte et al. (2017) which relies on an external NER system and a gazetteer. This may suggest that the AMR bank on its own is not large enough for a system to learn an accurate NER model. For negation detection, it is also not surprising as many negations are encoded with morphology, and character models, unlike our word-level model, are able to capture predictive morphological features (e.g., detect prefixes such as "un-" or "im-").   Table 3). First, we would like to see if our latent alignment framework is beneficial. In order to test this, we create a baseline version of our system ('fixedalign') which relies on the JAMR aligner (Flanigan et al., 2014), rather than induces alignments as latent variables. Recall that in our model we used training data and a lemmatizer to produce candidates for the concept prediction model (see Section 2.3, the copy function). In order to have a fair comparison, if a concept is not aligned after JAMR, we try to use our copy function to align it. If an alignment is not found, we make the alignment uniform across the unaligned words. In preliminary experiments, we considered alternatives versions (e.g., dropping concepts unaligned by JAMR or dropping concepts unaligned after both JAMR and the matching heuristic), but the chosen strategy was the most effective. We observe that using fixed alignments gives us 70.9% Smatch score on R1, a substantial drop in performance (1.8%). Interestingly, these scores of fixedalign are on par with Foland and Martin (2017). The fixed-align version is indeed similar to Foland and Martin (2017): both rely on fixed JAMR alignments and use BiLSTM encoders. These results confirm that we used a strong baseline, and that the gains in performance are primarily due to using our variational alignment framework.
We present further ablations in Table 4. We would like to confirm that our 2-stage training procedure (described in section 4.1) is necessary. When compared to using one-stage training ('full joint'), we observe that using the schedule was beneficial (+1.5%). Now, it is a natural question to ask whether alignments need to be updated at all  Figure 4: When modeling concepts alone, the posterior probability of the correct (green) and wrong (red) alignment links will be the same.  based on relations (i.e. on the second stage). This 'no fine-tune' version also appears weaker overall than the best approach (-0.3%). The drop is more substantial for relations ('SRL'): -0.6%. In order to see why relations are potentially useful in learning alignments, consider Figure 4. The example contains duplicate concepts long. The concept prediction model factorizes over concepts and does not care which way these duplicates are aligned: correctly (green edges) or not (red edges). Formally, the true posterior under the relation-only model in 'no fine-tune' assigns exactly the same probability to both configurations, and the alignment model Q ψ will be forced to mimic it (even though it relies on a LSTM model of the graph).
The spurious ambiguity will have a detrimental effect on the relation identification stage. These observations and the ablations suggest that our training regime provides a good middle ground between full-joint optimization (which fails to emphasize the importance of the concept prediction stage) and 'no fine-tune' (which makes the alignments sub-optimal for relation identification). We perform two additional ablation tests. First, we compare our relation identification component to simply using a bilinear scoring function (i.e. not encouraging competition, Section 2.4) and observe a drop in performance (-0.2% overall Smatch, and -0.9 % in SRL). Secondly, we show that using the simple hierarchical relaxation ('hrelax', see equation (5))) results in a large drop in performance when compared to our softer relaxation (-1.5% Smatch). We hypothesize that the softer relaxation favors exploration of alignments and helps to discover better configurations.

Additional Related Work
Alignment performance has been previously identified as a potential bottleneck affecting AMR parsing (Damonte et al., 2017;Foland and Martin, 2017). Some recent work has focused on building aligners specifically for training their parsers (Werling et al., 2015;. However, those aligners are trained independently of concept and relation identification and only used at pre-processing.
Treating alignment as discrete variables has been successful in some sequence transduction tasks (Yu et al., 2017(Yu et al., , 2016. Our work is similar in that we also train discrete alignments jointly but the tasks, the inference framework and the decoders are very different. For AMR parsing, another way to avoid using pre-trained aligners is to use seq2seq models (Konstas et al., 2017;van Noord and Bos, 2017). In particular, van Noord and Bos (2017) used character level seq2seq model and achieved the previous state-of-the-art result. However, their model is very data demanding as they needed to train it on additional 100K sentences parsed by other parsers. This may be due to two reasons. First, seq2seq models are often not as strong on smaller datasets. Second, recurrent decoders may struggle with predicting the linearized AMRs, as many statistical dependencies are highly non-local.