Unsupervised AMR-Dependency Parse Alignment

In this paper, we introduce an Abstract Meaning Representation (AMR) to Dependency Parse aligner. Alignment is a preliminary step for AMR parsing, and our aligner improves current AMR parser performance. Our aligner involves several different features, including named entity tags and semantic role labels, and uses Expectation-Maximization training. Results show that our aligner reaches an 87.1% F-Score score with the experimental data, and enhances AMR parsing.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a semantic representation that expresses the logical meaning of English sentences with rooted, directed, acylic graphs. AMR associates semantic concepts with the nodes on a graph, while the relations are the label edges between concept nodes. Meanwhile, AMR relies heavily on predicate-argument relations from PropBank (Palmer et al., 2005), which share several edge labels. The representation also encodes rich information, like semantic roles (all the "ARGN" tags from PropBank), named entities (NE) ("person", "location", etc., concepts), wiki-links (":wiki" tags), and co-reference (reuse of variables, e.g., p). An example AMR in PEN-MAN format (Matthiessen and Bateman, 1991) is shown in Figure 1.
The design of an AMR to English sentence aligner is the first step for implementation of an AMR parser, since AMR annotation does not contain links between each AMR concept and the original span of words. The basic alignment strategy is to link the AMR tokens (either concepts or edge labels) with their corresponding (j / join-01 :ARG0 (p / person :wiki -:name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board :ARG1-of (h / have-org-role-91 :ARG0 p :ARG2 (d2 / director :mod (e / executive :polarity -)))) :time (d / date-entity :month 11 :day 29)) Figure 1: The AMR annotation of sentence "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29." in PENMAN format span of words. Another strategy is to find the alignment from an AMR concept to a word node in a dependency parse tree, the goal of this paper. A dependency parse tree is a good structure for attaching more information, e.g. named entity tags, lemma, and semantic role labels, etc., and provides richer syntactic information than the span of words. An alignment between an AMR concept and a dependency node represents a correspondence between the meaning of this concept and its child concepts and the phrase governed by the dependency node (i.e., head word). An example alignment is shown in Figure 2. For example, the word node "Vinken" on the dependency parse side in Figure 2 links to the lexical concept of "Vinken" and, furthermore, links to the "p2/name" and the "p/person" concepts since "Vinken" is the head of the named entity "Pierre Vinken" and the head of the whole noun phrase "Pierre Vinken, 61 years old.". In our work, we use Expectation-Maximization(EM) (Dempster et al., 1977) Figure 2: The alignment between a subgraph of an AMR (top) and a dependency parse (bottom) for the "Pierre Vinken" sentence. Dashed lines link dependency parse nodes and corresponding concepts.
labels, and global features, etc. Then EM processing incorporates all the individual probabilities and estimates the final alignments. We will describe AMR-English sentence alignment in general, and review related work, in Section 2. Then the descriptions of our AMRdependency parse features and alignment model are in Section 3. Our beam-search docoder is described in Section 4. Our experimental results are presented in Section 5, followed by our conclusion and discussion of future work (Section 6).

AMR-English Sentence Aligner
A preliminary step for an AMR parser is aligning AMR concepts and the original spans of words. JAMR (Flanigan et al., 2014) includes a heuristic alignment algorithm between AMR concepts and words or phrases from the original sentence. They use a set of alignment rules, like named entity, fuzzy named entity, data entity, etc., with a greedy strategy to match the alignments. This aligner achieves a 90% F 1 score on hand aligned AMR-sentence pairs. On the other hand, the ISI Aligner (Pourdamghani et al., 2014) presents a generative model to align AMR graphs to sentence strings. They propose a string-to-string alignment model which transfers the AMR expression to a linearized string representation as the initial step. Their training method is based on the IBM word alignment model (Brown et al., 1993) but they modify the objective function of the alignment model. IBM Model-4 with a symmetric method reaches the highest F 1 score, 83.1%. When separating the alignments into roles (edge labels) and non-roles (concepts), F 1 scores are 49.3% and 89.8%, respectively. In Werling's AMR parser (Werling et al., 2015), they conceive of the alignment task as a linear programming relaxation of a boolean problem. The objective function is to maximize the sum of action reliability. Each concept is constrained to align to exactly one token in a sentence. This ensures that only adjacent nodes or nodes that share the same title refer to the same token. They hand-annotate 100 AMR parses, and their aligner achieves an accuracy of 83.2%. By providing alternative alignments to their graph-based AMR parser, their aligner achieves a better Smatch score than JAMR's aligner.
However, two transition-based parsers which parse dependency parse tree structures into AMRs, e.g., the CAMR system (Wang et al., 2015;Wang et al., 2016) and the RIGA system (Barzdins and Gosko, 2016), tie for the best results in SemEval-2016task 8 (May, 2016. It is important to note that the JAMR aligner was not designed to align between a dependency word node and an AMR concept where its alignment F 1 score is only 69.8% (see Section 5.2). In order to deal with this problem, (Chen, 2015) proposed a preliminary aligner which estimates alignments by learning the feature probabilities of lexical (surface) forms, relations, named entities and semantic roles jointly. Besides the objective to obtain alignment between AMR concepts and original word spans, the estimation of these feature probabilities is also useful for further development of the AMR parser with these initial models. In our paper, we extend their previous work by adding rule-based and global features, and adding a beam-search algorithm at decoding time.
Concepts C = c 1 , c 2 , . . . , c |C| , and the corresponding dependency parse as a list of dependency word nodes D = d 1 , d 2 , . . . , d |D| . An alignment function a is designed to produce exactly one alignment to a dependency node d c j for each concept c j , within a single sentence. Alternatively, we can view a as a mapping function that accepts one input variable concept c j and outputs a dependency node d c j with which c j is aligned. A is the alignment set that contains all different a l that cover possible alignments within C and D. Our model adopts an asymmetric alignment direction, where one concept maps to exactly one dependency parse node, and each dependency parse node can be aligned by zero to multiple concepts. We denote dependency node d c linked by concept c as d c = a(c). c p is the parent concept of concept c, while c s 1 , c s 2 , ..., c s k are the k child concepts of concept c.

Basic Features
Several of the AMR concepts use the word form directly. For example, the concept "join-01" in Figure 1 would align to the dependency node "join" naturally. Similarly, the leaf concepts usually align to identical terms in the dependency parse. In Figure 1, the names "Pierre" and "Vinken" are aligned to their word forms on the dependency parse leaves. Therefore, we design a straightforward rule-based probability, P rule , which catches the appearance of the surface form. P rule (c, d c ) is defined as the probability that the matching type for a given concept c and dependency node d c are linked. The different types of rules, e.g., word, lemma, numbers, and date, etc., and their proportional applicability to both AMR concepts and leaves are listed in Table 1. For example, the rule "Date" type aligns concept "11" with word node "November" in Figure 1, while "Numbers" aligns concept "5" with word node "five". P rule decides which match type to apply by following a greedy matching strategy.

External Features
To capture alignments for concepts which do not match any of the above basic rules, we design the following four external feature probabilities:  Figure 3a, the concept c ="temporalquantity" is highly likely to align to the word node d c = "old" since "old" is usually the head word of a phrase expressing age ("61 years old" here). Also, have-org-role-91 can align to the word node "director" since "director" appears quite often with have-org-role-91 (defined as roles in organizations). Besides, some special leaf concepts, like ":polarity -" (negative), and ":mode expressive" (which is used to mark exclamational word), also rely on this feature rather than the basic rules.
Relation Probability is the conditional probability of the AMR relation label of c, given the parse tree path between d c and d c p , where d c and d c p represent the dependency nodes that are aligned by c and c p , respectively. Parse tree path is the concatenation of all dependency tree and direction labels through the tree path between d c and d c p . For example, the relation probability of c = 61, d c = 61, and d c p = old in Figure 3b is P (quant|advmod ↓ num ↓). A parse tree path is a useful feature for extracting relations between any two tree nodes, e.g., Semantic Role Labeling (SRL) (Gildea and Jurafsky, 2002) and relation extraction (Bunescu and Mooney, 2005; Kambhatla, 2004;Xu et al., 2015), so we add relation probability to our model.  TION, etc.). N amedEntity(d) indicates the named entity type of the phrase with d as the head word. For example, after named entity recognition (NER) tagging, the label assigned to "PER-SON" is the dependency parse tree node "Vinken". So the named entity probability of P N E (c = person, d c = V inken) in Figure 3c is P (person | PERSON). Since AMR contains a large amount of named entity information, we assume that a feature based on an external named entity module should improve the alignment accuracy.
Semantic Role Probability is the conditional probability of the AMR relation label of c, given the semantic role d c if d c p is a predicate and d c is d c p 's argument. If a predicate-argument structure does not exist between d c p and d c , the semantic role probability is omitted. For example, in Figure 3d, the semantic role probability of P SR (c = person, d c = V inken, d c p = join) is equal to P (ARG0|Arg0). Since AMR depends heavily on predicate-argument relations, external predicate-argument information from an external SRL system should enhance the overall alignment accuracy.
The above four feature probabilities are learned by the EM algorithm (Section 3.2).

Global Feature
The above basic and external features capture local alignment information. However, to make sure that a concept is aligned to the correct phrase head word which represents the same sub-meaning, we need a global feature to calculate coverage. The design of our concept coverage feature is as follows: R CC (c) Overlapping Ratio of the child concept aligned phrases to their parent concept aligned phrases plus the non-covered penalty. This ratio is defined as: advmod num temporalquantity 61 year quant unit Figure 4: A sample of incorrect alignment. We use this sample to calculate its overlapping ratio (R cc ) let c = temporal-quantity where W refers to the set of words that the aligned dependency word node contains. The first term of R CC ensures the child concepts contain the largest possible subspans of the parent concept span. The non-covered penalty term (pen) is to prevent a child concept from aligning to a word node that contains a larger word span than the child's parent concept. The pen term will increase exponentially if child concepts align to a larger word span. The back slash term "\" refers to set subtraction. We take Figure 4 as an example of an incorrect alignment example where the concept "temporal-quantity" aligns to "year" and the concept "year" aligns to "old", the overlapping ratio of this alignment is 0.37 since it suffers a penalty. As we compare it with the correct alignment in Figure 3b, the overlapping ratio of this alignment is 0.67, which is much higher than the incorrect one.

Training with EM Algorithm
The objective function of our AMR-to-Dependency Parse aligner is listed as follows: Since our long term goal is to design a dependency parse to AMR parser, we define the objective function L θ as the probability that dependency parses transfer to AMR graphs for the AMR-to-Dependency Parse aligner: (3) where θ = (P lemma , P rel , P N E , P SR ) is the set of feature probabilities (parameters) we want to estimate, alignment set A is the latent variable we want to observe, and S is the training sample that contains a set of tuples (C, D, A), where C and D are a AMR, dependency parse pair and A is their alignment combination set. In equation (3), the probability that dependency tree D translates to AMR C with an alignment combination a is equal to the product of all probabilities that concept c j in C aligns to dependency node d c j and c p j aligns to dependency node d c p j .

Expectation-Step
The E-Step estimates all the different alignment probabilities of an input AMR and dependency parse pair by giving the product of feature probabilities. The alignment probability can be calculated using: The alignment probability is equal to the product of all tuple (c, d c , d c p )'s aligning probabilities. P rule is obtained by a simple calculation from the development set, while P lemma , P rel , P N E , and P SR are initialized uniformly before the first round of E-step. And these feature probabilities will be updated during the M-step.

Maximization-Step
In the M-Step, feature probabilities are reestimated by collecting the count of all AMRdependency parse pairs. The count of lemma (cnt lemma ), relation (cnt rel ), named entity (cnt N E ), and semantic role (cnt SR ) features are the normalized counts that are collected from the accumulating probability of all possible alignments from the E-step. Here we take the derivation of cnt lemma as an example. cnt rel , cnt N E , and cnt SR can be obtained with similar equations: After we collect all counts for different features, the four feature probabilities, P lemma , P rel , P N E , and P SR , are updated with their feature counts. Here we show the update of P lemma as an example. The rest of feature probability updates can be derived in the same way: After this, we apply the newer feature probabilities to recalculate alignment probabilities in the E-step again. EM iterates the E and M-steps until convergence or certain criteria are met.

Decoding
At decoding time, we want to find the most likely alignment a for the given C, D . By applying Equations (4) and (5), we define the search for alignments as follows: argmax a P (a|C, D) = argmax a |C| j=1 R CC (c j ) * P (c j |d i = a(c j ), d l = a(c p j )) This decoding problem finds the alignment a that maximizes the likelihood, which we define in  Equation (5). The overlapping ratio(R CC ) is introduced to the likelihood function to ensure that a parent concept covers a wider word span range than its child concepts. A beam search algorithm is designed to extract the target alignment without exhaustively searching all of the candidate alignments (which has a complexity of O(|D| |C| ).) The beam search starts from leaf concepts and then walks through parent concepts after their child concepts have been traversed. When we go through concept c j , we need to consider all the following likelihoods: 1) the accumulated likelihood for aligning to any dependency word node d c j from all the child concepts of c j , and 2) the product of P lemma , P N E , P rel , P SR , and R CC for c j . Instead of using during training, R CC is only applied during decoding time. The probabilities are obtained simply from the product of all the above likelihoods. We keep the top-|b| alignment probabilities and their aligned dependency node d c j for each c j until we reach the root concept, where |b| is the beam size. Finally we can trace back and find the most likely alignment. The running time for the beam search algorithm is O(|b| * |C| * |D| 2 ).

Experimental Data
The LDC DEFT Phase 2 AMR Annotation Release 2.0 1 consists of AMRs with English sentence pairs. Annotated selections from various genres (including newswire, discussion forum, other web logs, and television transcripts) are available, for a total of 39,260 sentences. This release uses the PropBank Unification frame files (Bonial et al., 2014;Bonial et al., 2016). To generate automatic dependency parses for all DEFT AMR Release data, we use ClearNLP (Choi and Mccallum, 2013) to produce dependency parses. ClearNLP also labels semantic roles and named entity tags automatically on the generated dependency parses. This data set is named "All". To compare the effect of applying automatic dependency parses to our aligner with gold dependency parses, we select the sentences which appear in the OntoNotes 5.0 2 release as well. The OntoNotes data contains TreeBanking, PropBanking, and Named Entity annotations. OntoNotes 5.0 also uses PropBank Unification frame files for PropBanking. This data set, containing a total of 8,276 of selected AMRs and their dependency parses from OntoNotes, is named "Gold". To generate the development and test set, we manually align the AMR concepts and dependency word nodes. Since the manual alignment is timeconsuming, "Gold" and "All" data share the same development/test set. Table 2 presents the statistics for the experimental data. 3

Experiment Results
We run EM for 50 iterations and ensure the EM model converges. Afterwards, we use our decoding algorithm to find the alignments that maximize the likelihood. The test set data is used to evaluate performance.
We first evaluate the performance of our system with the external features added incrementally. Table 3 indicates the results. By running with the "Gold"' data, the only feature that improves significantly over the baseline (rule-based and lexicon features only) is the semantic role feature. The named entity feature actually hurts performance. On the other hand, all the features contribute to the F-Score incrementally for "All". Again, the semantic role feature still has the most positive impact against other features, and a significant improvement over the baseline.
As we compare the F 1 score on training with 2 LDC OntoNotes Release 5.0, Release date: October 16th, 2013 https://catalog.ldc.upenn.edu/LDC2013T19 3 The manually aligned data and our aligner will be available after this paper gets accepted  "All" and "Gold" data set, training with "All" outperforms training with "Gold" data in all different feature combinations. We believe there are two reasons for this. First, the "All" data contains richer information than the "Gold" data. "All" has double the sentence size of "Gold", and proportionally more named entity labels. Second, the automatic dependency parses do not hurt the performance of our aligner very much. We believe that our unsupervised alignment model works better with more data, even without access to gold standard dependency parses. We then compare our aligner with three other aligners: JAMR, another version of unsupervised alignment (Chen, 2015), and ISI. To make them fit our test data, we design a heuristic method to force every unaligned concept (e.g., named entity and "':polarity -"' concepts) to align to a dependency word node according to rule-based and global features (see Section 3.1). The alignment is counted as a correct match when the concept aligns to either the head word or the partial word span of a phrase. The alignments from concept relation to word span (apply in ISI) are discarded in our task. The results of the experiment are shown in Table 4. Our aligner achieves the best F 1 score in both the "All" and "Gold" data sets, as it should, since it is designed to align AMRs to dependency parses, as was the Chen aligner. Our aligner performs better than the Chen aligner by around 28% in F 1 score. We can conclude that the addition of rule-based feature, global features, and beam-search in decoding time helps the alignment task substantially.

Apply to AMR Parsing
To evaluate how alignment can enhance AMR parsing, we compare the parsing performance of  Table 4: Results of different alignment models the CAMR parser with different alignments produced by JAMR, ISI, and our aligner. To make the alignments fit the CAMR parser, we convert both ISI and our alignments to the original JAMR alignment format, word span to AMR concept. We get rid of the ":wiki" tag, which links the named entity to its Wikipedia page, to simplify the parsing task since we think the Wikify task (Mihalcea and Csomai, 2007) is basically different from the AMR parsing task. Smatch v2.0.2 is used to evaluate AMR parsing performance (Cai and Knight, 2012). The evaluation script is obtained from the SemEval 2016 Task 8 website 4 . A comparison of parsing results is given in Table 5. We first train the parser with "Gold" Standard dependency parses and alignments from the different aligners. Results show that our aligner improves by a 2% F 1 score over the two other aligners. Then we train the AMR Parser system with the "All" data set. The dependency parses attached with semantic roles and named entities generated by ClearNLP are also provided to CAMR as training data. CAMR use dependency parsing results from Stanford dependency parser (Klein and Manning, 2003) by default. Our aligner still achieves slightly better performance than the other two. Modifying the AMR parser to take advantage of parse node-concept alignments could potentially result in greater improvement, since CAMR takes the input alignments as word span to AMR concept.

Error Analysis
To further understand the advantages and the disadvantages of our model, we go through all incorrect alignments and manually categorize 40% of them into different error types, with their propor-4 http://alt.qcri.org/semeval2016/task8/index.php?id=dataand-tools   (Marcus et al., 1993), Section 23, for dependency parsing. Therefore, when training our aligner on the "All" data set with dependency parses, named entities, and semantic roles generated by ClearNLP, incorrect parses occasionally show up. Since NE and semantic roles are attached to dependency parses, incorrect dependency parses cause additional NE and semantic roles alignment errors, on top of the dependency parse alignment errors.
Long Distance Dependencies -14.2%: Long sentences with long distance dependencies always bring difficulty to NLP parsing tasks. Experimental results show that our model runs into troubles when nearby concepts align to dependency nodes which are far from each other. Co-reference is an example that is highly likely to align to long distance dependencies, and our model can not deal with it well.
Duplicate Words -17.4%: When two identical concepts align to different word nodes, our model is confused by duplicate words. In Figure 5, there are two "first"s in the sentence. One refers to "first 6 rounds", and the other refers to "first position". However, our model faultily aligns both ordinal-entity concepts to the same "first" word node. Our model did not distinguish these two ordinal-entities since the lexicon and named entity tags of the two "first"s are identical.
Meaning Coverage Errors -40.4%: We define a good alignment as a concept that aligns to the correct phrase head word which represents the same sub-meaning. So instead of aligning to a concept's word lexicon, sometimes a concept aligns to its parent node (head word). However, the lexicon features dominate the alignment probability in our E-M calculation. That causes our model to tend to align a concept with its word form instead of its head word. For example, English light verb constructions (LVCs), e.g., take a bath, are thought to consist of a semantically general verb and a noun that denotes an event or state. AMR representation always drops light verb and uses eventive noun as concept. Our model sometimes aligns this eventive noun concept to its nominal word node, which is incorrect since the light verb on dependency parse covers the same sub-meaning and should be aligned.

Conclusion and Future Work
In this paper, we present an AMR-Dependency Parse aligner, which estimates the feature probabilities by running the EM algorithm. It can be used directly by AMR parser. Results show that our aligner performs better than other aligners, and improves AMR parser performance. The latent probabilities that we obtain during training, i.e., all the external feature sets, could also potentially benefit a parser. We plan to develop our own AMR parser, which will apply these external feature sets as the basic model. We also plan to continue to perfect our aligner via tuning the feature weights and learning techniques, and adding new features, like word embeddings and WordNet features.