AMR-to-text Generation with Synchronous Node Replacement Grammar

This paper addresses the task of AMR-to-text generation by leveraging synchronous node replacement grammar. During training, graph-to-string rules are learned using a heuristic extraction algorithm. At test time, a graph transducer is applied to collapse input AMRs and generate output sentences. Evaluated on a standard benchmark, our method gives the state-of-the-art result.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a semantic formalism encoding the meaning of a sentence as a rooted, directed graph. AMR uses a graph to represent meaning, where nodes (such as "boy", "want-01") represent concepts, and edges (such as "ARG0", "ARG1") represent relations between concepts. Encoding many semantic phenomena into a graph structure, AMR is useful for NLP tasks such as machine translation (Jones et al., 2012;Tamchyna et al., 2015), question answering (Mitra and Baral, 2015), summarization (Takase et al., 2016) and event detection (Li et al., 2015).
to a sentence using a tree-to-string transducer. Their method leverages existing machine translation techniques, capturing hierarchical correspondences between the spanning tree and the surface string. On the other hand, it suffers from error propagation since the output is constrained given a spanning tree due to the projective correspondence between them. Information loss in the graph-to-tree transformation step cannot be recovered. Song et al. (2016) directly generate sentences using graph-fragment-to-string rules. They cast the task of finding a sequence of disjoint rules to transduce an AMR graph into a sentence as a traveling salesman problem, using local features and a language model to rank candidate sentences. However, their method does not learn hierarchical structural correspondences between AMR graphs and strings.
We propose to leverage the advantages of hierarchical rules without suffering from graph-to-tree errors by directly learning graph-to-string rules. As shown in Figure 1, we learn a synchronous node replacement grammar (NRG) from a corpus of AMR, sentence pairs. At test time, we apply a graph transducer to collapse input AMR graphs and generate output strings according to the    , and beam search decoding. It gives a BLEU score of 25.62 on the SemEval-2016 Task 8 benchmark, which is the best result reported so far.

Grammar Definition
A synchronous node replacement grammar (NRG) is a rewriting formalism: G = N, Σ, ∆, P, S , where N is a finite set of nonterminals, Σ and ∆ are finite sets of terminal symbols for the source and target sides, respectively. S ∈ N is the start symbol, and P is a finite set of productions. Each instance of P takes the form X i → ( F, E , ∼), where X i ∈ N is a nonterminal node, F is a rooted, connected AMR fragment with edge labels over Σ and node labels over N ∪ Σ, E is a corresponding target string over N ∪ ∆ and ∼ denotes the alignment of nonterminal symbols between F and E. A classic NRG (Engelfriet and Rozenberg, 1997, Chapter 1) also defines C, which is an embedding mechanism defining how F is connected to the rest of the graph when replacing X i with F on the graph. Here we omit defining C and allow arbitrary connections 1 . Following Chiang  (2005), we use only one nonterminal X in addition to S, and use subscripts to distinguish different non-terminal instances. Figure 2 shows an example derivation process for the sentence "the boy wants to go" given the rule set in Table 1. Given the start symbol S, which is first replaced with X 1 , rule (c) is applied to generate "X 2 to go" and its AMR counterpart. Then rule (b) is used to generate "X 3 wants" and its AMR counterpart from X 2 . Finally, rule (a) is used to generate "the boy" and its AMR counterpart from X 3 . Our graph-to-string rules are inspired by synchronous grammars for machine translation (Wu, 1997;Yamada and Knight, 2002;Gildea, 2003;Chiang, 2005;Huang et al., 2006;Liu et al., 2006).

Induced Rules
There are three types of rules in our system, namely induced rules, concept rules and graph glue rules. Here we first introduce induced rules, which are obtained by a two-step procedure on a training corpus. Shown in Algorithm 1, the first replaced with Xi, nodes previously connected to F are reconnected to Xi step is to extract a set of initial rules from training sentence, AMR pairs (Line 2) using the phraseto-graph fragment extraction algorithm of Peng et al. (2015) (Line 3). Here an initial rule contains only terminal symbols in both F and E. As a next step, we match between pairs of initial rules r i and r j according to Cai and Knight (2013), and generate r ij by collapsing r i with r j , if r i contains r j 2 (Line 6-8). When collapsing r i with r j , we replace the corresponding subgraph in r i .F with a new non-terminal node, and the sub-phrase in r i .E with a non-terminal. For example, we obtain rule (b) from rules (d) and (a) in Table 1. All initial and generated rules are stored in a rule list R (Lines 5 and 9), which will be further normalized to obtain the final induced rule set.

Concept Rules and Glue Rules
In addition to induced rules, we adopt concept rules (Song et al., 2016) and graph glue rules to ensure existence of derivations. For a concept rule, F is a single node in the input AMR graph, and E is a morphological string of the node concept. A concept rule is used in case no induced rule can cover the node. We refer to the verbalization list 3 and AMR guidelines 4 for creating more complex concept rules. For example, one concept rule created from the verbalization list is "(k / keep-01 :ARG1 (p / peace)) ||| peacekeeping".
Inspired by Chiang (2005), we define graph glue rules to concatenate non-terminal nodes connected with an edge, when no induced rules can be applied. Three glue rules are defined for each type of edge label. Taking the edge label "ARG0" as an example, we create the following glue rules: r1 (X1 / #X1# :ARG0 (X2 / #X2#)) ||| #X1# #X2# r2 (X1 / #X1# :ARG0 (X2 / #X2#)) ||| #X2# #X1# r3 (X1 / #X1# :ARG0 X1) ||| #X1# where for both r 1 and r 2 , F contains two nonterminal nodes with a directed edge connecting them, and E is the concatenation the two nonterminals in either the monotonic or the inverse order. For r 3 , F contains one non-terminal node with a self-pointing edge, and E is the nonterminal. With concept rules and glue rules in our final rule set, it is easily guaranteed that there are legal derivations for any input AMR graph.

Model
We adopt a log-linear model for scoring search hypotheses. Given an input AMR graph, we find the highest scored derivation t * from all possible derivations t: where g denotes the input AMR, f i (·, ·) and w i represent a feature and the corresponding weight, respectively. The feature set that we adopt includes phrase-to-graph and graph-to-phrase translation probabilities and their corresponding lexicalized translation probabilities (section 3.1), language model score, word count, rule count, reordering model score (section 3.2) and moving distance (section 3.3). The language model score, word count and phrase count features are adopted from SMT (Koehn et al., 2003;Chiang, 2005). We perform bottom-up search to transduce input AMRs to surface strings. Each hypothesis contains the current AMR graph, translations of collapsed subgraphs, the feature vector and the current model score. Beam search is adopted, where hypotheses with the same number of collapsed edges and nodes are put into the same beam.

Translation Probabilities
Production rules serve as a basis for scoring hypotheses. We associate each synchronous NRG rule n → ( F, E , ∼) with a set of probabilities. First, phrase-to-fragment translation probabilities are defined based on maximum likelihood estimation (MLE), as shown in Equation 2, where c F,E is the fractional count of F, E .
In addition, lexicalized translation probabilities are defined as: Here l is a label (including both edge labels such as "ARG0" and concept labels such as "want-01") in the AMR fragment F , and w is a word in the phrase E. Equation 3 can be regarded as a "soft" version of the lexicalized translation probabilities adopted by SMT, which picks the alignment yielding the maximum lexicalized probability for each translation rule. In addition to p(F |E) and p w (F |E), we use features in the reverse direction, namely p(E|F ) and p w (E|F ), the definitions of which are omitted as they are consistent with Equations 2 and 3, respectively. The probabilities associated with concept rules and glue rules are manually set to 0.0001.

Reordering Model
Although the word order is defined for induced rules, it is not the case for glue rules. We learn a reordering model that helps to decide whether the translations of the nodes should be monotonic or inverse given the directed connecting edge label. The probabilistic model using smoothed counts is defined as: c(h, l, t, M ) is the count of monotonic translations of head h and tail t, connected by edge l.

Moving Distance
The moving distance feature captures the distances between the subgraph roots of two consecutive rule matches in the decoding process, which controls a bias towards collapsing nearby subgraphs consecutively.

Setup
We use the dataset of SemEval-2016 Task8, which contains 16833 training, 1368 dev and 1371 test instances. Rules are extracted from the training data, and model parameters are tuned on the dev set. For tuning and testing, we filter out sentences with more than 30 words, resulting in 1103 dev instances and 1055 test instances. We train a 4-gram language model (LM) on gigaword (LDC2011T07), and use BLEU (Papineni et al., 2002) as the evaluation metric. MERT is used  to tune model parameters on k-best outputs on the devset, where k is set 20. We investigate the effectiveness of rules and features by ablation tests: "NoInducedRule" does not adopt induced rules, "NoConceptRule" does not adopt concept rules, "NoMovingDistance" does not adopt the moving distance feature, and "NoReorderModel" disables the reordering model. Given an AMR graph, if NoConceptRule cannot produce a legal derivation, we concatenate existing translation fragments into a final translation, and if a subgraph can not be translated, the empty string is used as the output. We also compare our method with previous works, in particular JAMR-gen (Flanigan et al., 2016) and TSP-gen (Song et al., 2016), on the same dataset.

Results
The results are shown in Table 2. First, All outperforms all baselines. NoInducedRule leads to the greatest performance drop compared with All, demonstrating that induced rules play a very important role in our system. On the other hand, No-ConceptRule does not lead to much performance drop. This observation is consistent with the observation of Song et al. (2016) for their TSP-based system. NoMovingDistance leads to a significant performance drop, empirically verifying the fact that the translations of nearby subgraphs are also close. Finally, NoReorderingModel does not affect the performance significantly, which can be because the most important reordering patterns are already covered by the hierarchical induced rules. Compared with TSP-gen and JAMR-gen, our final model All improves the BLEU from 22.44 and 23.00 to 25.62, showing the advantage of our model. To our knowledge, this is the best result reported so far on the task.

Conclusion
We showed that synchronous node replacement grammar is useful for AMR-to-text generation by developing a system that learns a synchronous NRG in the training time, and applies a graph transducer to collapse input AMR graphs and generate output strings according to the learned grammar at test time. Our method outperforms the previous state-of-the-art, empirically proving the advantages of our graph-to-string rules.