Online Back-Parsing for AMR-to-Text Generation

AMR-to-text generation aims to recover a text containing the same meaning as an input AMR graph. Current research develops increasingly powerful graph encoders to better represent AMR graphs, with decoders based on standard language modeling being used to generate outputs. We propose a decoder that back predicts projected AMR graphs on the target sentence during text generation. As the result, our outputs can better preserve the input meaning than standard decoders. Experiments on two AMR benchmarks show the superiority of our model over the previous state-of-the-art system based on graph Transformer.


Introduction
Abstract meaning representation (AMR) (Banarescu et al., 2013) is a semantic graph representation that abstracts meaning away from a sentence. Figure 1 shows an AMR graph, where the nodes, such as "possible-01" and "police", represent concepts, and the edges, such as "ARG0" and "ARG1", indicate relations between the concepts they connect. The task of AMR-to-text generation (Konstas et al., 2017) aims to produce fluent sentences that convey consistent meaning with input AMR graphs. For example, taking the AMR in Figure 1 as input, a model can produce the sentence "The police could help the victim". AMR-to-text generation has been shown useful for many applications such as machine translation (Song et al., 2019) and summarization (Liu et al., 2015;Yasunaga et al., 2017;Liao et al., 2018;Hardy and Vlachos, 2018). In addition, AMR-totext generation can be a good test bed for general graph-to-sequence problems (Belz et al., 2011;Gardent et al., 2017).
AMR-to-text generation has attracted increasing research attention recently. Previous work has focused on developing effective encoders for representing graphs. In particular, graph neural networks (Beck et al., 2018;Song et al., 2018;Guo et al., 2019) and richer graph representations (Damonte and Cohen, 2019;Hajdik et al., 2019;Ribeiro et al., 2019) have been shown to give better performances than RNN-based models (Konstas et al., 2017) on linearized graphs. Subsequent work exploited graph Transformer (Zhu et al., 2019;Cai and Lam, 2020;Wang et al., 2020), achieving better performances by directly modeling the intercorrelations between distant node pairs with relation-aware global communication. Despite the progress on the encoder side, the current state-of-the-art models use a rather standard decoder: it functions as a language model, where each word is generated given only the previous words. As a result, one limitation of such decoders is that they tend to produce fluent sentences that may not retain the meaning of input AMRs.
We investigate enhancing AMR-to-text decoding by integrating online back-parsing, simultaneously predicting a projected AMR graph on the target sentence while it is being constructed. This is largely inspired by work on backtranslation (Sennrich et al., 2016;Tu et al., 2017), which shows that back predicting the source sentence given a target translation output can be useful for strengthening neural machine translation. We perform online back parsing, where the AMR graph structure is constructed through the autoregressive sentence construction process, thereby saving the need for training a separate AMR parser. By adding online back parsing to the decoder, structural information of the source graph can intuitively be better preserved in the decoder network. Figure 2 visualizes our structure-integrated decoding model when taking the AMR in Figure  1 as input. In particular, at each decoding step, the model predicts the current word together with its corresponding AMR node and outgoing edges to the previously generated words. The predicted word, AMR node and edges are then integrated as the input for the next decoding step. In this way, the decoder can benefit from both more informative loss via multi-task training and richer features taken as decoding inputs.
Experiments on two AMR benchmark datasets (LDC2015E86 and LDC2017T10 1 ) show that our model significantly outperforms a state-ofthe-art graph Transformer baseline by 1.8 and 2.5 BLEU points, respectively, demonstrating the advantage of structure-integrated decoding for AMR-to-text generation. Deep analysis and human evaluation also confirms the superiority of our model. Our code is available at https: //github.com/muyeby/AMR-Backparsing.

Baseline: Graph Transformer
Formally, the AMR-to-text generation task takes an AMR graph as input, which can be denoted as a directed acyclic graph G = (V, E), where V denotes the set of nodes and E refers to the set of labeled edges. An edge can further be represented by a triple v i , r k , v j , showing that node v i and v j are connected by relation type r k . Here k ∈ [1, ..., R], and R is the total number of relation types. The goal of AMR-to-text generation is to generate a word sequence y = [y 1 , y 2 , . . . , y M ], which conveys the same meaning as G.
We take a graph Transformer model (Koncel-Kedziorski et al., 2019;Zhu et al., 2019;Cai and Lam, 2020;Wang et al., 2020) as our baseline. Previous work has proposed several variations of graph-Transformer. We take the model of Zhu et al. (2019), which gives the state-of-the-art performance. This approach exploits a graph Transformer encoder for AMR encoding and a standard Transformer decoder for text generation.

Graph Transformer Encoder
The Graph Transformer Encoder is an extension of the standard Transformer encoder (Vaswani et al., 2017), which stacks L encoder layers, each having two sublayers: a self-attention layer and a position-wise feed forward layer. Given a set of AMR nodes [v 1 , v 2 , . . . , v N ], the l-th encoder layer takes the node features [h l−1 1 , h l−1 2 , . . . , h l−1 N ] from its preceding layer as input and produces a new set of features 1, . . . , L], and h 0 i represents the embedding of AMR node v i , which is randomly initialized.
The graph Transformer encoder extends the vanilla self-attention (SAN) mechanism by explicitly encoding the relation r k 2 between each AMR node pair (v i , v j ) in the graph. In particular, the relation-aware self-attention weights are obtained by: where W Q , W K , W R are model parameters, and γ k ∈ R dr is the embedding of relation r k , which is randomly initialized and optimized during training, d r is the dimension of relation embeddings.
With α ij , the output features are: where W V is a parameter matrix. Similar to the vanilla Transformer, a graph Transformer also uses multi-head self-attention, residual connection and layer normalization.

Standard Transformer Decoder
The graph Transformer decoder is identical to the vanilla Transformer (Vaswani et al., 2017). It consists of an embedding layer, multiple 2 Since the adjacency matrix is sparse, the graph Transformer encoder uses the shortest label path between two nodes to represent the relation (e.g. path (victim, police) = "↑ARG1 ↓ARG0", path (police, victim) = "↑ARG0 ↓ARG1").
In Eq 3, AN is a standard attention layer, which computes a set of attention scores β ti (i ∈ [1, . . . , N ]) and a context vector c t : where f is a scaled dot-product attention function. Denoting the output hidden state of the L-th decoder layer at time t as s L t , the generator layer predicted the probability of a target word y t as: where y <t = [y 1 , y 2 , . . . , y t−1 ], and W g is a model parameter.

Training Objective
The training objective of the baseline model is to minimize the negative log-likelihood of conditional word probabilities: where Θ denotes the full set of parameters.
3 Model with Back-Parsing Figure 2 illustrates the proposed model. We adopt the baseline graph encoder described in Section 2.1 for AMR encoding, while enhancing the baseline decoder (Section 2.2) with AMR graph prediction for better structure preservation.
In particular, we train the decoder to reconstruct the AMR graph (so called "back-parsing") by jointly predicting the corresponding AMR nodes and projected relations when generating a new word. In this way, we expect that the model can better memorize the AMR graph and generate more faithful outputs. In addition, our decoder is trained in an online manner, which uses the last node and edge predictions to better inform the generation of the next word. Specifically, the encoder hidden states are first calculated given an AMR graph. At each decoding time step, the proposed decoder takes the encoder states as inputs and generates a new word (as in Section 2.2), together with its corresponding AMR node (Section 3.1) and its outgoing edges (Section 3.2), These predictions are then used inputs to calculate the next state (Section 3.3).

Node Prediction
We first equip a standard decoder with the ability to make word-to-node alignments while generating target words. Making alignments can be formalized as a matching problem, which aims to find the most relevant AMR graph node for each target word. Inspired by previous work (Liu et al., 2016;Mi et al., 2016), we solve the matching problem by supervising the word-tonode attention scores given by the Transformer decoder. In order to deal with words without alignments, we introduce a NULL node v ∅ into the input AMR graph (as shown in Figure 2) and align such words to it. 3 More specifically, at each decoding step t, our Transformer decoder first calculates the top decoder layer word-to-node attention distribution β t = [β t0 , β t1 , ..., β tN ] (Eq 3 and Eq 4) after taking the encoder states together with the previously generated sequence y <t = [y 1 , y 2 , . . . , y t−1 ] (β t0 and h L 0 are the probability and encoder state for the NULL node v ∅ ). Then the probability of aligning the current decoder state to node v i is defined as: where ALI is the sub-network for finding the best aligned AMR node for a given decoder state.
Training. Supposing that the gold alignment (refer to Section 4.1) at time t isβ t , the training objective for node prediction is to minimize the loss defined as the distance between β t andβ t : where ∆ denotes a discrepancy criterion that can quantify the distance between β t andβ t . We take two common alternatives: (1) Mean Squared Error (MSE), and (2) Cross Entropy Loss (CE).

Edge Prediction
The edge prediction sub-task aims to preserve the node-to-node relations in an AMR graph during text generation. To this end, we project the edges of each input AMR graph onto the corresponding sentence according to their nodeto-word alignments, before training the decoder to generate the projected edges along with target words. For words without outgoing edges, we add a "self-loop" edge for consistency. Formally, at decoding step t, each relevant directed edge (or arc) with relation label r k starting from y t can be represented as y j , r k , y t , where j ≤ t, y j , y t and r k are called "arc to", "arc from", and "label" respectively. We modify the deep biaffine attention classifier (Dozat and Manning, 2016) to model these edges.
In particular, we factorize the probability for each labeled edge into the "arc" and "label" parts, computing both based on the current decoder hidden state and the states of all previous words. The "arc" score ψ arc tj ∈ R 1 , which measures whether or not a directed edge from y t to y j exists, is calculated as: Similarly, the "label" score ψ label tj ∈ R R , which is used to predict a label for potential word pair (y j , y t ), is given by: In Eq 9 and Eq 10, FF arc to , FF arc from , FF label to and FF label from are linear transformations. Biaff arc and Biaff label are biaffine transformations: where ⊕ denotes vector concatenation, U, W and b are model parameters. U is a (d × 1 × d) tensor for unlabeled classification (Eq 9) and a (d × R × d) tensor for labeled classification (Eq 10), where d is the hidden size. Defining p(y j |y t ) as ψ arc tj and p(r k |y j , y t ) as ψ label tj [k], the probability of a labeled edge y j , r k , y t is calculated by the chain rule: Training. The training objective for the edge prediction task is the negative log-likelihood over all projected edges E :

Next State Calculation
In addition to simple "one-way" AMR backparsing (as shown in Section 3.1 and 3.2), we also study integrating the previously predicted AMR nodes and outgoing edges as additional decoder inputs to help generate the next word.
In particular, for calculating the decoder hidden states [s 1 t+1 , s 2 t+1 , ..., s L t+1 ] at step t + 1, the input feature to our decoder is a triple y t , v t , e t instead of a single value y t , which the baseline has.
Here y t , v t and e t are vector representations of the predicted word, AMR node and edges at step t, respectively. More specifically, v t is a weighted sum of the top-layer encoder hidden , and coefficients are from the distribution of β t in Eq 7: where is the operation for scalar-tensor product.
Similarly, e t is calculated as: where ⊕ concatenates two tensors, p(r k , y j |y t ) and p(y j |y t ) are probabilities given in Eq 12, γ k is a relation embedding, and s L j is the decoder hidden state at step j. e t−1 is a vector concatenation of r t and s t , which are weighted sum of relation embeddings and weighted sum of previous decoder hidden states, respectively.
In contrast to the baseline in Eq 3, at time t+1, the hidden state of the first decoder layer is calculated as: where the definition of H L , SAN, AN, FF and [s 0 1 , . . . , s 0 t ] are the same as Eq 3. v 0 and e 0 (as shown in Figure 2) are defined as zero vectors. The hidden states of upper decoder layers ([s 2 t+1 , ..., s L t+1 ]) are updated in the same way as Eq 3.
Following previous work on syntactic text generation (Wu et al., 2017;, we use gold AMR nodes and outgoing edges as inputs for training, while we take automatic predictions for decoding.

Training Objective
The overall training objective is: where λ 1 and λ 2 are weighting hyper-parameters for node and label , respectively.

Experiments
We conduct experiments on two benchmark AMR-to-text generation datasets, including LDC2015E86 and LDC2017T10. These two datasets contain 16,833 and 36,521 training examples, respectively, and share a common set of 1,368 development and 1,371 test instances.

Experimental Settings
Data preprocessing. Following previous work (Song et al., 2018;Zhu et al., 2019), we take a standard simplifier (Konstas et al., 2017) to preprocess AMR graphs, adopting the Stanford tokenizer 4 and Subword Tool 5 to segment text into subword units. The node-to-word alignments are generated by ISI aligner (Pourdamghani et al., 2014). We then project the source AMR graph onto the target sentence according to such alignments.
For node prediction, the attention distributions are normalized, but the alignment scores generated by the ISI aligner are unnormalized hard 0/1 values. To enable cross entropy loss, we follow previous work (Mi et al., 2016) to normalize the gold-standard alignment scores. Hyperparameters. We choose the feature-based model 6 of Zhu et al. (2019) as our baseline (G-Trans-F-Ours). Also following their settings, both the encoder and decoder have 6 layers, with each layer having 8 attention heads. The sizes of hidden layers and word embeddings are 512, and the size of relation embedding is 64. The hidden size of the biaffine attention module is 512. We use Adam (Kingma and Ba, 2015) with a learning rate of 0.5 for optimization. Our models are trained for 500K steps on a single 2080Ti GPU. We tune these hyperparameters on the LDC2015E86 development set and use the selected values for testing 7 .

Model Evaluation.
We set the decoding beam size as 5 and take BLEU (Papineni et al., 2002) and Meteor (Banerjee and Lavie, 2005;Denkowski and Lavie, 2014) as automatic evaluation metrics. We also employ human evaluation to assess the semantic faithfulness and generation fluency of compared methods by randomly selecting 50 AMR graphs for comparison. Three people familiar with AMR are asked to score the generation quality with regard to three aspects -concept preservation rate, relation preservation rate and fluency (on a scale of [0,5]). Details about the criteria are: • Concept preservation rate assesses to what extent the concepts in input AMR graphs are involved in generated sentences.
• Relation preservation rate measures to what extent the relations in input AMR graphs exist in produced utterances. • Fluency evaluates whether the generated sentence is fluent and grammatically correct.
Recently, significant progress (Ribeiro et al., 2019;Zhang et al., 2020;Ç elikyilmaz et al., 2020) in developing new metrics for NLG evaluation has made. We leave evaluation on these metrics for future work. Table 1 shows the performances on the devset of LDC2015E86 under different model settings. For the node prediction task, it can be observed that both cross entropy loss (CE) and mean squared error loss (MSE) give significantly better results than the baseline, with 0.46 and 0.65 improvement in terms of BLEU, respectively. In addition, CE gives a better result than MSE.

Development Experiments
Regarding edge prediction, we investigate two settings, with relation embeddings being shared by the encoder and decoder, or being separately constructed, respectively. Both settings give large improvements over the baseline. Compared with the model using independent relation embeddings, the model with shared relation embeddings gives slightly better results with less parameters, indicating that the relations in an AMR graph and the relations between words are consistent. We therefore adopt the CE loss and shared relation embeddings for the remaining experiments. Figure 3 presents the BLEU scores of integrating standard AMR-to-text generation with node prediction or edge prediction under different λ 1 and λ 2 values, respectively.
There are improvements when increasing the coefficient from 0, demonstrating that both node prediction and edge prediction have positive influence on AMR-to-text generation. The BLEU of the two models reach peaks at λ 1 = 0.01 and λ 2 = 0.1, respectively. When further increasing the coefficients, the BLEU scores start to decrease. We thus set λ 1 = 0.01, λ 2 = 0.1 for the rest of our experiments. Table 2 shows the automatic evaluation results, where "G-Trans-F-Ours" and "Ours Back-Parsing" represent the baseline and our full model, respectively.

Automatic Evaluation
The top group of the table shows the previous state-of-the-art results on the LDC2015E86 and LDC2017T10 testsets. Our systems give significantly better results than the previous systems using different encoders, including LSTM (Konstas et al., 2017), graph gated neural network (GGNN; Beck et al., 2018), graph recurrent network (GRN; Song et al., 2018), densely connected graph convolutional network (DCGCN; Guo et al., 2019) and various graph transformers (G-Trans-F, G-Trans-SA, G-Trans-C, G-Trans-W). Our baseline also achieves better Model LDC15 LDC17 LSTM (Konstas et al., 2017) 22.00 -GGNN (Beck et al., 2018) -23.30 GRN (Song et al., 2018) 23.30 -DCGCN (Guo et al., 2019) 25.9 27.9 G-Trans-F (Zhu et al., 2019) 27.23 30.18 G-Trans-SA (Zhu et al., 2019) 29.66 31.54 G-Trans-C (Cai and Lam, 2020) 27.4 29.8 G-Trans-W (Wang et al., 2020) 25 with external data LSTM (20M) (Konstas et al., 2017) 33.8 -GRN (2M) (Song et al., 2018) 33.6 -G-Trans-W (2M) (Wang et al., 2020) 36.4 - BLEU scores than the corresponding models of Zhu et al. (2019). The main reason is that we train with more steps (500K vs 300K) and we do not prune low-frequency vocabulary items after applying BPE. Note that we do not compare our model with methods by using external data. Compared with our baseline (G-Trans-F-Ours), the proposed approach achieves significant (p < 0.01) improvements, giving BLEU scores of 31.48 and 34.19 on LDC2015E86 and LDC2017T10, respectively, which are to our knowledge the best reported results in the literature. In addition, the outputs of our model have 0.8 more words than the baseline on average. Since the BLEU metric tend to prefer shorter results, this confirm that our model indeed recovers more information.

Human Evaluation
As shown in Table 3, our model gives higher scores of concept preservation rate than the baseline on both datasets, with improvements of 3.6 and 3.3, respectively. In addition, the relation preservation rate of our model is also better than the baseline. This indicating that our model can preserve more concepts and relations than the baseline method, thanks to the back-parsing mechanism. With regard to the generation fluency, our model also gives better results than baseline. The main reason is that the relations between concepts such as subject-predicate relation and modified relation are helpful for generating fluency sentences.
Apart from that, we study discourse (Prasad et al., 2008) relations, which are essential for generating a good sentence with correct meaning. Specifically, we consider 4   common discourse relations ("Cause", "Contrast", "Condition", "Coordinating"). For each type of discourse, we randomly select 50 examples from the test set and ask 3 linguistic experts to calculate the discourse preservation accuracy by checking if the generated sentence preserves such information. Table 4 gives discourse preservation accuracy results of the baseline and our model, respectively. The baseline already performs well, which is likely because discourse information can somehow be captured through co-occurrence in each (AMR, sentence) pair. Nevertheless, our approach achieves better results, showing that our back-parsing mechanism is helpful for preserving discourse relations.

Analysis
Ablation We conduct ablation tests to study the contribution of each component to the proposed model. In particular, we evaluate models with only the node prediction loss (Node Prediction, Section 3.1) and the edge prediction loss (Edge Prediction, Section 3.2), respectively, and further investigate the effect of integrating node and edge information into the next state computation (Section 3.3) by comparing models without and with (Int.) such integration. Table 5 shows the BLEU and Meteor scores on the LDC2015E86 testset. Compared with the baseline, we observe a performance improvement of 0.34 BLEU by adding the node prediction loss only. When using the predicted AMR graph nodes as additional input for next state computation (i.e.,   Node Prediction (Int.)), the BLEU score increases from 30.49 to 30.72, and the Meteor score reaches 35.94, showing that the previously predicted nodes are beneficial for text generation. Such results are consistent with our expectation that predicting the corresponding AMR node can help the generation of correct content words (a.k.a. concepts).
Similarly, edge prediction also leads to performance boosts. In particular, integrating the predicted relations for next state computation (Edge Prediction (Int.)) gives an improvement of 0.92 BLEU over the baseline. Edge prediction results in larger improvements than node prediction, indicating that relation knowledge is more informative than word-to-node alignment.
In addition, combining the node prediction and edge prediction losses (Both Prediction) leads to better model performance, which indicates that node prediction and edge prediction have mutual benefit.
Integrating both node and edge predictions (Both Prediction (Int.)) further improves the system to 31.48 BLEU and 36.15 Meteor, respectively. Correlation between Prediction Accuracy and Model Performance We further investigate the influence of AMR-structure preservation on the performance of the main text generation task. Specifically, we first force our model to generate a gold sentence in order to calculate the accuracies  for node prediction and edge prediction. We then calculate the corresponding BLEU score for the sentence generated by our model on the same input AMR graph without forced decoding, before drawing correlation between the accuracies and the BLEU score. As shown in Figure 4(a) and 4(b) 8 , both node accuracy and edge accuracy have a strong positive correlation with the BLEU score, indicating that the more structural information is retained, the better the generated text is. We also evaluate the pearson (ρ) correlation coefficients between BLEU scores and node (edge) prediction accuracies. Results are given in Table 6. Both types of prediction accuracies have strong positive correlations with the final BLEU scores, and their combination yields further boost on the correlation coefficient, indicating the necessity of jointly predicting the nodes and edges. Performances VS AMR Graphs Sizes Figure 5 compares the BLEU scores of the baseline and our model on different AMR sizes. Our model is consistently better than the baseline for most length brackets, and the advantage is more obvious for large AMRs (size 51+).

Case Study
We provide two examples in Table 7 to help better understand the proposed model. Each example consists of an AMR graph, a reference sentence Figure 6: Visualization of word-to-node attention obtained from the baseline graph Transformer (left) and our model with node prediction loss (right).
(REF), the output of baseline model (Baseline) and the sentence generated by our method (Ours).
As shown in the first example, although the baseline model maintains the main idea of the original text, it fails to recognize the AMR graph nodes "local" and "problem". In contrast, our model successfully recovers these two nodes and generates a sentence which is more faithful to the reference. We attribute this improvement to node prediction. To verify this, we visualize the word-to-node attention scores of both approaches in Figure 6. As shown in the figure, the baseline model gives little attention to the AMR node "local" and "problem" during text generation. In contrast, our system gives a more accurate alignment to the relevant AMR nodes in decoding.
In the second example, the baseline model incorrectly positions the terms "doctor", "see" and "worse cases" while our approach generates a more natural sentence. This can be attributed to the edge prediction task, which can inform the decoder to preserve the relation that "doctor" is the subject of "see" and "worse cases" is the object.

Related Work
Early studies on AMR-to-text generation rely on statistical methods. Flanigan et al. (2016) convert input AMR graphs to trees by splitting re-entrances, before translating these trees into target sentences with a tree-to-string transducer; Pourdamghani et al. (2016) apply a phrase-based MT system on linearized AMRs; Song et al. (2017) design a synchronous node replacement grammar to parse input AMRs while generating target sentences. These approaches show comparable or better results than early neural (1) (o / obvious-01 :ARG1 (p / problem :ARG1-of (l / local-02)) :ARG1-of (c / cause-01 :ARG0 (l2 / lumpy :domain (d / dough :mod (c2 / cookie) :mod (t / this))))) REF: Obviously there are local problems because this cookie dough is lumpy . Baseline: It is obvious that these cookie dough were a lumpy . Ours: Obviously there is a local problem as this cookie dough is a lumpy .
Related work on NMT studies back-translation loss (Sennrich et al., 2016;Tu et al., 2017) by translating the target reference back into the source text (reconstruction), which can help retain more comprehensive input information. This is similar to our goal. Wiseman et al. (2017) extended the reconstruction loss of Tu et al. (2017) for table-to-text generation. We study a more challenging topic on how to retain the meaning of a complex graph structure rather than a sentence or a table. In addition, rather than reconstructing the input after the output is produced, we predict the input while the output is constructed, thereby allowing stronger information sharing.
Our work is also remotely related to previous work on string-to-tree neural machine translation (NMT) (Aharoni and Goldberg, 2017;Wu et al., 2017;, which aims at generating target sentences together with their syntactic trees.
One major difference is that their goal is producing grammatical outputs, while ours is preserving input structural information.

Conclusion
We investigated back-parsing for AMR-to-text generation by integrating the prediction of projected AMRs into sentence decoding. The resulting model benefits from both richer loss and more structual features during decoding. Experiments on two benchmarks show advantage of our model over a state-of-the-art baseline.   Our re-implementation and the proposed model are released at https://github.com/muyeby/ AMR-Backparsing.
The results are shown in Table 9. It can be observed that our approach achieves the best performance on both datasets regardless of the evaluation metrics. This observation is consistent with Table 2.