Line Graph Enhanced AMR-to-Text Generation with Mix-Order Graph Attention Networks

Efficient structure encoding for graphs with labeled edges is an important yet challenging point in many graph-based models. This work focuses on AMR-to-text generation – A graph-to-sequence task aiming to recover natural language from Abstract Meaning Representations (AMR). Existing graph-to-sequence approaches generally utilize graph neural networks as their encoders, which have two limitations: 1) The message propagation process in AMR graphs is only guided by the first-order adjacency information. 2) The relationships between labeled edges are not fully considered. In this work, we propose a novel graph encoding framework which can effectively explore the edge relations. We also adopt graph attention networks with higher-order neighborhood information to encode the rich structure in AMR graphs. Experiment results show that our approach obtains new state-of-the-art performance on English AMR benchmark datasets. The ablation analyses also demonstrate that both edge relations and higher-order information are beneficial to graph-to-sequence modeling.


Introduction
Abstract Meaning Representation (Banarescu et al., 2013) is a sentence-level semantic representation formalized by a rooted directed graph, where nodes are concepts and edges are semantic relations. Since AMR is a highly structured meaning representation, it can promote many semantic related tasks such as machine translation (Song et al., 2019) and summarization (Liao et al., 2018). However, the usage of AMR graphs can be challenging, since it is non-trivial to completely capture the rich structural information in the graph-based data, especially when the graph has labeled edges. * Kai Yu is the corresponding author. Generation from AMR aims to translate the AMR semantics into the surface form (natural language). It is a basic Graph-to-sequence task that directly takes AMR as input. Figure 1 (left) gives a standard AMR graph and its corresponding surface form. Early works utilize sequence-to-sequence framework by linearizing the entire graph (Konstas et al., 2017;Cao and Clark, 2019). Such representation may lose useful structural information. In recent studies, graph neural networks (GNNs) have been in a dominant position on this task and achieved state-of-the-art performance (Beck et al., 2018;Song et al., 2018;Guo et al., 2019;Damonte and Cohen, 2019). However, In these GNN-based models, the representation of each concept node is only updated by the aggregated information from its neighbors, which leads to two limitations: 1) The interaction between indirectly connected nodes heavily relies on the number of stacked layers. When the graph size becomes larger, the dependencies between distant AMR concepts cannot be fully explored. 2) They only focus on modeling the relations between concepts while ignoring edge relations and their structures. Zhu et al. (2019) and Cai and Lam (2019) use Transformer to model arbitrary concept pairs no matter whether directly connected or not, but they still ignore the topological structures of the edges in the entire AMR graph.
To address the above limitations, we propose a novel graph-to-sequence model based on graph attention networks (Velickovic et al., 2018). We transform the edge labels into relation nodes and construct a new graph that directly reflects the edge relations. In graph theory, such a graph is called a Line Graph (Harary and Norman, 1960). As illustrated in Figure 1, we thus separate the original AMR graph into two sub-graphs without labeled edges -concept graph and relation graph. The two graphs describe the dependencies of AMR concepts and edges respectively, which is helpful in modeling these relationships (especially for edges). Our model takes these sub-graphs as inputs, and the communications between the two graphs are based on the attention mechanism. Furthermore, for both graphs, we mix the higher-order neighborhood information into the corresponding graph encoders in order to model the relationships between indirectly connected nodes.
Empirical study on two English benchmark datasets shows that our model reaches state-of-theart performance with 30.58 and 32.46 BLEU scores on LDC2015E86 and LDC2017T10, respectively. In summary, our contributions include: • We propose a novel graph-to-sequence model, which firstly uses the line graph to model the relationships between AMR edges.
• We integrate higher-order neighborhood information into graph encoders to model the relationships between indirectly connected nodes.
• We demonstrate that both higher-order neighborhood information and edge relations are important to graph-to-sequence modeling.

Mix-Order Graph Attention Networks
In this section, we first introduce graph attention networks (GATs) and their mix-order extensions, which are the basis of our proposed model.

Graph Attention Networks
GAT is a special type of networks that operates on graph-structured data with attention mechanisms. Given a graph G = (V, E), where V and E are the set of nodes x i and the set of edges (e ij , e ) 1 , respectively. N (x i ) denote the nodes which are directly connected by x i . N + (x i ) is the set including x i and all its direct neighbors. we have Each node x i in the graph has an initial feature h 0 i ∈ R d , where d is the feature dimension. The representation of each node is iteratively updated by the graph attention operation. At the l-th step, each node x i aggregates context information by attending over its neighbors and itself. The updated representation h l i is calculated by the weighted average of the connected nodes: where attention coefficient α ij is calculated as: where σ is a nonlinear activation function, e.g. ReLU. W l , W l t1 and W l t2 ∈ R d×d are learnable parameters for projections. After L steps, each node will finally have a context-aware representation h L i . In order to achieve a stable training process, we also employ a residual connection followed by layer normalization between two graph attention layers.

Mixing Higher Order Information
The relations between indirectly connected nodes are ignored in a traditional graph attention layer. Mix-Order GAT, however, can explore these relationships in a single-step operation by mixing the higher-order neighborhood information. We first give some notations before describing the details of the Mix-Order GAT. We use R K = R 1 , ...R K to represent neighborhood information from order , and as illustrated in Figure 2, we can have: The K-Mix GAT integrates the neighborhood information R K . At the l-th update step, each x i will interact with its reachable neighbors with different orders and calculate the attentive features independently. The representation h l i is updated by the concatenated features from different orders, i.e.
where represents concatenation, α k ij are the attention weights in the k-th order, and W l k ∈ R d×d/K are learnable weights for projections. We will use MixGAT(·) to denote the Mix-Order GAT layer in the following section.

Method
The architecture of our method is illustrated in Figure 3. As mentioned above, we separate the AMR graph into two sub-graphs without labeled edges. Our model follows the Encoder-Decoder architecture, where the encoder takes the two sub-graphs as inputs, and the decoder generates corresponding text from the encoded information. We first give some detailed explanations about the line graph and input representation.

Line Graph & Input Representation
The line graph of a graph G is another graph L(G) that represents the adjacencies between edges of G. L(G) is defined as: • Each node of L(G) represents an edge of G • Two nodes of L(G) are adjacent if and only if their corresponding edges share a common node in G.
For directed graphs, the directions are maintained in the corresponding line graphs. Redundant edges between two relation nodes are removed in the line graphs. Figure 4 provides several examples. In our model, we use the line graph to organize labeled edges and transform the original AMR graph into two sub-graphs. Given an AMR graph where G e = L(G a ). As for concept graph G c , its topological structure is the same with G a , but the edge labels are eliminated, i.e.
WhereÊ a is the edge set without label information.
Both G c and G e have no labeled edges, which can be efficiently encoded by Mix-Order GAT. We use R K c and R K e to denote 1 ∼ K orders neighborhood information of G c and G e . We represent each concept node x i ∈ V c with an initial embedding c 0 i ∈ R d , and each relation node y i ∈ V e e1 e2 e1 e2 e1 e2 e1 e2 original graph line graph original graph line graph Figure 4: Examples of finding line graphs. In the left part, e1 and e2 have opposite directions, so each direction is maintained in the line graph. In the right part, e1 and e2 follow the same direction, so there is only one direction in the corresponding line graph.
with an embedding e 0 i ∈ R d . The sets of node embeddings are denoted as where m = |V c | and n = |V e | denote the numbers of concept nodes and relation nodes, respectively. Thus, the inputs of our system can be formulated by

Self Updating
The encoder of our system consists of N stacked graph encoding layers. As illustrated in Figure  3, each graph encoding layer has two parts: selfupdating for each graph and masked cross attention.
to denote the input node embeddings of the l-th encoding layer. The representations of the two graphs are updated independently by mix-order graph attention networks (MixGAT). At the l-th step (layer), we have: Where C l self and E l self are updated representations according to the mix-order neighborhood information R K c and R K e . One thing should be noticed is that both G c and G e are directed graphs. This implies that the information propagation in the graph is in a top-down manner, following the pre-specified direction. However, unidirectional propagation loses the structural information in the reversed direction. To build communication in both directions, we employ Dual Graph (Ribeiro et al., 2019). Dual graph has the same node representations but reversed edge directions compared to the original graph. For example, if edge A→B is in the original graph, it turns to B→A in the corresponding dual graph. Since dual graphs have the same node representations, we only need to change the neighborhood information. Denote G c and G e as the dual graph of G c and G e . R K c and R K e are the corresponding neighborhood information. We have: Since we have updated the node embeddings in two directions, the final representations of the independent graph updating process are the combination of the bi-directional embeddings, i.e.
where W l c1 and W 1 e1 ∈ R 2d×d are trainable matrix for projections. C l self ∈ R m×d and E l self ∈ R n×d are results of the self-updating process.

Masked Cross Attention
Self updating for G c and G e can model the relationships of AMR concepts and edge respectively. However, it is also necessary to explore the dependencies between concept nodes and relation nodes. As a result, the cross-graph communication between G c and G e is very important. From the structure of the original AMR graph, we can easily build alignment between G c and G e . A relation node y i is directly aligned to a concept node x i if x i is the start-point/end-point of the edge corresponding to y i . As illustrated in Figure 1, ARG0 is the edge between run-02 and he. As a result, node ARG0 in G e is directly connect to run-02 and he in G c .
We apply the attention mechanism to complete the interaction between the two graphs, and use M ∈ R n×m to mask the attention weights of unaligned pairs between G c and G e . For element m ij in M, we let m ij = 0 if y i ∈ V e is aligned to x j ∈ V c , otherwise m ij = −∞. The masked cross attention is employed between the representation sets E l self and C l self , and the matrix of attention weights A l can be calculated as: where W l a1 and W l a2 ∈ R d×d are learnable projection matrixes. The weight scores of unaligned pairs are set to −∞ according to M. For nodes in E l self , the relevant representation from C l self is identified using A l as: where E l cross ∈ R n×d is the masked weighted summation of C l self . The same calculation is performed for nodes in C l self as: The final outputs of a graph encoding layer are the combination of the original embeddings and the context representations from another graph. We also employ the outputs from previous layer as residual inputs, i.e.
where FFN is a feed-forward network consists of two linear transformations. After N -stacked graph encoding layers, The two graphs G c and G e are finally encoded as C N and E N .

Decoder
The decoder of our system is similar to the Transformer decoder. At each generation step, the representation of the output token is updated by multiple rounds of attention with the previously-generated tokens and the encoder outputs. Note that the outputs of our graph encoder have two parts: concept representations C N and the relation representations E N . For generation, concept information is more important, since the concept graph directly contains the natural words. With the multi-step cross attention, C N also caries abundant relation information. For simplicity, we only use C N as the encoder output on the decoder side 2 .
To address the data sparsity issue in sequence generation, we employ the Byte Pair Encoding (BPE) (Sennrich et al., 2016) following the settings of Zhu et al. (2019). We split the word nodes in AMR graphs and reference sentences into subwords, and the decoder vocabulary is shared with the encoder for concept graphs.

Settings
Data and preprocessing We conduct our experiments with two benchmark datasets: LDC2015E85 and LDC2017T10. The two datasets contain 16833 and 36521 training samples, and they use a common development set with 1368 samples and a common test set with 1371 samples. We segment natural words in both AMR graphs and references into sub-words. As a result, a word node in AMR graphs may be divided into several sub-word nodes. We use a special edge subword to link the corresponding sub-word nodes. Then, for each AMR graph, we find its corresponding line graph and generate G c and G e respectively. Training details For model parameters, the number of graph encoding layers is fixed to 6, and the representation dimension d is set to 512. We set the graph neighborhood order K = 1, 2 and 4 for both G c and G e . The Transformer decoder is based on Open-NMT (Klein et al., 2018), with 6 layers, 512 dimensions and 8 heads. We use Adam (Kingma and Ba, 2015) as our optimizer and β = (0.9, 0.98). The learning rate is varied over the course of training, similar with Vaswani et al. (2017): where t denotes the accumulative training steps, and w indicates the warmup steps. We use w = 16000 and the coefficient γ is set to 0.75. As for batch size, we use 80 for LDC2015E86 and 120 for LDC2017T10. 3

Results
We compare our system with several baselines, including traditional sequence-to-sequence models, several graph-to-sequence models with multiple graph encoders, and transformer-based models. All models are trained on the single dataset without ensemble or additional unlabeled data. For performance evaluation, we use BLEU (Papineni et al., 2002) as our major metric. We also use Meteor (Banerjee and Lavie, 2005), which considers the synonyms between predicted sentences and references.
The experimental results on the test sets of LDC2015E86 and LDC2017T10 are reported in Table 1. As we can see, Sequence-based models perform the worst, since they lose useful structural information in graphs. Graph-based models get better results with varied graph encoders to capture the structural information in graphs. Transformer-based models reach previous state-ofthe-art with structure-aware self-attention approach to better modeling the relations between indirectly connected concepts. Comparing to previous studies, our approach with K = 4 order neighborhood information reaches the best BLEU scores, improving over the state-of-the-art model (Zhu et al., 2019) by 0.92 on both datasets. Similar phenomena can be found on the additional metrics of Meteor.

Analysis
As mentioned above, our system has two critical points: higher-order graph neighborhood information and relationships between AMR edges. To verify the effectiveness of these two settings, we conduct a series of ablation tests based on different characteristics of graphs.

Ablation Study on Neighborhood information
Higher order neighborhood information includes the relationships between indirectly connected nodes. Table 2 shows the connectivity of the con-cept graphs under different orders. When K = 1, each node can reach 24.91% of the other nodes directly in the graph (LDC2015E86), and it grows to 41.67% when K = 4. As suggested in Table 1, if graph nodes only interact with their direct neighbors (K = 1), it performs worse than previous Transformer-based models. However, significant improvement can be observed when we integrate higher-order neighborhood information. As K grows form 1 to 4, the BLEU score increases 1.94 and 2.50 on LDC2015E86 and LDC2017T10, respectively.  Figure 5: BLEU variation between models with different orders K with respect to AMR graph size.
As mentioned above, if only consider the firstorder neighborhood, the dependencies between distant AMR concepts cannot be fully explored when the graph size becomes larger. To verify this hypothesis, we split the test set into different parts according to the AMR graph size (i.e. number of concepts). We evaluate our models with order  Figure 6: BLEU variation between models with different K e with respect to size of AMR graph and (left) and reentrancy numbers (right). K = 4 and K = 1 on different partitions. All models are trained on LDC2015E86 set. Figure 5 shows the result. The model with K = 4 significantly outperforms the one with K = 1. Furthermore, we can find that the performance gap between the two models increases when the graph gets bigger. As a result, higher-order neighborhood information does play an important role in graph-to-sequence generation, especially for larger AMR graphs.

Ablation Study on Relationships of Labeled Edges
We are the first one to consider the relationships between labeled edges in AMR graph by integrating the line graph (relation graph) G e in our system. This section will deeply analyze the effectiveness of this contribution. In previous settings, the graph neighborhood order K is the same for both G c and G e . To conduct the ablation test, we fix the neighborhood order K c for G c and vary the order K e for relation graph G e . We set K e = 0, 1 and 4, where K e = 0 indicates that the relation nodes in G e can only interact with itself. This means the dependencies between AMR edges are completely ignored, and the edge information is simply combined with the corresponding concepts. We report the results on both test sets in Table 3.  Table 3: Results of models with varied neighborhood orders of relation graph G e . BLEU scores significantly different from the best model is marked with * (p < 0.01), tested by bootstrap resampling (Koehn, 2004).
If we ignore the dependencies between AMR edges (K e = 0), there is a significant performance degradation: 1.69 and 1.38 BLEU score decline on LDC2015E86 and LDC2017T10 respectively. The performance gets better when K e > 0, which means the edge relations do bring benefits to the graph encoding and sequence generation. When K e = 4, the edge relations are fully explored in varied neighborhood orders, and it reaches the best performance on both datasets. Performance test on different partitions of AMR graph size (Figure 6, left) also suggests that relationships of edges are helpful when the graph becomes larger. We also study the effectiveness of edge relations when handling reentrancies. Reentrancies are the nodes with multiple parents. Such structures are identified as very difficult aspects in AMR graph (Damonte and Cohen, 2019). We think the relation graph G e is helpful in exploring different dependencies with the same concept, which can bring benefits to those graphs containing more reentrancies. To test this hypothesis, we also split the test set into different parts according to their numbers of reentrancies and evaluate our models with K e = 4 and K e = 0 on different partitions. As shown in Figure 6 (right), the gap becomes wide when the number of reentrancies grows to 5. Also, compare to the graph size, edge relations are more important in handling graphs with reentrancies.

Case Study
To gain insight into the model performance. Table  4 provides a few examples. The reentrancies in the AMR graphs is marked with bold type.
In Example (a), two different nodes have same conceptcompete, but they have different forms in the corresponding natural language. According to the references, one is for "competitors" and the other is for "competition". Our model with K e = 0 fails to distinguish the difference and generate two (f / feel-02 :ARG0 (h / he) :ARG1 (p / person :quant (m / more) :ARG0-of (c / compete-01) :ARG1-of (n / new-01) :source (c2 / country :poss (w / we))) :ARG0-of (p2 / participate-01 :ARG1 (c3 / compete-01 :mod (t / this))))) Reference: he felt that , there were more new competitors from our country participating in this competition .
K e = 0: he feels more competition from our country who participate in this competition . K e = 4: he feels that more new competitors from our country who participate in this competition .  "competition" in the output. However, model with K e = 4 successfully recover word "competitors" from the context of the AMR graph.
In Example (b), the concept they has two parents with the same conceptwant. Though our model with K e = 0 successfully finds they is the subject of the both two want, it fails to recognize the parallel relationship between the objects money and face and regard face as a verb. In the contrast, our model with K e = 4 perfectly finds the parallel structure in the AMR graph and reconstructs the correct sentence.
In Example (c), we compare our best model with two baselines: GCNSEQ (Damonte and Cohen, 2019) and Structural Transformer (Denote as ST-Transformer) from Zhu et al. (2019). The AMR graph in Example (b) has two reentrancies, which makes it more difficult to recover the corresponding sentence. As we can see, traditional graph-based model GCNSEQ cannot predict the correct subject of the predicate can. Structural-Transformer uses the correct subject, but the recovered sentence is quite disfluent because of the redundant people. This overgeneration problem is mainly caused by reentrancies (Beck et al., 2018). However, our model can effectively handle this problem and generates a proper sentence with correct semantics.

Related Work
AMR-to-text generation is a typical graph-tosequence task. Early research employs rule-based methods to deal with this problem. Flanigan et al. (2016) use two-stage method by first split the graphs into spanning trees and use multiple tree transducers to generate natural language. Song et al. (2017) use heuristic extraction algorithm to learn graph-to-string rules. More works frame graph-to-sequence as a translation task and use either phrase-based (Ferreira et al., 2017;Pourdamghani et al., 2016) or neural-based (Konstas et al., 2017) models. These methods usually need to linearize the input graphs by means of a depthfirst traversal. Cao and Clark (2019) get a better sequence-based model by leveraging additional syntactic information.
Moving to graph-to-sequence approaches, Marcheggiani and Perez-Beltrachini (2018) first show that graph neural networks can significantly improve the generation performance by explicitly encoding the structure of the graph. Since than, models with variant graph encoders have been proposed in recent years, such as graph LSTM (Song et al., 2018), gated graph neural networks (GGNN) (Beck et al., 2018) and graph convolutional neural networks (Damonte and Cohen, 2019). Guo et al. (2019) introduce dense connectivity to allow the information exchange across different of layers. Ribeiro et al. (2019) learn dual representations capturing top-down and bottom-up adjuvant view of the graph, and reach the best performance in graph-based models.
Despite the great success of graph neural net-works, they all restrict the update of node representation based on only first-order neighborhood and rely on stacked layers to model the relationships between indirectly connected nodes. To solve this problem, recent studies extend the Transformer (Vaswani et al., 2017) to encode the graph structure. Zhu et al. (2019) and Cai and Lam (2019) use relation-aware self-attention to encode structural label sequences of concept pairs, which can model arbitrary concept pairs no matter whether directly connected or not. With several mechanisms such as sub-word (Sennrich et al., 2016) and shared vocabulary, Zhu et al. (2019) achieved state-of-theart performance on this task. Our model follows the same spirit of exploring the relations between indirectly connected nodes, but our method is substantially different: (1) we use a graph-based method integrated with higherorder neighborhood information while keeping the explicit structure of graphs.
(2) we first consider the relations between labeled edges by introducing line graphs.

Conclusion and Future Work
In this work, we presented a novel graph-tosequence approach which uses line graph to model the relationships between labeled edges from the original AMR graph. The mix-order graph attention networks are found effective when handling indirectly connected nodes. The ablation studies also demonstrate that exploring edge relations brings benefits to graph-to-sequence modeling. Furthermore, our framework can be efficiently applied to other graph-to-sequence tasks such as WebNLG (Gardent et al., 2017) and syntax-based neural machine translation (Bastings et al., 2017). In future work we would like to do several experiments on other related tasks to test the versatility of our framework. Also, we plan to use large-scale unlabeled data to improve the performance further.