Lightweight, Dynamic Graph Convolutional Networks for AMR-to-Text Generation

AMR-to-text generation is used to transduce Abstract Meaning Representation structures (AMR) into text. A key challenge in this task is to efficiently learn effective graph representations. Previously, Graph Convolution Networks (GCNs) were used to encode input AMRs, however, vanilla GCNs are not able to capture non-local information and additionally, they follow a local (first-order) information aggregation scheme. To account for these issues, larger and deeper GCN models are required to capture more complex interactions. In this paper, we introduce a dynamic fusion mechanism, proposing Lightweight Dynamic Graph Convolutional Networks (LDGCNs) that capture richer non-local interactions by synthesizing higher order information from the input graphs. We further develop two novel parameter saving strategies based on the group graph convolutions and weight tied convolutions to reduce memory usage and model complexity. With the help of these strategies, we are able to train a model with fewer parameters while maintaining the model capacity. Experiments demonstrate that LDGCNs outperform state-of-the-art models on two benchmark datasets for AMR-to-text generation with significantly fewer parameters.


Introduction
Graph structures play a pivotal role in NLP because they are able to capture particularly rich structural information. For example, Figure 1 shows a directed, labeled Abstract Meaning Representation (AMR; Banarescu et al. 2013) graph, where each node denotes a semantic concept and each edge denotes a relation between such concepts. Within * * Equally Contributed. Work done while Yan Zhang was an intern at DAMO Academy, Alibaba Group and Zhijiang Guo was at the University of Edinburgh.
† Corresponding author. The concept (join-01) in vanilla GCNs is that it only captures information from its immediate neighbors (first-order), while in LDGCNs it can integrate information from neighbors of different order (e.g., second-order and third-order). In SANs, the node collects information from all other nodes, while in structured SANs it is aware of its connected nodes in the original graph.
the realm of work on AMR, we focus in this paper on the problem of AMR-to-text generation, i.e. transducing AMR graphs into text that conveys the information in the AMR structure. A key challenge in this task is to efficiently learn useful representations of the AMR graphs. Early efforts (Pourdamghani et al., 2016;Konstas et al., 2017) neglect a significant part of the structural information in the input graph by linearizing it. Recently, Graph Neural Networks (GNNs) have been explored to better encode structural information for this task (Beck et al., 2018;Song et al., 2018;Damonte and Cohen, 2019;Ribeiro et al., 2019).
One type of such GNNs is Graph Convolutional Networks (GCNs; Kipf and Welling 2017). GCNs follow a local information aggregation scheme, iteratively updating the representations of nodes based on their immediate (first-order) neighbors. Intuitively, stacking more convolutional layers in GCNs helps capture more complex interactions (Xu et al., 2018;Guo et al., 2019b). However, prior efforts (Zhu et al., 2019;Cai and Lam, 2020;Wang et al., 2020) have shown that the locality property of existing GCNs precludes efficient nonlocal information propagation. Abu-El-Haija et al.
(2019) further proved that vanilla GCNs are unable to capture feature differences among neighbors from different orders no matter how many layers are stacked. Therefore, Self-Attention Networks (SANs; Vaswani et al. 2017) have been explored as an alternative to capture global dependencies. As shown in Figure 1 (c), SANs associate each node with other nodes such that we model interactions between any two nodes in the graph. Still, this approach ignores the structure of the original graph. Zhu et al. (2019) and Cai and Lam (2020) propose structured SANs that incorporate additional neural components to encode the structural information of the input graph.
Convolutional operations, however, are more computationally efficient than self-attention operations because the computation of attention weights scales quadratically while convolutions scale linearly with respect to the input length (Wu et al., 2019). Therefore, it is worthwhile to explore the possibility of models based on graph convolutions. One potential approach that has been considered is to incorporate information from higher order neighbors, which helps to facilitate non-local information aggregation for node classification (Abu-El-Haija et al., 2018Morris et al., 2019). However, simple concatenation of different order representations may not be able to model complex interactions in semantics for text generation (Luan et al., 2019).
We propose to better integrate high-order information, by introducing a novel dynamic fusion mechanism and propose the Lightweight, Dynamic Graph Convolutional Networks (LDGCNs). As shown in Figure 1 (b), nodes in the LDGCN model are able to integrate information from first to thirdorder neighbors. With the help of the dynamic mechanism, LDGCNs can effectively synthesize information from different orders to model complex interactions in the AMR graph for text generation. Also, LDGCNs require no additional computational overhead, in contrast to vanilla GCN models. We further develop two novel weight sharing strategies based on the group graph convolutions and weight tied convolutions. These strategies allow the LDGCN model to reduce memory usage and model complexity.
Experiments on AMR-to-text generation show that LDGCNs outperform best reported GCNs and SANs trained on LDC2015E86 and LDC2017T10 with significantly fewer parameters. On the large-scale semi-supervised setting, our model is also consistently better than others, showing the effectiveness of the model on a large training set. We release our code and pretrained models at https://github.com/yanzhang92/LDGCNs. 1

Background
Graph Convolutional Networks Our LDGCN model is closely related to GCNs (Kipf and Welling, 2017) which restrict filters to operate on a first-order neighborhood. Given an AMR graph G with n concepts (nodes), GCNs associate each concept v with a feature vector h v ∈ R d , where d is the feature dimension. G can be represented by concatenating features of all the concepts, i.e., H=[h v 1 , ..., h vn ]. Graph convolutions at l-th layer can be defined as: where H l is hidden representations of the l-th layer. W l and b l are trainable model parameters for the l-th layer, φ is an activation function. A is the adjacency matrix, A uv =1 if there exists a relation (edge) that goes from concept u to concept v.
Self-Attention Networks Unlike GCNs, SANs (Vaswani et al., 2017) capture global interactions by connecting each concept to all other concepts. Intuitively, the attention matrix can be treated as the adjacency matrix of a fully-connected graph. Formally, SANs take a sequence of representations of n nodes H=[h v 1 , ..., h vn ] as the input. Attention score A uv between the concept pair (u,v) is: where W Q and W K are projection parameters. The adjacency matrix A in GCNs is given by the input AMR graph, while in SANs A is computed based on H, which neglects the structural information of the input AMR. The number of operations required by graph convolutions scales is found linearly in the input length, whereas they are quadratic for SANs. Zhu et al. (2019) and Cai and Lam (2020) extend SAN s by incorporating the relation r uv between node u and node v in the Figure 2: Comparison between vanilla GCNs and LDGCNs. H l denotes the representation of l-th layer. W l denotes the trainable weights and × denotes matrix multiplication. Vanilla GCNs take the 1st-order adjacency matrix A 1 as the input, which only captures information from one-hop neighbors. LDGCNs take k number of k-order adjacency matrix A k as inputs, W l is shared for all A k . k is set to 2 here for simplification. A dynamic fusion mechanism is applied to integrate the information from 1-to k-hop neighbors.

Structured SANs
original graph such that the model is aware of the input structure when computing attention scores: where r uv is obtained based on the shortest relation path between the concept pair (u, v) in the graph. For example, the shortest relation path between (join-01, this) in Figure 1 (d) is [ARG1,mod]. Formally, the path between concept u and v is represented as s uv =[e(u, k 1 ), e(k 1 , k 2 ), ..., e(k m , v)], where e indicates the relation label between two concepts and k 1:m are the relay nodes. We have r uv = f (s uv ) where f is a sequence encoder, and this can be performed with gated recurrent units (GRUs) or convolutional neural networks (CNNs).

Dynamic Fusion Mechanism
As discussed in Section 2, GCNs are generally more computationally efficient than structured SANs as their computation cost scales linearly and no additional relation encoders are required. However, the locality nature of GCNs precludes efficient non-local information propagation. To address this issue, we propose the dynamic fusion mechanism, which integrates higher order information for better non-local information aggregation. With the help of this mechanism, our model solely based on graph convolutions is able to outperform competitive structured SANs. Inspired by Gated Linear Units (GLUs; Dauphin et al. 2016), which leverage gating mechanisms (Hochreiter and Schmidhuber, 1997) to dynamically control information flows in the convolutional neural networks, we propose dynamic fusion mechanism (DFM) to integrate information from different orders. DFM allows the model to automatically synthesize information from neighbors at varying degrees of hops away. Similar to GLUs, DFM retains non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling. Based on this non-linear mixture procedure, DFM is able to control the information flows from a range of orders to specific nodes in the AMR graph. Formally, graph convolutions based on DFM are defined as: is a gating matrix conditioned on the k-th order adjacency matrix A k , namely: where denotes elementwise product, σ denotes the sigmoid function, λ ∈ (0, 1) is a scalar, K ≥ 2 is the highest order used for information aggregation, and W l denotes trainable weights shared by different A k . Both λ and K are hyperparameters.
Computational Overhead In practice, there is no need to calculate or store A k . A k H l is computed with right-to-left multiplication. Specifically, if k=3, we calculate A 3 H l as (A(A(AH l ))). Since we store A as a sparse matrix with m nonzero entries as vanilla GCNs, an efficient implementation of our layer takes O(k max × m × d) computational time, where k max is the highest order used and d is the feature dimension of H l . Under the realistic assumptions of k max m and d m, running an l-layer model takes O(lm) computational time. This matches the computational complexity of the vanilla GCNs. On the other hand, DFM does not require additional parameters as the weight matrix is shared over various orders.
Deeper LDGCNs To further facilitate the nonlocal information aggregation, we stack several LDGCN layers. In order to stabilize the training, we introduce dense connections (Huang et al., 2017; Guo et al., 2019b) into the LDGCN model. Mathematically, we define the input of the l-th layer H l as the concatenation of all node representations produced in layers 1, · · · , l − 1: Accordingly, H l in Eq. 4 is replaced byĤ l .
where W l ∈ R d l ×d and d l =d × (l − 1). The model size scales linearly as we increase the depth of the network.

Parameter Saving Strategies
Although we are able to train a very deep LDGCN model, the LDGCN model size increases sharply as we stack more layers, resulting in large model complexity. To maintain a better balance between parameter efficiency and model capacity, we develop two novel parameter saving strategies. We first reduce partial parameters in each layer based on group graph convolutions. Then we further share parameters across all layers based on weight tied convolutions. These strategies allow the LDGCN model to reduce memory usage and model complexity.

Group Graph Convolutions
Group convolutions have been used to build efficient networks for various computer vision tasks as they can better integrate feature maps (Xie et al., 2017;Li et al., 2019b) and have lower computational costs (Howard et al., 2017; compared to vanilla convolutions. In order to reduce the model complexity in the deep LDGCN model, we extend group convolutions to GCNs by introducing group convolutions along two directions: depthwise and layerwise.
Depthwise Graph Convolutions: As discussed in Section 2, graph convolutions operate on the features of n nodes H ∈ R n×d . For simplicity, we assume n=1, the input and output representation of the l-th layer are h l ∈ R d l and h l+1 ∈ R d l+1 , respectively. As shown in Figure 3, the size of the weight matrix W l in a vanilla graph convo- N . Finally, we obtain the output representation h l+1 by concatenating N groups of outputs [g 1 l+1 ;...;g N l+1 ]. Now the parameters of each layer can be reduced by a factor of N , to Layerwise Graph Convolutions: These group convolutions are built based on densely connected graph convolutions (Guo et al., 2019b). As shown in Figure 4, each layer takes the concatenation of outputs from all preceding layers as its input. For example, layer L 2 takes the concatenation of [h 0 ; h 1 ] as its input. Guo et al. (2019b) further adopt a dimension shrinkage strategy. Assume h 0 ∈ R d and that the network has L layers. The dimension of output for each layer is set to d L . Finally, we concatenate the output of L layers [h 1 ; ...; h L ] to form the final representation h f inal ∈ R d . Therefore, the size of weight matrix for l-th layer is Notice that main computation cost originates in the computation of h 0 as it has a large dimension and it is concatenated to the input of each layer. In layerwise graph convolutions however, we improve the parameter efficiency by dividing the input repre-sentation h 0 into M groups {g 1 0 ,...g M 0 }, where M equals the total number of layers L. The first group g 1 0 is fed to all L layers, and the second group g 2 0 is fed to (L-1) layers, so on and so forth. Accordingly, the size of weight matrix for the l-th layer is ( d×(2l−1) L ) × d L . Formally, we partition the input representations of n concept H 0 ∈ R n×d to the first layer into M groups {G 1 0 , ..., G M 0 }, where the size of each group is n × d M . Accordingly, we modify the input of the l-th layerĤ l in Eq. 6 as: In practice, we combine these two convolutions together to further reduce the model size. For example, assume the size of the input is d=360 and the number of layers is L=6. The size of the weight matrix for the first layer (l=1) Assume we set N =3 for depthwise graph convolutions and M =6 for layerwise graph convolutions. We first use layerwise graph convolutions by dividing the input into 6 groups, where each one has the size d M =60. Then we feed the first group to the first layer. Next we use depthwise graph convolutions to further split the input into 3 groups. We now have 3 weight matrices for the first layer, each one with the size d×(2l−1) M × d M ×N = 20 × 20. With the increase of the feature dimension d and the number of layer L, more prominent parameter efficiency can be observed.

Weight Tied Convolutions
We further adopt a more aggressive strategy where parameters are shared across all layers. This further significantly reduces the size of the model. Theoretically, weight tied networks can be unrolled to any depth, typically with improved feature abstractions as depth increases (Bai et al., 2019a). Recently, weight tied SANs were explored to regularize the training and help with generalization (Dehghani et al., 2019;Lan et al., 2020). Mathematically, Eq. 1 can be rewritten as: where W and b are shared parameters for all convolutional layers. To stabilize training, a gating mechanism was introduced to graph neural networks in order to build graph recurrent networks (Li et al., 2016;Song et al., 2018), where parameters are shared across states (time steps). However, the graph convolutional structure is very deep (e.g., 36 layers). Instead, we adopt a jumping connection (Xu et al., 2018), which forms the final representation H f inal based on the output of all layers. This connection mechanism can be considered deep supervision (Lee et al., 2015;Bai et al., 2019b) (Sennrich et al., 2016) to deal with rare words.
Following Guo et al. (2019b), we stack 4 LDGCN blocks as the encoder of our model. Each block consists of two sub-blocks where the bottom one contains 6 layers and the top one contains 3 layers. The hidden dimension of LDGCN model is 480. Other model hyperparameters are set as λ=0.7, K=2 for dynamic fusion mechanism, N =2 for depthwise graph convolutions and M =6 and 3 for layerwise graph convolutions for the bottom and top sub-blocks, respectively. For the decoder, we employ the same attention-based LSTM as in previous work (Beck et al., 2018;Guo et al., 2019b;Damonte and Cohen, 2019). Following Wang et al. (2020), we use a transformer as the decoder for large-scale evaluation. For fair comparisons, we use the same optimization and regularization strategies as in Guo et al. (2019b). All hyperparameters are tuned on the development set 2 .
For evaluation, we report BLEU scores (Papineni et al., 2002), CHRF++ (Popovic, 2017) Table 1: Main results on AMR-to-text generation. B, C, M and #P denote BLEU, CHRF++, METEOR and the model size in terms of parameters, respectively. Results with ‡ are obtained from the authors. We also conduct the statistical significance tests by following (Zhu et al., 2019). All our proposed systems are significant over the baseline at p < 0.01, tested by bootstrap resampling (Koehn, 2004

Main Results
We consider two kinds of baseline models: 1) models based on Recurrent Neural Networks (Konstas et al., 2017;Cao and Clark, 2019) and Graph Neural Networks (GNNs) (Song et al., 2018;Beck et al., 2018;Damonte and Cohen, 2019;Guo et al., 2019b;Ribeiro et al., 2019). These models use an attention-based LSTM decoder. 2) models based on SANs (Zhu et al., 2019) and structured SANs (Cai and Lam, 2020;Zhu et al., 2019;Wang et al., 2020  We also evaluate our model on the latest AMR3.0 dataset. Results are shown in Table 3. LDGCN WT and LDGCN GC consistently outperform GNN-based models including DCGCN and GGNNs on this larger dataset. These results suggest that LDGCN can learn better representation more efficiently. Large-scale Evaluation. We further evaluate LDGCNs on a large-scale dataset. Following Wang et al. (2020), we first use the additional data to pretrain the model, then finetune it on the gold data. Evaluation results are reported in Table.2. Using 0.5M data, LDGCN WT outperforms all models including structured SANs with 2M additional data. These results show that our model is more effective in terms of using a larger dataset. Interestingly, LDGCN WT consistently outperforms LDGCN GC under this setting. Unlike training the model on AMR1.0, training LDGCN WT on the large-scale dataset has fewer oscillations, which confirms our hypothesis that sufficient data acts as a regularizer to stabilize the training process of weight tied models.

Development Experiments
We conduct an ablation study to demonstrate how dynamic fusion mechanism and parameter saving strategies are beneficial to the lightweight model with better performance based on development of experimental results on AMR1.0. Results are shown in Table 4. DeepGCN is the model with dense connections (Huang et al., 2017;Guo et al., 2019b). DeepGCN+GC+DF and Deep-GCN+WT+DF are essentially LDGCN GC and LDGCN WT models in Section 5.2, respectively.
Dynamic Fusion Mechanism. The performance of DeepGCN+DF is 1.1 BLEU points higher than DeepGCN, which demonstrates that our dynamic fusion mechanism is beneficial for graph encoding when applied alone. Adding the group graph convolutions strategies gives a BLEU score of 30.3, which is only 0.1 points lower than DeepGCN+DF. This result shows that the representation learning ability of the dynamic fusion mechanism is robust against parameter sharing and reduction. We also     and the Sockeye neural machine translation toolkit (Felix et al., 2017). Results on speed are based on beam size 10, batch size 30 on an NVIDIA RTX 1080 GPU.
observe that the mechanism helps to alleviate oscillation when training the weight tied model. Deep-GCN+WT+DF achieves better results than Deep-GCN+WT, which is hard to converge when training it on the small AMR1.0 dataset.
Parameter Saving Strategy. Table 4 demonstrates that although the performance of Deep-GCN+GC is only 0.3 BLEU points lower than that of DeepGCN, DeepGCN+GC only requires 65% of the number of parameters of DeepGCN. Furthermore, by introducing the dynamic fusion mechanism, the performance of DeepGCN+GC is improved greatly and is in fact on par with DeepGCN. Also, DeepGCN+GC+DF does not rely on any kind of self-attention layers, hence, its number of parameters is much smaller than that of graph transformers, i.e., DeepGCN+GC+DF only needs 1/4 to 1/3 the number of parameters of graph transformers, as shown in Table 1. On the other hand, Deep-GCN+WT is more efficient than DeepGCN+GC. As shown in Table 2, with an increase in training data, more prominent parameter efficiency can be observed.  Table 6: Human evaluation. We also perform significance tests by following (Ribeiro et al., 2019). Results are statistically significant with p < 0.05.

Time Cost Analysis. As shown in the
than the other two models, since it requires additional tensor split operations. We believe that stateof-the-art structured SANs are also strictly slower than vanilla SANs, as they require additional neural components, such as GRUs, to encode structural information in the AMR graph. In summary, our model not only has better parameter efficiency, but also lower time costs.

Human Evaluation
We further assess the quality of the generated sentences with human evaluation. Following Ribeiro et al. (2019), two evaluation criteria are used: (i) meaning similarity: how close in meaning the generated text is to the gold sentence; (ii) readability: how well the generated sentence reads. We randomly select 100 sentences generated by 4 models. 30 human subjects rate the sentences on a 0-100 rating scale. The evaluation is conducted separately and subjects were first given brief instructions explaining the criteria of assessment. For each sentence, we collect scores from 5 subjects and average them. Models are ranked according to the mean of sentence-level scores. Also, we apply a quality control step filtering subjects who do not score some faked and known sentences properly. As shown in Table 6, LDGCN GC has better human rankings in terms of both meaning similarity and readability than the state-of-the art GNN-based (DualGraph) and SAN-based model (GT SAN). DeepGCN without dynamic fusion mechanism obtains lower scores than GT SAN, which further confirms that synthesizing higher order information helps in learning better graph representations.

Additional Analysis
To further reveal the source of performance gains, we perform additional analysis based on the characteristics of AMR graphs, i.e., graph size and graph reentrancy (Damonte and Cohen, 2019;Damonte et al., 2020). All experiments are conducted on the AMR2.0 test set and CHRF++ scores are reported. Graph Size. As shown in Figure 5, the size of AMR graphs is partitioned into four categories ((0, 20], (20, 30], (30, 40], > 40), Overall, LDGCN GC outperforms the best-reported GT SAN model across all graph sizes, and the performance gap becomes more profound with the increase of graph sizes. Although both models have sharp performance degradation for extremely large graphs (> 40), the performance of LDGCN GC is more stable. Such a result suggests that our model can better deal with large graphs with more complicated structures.
Graph re-entrancies. Reentrancies describe the co-references and control structures in AMR graphs. A graph is considered more complex if it contains more re-entrancies. In Figure 6, we show how the LDGCN GC and GT SAN generalize to different scales of reentrancies. Again, LDGCN GC consistently outperforms GT SAN and the performance gap becomes noticeably wider when the number of re-entrancies increases. These results suggest that our model can better model the Reference: trust me , it 's better to get these things as early as possible rather than let them get even worse .
DualGraph: so to me , this is the best thing to get these things as they can , instead of letting it even worse .
DeepGCN: i trust me , it 's better that these things get in the early than letting them even get worse .
GT SAN: trust me , this is better to get these things , rather than let it even get worse .
LDGCN GC: trust me . better to get these things as early as possible , rather than letting them even make worse . complex dependencies in AMR graphs.
Case Study. Table 7 shows the generated sentence of an AMR graph from four models together with the gold reference. The phrase "trust me" is the beginning of the sentence. DualGraph fails to decode it. On the other hand, GT SAN successfully generates the second half of the sentence, i.e., "rather than let them get even worse", but it fails to capture the meaning of word "early" in its output, which is a critical part. DeepGCN parses both "early" and "get even worse" in the results. However, the readability of the generated sentence is not satisfactory. Compared to baselines, LDGCN is able to produce the best result, which has a correct starting phrase and captures the semantic meaning of critical words such as "early" and "get even worse" while also attaining good readability.

Related Work
Graph convolutional networks (Kipf and Welling, 2017) have been widely used as the structural encoder in various NLP applications including question answering (De Cao et al., 2019;Lin et al., 2019), semantic parsing (Bogin et al., 2019a,b) and relation extraction (Guo et al., 2019a(Guo et al., , 2020. Early efforts for AMR-to-text generation mainly include grammar-based models (Flanigan et al., 2016;Song et al., 2017) and sequence-based models (Pourdamghani et al., 2016;Konstas et al., 2017;Cao and Clark, 2019), discarding crucial structural information when linearising the input AMR graph. To solve that, various GNNs including graph recurrent networks (Song et al., 2018;Ribeiro et al., 2019) and graph convolutional networks (Damonte and Cohen, 2019;Guo et al., 2019b) have been used to encode the AMR structure. Though GNNs are able to operate directly on graphs, the locality nature of them precludes efficient information propagation (Abu-El-Haija et al., 2018Luan et al., 2019). Larger and deeper models are required to model the complex non-local interactions (Xu et al., 2018;Li et al., 2019a). More recently, SANbased models (Zhu et al., 2019;Cai and Lam, 2020;Wang et al., 2020) outperform GNN-based models as they are able to capture global dependencies. Unlike previous models, our local, yet efficient model, based solely on graph convolutions, outperforms competitive structured SANs while using a significantly smaller model.

Conclusion
In this paper, we propose LDGCNs for AMR-totext generation. Compared with existing GCNs and SANs, LDGCNs maintain a better balance between parameter efficiency and model capacity. LDGCNs outperform state-of-the-art models on AMR-to-text generation. In future work, we would like to investigate methods to stabilize the training of weight tied models and apply our model on other tasks in Natural Language Generation.