Heterogeneous Graph Transformer for Graph-to-Sequence Learning

The graph-to-sequence (Graph2Seq) learning aims to transduce graph-structured representations to word sequences for text generation. Recent studies propose various models to encode graph structure. However, most previous works ignore the indirect relations between distance nodes, or treat indirect relations and direct relations in the same way. In this paper, we propose the Heterogeneous Graph Transformer to independently model the different relations in the individual subgraphs of the original graph, including direct relations, indirect relations and multiple possible relations between nodes. Experimental results show that our model strongly outperforms the state of the art on all four standard benchmarks of AMR-to-text generation and syntax-based neural machine translation.


Introduction
Graph-to-sequence (Graph2Seq) learning has attracted lots of attention in recent years. Many Natural Language Process (NLP) problems involve learning from not only sequential data but also more complex structured data, such as semantic graphs. For example, AMR-to-text generation is a task of generating text from Abstract Meaning Representation (AMR) graphs, where nodes denote semantic concepts and edges refer to relations between concepts (see Figure 1 (a)). In addition, it has been shown that even if the sequential input can be augmented by additional structural information, bringing benefits for some tasks, such as semantic parsing (Pust et al., 2015;Guo and Lu, 2018) and machine translation (Bastings et al., 2017). Therefore, Xu et al. (2018b) introduced the Graph2Seq problems which aim to generate target sequence from graph-structured data.
The main challenge for Graph2Seq learning is to build a powerful encoder which is able to cap-ture the inherent structure in the given graph and learn good representations for generating the target text. Early work relies on statistical methods or sequence-to-sequence (Seq2Seq) models where input graphs are linearized (Lu et al., 2009;Song et al., 2017;Konstas et al., 2017). Recent studies propose various models based on graph neural network (GNN) to encode graphs (Xu et al., 2018b;Beck et al., 2018;Guo et al., 2019;Damonte and Cohen, 2019;Ribeiro et al., 2019). However, these approaches only consider the relations between directly connected nodes, ignore the indirect relations between distance nodes. Inspired by the success of Transformer (Vaswani et al., 2017) which can learn the dependencies between all tokens without regard to their distance, the current state-of-theart Graph2Seq models (Zhu et al., 2019;Cai and Lam, 2020) are based on Transformer and learn the relations between all nodes no matter they are connected or not. These approaches use shortest relation path between nodes to encode semantic relationships. However, they ignore the information of nodes in the relation path and encode the direct relations and indirect relations without distinction. It may disturb the information propagation process when aggregate information from direct neighbors.
To solve the issues above, we propose the Heterogeneous Graph Transformer (HetGT) to encode the graph, which independently model the different relations in the individual subgraphs of the original graph. HetGT is adapted from Transformer and it also employs an encoder-decoder architecture. Following Beck et al. (2018), we first transform the input into its corresponding Levi graph which is a heterogeneous graph (contains different types of edges). Then we split the transformed graph into multiple subgraphs according to its heterogeneity, which corresponds to different representation subspaces of the graph. For updating the node representations, attention mechanisms are used for inde- pendently aggregating information in different subgraphs. Finally, the representations of each node obtained in different subgraphs are concatenated together and a parameterized linear transformation is applied. In this way, HetGT could adaptively model the various relations in the graph independently, avoiding the information loss caused by mixing all of them. Moreover, we introduce the jump connection in our model, which significantly improves the model performance.
We evaluate our model on four benchmark datasets of two Graph2Seq tasks: the AMR-totext generation and the syntax-based Neural Machine Translation (NMT). In terms of various evaluation metrics, our model strongly outperforms the state-of-the-art (SOTA) results on both two tasks. Particularly, in AMR-to-text generation, our model improves the BLEU scores of the SOTA by about 2.2 points and 2.3 points on two benchmark datasets (LDC2015E86 and LDC2017T10). In syntax-based NMT, our model surpasses the SOTA by about 4.1 and 2.2 BLEU scores for English-German and English-Czech on News Commentary v11 datasets from the WMT16 translation task. Our contributions can be summarized as follows: • We propose the Heterogeneous Graph Transformer (HetGT) which adaptively models the various relations in different representation subgraphs.
• We analyze the shortcomings of the residual connection and introduce a better connectivity method around encoder layers.
• Experimental results show that our model achieves new state-of-the-art performance on four benchmark datasets of two Graph2Seq tasks.

Neural Graph-to-Sequence Model
In this section, we will first begin with a brief review of the Transformer which is the basis of our model. Then we will introduce the graph transformation process. Finally, we will detail the whole architecture of HetGT.

Transformer
The Transformer employs an encoder-decoder architecture, consisting of stacked encoder and decoder layers. Encoder layers consist of two sublayers: a self-attention mechanism and a position-wise feed-forward network. Self-attention mechanism employs h attention heads. Each attention head operates on an input sequence x = (x 1 , ..., x n ) of n elements where x i ∈ R dx , and computes a new sequence z = (z 1 , ..., z n ) of the same length where z ∈ R dz . Finally, the results from all the attention heads are concatenated together and a parameterized linear transformation is applied to get the output of the self-attention sublayer. Each output element z i is computed as the weighted sum of linearly transformed input elements: where α ij is weight coefficient and computed by a softmax function: And e ij is computed using a compatibility function that compares two input elements: Scaled dot product was chosen for the compatibility function. W V , W Q , W V ∈ R dx×dz are layerspecific trainable parameter matrices. Meanwhile, these parameter matrices are unique per attention head.

Input Graph Transformation
Following Beck et al. (2018), we transform the original graph into the Levi graph. The transformation equivalently turns edges into additional nodes so we can encode the original edge labels in the same way as for nodes. We also add a reverse edge between each pair of connected nodes as well as a self-loop edge for each node. These strategies can make the model benefit from the information propagation from different directions (See Figure 1 (b)). In order to alleviate the data sparsity problem in the corpus, we further introduce the Byte Pair Encoding (BPE) (Sennrich et al., 2016) into the Levi Graph. We split the original node into multiple subword nodes. Besides adding default connections, we also add the reverse and self-loop edges among subwords. For example, the word country in Figure 2 is segmented into co@@, un@@, try with three types of edges between them. Finally, we transform the AMR graph into the extended Levi Graph which can be seen as a heterogeneous graph, as it has different types of edges.

Heterogeneous Graph Transformer
Our model is also an encoder-decoder architecture, consisting of stacked encoder and decoder layers. Given a preprocessed extended Levi graph, we split the extended Levi graph into multiple subgraphs according to its heterogeneity. In each graph encoder block, the node representation in different subgraphs is updated based on its neighbor nodes in the current subgraph. Then all the representations of this node in different subgraphs will be combined to get its final representation. In this way, the model can attend to information from different representation subgraphs and adaptively model the various relations. The learned representations of all nodes at the last block are fed to the sequence decoder for sequence generation. The architecture of HetGT encoder is shown in Figure 1 (c). Due to the limitation of space the decoder is omitted in the figure. We will describe it in Section 2.3.2.

Graph Encoder
Unlike previous Transformer-based Graph2Seq models using relative position encoding to incorporate structural information, we use a simpler way to encode the graph structure. As Transformer treats the sentence as a fully-connected graph, we directly mask the non-neighbor nodes' attention when updating each node's representation. Specifically, we mask the attention α ij for node j / ∈ N i , where N i is the set of neighbors of node i in the graph. So given the input sequence x = (x 1 , ..., x n ), the output representation of node i denoted as z i in each attention head is computed as follows: where α ij represents the attention score of node j to i which is computed using scaled dot-product function as in Equation 2. We also investigate another way to compute attention scores. We use the additive form of attention instead of scaled dot-product attention, which is similar to graph attention network (Veličković et al., 2018). The additive form of attention shows better performance and trainability in some tasks (Chen et al., 2019). The attention coefficient α ij is computed as follows: where a ∈ R 2dz is a weight vector. LeakyReLU (Girshick et al., 2014) is used as the activation function.

Heterogeneous Mechanism
Motivated by the success of the multi-head mechanism, we propose the heterogeneous mechanism. Considering a sentence, multi-head attention allows the model to implicitly attend to information from different representation subspaces at different positions. Correspondingly, our heterogeneous mechanism makes the model explicitly attend to information in different subgraphs, corresponding to different representation subspaces of the graph, which enhances the models' encoding capability.
As stated above, the extended Levi graph is a heterogeneous graph which contains different types of edges. For example, in Figure 1 (b), the edge type vocabulary for the Levi graph of the AMR graph is T ={default, reverse, self }. Specifically, we first group all edge types into a single one to get a homogeneous subgraph referred to connected subgraph. The connected subgraph is actually an undirected graph which contains the complete connected information in the original graph. Then we split the input graph into multiple subgraphs according to the edge types. Besides learning the directly connected relations, we introduce a fully-connected subgraph to learn the implicit relationships between indirectly connected nodes. Finally, we get the set of subgraphs including M elements G sub ={fullyconnected, connected, default, reverse}. For AMR graph M = 4 (For NMT M = 6, we will detail it in section 3.1). Note that we do not have a subgraph only containing self edges. Instead, we add the self-loop edges into all subgraphs. We think it is more helpful for information propagation than constructing an independent self-connected subgraph. Now the output z in each encoder layer is computed as follows: is the set of neighbors in the m-th subgraph of node i. α ij is computed as Equation 2 or Equation 5. FFN is a feed-forward network which consists of two linear transformations with a ReLU activation in between. We also employ the residual connections between sublayers as well as layer normalization. Note that the heterogeneous mechanism is independent of the model architecture, so it can be applied to any other graph models which may bring benefits.
For decoder, we follow the standard implementation of the sequential Transformer decoder to generate the text sequence. The decoder layers consist of three sublayers: self-attention followed by encoder-decoder attention, followed by a positionwise feed-forward layer.

Layer Aggregation
As stated above, our model consists of stacked encoder layers. A better information propagation between encoder layers may bring better performance. Therefore, we investigate three different layer aggregation methods, which are illustrated in Figure 3. When updating the representation of each node at l-th layer, recent approaches aggregate the neighbors first and then combine the aggregated result with the node's representation from (l − 1)-th layer. This strategy can be viewed as a form of a skip connection between different layers (Xu et al., 2018a): The residual connection is another well-known skip connection which uses the identity mapping as the combine function to help signals propagate (He et al., 2016). However, these skip connections cannot adaptively adjust the neighborhood size of the final-layer representation independently. If we "skip" a layer for z (l) i , all subsequent units such as z (l+j) i using this representation will be using this skip implicitly. Thus, to selectively aggregate the outputs of previous layers at the last, we introduce the Jumping Knowledge architecture (Xu et al., 2018a) in our model. At the last layer L of the encoder, we combine all the outputs of previous encoder layers by concatenation to help the model selectively aggregate all of those intermediate representations. (8) where W jump ∈ R (Ldz+dx)×dz . Furthermore, to better improve information propagation, dense connectivity can be introduced as well. With dense connectivity, the nodes in l-th layer not only take input from (l − 1)-th layer but also draw information from all preceding layers: Dense connectivity are also introduced in previous researches (Huang et al., 2017;Guo et al., 2019).

Data and preprocessing
We build and test our model on two typical Graph2Seq learning tasks. One is AMR-to-text generation and the other is syntax-based NMT. Table 1 presents the statistics of four datasets of the two tasks. For AMR-to-text generation, we use two standard benchmarks LDC2015E86 (AMR15) and LDC2017T10 (AMR17). These two datasets contain 16K and 36K training instances, respectively, and share the development and test set. Each instance contains a sentence and an AMR graph. In the preprocessing steps, we apply entity simplification and anonymization in the same way as Konstas et al. (2017). Then we transform each preprocessed AMR graph into its extended Levi graph as described in Section 2.2.
For the syntax-based NMT, we take syntactic trees of source texts as inputs. We evaluate our model on both English-German (En-De) and English-Czech (En-Cs) News Commentary v11 datasets from the WMT16 translation task 1 . Both sides are tokenized and split into subwords using BPE with 8000 merge operations. English text is parsed using SyntaxNet (Alberti et al., 2017 we transform the labeled dependency tree into the extended Levi graph as described in Section 2.2. Unlike AMR-to-text generation, in NMT task the input sentence contains significant sequential information. This information is lost when treating the sentence as a graph. Guo et al. (2019) consider this information by adding sequential connections between each word node. In our model, we also add forward and backward edges in the extended Levi graph. Thus, the edge types vocabulary for the extended Levi graph of the dependency tree is T ={default, reverse, self, forward, backward}. So the set of subgraphs for NMT is G sub = {fullyconnected, connected, default, reverse, forward, backward}. Note that we do not change the model architecture in the NMT tasks. However, we still get good results, which indicates the effectiveness of our model on Graph2Seq tasks. Except for introducing BPE into Levi graph, the above preprocessing steps are following Bastings et al. (2017). We refer to them for further information on the preprocessing steps.

Parameter Settings
Both our encoder and decoder have 6 layers with 512-dimensional word embeddings and hidden states. We employ 8 heads and dropout with a rate of 0.3. For optimization, we use Adam optimizer with β 2 = 0.998 and set batch size to 4096 tokens. Meanwhile, we increase learning rate linearly for the first warmup steps, and decrease it thereafter proportionally to the inverse square root of the step number. We set warmup steps to 8000. The similar learning rate schedule is adopted in (Vaswani et al., 2017). Our implementation uses the openNMT library (Klein et al., 2017). We train the models for 250K steps on a single GeForce GTX 1080 Ti GPU. Our code is available at https://github.com/QAQ-v/HetGT.

Metrics and Baselines
For performance evaluation, we use BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and sentence-level CHRF++ (Popović, 2015) with default hyperparameter settings as evaluation metrics. Meanwhile, we use the tools in Neubig et al. (2019) for the statistical significance tests. Our baseline is the original Transformer 2 . For AMR-to-text generation, Transformer takes linearized graphs as inputs. For syntax-based NMT, Transformer is trained on the preprocessed translation dataset without syntactic information. We also compare the performance of HetGT with previous single/ensenmble approaches which can be grouped into three categories: (1) Recurrent neu-2 Parameters were chosen following the OpenNMT FAQ: http://opennmt.net/OpenNMT-py/FAQ.html#how-do-iuse-the-transformer-model ral network (RNN) based methods (GGNN2Seq, GraphLSTM); (2) Graph neural network (GNN) based methods (GCNSEQ, DGCN, G2S-GGNN); (3) The Transformer based methods (Structural Transformer, GTransformer). The ensemble models are denoted by subscripts in Table 2 and Table  3. Table 2 presents the results of our single model and previous single/ensemble models on the test sets of AMR15 and AMR17. We can see that our Transformer baseline outperforms most previous single models, and our best single model HetGT additive outperforms the Transformer baseline by a large margin (6.15 BLEU and 6.44 BLEU) on both benchmarks. It demonstrates the importance of incorporating structural information. Meanwhile, HetGT additive gets an improvement of 2.18 and 2.28 BLEU points over the latest SoTA results (Zhu et al., 2019) on AMR15 and AMR17, respectively. Previous models can capture the structural information but most of them ignore heterogeneous information. These results indicate that the heterogeneity in the graph carries lots of useful information for the downstream tasks, and our model can make good use of it.

Results on AMR-to-text Generation
Furthermore, our best single model still has better results than previous ensemble models on both two datasets. Note that additive attention based model HetGT additive is significantly better that dotproduct attention based model HetGT dot-product in AMR-to-text generation. It may be attributed to that the additive attention has less parameters and is easier to train on the small dataset. Table 3 presents the results of our single model and previous single/ensemble models on the test sets for En-De and En-Cs language pairs. We can see that our Transformer baseline already outperforms all previous results even though some of them are Transformer based. It shows the effectiveness of Transformer for NMT tasks. Meanwhile, even without changing the model architecture for the NMT tasks, our single model surpasses Transformer baseline by 2.26 and 1.46 BLEU points on the En-De and En-Cs tasks, respectively, and our model surpasses previous best models by 4.14 and 2.19 BLEU points. In syntax-based NMT where the dataset is larger than AMR-to-text generation, the HetGT dot-product gets comparable results compared to the HetGT additive , and even outperforms the HetGT additive in terms of METEOR and CHRF++ on the language pair En-De. We think on the larger datasets the HetGT dot-product will get better results than the HetGT additive .

Effect of Layer Aggregation Method
Firstly, we compare the performance of three layeraggregation methods discussed in Section 2.3.3.

Method
HetGT  The results are shown in Table 4. We can see the jump connection is the most effective method. However, the dense connection performs the worst. We think the reason is that dense connection introduce lots of extra parameters which are harder to learn.

Effect of Subgraphs
In this section, we also use AMR15 as our benchmark to investigate how each subgraph influences the final results of our best model HetGT additive . Table 5 shows the results of removing or only keeping the specific subgraph. Only keeping the fullyconnected subgraph essentially is what the Transformer baseline does. It means the model does not consider the inherent structural information in inputs. Obviously, it cannot get a good result. In addition, only keeping the connected subgraph does not perform well even it considers the structural information. It demonstrates that the heterogeneous information in the graph is helpful for learning the representation of the graph. When removing any subgraph, the performance of the model will decrease. It demonstrates that each subgraph has contributed to the final results. At last, we remove BPE, and we get 29.84 BLEU score which is still better than previous SoTA that also uses BPE. Note that when we remove the connected subgraph, the results do not have statistically significant changes (p = 0.293). We think the reason is that the left subgraphs already contain the full information of the original graph because the connected subgraph is obtained by grouping all edge types into a single one. Except that, all the other results have statistically significant changes (p ≤ 0.05).

Case Study
We perform case studies for better understanding the model performance. We compare the outputs of Transformer baseline and our HetGT additive . The results are presented in Table 6. In the first simple example, our Transformer baseline and HetGT additive can generate the target sequence without mistakes. In the second example which is more complicated, the Transformer baseline fails to identify the possessor of "opinion" and the subject of "agreed" while our model successfully recognizes them. However, we find the there is a common problem: the sentences they generate all have some duplication. We will explore this issue further in the future work.

Related Work
Early researches for Graph2Seq learning tasks are based on statistical methods and neural seq2seq model. Lu et al. (2009) propose an NLG approach built on top of tree conditional random fields to use the tree-structured meaning representation. Song et al. (2017) use synchronous node replacement grammar to generate text. Konstas et al. (2017) linearize the input graph and feed it to the seq2seq model for text-to-AMR parsing and AMR-to-text generation. However, linearizing AMR graphs into sequences may incurs in loss of information. Recent efforts consider to capture the structural information in the encoder. Beck et al. (2018) employ Gated Graph Neural Networks (GGNN) as the encoder and Song et al. (2018) propose the graph-state LSTM to incorporate the graph structure. Their works belong to the family of recurrent neural network (RNN). In addition, there are some works are build upon the GNN. Damonte and Cohen (2019) propose stacking encoders including LSTM and GCN. Guo et al. (2019) introduce the densely connected GCN to encode richer local and non-local information for better graph representation.
Recent studies also extend Transformer to encode structure information. Shaw et al. (2018) propose the relation-aware self-attention which learns explicit embeddings for pair-wise relationships between input elements. Zhu et al. (2019) and Cai and Lam (2020) both extend the relation-aware selfattention to generate text from AMR graph. Our model is also based on Transformer. However, we do not employ the relative position encoding to incorporate structural information. Instead, we directly mask the non-neighbor nodes attention when updating each nodes representation. Moreover, we introduce the heterogeneous information and jump connection to help model learn a better graph representation, bringing substantial gains in the model performance.

Conclusion
In this paper, we propose the Heterogeneous Graph Transformer (HetGT) for Graph2Seq learning. Our proposed heterogeneous mechanism can adaptively model the different representation subgraphs. Experimental results show that HetGT strongly outperforms the state of the art performances on four benchmark datasets of AMR-to-text generation and syntax-based neural machine translation tasks.
There are two directions for future works. One is to investigate how the other graph models can benefit from our proposed heterogeneous mecha-nism. On the other hand, we would also like to investigate how to make use of our proposed model to solve sequence-to-sequence tasks.