Graph-to-Sequence Learning using Gated Graph Neural Networks

Many NLP applications can be framed as a graph-to-sequence learning problem. Previous work proposing neural architectures on graph-to-sequence obtained promising results compared to grammar-based approaches but still rely on linearisation heuristics and/or standard recurrent networks to achieve the best performance. In this work propose a new model that encodes the full structural information contained in the graph. Our architecture couples the recently proposed Gated Graph Neural Networks with an input transformation that allows nodes and edges to have their own hidden representations, while tackling the parameter explosion problem present in previous work. Experimental results shows that our model outperforms strong baselines in generation from AMR graphs and syntax-based neural machine translation.


Introduction
Graph structures are ubiquitous in representations of natural language. In particular, many wholesentence semantic frameworks employ directed acyclic graphs as the underlying formalism, while most tree-based syntactic representations can also be seen as graphs. A range of NLP applications can be framed as the process of transducing a graph structure into a sequence. For instance, language generation may involve realising a semantic graph into a surface form and syntactic machine translation involves transforming a tree-annotated source sentence to its translation.
Previous work in this setting rely on grammarbased approaches such as tree transducers (Flanigan et al., 2016) and hyperedge replacement gram-mars (Jones et al., 2012). A key limitation of these approaches is that alignments between graph nodes and surface tokens are required. These alignments are usually automatically generated so they can propagate errors when building the grammar. More recent approaches transform the graph into a linearised form and use off-the-shelf methods such as phrase-based machine translation (Pourdamghani et al., 2016) or neural sequenceto-sequence (henceforth, s2s) models (Konstas et al., 2017). Such approaches ignore the full graph structure, discarding key information.
In this work we propose a model for graph-tosequence (henceforth, g2s) learning that leverages recent advances in neural encoder-decoder architectures. Specifically, we employ an encoder based on Gated Graph Neural Networks (Li et al., 2016, GGNNs), which can incorporate the full graph structure without loss of information. Such networks represent edge information as label-wise parameters, which can be problematic even for small sized label vocabularies (in the order of hundreds). To address this limitation, we also introduce a graph transformation that changes edges to additional nodes, solving the parameter explosion problem. This also ensures that edges have graphspecific hidden vectors, which gives more information to the attention and decoding modules in the network. We benchmark our model in two graph-tosequence problems, generation from Abstract Meaning Representations (AMRs) and Neural Machine Translation (NMT) with source dependency information. Our approach outperforms strong s2s baselines in both tasks without relying on standard RNN encoders, in contrast with previous work. In particular, for NMT we show that we avoid the need for RNNs by adding sequential edges between contiguous words in the dependency tree. This illustrates the generality of our  Figure 1: Left: the AMR graph representing the sentence "The boy wants the girl to believe him.". Right: Our proposed architecture using the same AMR graph as input and the surface form as output. The first layer is a concatenation of node and positional embeddings, using distance from the root node as the position. The GGNN encoder updates the embeddings using edge-wise parameters, represented by different colors (in this example, ARG0 and ARG1). The encoder also add corresponding reverse edges (dotted arrows) and self edges for each node (dashed arrows). All parameters are shared between layers. Attention and decoder components are similar to standard s2s models. This is a pictorial representation: in our experiments the graphs are transformed before being used as inputs (see §3).
approach: linguistic biases can be added to the inputs by simple graph transformations, without the need for changes to the model architecture.

Neural Graph-to-Sequence Model
Our proposed architecture is shown in Figure 1, with an example AMR graph and its transformation into its surface form. Compared to standard s2s models, the main difference is in the encoder, where we employ a GGNN to build a graph representation. In the following we explain the components of this architecture in detail. 1

Gated Graph Neural Networks
Early approaches for recurrent networks on graphs (Gori et al., 2005;Scarselli et al., 2009) assume a fixed point representation of the parameters and learn using contraction maps. Li et al. (2016) argues that this restricts the capacity of the model and makes it harder to learn long distance relations between nodes. To tackle these issues, they propose Gated Graph Neural Networks, which extend these architectures with gating mechanisms 1 Our implementation uses MXNet (Chen et al., 2015) and is based on the Sockeye toolkit (Hieber et al., 2017). Code is available at github.com/beckdaniel/acl2018_ graph2seq.
in a similar fashion to Gated Recurrent Units (Cho et al., 2014). This allows the network to be learnt via modern backpropagation procedures.
In following, we formally define the version of GGNNs we employ in this study. Assume a di- and L V and L E are respectively vocabularies for nodes and edges, from which node and edge labels ( v and e ) are defined. Given an input graph with nodes mapped to embeddings X, a GGNN is defined as where e = (u, v, e ) is the edge between nodes u and v, N (v) is the set of neighbour nodes for v, ρ is a non-linear function, σ is the sigmoid function Our formulation differs from the original GGNNs from Li et al. (2016) in some aspects: 1) we add bias vectors for the hidden state, reset gate and update gate computations; 2) labelspecific matrices do not share any components; 3) reset gates are applied to all hidden states before any computation and 4) we add normalisation constants. These modifications were applied based on preliminary experiments and ease of implementation.
An alternative to GGNNs is the model from Marcheggiani and Titov (2017), which add edge label information to Graph Convolutional Networks (GCNs). According to Li et al. (2016), the main difference between GCNs and GGNNs is analogous to the difference between convolutional and recurrent networks. More specifically, GGNNs can be seen as multi-layered GCNs where layer-wise parameters are tied and gating mechanisms are added. A large number of layers can propagate node information between longer distances in the graph and, unlike GCNs, GGNNs can have an arbitrary number of layers without increasing the number of parameters. Nevertheless, our architecture borrows ideas from GCNs as well, such as normalising factors.

Using GGNNs in attentional encoder-decoder models
In s2s models, inputs are sequences of tokens where each token is represented by an embedding vector. The encoder then transforms these vectors into hidden states by incorporating context, usually through a recurrent or a convolutional network. These are fed into an attention mechanism, generating a single context vector that informs decisions in the decoder. Our model follows a similar structure, where the encoder is a GGNN that receives node embeddings as inputs and generates node hidden states as outputs, using the graph structure as context. This is shown in the example of Figure 1, where we have 4 hidden vectors, one per node in the AMR graph. The attention and decoder components follow similar standard s2s models, where we use a bilinear attention mechanism (Luong et al., 2015) and a 2-layered LSTM (Hochreiter and Schmidhuber, 1997) as the decoder. Note, however, that other decoders and attention mechanisms can be easily employed instead. Bastings et al. (2017) employs a similar idea for syntax-based NMT, but using GCNs instead.

Bidirectionality and positional embeddings
While our architecture can in theory be used with general graphs, rooted directed acyclic graphs (DAGs) are arguably the most common kind in the problems we are addressing. This means that node embedding information is propagated in a top down manner. However, it is desirable to have information flow from the reverse direction as well, in the same way RNN-based encoders benefit from right-to-left propagation (as in bidirectional RNNs). Marcheggiani and Titov (2017) and Bastings et al. (2017) achieve this by adding reverse edges to the graph, as well as self-loops edges for each node. These extra edges have specific labels, hence their own parameters in the network.
In this work, we also follow this procedure to ensure information is evenly propagated in the graph. However, this raises another limitation: because the graph becomes essentially undirected, the encoder is now unaware of any intrinsic hierarchy present in the input. Inspired by Gehring et al. (2017) and Vaswani et al. (2017), we tackle this problem by adding positional embeddings to every node. These embeddings are indexed by integer values representing the minimum distance from the root node and are learned as model parameters. 2 This kind of positional embedding is restricted to rooted DAGs: for general graphs, different notions of distance could be employed.

Levi Graph Transformation
The g2s model proposed in §2 has two key deficiencies. First, GGNNs have three linear transformations per edge type. This means that the number of parameters can explode: AMR, for instance, has around 100 different predicates, which correspond to edge labels. Previous work deal with this problem by explicitly grouping edge labels into a single one (Marcheggiani and Titov, 2017;Bastings et al., 2017) but this is not an ideal solution since it incurs in loss of information. The second deficiency is that edge label information is encoded in the form of GGNN parameters in the network. This means that each label will have the same "representation" across all graphs. However, the latent information in edges can depend on the content in which they appear in a graph. Ideally, edges should have instance-specific hidden states, in the same way as nodes, and these should also inform decisions made in the decoder through the attention module. For instance, in the AMR graph shown in Figure 1, the ARG1 predicate between want-01 and believe-01 can be interpreted as the preposition "to" in the surface form, while the ARG1 predicate connecting believe-01 and boy is realised as a pronoun. Notice that edge hidden vectors are already present in s2s networks that use linearised graphs: we would like our architecture to also have this benefit.
Instead of modifying the architecture, we propose to transform the input graph into its equivalent Levi graph (Levi, 1942;Gross and Yellen, 2004, p. 765).
The new edge set E contains a edge for every (node, edge) pair that is present in the original graph. By definition, the Levi graph is bipartite.
Intuitively, transforming a graph into its Levi graph equivalent turns edges into additional nodes. While simple in theory, this transformation addresses both modelling deficiencies mentioned above in an elegant way. Since the Levi graph has no labelled edges there is no risk of parameter explosion: original edge labels are represented as embeddings, in the same way as nodes. Furthermore, the encoder now naturally generates hidden states for original edges as well.
In practice, we follow the procedure in §2.3 and add reverse and self-loop edges to the Levi graph, so the practical edge label vocabulary is L E = {default, reverse, self}. This still keeps the parameter space modest since we have only three labels. Figure 2 shows the transformation steps in detail, applied to the AMR graph shown in Figure 1. Notice that the transformed graphs are the ones fed into our architecture: we show the original graph in Figure 1 for simplicity.
It is important to note that this transformation can be applied to any graph and therefore is independent of the model architecture. We speculate this can be beneficial in other kinds of graph-based encoder such as GCNs and leave further investigation to future work. development and test sets. Each graph is preprocessed using a procedure similar to what is performed by Konstas et al. (2017), which includes entity simplification and anonymisation. This preprocessing is done before transforming the graph into its Levi graph equivalent. For the s2s baselines, we also add scope markers as in Konstas et al. (2017). We detail these procedures in the Appendix.
Models Our baselines are attentional s2s models which take linearised graphs as inputs. The architecture is similar to the one used in Konstas et al. (2017) for AMR generation, where the encoder is a BiLSTM followed by a unidirectional LSTM. All dimensionalities are fixed to 512.
For the g2s models, we fix the number of layers in the GGNN encoder to 8, as this gave the best results on the development set. Dimensionalities are also fixed at 512 except for the GGNN encoder which uses 576. This is to ensure all models have a comparable number of parameters and therefore similar capacity.
Training for all models uses Adam (Kingma and Ba, 2015) with 0.0003 initial learning rate and 16 as the batch size. 4 To regularise our models we perform early stopping on the dev set based on perplexity and apply 0.5 dropout (Srivastava et al., 2014) on the source embeddings. We detail additional model and training hyperparameters in the Appendix.
Evaluation Following previous work, we evaluate our models using BLEU (Papineni et al., 2001) and perform bootstrap resampling to check statistical significance.
However, since recent work has questioned the effectiveness of BLEU with bootstrap resampling (Graham et al., 2014), we also report results using sentence-level CHRF++ (Popović, 2017), using the Wilcoxon signed-rank test to check significance. Evaluation is case-insensitive for both metrics.
Recent work has shown that evaluation in neural models can lead to wrong conclusions by just changing the random seed (Reimers and Gurevych, 2017). In an effort to make our conclusions more robust, we run each model 5 times using different seeds. From each pool, we report BLEU CHRF++ #params results using the median model according to performance on the dev set (simulating what is expected from a single run) and using an ensemble of the 5 models. Finally, we also report the number of parameters used in each model. Since our encoder architectures are quite different, we try to match the number of parameters between them by changing the dimensionality of the hidden layers (as explained above). We do this to minimise the effects of model capacity as a confounder. Table 1 shows the results on the test set. For the s2s models, we also report results without the scope marking procedure of Konstas et al. (2017). Our approach significantly outperforms the s2s baselines both with individual models and ensembles, while using a comparable number of parameters. In particular, we obtain these results without relying on scoping heuristics.

Reference surface form
We can see that the s2s prediction overgenerates the "India and China" phrase. The g2s prediction avoids overgeneration, and almost perfectly matches the reference. While this is only a single example, it provides evidence that retaining the full graphical structure is beneficial for this task, which is corroborated by our quantitative results. Table 1 also show BLEU scores reported in previous work. These results are not strictly comparable because they used different training set versions and/or employ additional unlabelled corpora; nonetheless some insights can be made. In particular, our g2s ensemble performs better than many previous models that combine a smaller training set with a large unlabelled corpus. It is also most informative to compare our s2s model with Konstas et al. (2017), since this baseline is very similar to theirs. We expected our single model baseline to outperform theirs since we use a larger training set but we obtained similar performance. We speculate that better results could be obtained by more careful tuning, but nevertheless we believe such tuning would also benefit our proposed g2s architecture.
The best results with unlabelled data are obtained by Konstas et al. (2017) using Gigaword sentences as additional data and a paired trained procedure with an AMR parser. It is important to note that this procedure is orthogonal to the individual models used for generation and parsing. Therefore, we hypothesise that our model can also benefit from such techniques, an avenue that we leave for future work.

Syntax-based Neural Machine Translation
Our second evaluation is NMT, using as graphs source language dependency syntax trees. We focus on a medium resource scenario where additional linguistic information tends to be more beneficial. Our experiments comprise two language pairs: English-German and English-Czech.

Experimental setup
Data and preprocessing We employ the same data and settings from Bastings et al. (2017), 5 which use the News Commentary V11 corpora from the WMT16 translation task. 6 English text is tokenised and parsed using SyntaxNet 7 while German and Czech texts are tokenised and split into subwords using byte-pair encodings (Sennrich et al., 2016, BPE) (8000 merge operations).
We refer to Bastings et al. (2017) for further information on the preprocessing steps. Labelled dependency trees in the source side are transformed into Levi graphs as a preprocessing step. However, unlike AMR generation, in NMT the inputs are originally surface forms that contain important sequential information. This information is lost when treating the input as dependency trees, which might explain why Bastings et al.
(2017) obtain the best performance when using an initial RNN layer in their encoder. To investigate this phenomenon, we also perform experiments adding sequential connections to each word in the dependency tree, corresponding to their order in the original surface form (henceforth, g2s+). These connections are represented as edges with specific left and right labels, which are added after the Levi graph transformation. Figure 4 shows an example of an input graph for g2s+, with the additional sequential edges connecting the words (reverse and self edges are omitted for simplicity).
Models Our s2s and g2s models are almost the same as in the AMR generation experiments ( §4.1). The only exception is the GGNN encoder dimensionality, where we use 512 for the experiments with dependency trees only and 448 when the inputs have additional sequential connections. As in the AMR generation setting, we do this to ensure model capacity are comparable in the number of parameters. Another key difference is that the s2s baselines do not use dependency trees: they are trained on the sentences only.
In addition to neural models, we also report results for Phrase-Based Statistical MT (PB-SMT), using Moses (Koehn et al., 2007). The PB-SMT models are trained using the same data conditions as s2s (no dependency trees) and use the standard setup in Moses, except for the language model, where we use a 5-gram LM trained on the target side of the respective parallel corpus. 8 Evaluation We report results in terms of BLEU and CHRF++, using case-sensitive versions of both metrics. Other settings are kept the same as in the AMR generation experiments ( §4.1). For PB-SMT, we also report the median result of 5 runs, obtained by tuning the model using MERT (Och and Ney, 2002) 5 times. 8 Note that target data is segmented using BPE, which is not the usual setting for PB-SMT. We decided to keep the segmentation to ensure data conditions are the same.
There is a deeper issue at stake .  Figure 4: Top: a sentence with its corresponding dependency tree. Bottom: the transformed tree into a Levi graph with additional sequential connections between words (dashed lines). The full graph also contains reverse and self edges, which are omitted in the figure. Table 2 shows the results on the respective test set for both language pairs. The g2s models, which do not account for sequential information, lag behind our baselines. This is in line with the findings of Bastings et al. (2017), who found that having a BiRNN layer was key to obtain the best results. However, the g2s+ models outperform the baselines in terms of BLEU scores under the same parameter budget, in both single model and ensemble scenarios. This result show that it is possible to incorporate sequential biases in our model without relying on RNNs or any other modification in the architecture.  Table 2: Results for syntax-based NMT on the test sets. All score differences between our models and the corresponding baselines are significantly different (p<0.05), including the negative CHRF++ result for En-Cs.

Results and analysis
Interestingly, we found different trends when analysing the CHRF++ numbers. In particular, this metric favours the PB-SMT models for both language pairs, while also showing improved performance for s2s in En-Cs. CHRF++ has been shown to better correlate with human judgments compared to BLEU, both at system and sentence level for both language pairs (Bojar et al., 2017), which motivated our choice as an additional metric. We leave further investigation of this phenomena for future work.
We also show some of the results reported by Bastings et al. (2017) in Table 2. Note that their results were based on a different implementation, which may explain some variation in performance. Their BoW+GCN model is the most similar to ours, as it uses only an embedding layer and a GCN encoder. We can see that even our simpler g2s model outperforms their results. A key difference between their approach and ours is the Levi graph transformation and the resulting hidden vectors for edges. We believe their architecture would also benefit from our proposed transformation. In terms of baselines, s2s performs better than their BiRNN model for En-De and comparably for En-Cs, which corroborates that our baselines are strong ones. Finally, our g2s+ single models outperform their BiRNN+GCN results, in particular for En-De, which is further evidence that RNNs are not necessary for obtaining the best performance in this setting.
An important point about these experiments is that we did not tune the architecture: we simply employed the same model we used in the AMR generation experiments, only adjusting the dimensionality of the encoder to match the capacity of the baselines. We speculate that even better results would be obtained by tuning the architecture to this task. Nevertheless, we still obtained improved performance over our baselines and previous work, underlining the generality of our architecture.

Related work
Graph-to-sequence modelling Early NLP approaches for this problem were based on Hyperedge Replacement Grammars (Drewes et al., 1997, HRGs). These grammars assume the transduction problem can be split into rules that map portions of a graph to a set of tokens in the output sequence. In particular, Chiang et al. (2013) defines a parsing algorithm, followed by a complexity analysis, while Jones et al. (2012) report experiments on semantic-based machine translation using HRGs. HRGs were also used in previous work on AMR parsing (Peng et al., 2015). The main drawback of these grammar-based approaches though is the need for alignments between graph nodes and surface tokens, which are usually not available in gold-standard form.
Neural networks for graphs Recurrent networks on general graphs were first proposed un-der the name Graph Neural Networks (Gori et al., 2005;Scarselli et al., 2009). Our work is based on the architecture proposed by Li et al. (2016), which add gating mechanisms. The main difference between their work and ours is that they focus on problems that concern the input graph itself such as node classification or path finding while we focus on generating strings. The main alternative for neural-based graph representations is Graph Convolutional Networks (Bruna et al., 2014;Duvenaud et al., 2015;, which have been applied in a range of problems. In NLP, Marcheggiani and Titov (2017) use a similar architecture for Semantic Role Labelling. They use heuristics to mitigate the parameter explosion by grouping edge labels, while we keep the original labels through our Levi graph transformation. An interesting alternative is proposed by Schlichtkrull et al. (2017), which uses tensor factorisation to reduce the number of parameters.
Applications Early work on AMR generation employs grammars and transducers (Flanigan et al., 2016;Song et al., 2017). Linearisation approaches include (Pourdamghani et al., 2016) and (Konstas et al., 2017), which showed that graph simplification and anonymisation are key to good performance, a procedure we also employ in our work. However, compared to our approach, linearisation incurs in loss of information. MT has a long history of previous work that aims at incorporating syntax (Wu, 1997;Yamada and Knight, 2001;Galley et al., 2004;Liu et al., 2006, inter alia). This idea has also been investigated in the context of NMT. Bastings et al. (2017) is the most similar work to ours, and we benchmark against their approach in our NMT experiments. Eriguchi et al. (2016) also employs source syntax, but using constituency trees instead. Other approaches have investigated the use of syntax in the target language (Aharoni and Goldberg, 2017;Eriguchi et al., 2017). Finally, Hashimoto and Tsuruoka (2017) treats source syntax as a latent variable, which can be pretrained using annotated data.

Discussion and Conclusion
We proposed a novel encoder-decoder architecture for graph-to-sequence learning, outperforming baselines in two NLP tasks: generation from AMR graphs and syntax-based NMT. Our approach addresses shortcomings from previous work, including loss of information from lineari-sation and parameter explosion. In particular, we showed how graph transformations can solve issues with graph-based networks without changing the underlying architecture. This is the case of the proposed Levi graph transformation, which ensures the decoder can attend to edges as well as nodes, but also to the sequential connections added to the dependency trees in the case of NMT. Overall, because our architecture can work with general graphs, it is straightforward to add linguistic biases in the form of extra node and/or edge information. We believe this is an interesting research direction in terms of applications.
Our architecture nevertheless has two major limitations. The first one is that GGNNs have a fixed number of layers, even though graphs can vary in size in terms of number of nodes and edges. A better approach would be to allow the encoder to have a dynamic number of layers, possibly based on the diameter (longest path) in the input graph. The second limitation comes from the Levi graph transformation: because edge labels are represented as nodes they end up sharing the vocabulary and therefore, the same semantic space. This is not ideal, as nodes and edges are different entities. An interesting alternative is Weave Module Networks (Kearnes et al., 2016), which explicitly decouples node and edge representations without incurring in parameter explosion. Incorporating both ideas to our architecture is an research direction we plan for future work.

A Simplification and Anonymisation for AMR graphs
The procedure for graph simplification and anonymisation is similar to what is done by (Konstas et al., 2017). The main difference is that we use the alignments provided by the original LDC version of the AMR corpus, while they use a combination of the JAMR aligner (Flanigan et al., 2014) and the unsupervised aligner of (Pourdamghani et al., 2014). This preprocessing is done before transforming the graph into its bipartite equivalent.
Simplification: we remove sense information from the concepts. For instance, believe-01 becomes believe. We also remove any subgraphs related to wikification (the ones starting the predicate :wiki).
Entity anonymisation: all subgraphs starting with the predicate :name are anonymised. The predicate contains one of the AMR entity names as the source node, such as country or * -quantity. These are replaced by a single anonymised node, containing the original entity concept plus an index. At training time, we use the alignment information to find the corresponding entity in the surface form and replace all aligned tokens with the same concept name. At test time, we extract a map that maps the anonymised entity to all concept names in the subgraph, in depth-first order. After prediction, if there is an anonymised token in the surface form it is replaced using that map (as long as the predicted token is present in the map).
Enitity clustering: entities are also clustered into four coarse-grained types in both graph and surface form. For instance country 0 becomes loc 0. We use the list obtained from the open source implementation of (Konstas et al., 2017) for that.
Date anonymisation: if there is a date-entity concept, all underlying concepts are also anonymised, using separate tokens for day, month and year. At training time, the surface form is also anonymised following the alignment, but we additionally split days and months into day name, day number, month name and month number. At test time, we render the day/month according to the predicted anonymised token in the surface form and the recorded map.

B Model Hyperparameters
Our implementation is based on the Sockeye toolkit for Neural Machine Translation (Hieber et al., 2017). Besides the specific hyperparameter values mentioned in the paper, all other hyperparameters are set to the default values in Sockeye. We detail them here for completeness:

B.1 Vocabulary
• For AMR, we set the minimum frequency to 2 in both source nodes and target surface tokens. For NMT, we also use 2 as the minimum frequency in the source but use 1 in the target since we use BPE tokens.

B.2 Model
• The baseline encoder use a BiLSTM followed by a unidirectional LSTM. The decoder in all models use a 2-layer LSTM.
• The attention module uses a bilinear scoring function (general as in (Luong et al., 2015)).
• The max sequence length during training is 200 for AMR. In NMT, we use 100 for the s2s baselines and 200 for the g2s models. This is because we do not use dependency trees in the baseline so it naturally has half of the tokens.
• All dimensionalities are fixed to 512, which is similar to what is used by (Konstas et al., 2017). The only exceptions are the GGNN hidden state dimensionality in the g2s models. We use 576 for the g2s models used in the AMR experiments and 448 for the g2s+ models used in the NMT experiments. As pointed out in the main paper, we change GGNN dimensionalities in order to have a similar parameter budget compared to the s2s baselines.

B.3 Training
These options apply for both s2s baselines and g2s g2s+ models.
• We use 16 as the batch size. This is a lower number than most previous work we compare with: we choose this because we obtained better results in the AMR dev set. This is in line with recent evidence showing that smaller batch sizes lead to better generalisation performance (Keskar et al., 2017). The drawback is that smaller batches makes training time slower. However, this was not a problem in our experiments due to the medium size of the datasets.
• Bucketing is used to speed up training: we use 10 as the bucket size.
• Models are trained using cross-entropy as the loss.
• We save parameter checkpoints at every full epoch on the training set.
• We use early stopping by perplexity on the dev set with patience 8 (training stops if dev perplexity do not improve for 8 checkpoints).
• A maximum of 30 epochs/checkpoints is used. All our models stopped training before reaching this limit.
• We use 0.5 dropout on the input embeddings, before they are fed to the encoder.
• Weigths are initalised using Xavier initialisation (Glorot and Bengio, 2010), with except of forget biases in the LSTMs which are initialised by 0.
• We use Adam (Kingma and Ba, 2015) as the optimiser with 0.0003 as the initial learning rate.
• Learning rate is halved every time dev perplexity does not improve for 3 epochs/checkpoints.
• Gradient clipping is set to 1.0.

B.4 Decoding
• We use beam search to decode, using 5 as the beam size.
• Ensembles are created by averaging log probabilities at every step (linear in Sockeye). Attention scores are averaged over the 5 models at every step.
• For AMR only, we replace <unk> tokens in the prediction with the node with the highest attention score at that step, in the same way described by (Konstas et al., 2017). This is done before deanonymisation.