Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders

The Multilingual Surface Realization Shared Task 2019 focuses on generating sentences from lemmatized sets of universal dependency parses with rich features. This paper describes the results of our participation in the deep track. The core innovation in our approach is to use a graph convolutional network to encode the dependency trees given as input. Upon adding morphological features, our system achieves the third rank without using data augmentation techniques or additional components (such as a re-ranker).


Introduction
The goal in the Multilingual Surface Realization Shared Task 2019 (MSR'19) is to generate fluent text from Universal Dependencies (UD) structures. The task makes available UD-annotated resources in 11 languages for the shallow task, and three languages (English, Spanish, French) for the deep track. Developing surface generation systems that are largely language-independent is a central objective of the shared task (Mille et al., 2018). To generate sentences based on the UD structure and morphological features, recent neural approaches mainly adopt neural sequenceto-sequence architectures (Cabezudo and Pardo, 2018;Madsack et al., 2018;Elder and Hokamp, 2018). While representing the feature-rich data in a linearized manner proved to be a viable option, we argue that these linear sequences do not optimally exploit the input information. We therefore propose to encode the dependency trees using a graph convolutional network (GCN) and find that this GCN encoder leads to a substantial boost in performance, compared to a sequential encoder.
The datasets in the deep track consist of semantic representations induced from syntactic dependency parses, see Figure 1 for an example. This Figure 1: An example of a UD structure with concatenated feature embeddings from the MSR'19 deep task.

Concat
task is reflects the information that's realistically available in real-world natural language generation task.
Our method works as follows: We first apply delexicalization to the datasets, replacing rare tokens with placeholders. Next, encode the dependency trees using graph representation learning techniques (Li et al., 2015;Xu et al., 2018a), in order to improve the encoding of structured data within the encoder-decoder architecture. Our model hence learns a mapping between graph inputs and sequence outputs. Our ablation study in the evaluation demonstrates that encoding UD structure in this manner does embed additional semantic information and subsequently improves the performance across the three languages available for the deep track (i.e. English, French, and Spanish). Finally, we use an LSTM decoder with copy mechanism and attention to generate surface text.
Our contributions are as follows: 1. We show that a GCN encoder for UD input structures outperforms sequential encoders.
2. We propose to use a variant of relational GCN (R-GCN) to better represent edge labels in the graph, and show that this boosts overall performance.
3. We show that structural encoding with the GCN benefits all three languages in the task.

76
2 Related Work 2.1 Neural NLG Systems proposed as part of the Surface Realization Shared Task 2018 are largely sequence-tosequence models targeting the Shallow Task. Most systems in the past contain two separate components: 1) preprocessing of the UD dataset, and the 2) neural generator with the encoder-decoder architecture.
Most neural generators combine features by concatenating the aligned feature sequences and feed them as a single sequence into the neural generator (Elder and Hokamp, 2018;Madsack et al., 2018). In these systems, a pre-trained embedding is typically used to represent each lemmas, before concatenated with embeddings of surfacelevel morphological categories and dependency relations. A form of Recurrent Neural Network (RNN) are utilized to map the input to a latent space, and another RNN then decodes into target output. Examples of common RNN usage include Long Short-Term memory (LSTM) or the Bidirectional LSTM as used in Elder and Hokamp (2018); Madsack et al. (2018).

Graph-to-text Generation
Considering the fact that a dependency tree is a special case of a directed acyclic graph, surface realization is a graph-to-text generation tasks. Graph neural networks have been successfully applied to different graph to text generation task like SQL to text generation (Xu et al., 2018b), AMR-to-text generation (Beck et al., 2018) and semantic machine translation (Song et al., 2019). LSTM can be modified to model graph-level information (Song et al., 2018). Graph Convolutional Networks (GCN), originally designed for semi-supervised learning of node representations in graphs (Kipf and Welling, 2017), explicitly exploit tree structure data and outperform LSTM and TreeLSTM on AMR-to-text generation (Damonte and Cohen, 2019). To also model different types of edges in graphs, Relational Graph Convolutional Networks (R-GCN) represent each type of edge with a corresponding parameter matrix (Schlichtkrull et al., 2018). We leverage the R-GCN by grouping inedge and out-edge together and apply to a graphto-text generation task.

Feature Representations
The input format of the MSR'19 deep track is multi-source in the sense that each type of feature corresponds to a sequence of features, i.e., part-of-speech tags (POS), morphological features etc. As shown in Figure 1, we transform the treestructured data into a graph. We construct node representations by simply concatenating token and its features. Then we use an embedding matrix to map the representations into low-dimensional vector space.
To handle rare words in input tokens, we firstly perform delexicalization for all datasets as follows: 1. Replace tokens that have part-of-speech tags of NAME, PROPN, NUM and X with placeholders jointly indexed by the number of head and the number of entities.
2. Build a dictionary from placeholders to original tokens for each input-output pair.
After obtaining the model output, we lexicalize the text by looking up each generated placeholder in the corresponding dictionary and insert the original token. For our official submission to the shared task, we did not make use of features, in order to see whether the dependency tree is informative enough for surface realization. However, we performed additional experiments to show the effectiveness of GCN encoder with selected concatenated features, see Table 1.

Model
The graph-to-text generation task has a directed acyclic graph as input G = {V, E}, where V is a set of nodes and E is a set of directed edges e between nodes. In this paper, a node is an embedding vector containing a token and its features. An edge is the dependency relation between two nodes. The output Y is a sequence of tokens which form a sentence expressing the input. We extend the architecture by Marcheggiani and Perez-Beltrachini (2018) which combines a graph convolutional encoder and attentional LSTM decoder as described in Figure 2. graph-structure input explicitly. Given a directed graph G, we represent each node with an embedding vector x v ∈ R d . Then the l-th R-GCN layer compute the hidden representation for node v in (l + 1)-th layer as follows: where W, W e ∈ R d×h and e ∈ E. f is the linear rectifier (ReLU), a non-linear activation function. N (v) is the set of all neighbours of node v. This design is over-parameterized and there is no parameter sharing between similar edge labels. Therefore we redesign the update rule to: where "•" is the Hadamard production, W r(u,v) ∈ R d×h , dir(e) ∈ {in, out} represents direction of the edge e u,v and r e ∈ R h is an embedding vector of the label of e u,v . Each layer aggregates the direct neighbours of each node. To model neighbours of neighbours, we stack L GCN layers where L is set to the average radius of all graphs (here, average depth of all trees). Stacking GCN into deep neural networks could lead to gradient vanishing problem, thus we add residual connections (He et al., 2016) or dense connections (Huang et al., 2017) for each layer.

LSTM Decoder
We apply stacked LSTM layers (Hochreiter and Schmidhuber, 1997) as the decoder on top of the GCN. The first layer is an input-feed LSTM (Luong et al., 2015) that aggregates the hidden representations of nodes into one hidden vector h C for the whole graph. The second LSTM layer decodes the hidden vector and generates the representations of output token at each time step. We use global attention (Luong et al., 2015) to re-weight the hidden representations from the first layer and merge them into a global hidden vector h G . In order to generate the placeholder directly from the input, we apply the copying mechanism (Gu et al., 2016), which is effective when using lexcalization. The probability of token y t conditioned on input G and previous token y 1:t−1 is obtained by applying a softmax layer on the decoder output as P (y t |y 1:t−1 , G) = sof tmax(g(h G , h C )), where g is a perceptron. The model is trained to maximize the likelihood function L = t=1 |Y | P (y t |y 1:t−1 , G). 41.01 9.43 56.49 (*) denotes system without morphological features, which is also our official submission to the shared task.

Experiments
We built our system on a variant of OpenNMT-py (Klein et al., 2017) from Marcheggiani and Perez-Beltrachini (2018) with customized encoders. We construct the training and validation datasets by concatenating corresponding splits of all available corpora for each language. We stack 4 R-GCN

Encoder Output
BiLSTM President Bush threw two members to replace manufacturers in the Washington area to replace manufacturers in federal nations.

GCN
In Tuesday, President Bush commissioned two connections to replace the federal individual of federal statements in the Washington area.

RGCN
In Tuesday, President Bush nominated two individuals to replace jurist trials to the Washington area.
Gold President Bush on Tuesday nominated two individuals to replace retiring jurist on federal courts in the Washington area.  layers with dense connections as encoder and train the model with dropout rate of 0.5. We perform early stopping when the training accuracy is higher than the validation accuracy and choose the checkpoint before over-fitting for evaluation.

Encoder Model Selection
As indicated in Table 1, we compare R-GCN with different encoders. Systems are evaluated on the validation set of UD English EWT (enewt-uddev) corpus. With the same linearized inputs, we began with a LSTM encoder before moving on to bi-directional LSTM (BiLSTM). With 2.4 BLEU points improvement, BiLSTM appeared to be the option in terms of sequential encoder. Next, we employed the variant of GCN by Marcheggiani and Perez-Beltrachini (2018) with four fullyconnected layers. We observed that this change gave an additional 4.7 BLEU points boost, which outperforms sequential encoders significantly. We then compare our R-GCN model to the GCN, which obtains additional 3.91 BLEU points. We further add dense conenctions to R-GCN, termed the dense, that eventually result in 41.01 BLEU points on the validation set. This was an overall 12 BLEU points improvement from the initial LSTM encoder.

Ablation Study: Decoder
We intend to investigate if the LSTM decoder can be further modified for improvement. Two of such changes are the copy mechanism and coverage attention. The copy mechanism was shown to be beneficial in numerous similar tasks such as data-to-text generation (Li and Wan, 2018). With the addition of copy mechanism while keeping the encoder unchanged, an average of 1 BLEU score improvement can be observed; reusing the global attention for copying mechanism gave the system another 5 BLEU point boost.

Discussion
Our analysis shows that structural encoding of the UD trees leads to substantial improvements in performance. It also shows that including morphological features is crucial to performance of the surface realizer -without these features, we observe many errors in tense and agreement.
We also analyzed the system outputs to look for evidence to substantiate the intuition that a structural encoder can better represent the a priori linguistic information. One of such examples is shown in Table 2, where we observe fluency improvements going from BiLSTM encoder to GCN, and finally to R-GCN, where an overall improvement in fluency is conspicuous.
We report the results of our submissions in Table 3. Comparing to the validation results, GCN(*) trained without morphological features performs similarly across validation and test datasets of each corpus, however R-GCN(dense) has a significant drop from validation to test and experiences over-fitting. Importantly, we notice substantial BLEU rise and drop going from GCN(*) to R-GCN(dense) on the test datasets. We postulate that addition of relational modeling of edges (R-GCN) on top of rich features constrain the model to learn specific subset of a priori linguistic structures, thereby mitigating the overall performance.

Conclusion
We have shown that without additional modules such as re-ranker or data augmentation, the traditional encoder-decoder architecture can still be competitive by exploiting the existing structural input information. For future work, we intend to see if the performance can be further improved with pre-trained language models such as GPT-2 (Radford et al.).