Generating High-Quality Surface Realizations Using Data Augmentation and Factored Sequence Models

This work presents state of the art results in reconstruction of surface realizations from obfuscated text. We identify the lack of sufficient training data as the major obstacle to training high-performing models, and solve this issue by generating large amounts of synthetic training data. We also propose preprocessing techniques which make the structure contained in the input features more accessible to sequence models. Our models were ranked first on all evaluation metrics in the English portion of the 2018 Surface Realization shared task.


Introduction
Contextualized Natural Language Generation (NLG) is a long-standing goal of Natural Language Processing (NLP) research. The task of generating text, conditioned on knowledge about the world, is applicable to almost any domain. However, despite recent advances in specific domains, NLG models still produce relatively low quality outputs in many settings. Representing the context in a consistent manner is still a challenge: how can we condition output on a stateful structure such as a graph or a tree?
Several shared tasks have recently explored NLG from inputs with graph-like structures; RDF triples (Colin et al., 2016), dialogue actbased meaning representations (Novikova et al., 2017) and abstract meaning representations (May and Priyadarshi, 2017). In each of these challenges, the input has structure beyond simple linear sequences; however, to date, the top results in these tasks have consistently been achieved using relatively standard sequence-to-sequence models.
The surface realization task is a conceptually simple challenge: given shuffled input, where tokens are represented by their lemmas, parts of speech, and dependency features, can we train a model to reconstruct the original text? A model that performs well at this task is likely to be a good starting point for solving more complex tasks, such as NLG from Resource Description Framework (RDF) graphs or Abstract Meaning Representation (AMR) structures. In addition, training data for the surface realization task can also be generated in a fully-automated manner.
In this work, we show that training dataset size may be the major obstacle preventing current sequence-to-sequence models from doing well at NLG from structured inputs. Although inputting the structures themselves is theoretically appealing (Tai et al., 2015), in many domains it may be enough to use sequential inputs by flattening structures, and providing structural information via input factors, as long as the training dataset is sufficiently large. By augmenting training data using a large corpus of unannotated data, we obtain a new state of the art in the surface realization task using off-the-shelf sequence to sequence models.
In addition, we show that information about the output word order, implicitly available from parse features, provides essential information about the word order of correct output sequences, confirming that structural information cannot be discarded without a large drop in performance.
The main contributions of this work are: 1. We show how training datasets can be augmented with synthetic data 2. We apply preprocessing steps to simplify the universal dependency structures, making the structure more explicit 3. We evaluate pointer models for the surface realization task

The Surface Realization Shared Task
In the shallow track of the 2018 surface realization (SR) shared task, inputs consist of tokens from a universal dependency (UD) tree provided in the form of lemmas. The original order of the sequence is obfuscated by random shuffling 1 . Models are evaluated on their ability to reconstruct the original, unshuffled input which generated the features. In order to do this, models must make use of structural information in order to reorder the tokens correctly as well as part-of-speech and/or dependency parse labels in order to restore the correct surface realization of lemmas. Note that we focus upon the English sub-task, where word order is critical because of the typologically analytic nature of English, however, for other languages, restoring word order may be less important, while deriving surface realizations from lemmas may be much more challenging.

Augmenting Training with Synthetic Datasets
To augment the SR training data, we used sentences from the WikiText corpus (Merity et al., 2016). Each of these sentences was parsed using UDPipe (Straka and Straková, 2017) to obtain the same features provided by the SR organizers. We then filtered this data, keeping only sentences with at least 95% vocabulary overlap with the indomain SR training data. Note that the input vocabulary for this task is word lemmas, so at least 95% of the tokens in each instance in our additional training data are lemmas which are also found in the in-domain data. The order of tokens in each instance of this additional dataset is then randomly shuffled to simulate the random input order in the SR data.
We thus obtain 642,960 additional training instances, which are added to the 12,375 instances supplied by the SR shared task organizers.

Leveraging Structured Features
Because we have the dependency parse features for each input, some information about word order is implicitly available from the parse information, but discovering the structural relationship between the dependency parse features and the order of words in the output sequence is likely to be challenging for our sequence to sequence model. Therefore, we construct the original parse tree from the dependency features, and perform a depth-first search to sort and reorder the lemmas. This is similar to the linearization step performed by Konstas et al. (2017), the main difference being we randomly choose between child nodes instead of using a predetermined order based on edge types.
In order to further augment the available context, we experiment with adding potential delemmatized forms for each input lemma. The possible forms for each lemma were found by creating a map from (lemma, xpos) → form, using the WikiText dataset. For each input lemma and xpos, we then check for the pair in the map -if it exists, the corresponding form is appended to the sequence. This makes forms available to the pointer model for copying.
For some lemma, xpos pairs there are multiple potential forms. When this occurs we add all potential forms to the input sequence. The mapping was found to cover 98.9% of cases in the development set.

Factored Inputs
Factored models were introduced by Alexandrescu et al. (2006) as a means of including additional features beyond word tokens into neural language models. The key idea is to create a separate embedding representation for each feature type, and to concatenate the embeddings for each input token to create its dense representation. Sennrich et al. (2016) showed that this technique is quite effective for neural machine translation, and some recent work, such as Hokamp (2017)   factors F is created as in Eq. 1: where indicates vector concatenation, E k is the embedding matrix of factor k, and x jk is a one hot vector for the k-th input factor. Table 1 lists each of the factors used in our models, along with its corresponding embedding size. The embedding size of 300 for the lemma is set in configuration, while the embedding sizes of the other features are set heuristically by OpenNMT-py, using the heuristic |embedding k | = |V k | 0.7 , where |V k | is the vocabulary size of feature k. Table 2 gives an example from the training data with actual instantiations of each of the features.

Model
Models were trained using the OpenNMTpy toolkit (Klein et al., 2017). The model architecture is a 1 layer bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) cells (Hochreiter and Urgen Schmidhuber, 1997) and attention (Luong et al., 2015). The model has 450 hidden units in the encoder and decoder layers, and 300 hidden units in the word embeddings which are learned jointly across the whole model. Dropout of 0.3 is applied between the LSTM stacks. We use a coverage attention layer (Tu et al., 2016) with lambda value of 1.
The models are trained using stochastic gradient descent with learning rate 1. A learning rate decay of 0.5 is applied at each epoch once perplexity does not decrease on the validation set. Models were trained for 20 epochs. Output was decoded using beam search with beam size 5. Unknown tokens were replaced with the input token that had the highest attention value at that time step (Vinyals et al., 2015). Output from the epoch checkpoint which performed best on the development set was chosen for test set submission.
The exploration and choice of hyperparameters was aided by the use of Bayesian hyperparameter optimization platform SigOpt 2 .

Experiments
We experiment with many different combinations of input features and training data, in order to understand which elements of the representation have the largest impact upon performance.
We limit vocabulary size during training to enable the pointer network to generalize to unknown tokens at test time. When using just the SR training data we train word embeddings for the 15,000 most frequent tokens from a possible 23,650 unique tokens. When using the combined SR training data and filtered WikiText dataset we use the 30,000 most frequent tokens from a possible 106,367 unique tokens.
We trained on a single Tesla K40 GPU. Training time was approximately 1 minute per epoch for the SR data and 1 hour per epoch for the combined SR data and filtered WikiText.

Results
We report results using automated evaluation metric BLEU (Papineni et al., 2002). On the test set we additionally report the NIST (Przybocki et al., 2009) Table 3: Ablation study with BLEU scores for different configurations on the shallow task development set Table 3 presents the results of the surface realization experiments. We observe three main components that drastically improve performance over the baseline model: 1. augmenting the training set with more data 2. reordering the input using the dependency parse features 2 https://sigopt.com/ 3. providing potential forms via the delemmatization map Table 4 gives the official SR 2018 results from task organizers. Our system, which corresponds to the best configuration from Table 3

Related Work
The surface realization task bears the closest resemblance to the SemEval 2017 shared task AMR-to-text (May and Priyadarshi, 2017). Our approach to data augmentation and preprocessing uses many insights from Neural AMR (Konstas et al., 2017). Traditional datato-text systems use a rule based approach (Reiter and Dale, 2000).

Conclusion
The main takeaway from this work is that data augmentation improves performance on the surface realization task. Although unsurprising, this result confirms that sufficient data is needed to achieve reasonable performance, and that flattened structural information such as dependency parse features is insufficient without additional preprocessing to reduce the complexity of the input. The surface realization task is ostensibly quite simple, thus it is surprising that baseline sequence to sequence models, which perform well in other tasks such as machine translation, cannot solve this task. We hypothesize that the lemmatization and shuffling of the input does not provide sufficient information to reconstruct the input. In sequences longer than a few words, there is likely to be significant ambiguity without additional structural information such as parse features. However, reconstructing the original sequence from unprocessed, flattened parse information alone is unrealistic using standard encoder-decoder models.
In future work, we plan to explore more challenging variants of this task, while also experimenting with models that do not require featurespecific preprocessing to make use of rich structural information in the input.