Shape of Synth to Come: Why We Should Use Synthetic Data for English Surface Realization

The Surface Realization Shared Tasks of 2018 and 2019 were Natural Language Generation shared tasks with the goal of exploring approaches to surface realization from Universal-Dependency-like trees to surface strings for several languages. In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data, and a new rule prohibiting the use of synthetic data was introduced for the 2019 shared task. Contrary to the findings of the 2018 shared task, we show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect – an improvement of almost 8 BLEU points for a previously state-of-the-art system. We analyse the effects of synthetic data, and we argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.


Introduction
The shallow task of the recent surface realization (SR) shared tasks (Belz et al., 2011;Mille et al., 2018Mille et al., , 2019) ) appears to be a relatively straightforward problem.Given a tree of lemmas, a system has to restore the original word order of the sentence and inflect its lemmas, see Figure 1.Yet SR systems often struggle, even for a relatively fixed word order language such as English.Improved performance would facilitate investigation of more complex versions of the shallow task, such as the deep task in which function words are pruned from the tree, which may be of more practical use in pipeline natural language generation (NLG) systems (Moryossef et al., 2019;Elder et al., 2019;   Castro Ferreira et al., 2019).
In this paper we explore the use of synthetic data for the English shallow task.Synthetic data is created by taking an unlabelled sentence, parsing it with an open source universal dependency parser1 and transforming the result into the input representation.
Unlike in the 2018 shared task, where a system trained with synthetic data performed roughly the same as a system trained on the original dataset (Elder and Hokamp, 2018;King and White, 2018), we find its use leads to a large improvement in performance.The state-of-the-art on the dataset is 72.7 BLEU-4 score (Yu et al., 2019b) -our system achieves a similar result of 72.3, which improves to 80.1 with the use of synthetic data.We analyse the ways in which synthetic data helps to improve performance, finding that longer sentences are particularly improved and more exactly correct linearizations are generated overall.
Although it is common knowledge that machine learning systems typically benefit from more data, this 7.4 point jump in BLEU is important and worth emphasizing.The 2019 shared task introduced a new rule which prohibited the use of synthetic data.This was done in order to make the results of different systems more comparable.However, systems designed with smaller datasets in mind might not scale to the use of synthetic data, and an inadvertent consequence of such a rule is that it may produce results which could be misleading for future research directions.
For instance, the system which was the clear winner of this year's shared task (Yu et al., 2019a) used tree-structured long short-term memory (LSTM) networks (Tai et al., 2015).In general, tree LSTMs can be slow and difficult to train. 2  Critics of neural NLG approaches4 emphasise that quality and reliability are at the core of production-ready NLG systems.What we are essentially arguing is that if using synthetic data contributes to producing higher quality outputs, then we ought to ensure we are designing systems that can take advantage of synthetic data.

Data
We evaluate on the Surface Realization Shared Task (SRST) 2018 dataset (Mille et al., 2018) for English5 , which was derived from the Universal Dependency English Web Treebank 2.06 .
The training set consists of 12,375 sentences, dev 1,978, test 2,062.

Baseline system
The system we use is an improved version of a previous shared task participant's system (Elder and Hokamp, 2018).This baseline system is a bidirectional LSTM encoder-decoder model.The model is trained with copy attention (Vinyals et al., 2015;See et al., 2017) which allows it to copy unknown tokens from the input sequence to the output.The system performs both linearization and inflection in a single decoding step.To aid inflection, a list is appended to the input sequence containing possible forms for each relevant lemma.
Depth first linearization (Konstas et al., 2017) is used to convert the tree structure into a linear format, which is required for the encoder.This linearization begins at the root node and adds each subsequent child to the sequence, before returning to the highest node not yet added.Where there are multiple child nodes one is selected at random.Decoding is done using beam search, the output sequence length is artificially constrained to contain the same number of tokens as the input.

Improvements to baseline
Random linearizations In the baseline system, a single random depth first linearization of the training data is obtained and used repeatedly to train the model.Instead, we obtain multiple linearizations, so that each epoch of training data potentially contains a different linearization of the same dependency tree.This makes the model more robust to different linearizations, which is helpful as neural networks don't generally deal well with randomness (Juraska et al., 2018).
Scoping brackets Similar to Konstas et al. (2017) we apply scoping brackets around child nodes.This provides further indication of the tree structure to the model, despite using a linear sequence as input.
Restricted beam search In an attempt to reduce unnecessary errors during decoding, our beam search looks at the input sequence and restricts the available vocabulary to only tokens from the input, and tokens which have not yet appeared in the output sequence.This is similar to the approach used by King and White (2018).

Synthetic Data
To augment the existing training data we create synthetic data by parsing sentences from publicly available corpora.The two corpora we investigated are Wikitext 103 (Merity et al., 2017) and the CNN stories portion of the DeepMind Q&A dataset (Hermann et al., 2015).
Each corpus requires some cleaning and formatting, after which they can be sentence tokenized using CoreNLP (Manning et al., 2014).Sentences are filtered by length -min 5 tokens and max 50and for vocabulary overlap with the original training data -set to 80% of tokens in a sentence required to appear in the original vocabulary.These sentences are then parsed using the Stanford NLP UD parser (Qi et al., 2018).This leaves us with 2.4 million parsed sentences from the CNN stories corpus and 2.1 million from Wikitext.
It is a straightforward process to convert a parse tree into synthetic data.First, word order information is removed by shuffling the IDs of the parse tree, then the tokens are lemmatised by removing the form column.This is the same process used by the shared task organizers to create datasets from the UD treebanks.
While it has been noted that the use of synthetic data is problematic in NLG tasks (WeatherGov (Liang et al., 2009) being the notable example) our data is created differently.The WeatherGov dataset is constructed by pairing a table with the output of a rule-based NLG system.This means any system trained on WeatherGov only re-learns the rules used to generate the text.Our approach is the reverse; we parse an existing, naturally occurring sentence, and, thus, the model must learn to reverse the parsing algorithm.

Training
The system is trained using a custom fork 7 of the OpenNMT-py framework (Klein et al., 2017), the only change made was to the beam search decoding code.Hyperparameter details and replication instructions are provided in our project's repository 8 , in particular in the config directory.
Vocabulary size varies based on the datasets in use.It is determined by using any tokens which appears 10 times or more.When using the original shared task dataset, the vocabulary size is

Evaluation
The evaluation is performed on detokenized sentences9 using the official evaluation script from the 2018 shared task.We focus on BLEU-4 score (Papineni et al., 2002) which was shown in both shared tasks to be highly correlated with human evaluation scores.

Results
In Table 1, we compare our results on the test set with those reported in Yu et al. (2019b), which include the Yu et al. system (Yu19), the best 2018 shared task result for English (Elder and Hokamp, 2018) 2016) (P16) .Ignoring for now the result with synthetic data, we can see that our system is competitive with that of Yu et al (72.3 vs 72.7).
In Section 2.3, we described three improvements to our baseline system: random linearization, scoping and restricted beam search.An ablation analysis of these improvements on the dev set is shown in tions.However, all three make a meaningful, positive contribution.

The Effect of Synthetic Data
The last row of Table 1 shows the effect of adding synthetic data.BLEU score on the test set jumps from 72.3 to 80.1.To help understand why additional data makes such a substantial difference, we perform various analyses on the dev set, including examining the effect of the choice of unlabeled corpus and highlighting interesting differences between the systems trained with and without the synthetic data.
The role of corpus Table 3 compares the Wikitext corpus as a source of additional training data to the CNN corpus.Both the individual results and the result obtained by combining the two corpora show that there is little difference between the two.
Sentence length and BLEU score Using compare-mt (Neubig et al., 2019) we noticed a striking difference between the systems with regards to performance on sentences of different length.10This is shown in Figure 2.Even though the synthetic data sentences were limited to 50 tokens in length, the synthetic data performed equally well for sentence length buckets 50-60 and 60+, while the baseline data system performed relatively worse.It is possible this is due to the synthetic data system containing a larger vocabulary and being exposed to a wider range of commonly occurring phrases, which make up parts of longer sentences.Table 4: Error analysis breakdown for the 1,978 dev sentences.SRST is our system without synthetic data and Synth is our system with synthetic data.
Error Analysis We perform some preliminary analysis that could serve as a precursor to more detailed human evaluation.Table 4 lists the number of exact matches, in which the tokenized reference sentence and the generated sentence exactly match.We also detect relatively minor errors, namely punctuation and inflection, in which these are the only differences between the reference and generated sentences.Punctuation errors are typically minor and there is usually ambiguity about their placement.11Inflection errors occur when a different inflected form has been chosen by the model than in the reference sentence.These tend to be small differences and are often valid alternatives, e.g.choosing 'm over am.
Within the remaining uncategorized sentences are mostly linearization errors.Linearization errors come in two main categories; non-breaking, in which the linearization is different from the reference sentence but is still valid and communicates the same meaning as the reference -see Example 1 below; and breaking, where the linearization has clear errors and doesn't contain the same meaning as the reference sentence -see Example 2 below.This kind of breakdown in an error analysis may help understand the quality of these systems in more absolute terms, since it's the overall number of accurate sentences which matters.This could be more intuitive than comparing BLEU scores relative to prior models when deciding whether to apply a system in a business setting.

Conclusion
We have argued for the use of synthetic data in English surface realization, justified by the fact that its use gives a significant performance boost on the shallow task, from 72.7 BLEU up to 80.1.While this is not yet at the level of reliability needed for neural NLG systems to be used commercially, it is a step in the right direction.
Assuming the use of synthetic data, more needs to be investigated in order to fully maximize its benefit on performance.Future work will look more closely at the choice of corpus, construction details of the synthetic dataset, as well as the tradeoff between training time and accuracy that comes with larger vocabularies.
The work described this paper has focused on English.Another avenue of research would be to investigate the role of synthetic data in surface realization in other languages.

Figure 2 :
Figure 2: BLEU score breakdown by sentence length buckets, comparing our best model trained on the original dataset with one trained with synthetic data Ref: From the AP comes this story: (b) Synth: This story comes from the AP: 2. Breaking (a) Ref: I ran across this item on the Internet.(b)Synth: I ran on the internet across this item.

Table 1 :
Test set results for baselines trained on the original dataset and the final model which uses synthetic data 2,193 tokens, training is done for 33 epochs and takes 40 minutes on two Nvidia 1080 Ti GPUs.All hyperparameters stay the same when training with the synthetic data, except for vocabulary size and training time.For the combined shared task, Wikitext and CNN datasets the vocabulary size is 89,233, training time increases to around 2 days, and uses 60 random linearizations of the shared task dataset and 8 of the Wikitext and CNN datasets.

Table 2 .
The biggest improvement comes from the introduction of random lineariza-

Table 2 :
Dev set results for ablation of the baseline system plus improvements, trained only on the original dataset

Table 3 :
Dev set results for the SR shared task data with additional synthetic data: the role of the corpus