E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language

In natural language generation (NLG), the task is to generate utterances from a more abstract input, such as structured data. An added challenge is to generate utterances that contain an accurate representation of the input, while reflecting the fluency and variety of human-generated text. In this paper, we report experiments with NLG models that can be used in task oriented dialogue systems. We explore the use of additional input to the model to encourage diversity and control of outputs. While our submission does not rank highly using automated metrics, qualitative investigation of generated utterances suggests the use of additional information in neural network NLG systems to be a promising research direction.


Introduction
Natural Language Generation (NLG) is a broad field, ranging from text-to-text translation to experiments in computational poetry (Gatt and Krahmer, 2018).Whether the task is to summarize, translate, or entertain, a core challenge is doing so in a manner that is compatible with human needs and preferences.
Formally, NLG systems aim to create utterances from a set of abstract inputs.These inputs can be closely aligned, e.g. machine translation (Sutskever et al., 2014), or require significant abstractive reasoning, as in summarization or datato-text tasks (See and Manning, 2017;Wiseman et al., 2017).Traditionally NLG systems have followed a rule-based approach (Reiter and Dale, 2000).While robust, these systems are noted to generate repetitive and stilted output, which can Meaning Representation name[The Wrestlers] eatType [restaurant] food [Japanese] priceRange[more than £30] area [riverside] familyFriendly [no] near[Raja Indian Cuisine] additionalWords[looking adults offerings really try good prices situated] Generated utterance If you're looking for an adults only Japanese restaurant, try The Wrestlers.It is really good and situated near Raja Indian Cuisine.The prices are more than £30.
Table 1: Utterance generated with a novel dialogue act containing additional words make interacting with rule based systems a tedious experience (Wen et al., 2015).
Data driven models using deep neural networks have achieved state-of-the-art results in many NLG tasks/datasets such as RoboCup, Weathergov, SF Hotels/Restaurants and AMR-to-text (Mei et al., 2016;Wen et al., 2016;Konstas et al., 2017).However Sharma et al. (2017) notes that high performance on datasets such as Wen et al. (2015)'s SF Restaurant indicates they no longer pose a sufficient challenge and that the community ought to progress to using larger and more complex datasets.
Two new crowd sourced datasets, each containing tens of thousands of examples and focusing on complex sentence structures, have been recently released; WebNLG and E2E (Colin et al., 2016;Novikova et al., 2017).This paper focuses on the E2E dataset which was created using a new methodology to maximize both the quality of collected utterances as well as their naturalness and variety (Novikova et al., 2016).Wei et al. (2017) note that neural networks learning from highly unaligned datasets have trouble choosing between equally plausible outputs and tend towards short and less meaningful outputs.They suggest that the number of plausible outputs can be decreased by providing additional information to the model.In Table 1 we augment the meaning representation (MR) with a novel dialogue act (DA) containing additional words to be included in the generated utterance.By conditioning the output on these words the model has managed to generate an utterance with a complex sentence structure and wide vocabulary.
Our contribution is to propose a pipeline system.Additional words are sampled from a secondary model which uses DAs from a given MR as inputs.These additional words are put into a new DA and added to the existing MR, as shown in Table 1.The augmented MR is then used as input to a model which generates the final utterance.
The approach of augmenting the source sequence takes inspiration from recent work in paraphrase generation (Guu et al., 2017) and generating structured queries from natural language (Zhong et al., 2017).As noted by Sharma et al. (2016) delexicalization can often lead to grammatically incorrect sentences.We opt instead to use a pointer network (Vinyals et al., 2015) which allows the model to copy tokens directly from the source sequence into the generated utterance.The model does not perform well relative to the baseline and this is possibly due to the failure of the secondary model to generate appropriate additional words.Improving upon the pipeline system remains an area of active research for us.

System Description
Here we present details of the pipeline system.First we describe how the training data for the pointer network with additional words model is constructed.This is followed by an explanation of the additional word generator which uses DAs from a given MR as input.
Typical approaches to generating diverse outputs focus on objective functions that affect the decoding step (Li et al., 2015).Our approach of augmenting the input sequence is similar to previous work on common sense dialogue models (Young et al., 2017) and content-introducing text generation (Mou et al., 2016).Other approaches to controllable text generation have focused on more abstract inputs.Language models which generate text about a specific topic, product, person, sentiment (Li et al., 2016;Tang et al., 2016;Fan et al., 2017;Dong et al., 2017).

Additional words model
We augment the MR with an extra DA containing additional words to be included in the generated sentence.To obtain the data for this we looked at each target sentence and, using a set of rules, determined what words the model would learn to include.These selected words were added to the source sequence inside a custom DA.This ability of the model to accept additional words ensured that we would have both diversity of outputs and fine grained control over those outputs at test time.
For our additional words model we extracted tokens from the target sequence that adhered to the following set of rules: • Not part of a list of stopwords • Does not appear in the source sequence or meaning representation • Does not contain punctuation or numbers After the original list was compiled we removed the most frequently appearing token located and any tokens which occurred less than 6 times.The unique contents of each DA in the MR are treated as a single token.We omit the name and near DAs as they were observed to have little correlation with the semantics of the additional words chosen.The model attempts to correlate specific DAs with the additional words that appear in target sentences.An example of the source and target sequences used for training are shown in Table 3.We use a sequence-to-sequence network with attention as the model.
Additional words are sampled from the model.We scale the final output layer of the model before applying softmax and sampling tokens for the generated utterance.The value used for scaling is known as temperature.Higher values of temperature lead to more diverse outputs.Temperature values close to 0 lead to the model choosing more conservative outputs.We use values of 0.9 to 1.1, to encourage the generation of a more diverse set of additional words.
Source sequence pub more than £30 5 out of 5 Target sequence star Prices start

Experiments
The data set was tokenized using the NLTK port of the moses tokenizer with aggressive hyphen splitting.For each DA a custom start and stop token was added to the source sequence; e.g.name start The Vaults name end The models used were from the OpenNMT-py library (Klein et al., 2017).Our model architecture contains 2 layers of bidirectional recurrent neural networks (RNN) with long short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997).We use 500 hidden units for the encoder and decoder layer, and 500 units for the word vectors which are learned jointly across the whole model.We add dropout of 0.3 applied between the LSTM stacks.
The models are trained using Adam (Kingma and Ba, 2014) with learning rate 0.001 and learn-ing rate decay of 0.5 applied after 8 epochs.The models were trained for 10 epochs and the best performing checkpoint on the development set was chosen.
The exploration and choice of hyperparameters was aided by the use of Bayesian hyperparameter optimization platform SigOpt (2014).

Results & Discussion
We report results using automated evaluation metrics; BLEU (Papineni et al., 2002), NIST (Przybocki et al., 2009), METEOR (Lavie and Agarwal, 2007), and ROUGE-L (Lin, 2004).Table 4 shows the performance of the baseline relative to our models using both sample additional words and those extracted from target sentences, these are the gold standard additional words.The baseline model is TGen, a sequence-to-sequence model with attention (Dušek and Jurčíček, 2016).
The model using extracted additional words performs better in almost all metrics.The poor performance of models using sampled words versus gold standard words highlights an issue with the generation of additional words.These results maintain their relative ranking in the test set as shown in Table 5.
Human evaluation was carried out on the primary systems.The two metrics used were Quality; which measures grammatical correctness and overall adequacy in the context of the MR, and Naturalness; could the utterance have been produced by a native speaker.Crowd workers were used to collect pairwise comparisons for each system.Systems were ranked using the TrueSkill algorithm (Sakaguchi et al., 2014).Our model ranked 4th, below the baseline which came in 2nd, as shown in Table 4 (Dušek et al., 2018) Automated evaluation and subsequent human evaluation results show our additional words model performs poorly relative to the baseline.A manual observation of the model's outputs reveal many errors such as repeated phrases and occasionally absent or incorrect information.We include a collection of generated utterances from the test set in table 7 to highlight areas where the model performs both well and poorly relative to the baseline.
Utterances from the baseline model tend to be more consistent but when viewed over many hundreds of samples this can be dry and repetitive.

Future Work
Many verbalization issues in the additional word model arise due to a conflict between an additional word and the existing DAs in the MR.This can be seen in some of the examples in Table 7.The model used for generating additional words could be improved substantially.Increasing the minimum frequency of occurrence for additional words in the training data may give the model more examples from which to better learn correct syntax.The pointer network with additional words model also suffers from an issue, common with pointer networks, in which source tokens are incorrectly repeated in the generated utterance.One way to handle this would be to have a second stage of training with a coverage loss as in See and Man-ning (2017).

Conclusion
We proposed the use of an additional DA to improve the diversity and level of control over utterances.Results show both the underlying network and the method used for generating additional words could be improved.Observation of generated samples show this approach has the potential to yield high quality and varied responses.
Table 2 contains an example of an augmented MR and utterance pair used for training.Prices start at £30.

Table 3 :
Example pair used for training the additional word generator In most cases the baseline model appears to have