Natural Language Generation at Scale: A Case Study for Open Domain Question Answering

Current approaches to Natural Language Generation (NLG) for dialog mainly focus on domain-specific, task-oriented applications (e.g. restaurant booking) using limited ontologies (up to 20 slot types), usually without considering the previous conversation context. Furthermore, these approaches require large amounts of data for each domain, and do not benefit from examples that may be available for other domains. This work explores the feasibility of applying statistical NLG to scenarios requiring larger ontologies, such as multi-domain dialog applications or open-domain question answering (QA) based on knowledge graphs. We model NLG through an Encoder-Decoder framework using a large dataset of interactions between real-world users and a conversational agent for open-domain QA. First, we investigate the impact of increasing the number of slot types on the generation quality and experiment with different partitions of the QA data with progressively larger ontologies (up to 369 slot types). Second, we perform multi-task learning experiments between open-domain QA and task-oriented dialog, and benchmark our model on a popular NLG dataset. Moreover, we experiment with using the conversational context as an additional input to improve response generation quality. Our experiments show the feasibility of learning statistical NLG models for open-domain QA with larger ontologies.


Introduction
In dialog literature Natural Language Generation (NLG) is framed as the task of generating natural language responses that faithfully convey the  1: Examples of input-output pairs from a taskoriented (Task) NLG (SFX (Wen et al., 2015)) and a Question-Answering (QA) dataset.In NLG the input is typically a Meaning Representation (MR) and the output is its textual realization (Text).Each MR is composed of a Dialog Act (bold) and a list of slot type (italic)-value pairs.Compared to most NLG datasets, our QA corpus also has the previous question (context) as input.While in the task-oriented setting we observe a one-to-one relation between slots in the input and the ones realized in the text, the same is not true for QA.
semantic information given by a Meaning Representation (MR).A MR is typically a structure consisting of a Dialog Act (DA) (e.g."inform" in Table 1) and a list of associated slots organized as slot type-slot value pairs (e.g.food:'french' in Table 1) representing the information which has to be conveyed in the generated text.
So far statistical NLG for dialog has mainly been investigated in research for task-oriented applications (e.g., restaurant reservation, bus information) in narrow, controlled environments with limited ontologies, i.e. considering a small set of DAs and slot types (respectively 12 and 8 in the popular San Francisco restaurant dataset (SFX) (Wen et al., 2015), 8 and 1 in the recent E2E NLG challenge (Novikova et al., 2017)).Furthermore, most datasets consider MRs in isolation (Novikova et al., 2017) i.e., they lack conversational context, even though the previous utterances in dialog have been shown to improve the performance of task-oriented NLG (Dušek and Jurcicek, 2016).These characteristics of current approaches to NLG can be linked to the fact that a vast majority of NLG research is tested on a single domain where the dialog agent performs simple tasks such as giving information about a restaurant, with few exceptions (Wen et al., 2016).
However, with the rise of conversational agents such as Amazon Alexa and Google Assistant, there is an increasing interest in complex multidomain tasks.These systems typically rely on hand-crafted NLG, but this approach cannot scale to the complex ontologies which may be required in real-world applications (e.g., booking a trip).
To study the scalability of current statistical NLG approaches, we apply state-of-the-art NLG models for dialog to open-domain Question Answering (QA).This allows us to investigate the performance of current NLG research in an environment with (1) much larger numbers of slot types, and (2) a different application compared to a traditional task-oriented dialog.We envision our approach as a first step towards an integrated statistical NLG module for a dialog system.Such a system should work with applications with a variety of ontology sizes and be flexible across multiple domains.Hence, we structure our NLG models for QA following the MR-to-text paradigm typical of task-oriented dialog using a neural Encoder-Decoder architecture.We use a large dataset of open-domain QA pairs between real-world users and a conversational agent and assess our models using both objective metrics and human evaluation.We observe that NLG for opendomain QA poses its challenges compared to taskoriented dialog since correct answers to the same question do not necessarily convey all slot types in the MR (see Table1).
In our first set of experiments, we investigate the effect of using increasingly larger ontologies with regards to slot types on the performance of NLG models.We find that, notwithstanding the larger ontologies and the noisiness of our dataset, the NLG performance does not degrade significantly regarding the naturalness of generated text and efficiency in encoding the MR information.In contrast, we observe it improves for some of the human evaluation metrics.We also observe that using conversational context improves the quality of generated responses.
In our second set of experiments, we investigate whether joint training of NLG models for taskoriented dialog and QA improves performances.To this end, we experiment with learning NLG models in a multi-task setting between QA and SFX.Our experiments show that learning models in a multi-task setting lead to better performances in naturalness for both tasks.

Related work
While classical approaches to NLG involve a pipeline of modules such as content selection, planning, and surface realization (Gatt and Krahmer, 2018), recently a large part of the literature investigated end-to-end neural approaches to NLG.The tasks tackled include dialog, text, and QA.While these tasks share some similarities, each comes with its own set of challenges and requires specific solutions.NLG for dialog State of the art NLG models for dialog (Dušek and Jurcıcek, 2016;Juraska et al., 2018) mostly use end-to-end neural Encoder-Decoder approaches with attention (Bahdanau et al., 2014) and re-ranking (Dušek et al., 2018).Ensembling is another technique employed to boost model performance (Juraska et al., 2018).Using delexicalization (Henderson et al., 2014), i.e., the process of substituting slot values with slot types in the generated text has shown improvements in many settings.However, recent work also depicted the disadvantages of delexicalization (Nayak et al., 2017).In our work, we compare and combine both delixecalized and lexicalized inputs for the NLG system.
NLG for dialog has been mostly tested in controlled environments using task-oriented, single domain datasets with limited ontologies.Although Wen et al. (2016) perform multi-domain task-oriented NLG, the ontologies used are still limited for such settings.Finally, while research has shown encoding previous utterance leads to better performances (Dušek and Jurcicek, 2016), most settings consider the turns in isolation.In our work, we perform open-domain NLG with significantly larger ontologies and also evaluate the impact of adding the context to the input.NLG for text and QA Recent work around NLG for text involves generating text using structured data using the encoder-decoder networks (Mei et al., 2016).Similarly to dialog, NLG for text has also been addressed in controlled environments such as weather forecast (Liang et al., 2009)  few exceptions (Lebret et al., 2016).
In the literature for QA, most approaches retrieve answers directly or generate answers jointly with the retrieval, and answers are usually entities or lists of entities (Dodge et al., 2015).On the contrary, in NLG we assume the answer has already been retrieved, and the goal is to generate text matching it.The field of QA which most strictly relates to our work is answer generation, where current approaches are also based on encoder-decoder networks encoding information directly from a knowledge base (Yin et al., 2016;He et al., 2017).An additional challenge to answer generation is that there are no publicly available datasets for this task (Fu and Feng, 2018).Our approach differs from answer generation in that we structure the task as in NLG dialog literature with a MR-to-text approach.

Question Answering
Source data Our source for generating the MRtext pairs are thousands of open-domain factual question-answer pairs from commercial data, grouped according to the type of question asked.Each group consists of a list of specific questions (e.g."who is the wife of barack obama", "tell me the wives of henry the viii") of the same type (e.g."who is the wife of") asked by real users to a conversational agent.Each specific question additionally has: (1) the answer to the question (e.g."michelle obama is obama's wife") generated by the NLG of the conversational system, either using information retrieval or a knowledge base search coupled with templates; (2) relevant noun and verb phrases (NPs, VPs) (e.g."michelle obama", "barack obama", "wife") used by the system to generate the answer, including the ones from the question.NPs are tagged according to their semantic type (examples of semantic types are timepoint and human being), while VPs are tagged as "relation" types (see "founded" tagged as relStr in Table 1).
The answers in the source data are varied, and range from a simple entity to a fully formed answer, as in Table 1 example where valid answers to the question "when was kentucky founded" can be "1792" or "kentucky formed in 1792".This shows an interesting difference between our QA data and task-oriented NLG datasets.While for task-oriented NLG all valid responses for a single MR have the same slot types (i.e., the ones in the input MR), in our dataset this is not always true.QA NLG datasets We generate the NLG inputoutput pairs for QA from our source data.In order to perform cross-application experiments, we maintain the same MR-text format as taskoriented dialog NLG.The target output is the text of the answers in the source data.To generate the input MRs we assumed only one DA across all answers, i.e. "inform"; for the slots, we used the semantic types and relations for NPs and VPs in the source data as slot types, while the actual entity or verb was used as the corresponding slot value.1 On top of the generated MR we use, as additional input, the previously asked question as context.
Answers are delexicalized (Henderson et al., 2014) to improve generalization.Since we do not have alignment between entities in the input and the generated text, we use a heuristic-based aligner which we also use to filter out data that could not be appropriately aligned.All NPs are delexicalized while VPs are not.Furthermore, similar to (Juraska et al., 2018), we use delexicalization for data augmentation.We generate additional references for each MR, besides the original one, by considering all delexicalized answers in the question group as candidate template answers for each specific question in the group and then substituting (where possible) slots values which are already available in the input.The text of the previous question is also delexicalized.
Finally, to investigate performances across different ontology sizes, we generate 3 different partitions of the data (QA.1,2,3 in Table 2) with a progressively larger number of slot types.Each QA partition was split in train, test and development set (using a 80-10-10 split) according to the type of question asked.We ensured there was no overlap between the different sets to test if the model generalizes to previously unseen questions.

Task-oriented
As task-oriented NLG corpus for our multi-task learning experiments we use the popular SFX (Wen et al., 2015) dataset.Statistics about the dataset is shown in Table 2.Although SFX is not large (6k examples), compared to the E2E NLG corpus it presents more variation for DA (although less in style).We use the TGEN library (Dušek and Jurcıcek, 2016) to delexicalize all slot types except binary values.

Model and Architectures
Encoder-decoder with Attention Juraska et al. ( 2018) recently achieved state-of-the-art results on variety of task-oriented datasets using an encoderdecoder model with attention.We use a similar approach, with bidirectional GRU and Luong general attention (Luong et al., 2015) as our baseline.Depending on the encoder used, either slot type or slot value, we refer to this model as 1-Enc MR (slot types) or 1-Enc MR (slot values).
Multi-Encoder, Single Decoder: We expand the baseline (1-Enc MR) models using multiple inputs from the MR (slot types, values, DAs), each encoded by a different encoder.The attention is performed on their concatenated output to produce the MR context vector c M R .Figure 1 A shows an example of such an architecture using two encoders, one for slot types and one for slot values.Furthermore, we experimented with adding the previous utterance (Utt) as input with an additional encoder (1-Enc Utt).In this case, the context vector for the Utt c U tt is produced by an independent attention mechanism and the outputs of both attentions (c M R and c U tt ) are concatenated (see Figure 1 B).
Multi-Encoder, Multi-Decoder: We also performed multi-task learning, jointly training the models for both QA and task-oriented NLG.As shown in Figure 1 C, we shared the encoders and corresponding input layers across multiple tasks while we maintained multiple decoders for individual tasks.

Evaluation
As word overlap metrics may not have a good correlation with human judgment for NLG output evaluation (Stent et al., 2005), we use both objective metrics and human evaluation.Objective metrics Besides the standard BLEU score (obtained using the official E2E NLG challenge evaluation script2 ), we report different types of Slot Error Rate (SER).In dialog NLG approaches SER shows the number of correct slots in the output compared to the input MR.We refer to this metric as SER mr to differentiate from it's modified versions we introduce next.The formula (Wen et al., 2015) is: SER mr = pmr+qmr Nmr ; where N mr is the total number of slots in the input MR and p mr , q mr are respectively the number of missing and redundant slots in the output.This formula works well for task-oriented NLG approaches, but it assumes a one-to-one relationship between the slots in the input MR and the output text.We found this assumption might not hold for our QA datasets where not all slots in the input MR need to be realized for the output to be correct, as shown in Table 1, where the first QA reference text would be penalized with 3 missing slots.
For this reason we designed additional metrics tailored for QA.Slot Error Rate Target (SER trg ) is a modification of SER mr where we simply substitute the MR with the main reference text output: SER trg = ptrg+qtrg Ntrg .SER trg is designed to penalize both missing and redundant slots compared to the target sentence.Hence, using SER trg the first QA reference text in Table 1 would not be penalized.Slot Error Rate MultiTarget (SER mtrg ), on the other hand, penalizes redundant slots that did not appear in any of the references: SER mtrg = pmtrg Nmtrg ; where N mtrg are the slots appearing in any reference and p mtrg are the slots that did not appear in any reference sentence.To compute SER mtrg for the model output "kentucky formed in 1792" given the QA MR in Table 1 we assume to have two references "1792" and "kentucky formed in 1792".In this case, SER mtrg would consider the output correct as all of its slot appear in at least one of the references.
Human evaluation In all experiments, for each dataset, we selected a sample of 100 MR-text pairs from the test set.Pairs were randomly selected among those where all models under comparison in the experiment had generated different output text.Data for all reported experiments were annotated by 2 human annotators, and final ratings were averaged between the two.In all experiments annotators, presented with MR and all outputs of the systems under comparison, were asked to rate the naturalness and informativeness of the generated output using a 1-6 Likert score, as in previous NLG dialog evaluations (Gatt and Krahmer, 2018).Additionally, for the QA datasets annotators had also the previous question as context.Moreover, for the QA datasets annotators were asked to rate how conversational the output was, on the same Likert scale, and whether or not the output could ultimately be considered an answer to the question (answer), as a binary choice.

Experimental setup
The hyperparameters chosen for our models were empirically determined through various experiments.Both encoder and decoder in all our models had only one layer, as we noticed additional layers did not give improvements.All embeddings were trained from scratch with a fixed dimension of 50.Models were trained using a cross-entropy loss function and the Adam optimizer with a learning rate of 0.001, for 1000 epochs, with early stopping on the validation set.We used mini-batches of size 32.For the NLG models for QA, experiments on QA.1 (not reported due to space limitations) with different encoders combination showed that the best performances were achieved using all input types (slot type, value, and previous context) with lexicalized (+ 1-Enc Utt lex) or delexicalized (+ 1-Enc Utt delex) previous context.As a baseline we decided to report the architecture with slot types and values, but without the previous context (2-Enc MR (slot types, values)).

Results
Open domain QA In our first batch of experiments we test various Encoder-Decoder architectures on our 3 different partitions of QA NLG data.
As we can see from Table 3, in general, the best performances across all QA datasets for both BLEU and SER trg are achieved by the model using as additional input the lexicalized previous question, followed by the model with the delexicalized one.However, SER mr results show the opposite picture, where the baseline with only slot types and values performs better (except for QA.2 where the score is close to the model with the delexicalized input) and the model with the lexicalized previous utterance is the worst.SER mtrg shows, on the other hand, that the context might slightly degrade performances with bigger ontologies in terms of all text references.
Human evaluation, on the other hand, seems in line with the picture depicted by BLEU and SER trg .Table 4 shows the model with the lexicalized context is regarded as the best, closely followed by the model with the delexicalized one in every metric except for conversational, where delexicalized is better.This confirms our hypothesis that SER mr might be a less reliable metric to evaluate NLG QA output.Moreover, although we notice a consistent but not drastic degradation in terms of BLEU and SER trg in correlation with bigger ontologies, human evaluation shows an even more gentle degradation between QA.1 and 3 for many metrics.Interestingly, it seems the ability of all models to give a proper answer to the question (answer) increases from QA.1 to 3.
A qualitative analysis showed the baseline model was the one with most grammatical errors (e.g."will ferrell 's 's wife is viveca paulin", "no , canada is not the bigger than united states .").Lexicalizing the previous question improved the grammar and produced short correct answers    Our experiments show that the NLG QA task improves the fluency on SFX both in terms of objective metrics (in Table 5) and human evaluation (in Table 6).However, training with QA seems to slightly degrade the model efficiency in generating the correct slots.This is to be expected given the difference in the relation between slots in MR and output (one-to-one in SFX, variable in QA.3).As for QA.3 results, it seems the task-oriented NLG task improves QA NLG performances in terms of fluency (BLEU and Naturalness) and slot errors (SER trg and Informativeness).SER mr and SER mtrg , however, show a slight degradation.We observe task-oriented NLG also makes QA NLG more conversational, however slightly reducing its probability of being an answer to the posed question as well.A qualitative analysis confirmed responses of both QA and task-oriented models trained in multitask were more grammatical.
Finally, comparing all experiments on QA.3, we notice that although multi-task learning helps, the previous context (either lexicalized or delexicalized) plays a critical role in improving the overall performance.

Conclusions
In this work, we apply the traditional dialog MRto-text approach to NLG to an open-domain QA setting with much larger ontologies.Our goal was to test the reliability of current approaches to NLG in an environment where the number of slots could be substantial, a requirement that is critical to meet if we want to move towards an integrated NLG module across different domains.
The experiments presented show the feasibility of learning a NLG module for QA using a MR-to-text approach.NLG models performances on datasets with progressively bigger ontologies reported a continuous but not drastic decline for most metrics.Moreover, our multitask learning experiments showed that learning NLG models jointly for QA and task-oriented dialog improves single tasks performances in terms of fluency.Results across different experimental settings also point towards the vital role played by the previous utterance context (delexicalized and especially lexicalized) to improve NLG models for opendomain QA.

Figure 1 :
Figure 1: Our baseline model (A) and the models with the previous utterance (B) and for multi-task learning (C).

Table 2 :
with Our QA NLG datasets compared to popular (task-oriented) NLG datasets: San Francisco restaurant (SFX) and the NLG E2E challenge (E2E).We report the full size of datasets in terms of MR-text pairs, the number of slot types, DAs, words (computed after delexicalization), domain and whether the dataset comprises the previous utterance or not.

Table 3 :
Objective metrics results on three QA NLG datasets with increasingly larger ontologies.

Table 4 :
Human evaluation results on three QA NLG datasets with increasingly larger ontologies.

Table 5 :
Objective metrics of multitask learning experiments combining QA and task-oriented dialog NLG.

Table 6 :
Results of multitask learning according to human evaluation