Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Achieving true human-like ability to conduct a conversation remains an elusive goal for open-ended dialogue systems. We posit this is because extant approaches towards natural language generation (NLG) are typically construed as end-to-end architectures that do not adequately model human generation processes. To investigate, we decouple generation into two separate phases: planning and realization. In the planning phase, we train two planners to generate plans for response utterances. The realization phase uses response plans to produce an appropriate response. Through rigorous evaluations, both automated and human, we demonstrate that decoupling the process into planning and realization performs better than an end-to-end approach.


Introduction
Recent advancements in the area of generative modeling have helped increase the fluency of generative models. However, several issues persist: coherence of output and the semblance of mere repetition/hallucination of tokens from the training data (Moryossef et al., 2019;Wiseman et al., 2017). One reason could be that the generation task is typically construed as an end-to-end system. This is in contrast to traditional approaches, which incorporate a sequence of steps in the NLG system, including content determination, sentence planning, and surface realization (Reiter, 1994;Reiter and Dale, 2000). A review of literature from psycholinguistics and cognitive science also provides strong empirical evidence that the human language production process is not a monolith (Dell, 1985;Bock, 1996;Bock et al., 2007;Kennison, 2018).
Prior approaches have indeed incorporated content planning into the NLG system, for example data-to-text generation problems (Puduppully et al., 2019;Moryossef et al., 2019) as well as classic works that include planning, based on speech acts (Cohen and Perrault, 1979) (for an in-depth review c.f. (Garoufi, 2014)). Our work closely follows these prior approaches, with one crucial difference: our planners are not based on dialogue acts or speech acts.
Consider the example in Fig. 1. An input utterance by Person B, a statement (Unfortunately no.), followed by a question (What do they do?), can be effectively responded to using plans, learned and generated, prior to the realization phase. The realization output can then include the mention of provides relief, consistent with the generated plan (PERFORM [provides [relief]]).
Dialogue acts (Stolcke et al., 2000) (e.g., statements, questions), by their nature, encompass a wide variety of realized output, and hence cannot sufficiently constrain the language model during the generation process. Research has addressed this issue by adapting existing taxonomies (Stolcke et al., 2000) towards their own goals (Wu et al., 2018;Oraby et al., 2017). We instead use an adapted and extended form of lexical-conceptual structures (LCSs) to help constrain the realization output more effectively (Dorr, 1994).
Our work makes the following contributions: • We investigate the impact of separating planning and realization in open-domain dialogue and find that the approach produces better responses per automated metrics and detailed human evaluations.
• We propose the use of LCS-inspired representations based on asks and framings, which in turn are grounded in conversation analysis literature, to generate plans, instead of using dialogue acts.
• We release corpora annotated with plans for all utterances, using three planners, including symbolic planners and attention-based planners.

Related Work
Open-Ended Dialogue Systems: Transformer models (Vaswani et al., 2017) and large transformer-based language models such as GPT, GPT-2, XLNet, BERT (Radford et al., 2018(Radford et al., , 2019Devlin et al., 2019) have helped achieve the SOTA performance across several natural language tasks. However, these models do not achieve the same level of consistent performance on generative modeling tasks as opposed to language understanding tasks (Ziegler et al., 2019;Edunov et al., 2019).  propose a transfer learning approach that fine tunes large pretrained language models and achieves SOTA scores on the PERSONA-chat dataset (Golovanov et al., 2019) and in the CONVAI2 competition (Dinan et al., 2019;Yusupov and Kuratov, 2018). Keskar et al. (2019) introduce a large-scale conditional transformer model that improves generation based on control codes.
Our training paradigm is consistent with existing research that constrains large-scale language models across generation tasks (Rashkin et al., 2019;Urbanek et al., 2019) and yields controllable text generation (Shen et al., 2019;Zhou et al., 2017), with one key difference: we learn to plan and realize separately. Accordingly, we overview planning based approaches next.
Planning-Based Approaches: A standard component of traditional NLG systems is a planner (Reiter and Dale, 2000). Prior work leverages intent and meaning representations (MR) to understand the content of the message (Young et al., 2013), but largely in task-oriented as opposed to open-ended dialogue systems (He et al., 2018). Novikova et al. (2017) propose the E2E challenge and use MRs to show lexical richness and syntactic variation. Similarly, Gardent et al. (2017) focus on structured data (e.g. DBpedia) to generate text in the WebNLG framework. Moryossef et al. (2019) use an explicit symbolic component for planning in a neural data to text generation system that allows controllable generation. Along with conversational intents, dialogue acts are also used for natural language understanding (NLU) in task-oriented systems Peskov et al., 2019).
In contrast to these prior approaches, our work uses more in-depth meaning representations for open-domain dialogue systems based on lexical conceptual structures (explained in Section 3.1).

NLU using Asks and Framing
The representation we use to generate plans leverages asks and framings based on conversation analysis literature (Pomerantz and Fehr, 2011;Sacks, 1992;Schegloff, 2007). An ask is closely related to the notion of a request (Zemel, 2017). Perhaps most importantly, an ask elicits relevant responses from the recipient. Framing refers to linguistic and social resources used to persuade the recipient of an ask to comply and perform the requested social action. Put another way, an ask creates a social obligation to respond, while framing provides an adequate basis for compliance with the ask. In Fig. 2, we show the ask/framing representational formalism that serves as the basis of our response plans. Here the ask is a request to PER-FORM the action of check out the website. The perceived risk or reward (or framing) for this request is that, upon performing the action, one may GAIN something, i.e., gather a lot more information. We use two types of asks: GIVE (provide something or information) and PERFORM (perform an action), and two types of framings: GAIN (gain some benefit) and LOSE (lose benefit or resource). This preliminary ontology was motivated by conversa- Figure 3: Architecture diagram of our system consisting of two phases: Planning and Realization. The Planning phase (Context and Pseudo Self Attention) encodes the input sequence and symbolic planner input to produce the response plans. The Realization phase uses the response plan and input utterance to generate the response tion analysis literature (Sacks et al., 1978;Curl and Drew, 2008;Epperson and Zemel, 2008): by treating utterances as actions, we are able to establish what each utterance seeks to accomplish and how a sender motivates the recipient in terms of the benefits and costs of compliant responses.

Method
Our goal is to generate an informative response to the input utterance by first generating an appropriate Response Plan. We train two components separately (c.f. Fig. 3). In the Planning Phase, we experiment with generating plans in three ways: 1. Symbolic Planner: Foremost, we need to extract plans automatically from utterances. To accomplish this goal, our symbolic planner adapts lexical representations previously used for language analysis (Dorr et al., 2020) to the problem of constructing Response Plans. We use lexical conceptual structures and basic language processing tools (Gardner et al., 2017;Manning et al., 2014) for parsing the input, identifying the main action, identifying the arguments (or targets), and applying semantic-role labeling. Fig. 2 presents ask/framing examples (type, action and target).
Once response plans are identified for all utterances in a given corpus using the symbolic planner, we need to address automated generation of such plans. Using the asks and framings as annotated data for a "silver" standard, 1 we train models to learn to generate "Response Plans" that are encoded with the same representation format used for asks/framings. We use the language modeling paradigm and use a large pre-trained model (GPT-2) (Radford et al., 2019) with the transformer architecture and the self-attention mechanism (Vaswani et al., 2017). We fine-tune this language model with the constraint of the input utterance and the plan for this input utterance, and train it to pro-duce the plan for the response utterance. We adopt the fine-tuning approach specified by Ziegler et al.
(2019) and train two specific models (CTX and PSA) described below.
2. Context Attention Planner (CTX): based on the encoder/decoder architecture. In this model, the decoder weights are initialized with the pre-trained weights of the language model. However, a new context attention layer is added in the decoder that concatenates the conditioning information to the pre-trained weight. The conditioning information, in our case, is the plan for the input utterance.
3. Pseudo Self Attention (PSA): Proposed by Ziegler et al. (2019), PSA injects conditioning information from the encoder directly into the pretrained self attention (similar to the "zero-shot" model proposed by Radford et al. (2019)).
In the Realization Phase, we generate responses by utilizing the response plan generated from the planning phase as well as the input utterance. We expect a more guided generation of responses that are constrained by the response plan. In this phase, we only experiment with the Pseudo Self attention (PSA) model, based on Ziegler et al. (2019), who demonstrate that PSA outperforms other approaches on text generation tasks. We use nucleus sampling to overcome some of the drawbacks of beam search (Holtzman et al., 2020).

Corpora
Our choice of corpora is driven by the presence of information elicitation and persuasive strategies in the utterances (i.e., asks and framings).
Accordingly, we experiment with the AntiScam  and Persuasion for Social Good (Wang et al., 2019) corpora. AntiScam contains dialogues about a customer service scenario and is specifically crowdsourced to understand human elicitation strategies. Persuasion for Social Good corpus contains interactions between workers who are assigned the roles of persuader and persuadee, where the persuader attempts to convince the persuadee to donate to a charity. All utterances in these corpora are first annotated through the Symbolic Planner (c.f. Section 3.2) to gauge suitability based on the presence of asks and framings. In Table 1, we provide descriptive statistics of the corpora; we find an adequate number of ask/framing types (GIVE, PERFORM, GAIN, LOSE). In cases where there are no asks/framings or the symbolic planner fails to detect them, we use the default action RESPOND.

Implementation
We implement the models using Open-NMT (Klein et al., 2017) and the PyTorch framework. 2 We use publicly available GPT-2 model (Radford et al., 2019) with 117M parameters, 12 layers and 12 heads in our implementations. The input utterances and the plans are tokenized using byte-pair encoding to reduce vocabulary size (Sennrich et al., 2015). Both phases are trained separately. In the Planning Phase, the plan for the input utterance along with the input utterance is used to generate the response plan for the response utterance; in the Realization Phase, the response plan and input utterance are input to the model to generate the response. In both planning and realization phase, separation tokens are added (e.g. <plan>), as is common practice for transformer inputs (Devlin et al., 2019;. We use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0005 and β 1 = 0.9 and β 2 = 0.98. During decoding, we use nucleus sam-pling both in the planning and realization phase. All models are trained on two TitanV GPU and take roughly 15 hours each to train the planner and realization component. The trained models and the codebase are available at https://github.com/ sashank06/planning_generation

Evaluation of Approach
The results reported in these subsections were obtained by combining both corpora and dividing randomly in a ratio of 80/10/10 for the training, testing, and validation set.

Planning Phase Evaluation
This evaluation focuses on investigating the efficacy of the two automated planners (Context Attention (CTX) and Pseudo-Self Attention (PSA)) in learning to generate response plans.

Automated Metrics
Are the automated planners able to faithfully learn how to generate the response utterance plans? To investigate, we compare the performance of the CTX and the PSA planner with the symbolic planner output (which is our silver standard reference) using common automated metrics Table 2: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015) on the test set. We use the library by Sharma et al. (2017). We find that PSA was able to achieve higher word overlap metrics with respect to the silver standard. We conducted an indepth analysis of the CTX and PSA planner output on the entire testing set. We found that the PSA model was more likely to produce ask actions that matched the ground truth, resulting in higher scores on the automated metrics.

Human Evaluation
Evaluation using automated metrics provides limited evidence for the ability to automatically generate plans; we do not know if these plans are actually useful in a realization task. The question then is: How well-suited are the automatically learned plans for the task of generating responses?
Study 1: We asked two experts in linguistics to independently rate 40 randomly sampled plans from the test set. For context, we provided the input utterance and its plan produced by the symbolic planner. Their task was to choose which of the learned response plans was better suited to the realization task (CTX, PSA, Both or Neither). They   also evaluated the plan constituents: (type, action and target). We randomized the presentation order of the planner outputs across questions to avoid ordering/learning effects (Medin and Bettger, 1994). We find an inter-rater agreement (Shrout and Fleiss, 1979) of 0.5 (p < 0.001) between the linguists. Table 3 shows the results from Study 1. From Q1, we find that CTX planner is better suited to generate an appropriate response over the PSA planner. Similarly, through Q2, Q3, and Q4, we find that the CTX planner is better able to generate the appropriate ask/framing types, actions, and targets. We also find that the linguists rated Neither plan was suited to generate a response 10% of the time. Put differently; the automatically generated plans would work 90% of the time to generate an appropriate utterance in the realization phase. The learned plans have trouble associating an appropriate ask/framing type and target (28.75% and 26.75%) but perform better with the ask/framing action (18.75% Neither rating).
This evaluation compares the automatic planners against one another, but how well do the planners compare to the silver standard (symbolic planner)?
Study 2: We asked the same linguistic experts to independently determine which amongst two plans (symbolic vs. each automated planner) would be more appropriate to generate a response. This study design is consistent with prior studies in dialogue evaluation (Mei et al., 2017; Table 4 presents the results from Study 2. We find that experts prefer the plans produced by the symbolic planner over the CTX output but not over the PSA planner output. Inter-annotator agreement (Shrout and Fleiss, 1979) between the experts for this study was 0.54. While Study 1 compared CTX and PSA planner outputs against one another, Study 2 compared CTX and PSA outputs against the silver standard. As we observe from the automated metrics (Table 2), PSA model plans are more faithful to the ground truth, e.g., higher BLEU 1-4 scores than CTX model plans. Since PSA planner outputs are more faithful to the ground truth, this may be why human judges rate them as preferable more often when compared against ground truth.
Planning Phase Evaluation Findings: To summarize this evaluation section, we find: PSA outperforms the CTX planner on automated metrics. This finding is consistent with the results from Ziegler et al. (2019). From Study 1, we find that both the planners are able to generate appropriate plans, with the appropriate ask/framing type, action, and target for the realization phase, a large proportion of the time. From Study 2, we find that when compared to the silver standard plans, PSA planner output is preferred over the CTX planner.

Realization Phase Evaluation
While the previous section focuses on evaluating the ability to generate plans automatically, we do not yet know whether separating the generation process into planning and realization produces better responses than an end-to-end system?
Thus, we compare four approaches towards realizing a response given an input utterance (through the Pseudo-Self Attention fine-tuned realization algorithm): (1) No Planner model which receives

Automated Metrics
Prior research has shown that most automated metrics have little to no correlation to human ratings on NLG tasks (Liu et al., 2016;Santhanam and Shaikh, 2019); however, they may provide some standard of reference to evaluate performance. We report the following metrics: (i) BLEU (Papineni et al., 2002) (ii) length of responses, with the understanding that models that are able to generate longer responses are better (iii) following, Mei et al (2017), we report the diversity metric (Li et al., 2016a). Diversity is calculated as the number of distinct unigrams in the generation scaled by the total number of generated tokens (Mei et al., 2017;Li et al., 2016b). (iv) BERT-Score (Zhang* et al., 2020) metric, an embedding-based score which has shown greater correlation to human ratings. Table 5 reports on the automated evaluation against the ground truth utterances. We find that on both corpora and across all metrics except Diversity, incorporating plans as an additional input to the realization phase helps achieve a higher score than having No Planner. From Table 5, we find that the realizer without any plans is able to achieve higher diversity, but the difference is not statistically significant.

Human Evaluation
Since automated metrics are not the most informative indicators of quality of generated responses, thorough human evaluation is necessary. We in-vestigate if humans prefer the responses generated by the planner-based models over those generated without the plan (No Planner). We conducted two human evaluation studies by recruiting workers from Amazon Mechanical Turk service with strict quality control criteria: workers should have at least 90% HIT approval rate and at least 1000 approved HITs. In each survey, workers are asked to evaluate responses on these metrics, following Novikova et al. (2018): (i) Appropriateness: determines whether response aligns with the topic of the conversation and the input utterance. (ii) Quality: determines the overall quality in terms of grammatical correctness, fluency, and adequacy (iii) Usefulness: determines if the response is highly informative to generate a response.  Table 6: Average ranking of realized output from four different planners, lower score is better Study 1: We tasked 30 crowd-sourced workers to rank order the four model responses from best to worst. We randomly sampled 60 examples from the test set with an even 50% split (30 examples each) between the Persuasion for Social Good and Anti-Scam corpora. We chose the best to worst ranking mechanism since it has shown greater consistency and agreement amongst workers on tasks related to dialogue evaluation over other evaluation designs (e.g. Likert scales) (Santhanam et al., 2020;Kiritchenko and Mohammad, 2017). The presentation order of model outputs for each question was again randomized to avoid learning effects (Medin and Bettger, 1994). Table 6 demonstrates the average rank position (1=Best, 4=Worst) obtained by each model. We find using the plans generated by the CTX planner helps generate better responses. On the metrics of quality and usefulness, we find that incorporating planning as additional input performs better than no plan (i.e. end-to-end system).
Study 2: In this study, we evaluate how well the generated responses compare to the ground truth. The ground truth references are those produced by humans in the PSG and Anti-Scam corpora. We recruited 11 MTurk workers with the same crowdsourcing quality controls as Study 1. For the same randomly sampled 60 examples from Study 1, workers were asked if they prefer the groundtruth response, the response generated from the three planners, or both, on the three chosen metrics. This study design is also consistent with prior work (Mei et al., 2017). Workers were blinded to the source of the response (ground truth or generated) and were presented the responses in a randomized order across all questions to avoid ordering effects. Fig. 4 shows the results (higher value/darker color is better): we find that responses generated from the symbolic planner as input do not perform well when compared to the ground truth. In other words, the proportion of time that the ground truth response is preferred over that generated by the symbolic planner is significant (e.g. 53% vs. 26% on the Appropriateness metric overall).
We find that on all three metrics, the responses generated using CTX and PSA plans help generate responses that are comparable to the responses produced by humans (ground truth). We also find that the PSA planner-based responses perform better overall and on the Persuasion for Social Good Figure 5: Sample outputs from realization phase with all variations of planner input, as well as the ground truth response from the corpus corpus. Surprisingly, the CTX planner based responses performs better than Ground Truth utterances for the Anti-Scam corpus (45%, 48% and 48% of the time preferable vs. ground truth response 35%, 37% and 37% on the three metrics, Appropriateness, Quality, and Usefulness, resp.). We explain this unexpected finding in the next subsection (Section 4.3).
Realization Phase Evaluation Findings: To summarize this evaluation subsection, we find that the Symbolic Planner-realized output outperforms the CTX, PSA, and No Planner output on the automated metrics of BLEU and BERT-score. Importantly, the CTX planner-realized output has a higher rank in terms of overall preference in human evaluation than the other models (c.f. Table 6). We also find that human-generated utterances (ground truth) are preferred overall (c.f Fig. 4) than the model outputs. We found inter-rater consistency and agreement scores to be >0.6 on average across the metrics (full tables are reported in the Appendix).

Issues Found
Input Utterance and Context Generated Plan for Response/ Generated Response

Non-Informative Ask/Framing Target
The money goes directly to the organization in order to help.

Qualitative Analysis
We conduct a qualitative evaluation of the outputs and present several cherry-and lemon-picked examples here. Additional examples of success and failure cases are uploaded in the Appendix. In the sample conversation shown in Figure 5, we find that realized outputs using CTX and PSA plans are more consistent with the context of conversation than the symbolic planner approach. Additionally, the No Planner output (an end-to-end system which does not get a plan as an additional input) produces an utterance that may not necessarily continue the conversation further. This example is also illustrative of the finding in Study 2 of the Planning Phase evaluation, where the crowdsourced workers rated the automated plannerbased outputs better than the symbolic plannerbased outputs (c.f. Fig. 7). This might seem contradictory, as the CTX and PSA planners are trained on the silver standard data from the symbolic planner. We contend that this is due to the ability of automated planners (CTX and PSA) to generalize, an ability lacking in the symbolic planner. In such cases, as shown in Fig. 5, the symbolic planner defaults to the RESPOND message plan, and this lead to generated output: That is not an exact word, which is generic and off-topic. The symbolic planner could be improved to cover more cases; however, the effort would not be scalable.
While we find promising results for the automatically-generated planners in Sections 4.1 and 4.2, areas of improvement do exist (Table 7): Non-Informative Ask/Framing Targets: We find several examples where the ask/framing tar-gets are non-informative words (e.g. this, that). Non-informative targets can cause the downstream realization process to generate an utterance that is, in turn, also non-informative. One example of such cases is shown in Row 1 of Table 7.
Wrong Type and Action: Another planning phase issue category is that the constituents of plan representation (e.g., the ask/framing type and action) can be incorrect. As illustrated by the example in Table 7, an ask target of why got is incorrect. Typically, we would expect to find a noun or a noun phrase as the ask/framing action (e.g., your billing date and names as shown in the plan in Row 3).
Ignored Plan: In the Realization phase, a typical issue is that the realizer may ignore the generated plan. As can be seen in Row 3 of Table 7, the plan should constrain the response, and thus should contain phrases such as finding your billing date and names. However, the generated response is instead a generic phrase Okay, thanks!.
Grammatical inconsistencies: We also note that there were cases where the grammar, e.g. pronoun usage, is inconsistent. For the example shown in Row 4 of Table 7, we see that the generated response is They help with that. whereas the conversation is between two persons; a generated response of I can help with that would be more consistent with the context of the conversation.

Conclusion and Future Work
We address the task of natural language generation in open-ended dialogue systems. We test our hypothesis that decoupling the generation process into planning and realization can achieve better performance than an end-to-end approach.
In the planning phase, we explore three methods to generate response plans, including a Symbolic Planner and two learned planners, the Context Attention and Pseudo Self Attention models. Through linguist expert evaluation, we are able to determine the efficacy of the response plans towards realization. In the realization phase, we use the Pseudo Self Attention model to make use of the learned response plans to generate responses.
Our key finding through two separate human crowdsourced studies is that decoupling realization, and planning phases outperforms an endto-end No Planner system across three metrics (Appropriateness, Quality, and Usefulness).
In this work, we have taken an initial step towards the goal of replicating human language generation processes. Thorough and rigorous evaluations are required to fully support our claims, e.g., by including additional metrics and more diverse corpora. In this work, we limit the types to GIVE, GAIN, LOSE, and PERFORM. However, we do not restrict the ask action and target at all. Also, since our symbolic planner can be used to obtain silver standard training data, straightforward changes like adding additional lexicons would enable us to generalize to other corpora as well as include additional ask types in our pipeline. Another natural extension would be to explore training the planning and realization phases together in a hierarchical process (Fan et al., 2018). This would, in principle, further validate the efficacy of our approach.

Acknowledgments
This work was supported by DARPA through AFRL Contract FA8650-18-C-7881 and through Army Contract W31P4Q-17-C-0066. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of DARPA, AFRL, Army, or the U.S. Government.

A.3 Additional Output Examples
In this section, we give additional examples of conversations from our test set. Realization output based on each planner configuration of the system is included. In the Tables 11, we provide additional examples of the issues we found through manual inspection of the outputs. Figure 6: Example conversation between two speakers A & B from test set. We can see that in the case of the realizer output from Symbolic Planner and PSA Planner, the responses also includes an ask (e.g. what is that for? and Are you involved with them? which may serve to carry the conversation further, as compared to other responses. Figure 7: Another conversation between two speakers A & B from our test set. The ground truth response in this case was lengthier than the typical response (consisting of 73 words) and has been shortened here for ease of presentation. In this example, we noticed that the PSA planner based model generates an incoherent response.