Neural Conversation Model Controllable by Given Dialogue Act Based on Adversarial Learning and Label-aware Objective

Building a controllable neural conversation model (NCM) is an important task. In this paper, we focus on controlling the responses of NCMs by using dialogue act labels of responses as conditions. We introduce an adversarial learning framework for the task of generating conditional responses with a new objective to a discriminator, which explicitly distinguishes sentences by using labels. This change strongly encourages the generation of label-conditioned sentences. We compared the proposed method with some existing methods for generating conditional responses. The experimental results show that our proposed method has higher controllability for dialogue acts even though it has higher or comparable naturalness to existing methods.


Introduction
A dialogue act is defined as the intention or the function of an utterance in dialogues. Dialogue act labels are defined as unique classes to distinguish between kinds of dialogue acts (Boyer et al., 2010;Bunt et al., 2012). Some existing studies have exploited the dialogue act as a component in modeling the dialogue strategy of dialogue systems (Meguro et al., 2010;Yoshino and Kawahara, 2015;Shibata et al., 2016;Keizer and Rieser, 2017).
Neural conversation models (NCMs), which learn a direct mapping between a dialogue history and a response utterance, are widely researched as a scalable approach to building non-task oriented dialogue systems (Vinyals and Le, 2015;. However, it is difficult to control their responses on the basis of actual constraints such as dialogue act classes. Some existing studies have tackled this problem to control responses from NCMs by using actual labels; however, these models still had some limitations (Wen et al., 2015;Li et al., 2016;Sun et al., 2017;. One crucial issue was that they do not have any explicit training objectives to guarantee that a generation has a discriminability for a given condition.
We extend a framework of the generative adversarial network for sequential generation  for improving the controllability of NCMs under the constraint of a given dialogue act condition. We propose an adversarial learning framework that alternatively trains between conditioned generator and a conditioned discriminator. The discriminator has a multi-class objective that explicitly classifies a generated response into an appropriate dialogue act class. This improves the discriminability of generation.
In this paper, we first describe the task of conditional response generation given a dialogue act label and its existing approaches (Section 3). Second, we introduce an adversarial learning framework and extend its architecture and objective to fit the problem of conditional generation (Section 4). In experiments, we use metrics to evaluate the controllability and naturalness of responses (Section 5). The experimental results show that our proposed model achieved the best controllability score in both automatic and human subjective evaluations even if it achieves better or comparable naturalness to existing methods (Section 6).

Related Work
Dialogue systems that have dialogue management modules determine a dialogue act or dialogue state of a system response by using statistical methods such as reinforcement learning Meguro et al., 2010;Yoshino and Kawahara, 2015;Keizer and Rieser, 2017). Response generation modules generate responses according to these dialogue acts or dialogue states on the basis of the rules, templates, agendas or other statistical models (Oh and Rudnicky, 2000;Xu and Rudnicky, 2000). Recently, neural network based generation modules have been widely used. Wen et al. (2015) proposed a conditional language model (Semantically Conditioned Long Short-Term Memory; SC-LSTM) for task-oriented systems, which generates utterances on the basis of any dialogue acts and frames in the domain of restaurant navigation dialogue by using gating mechanism. However, the training framework of SC-LSTM requires state frames that express the function and the contents of target utterances entirely. Thus, it is not realistic to apply this method to building an open-domain dialogue system.  proposed an NCM based on a variation of the conditional variational autoencoder (CVAE), which generates responses that have high diversity in discourse level by using latent variables as dialogue acts. However, this model has no mechanism to guarantee for generating discriminable responses for given dialogue acts.
There is another research trend in controlling NCMs with a given condition, such as speaker or emotion labels (Li et al., 2016;Sun et al., 2017;. These NCMs are optimized by softmax cross-entropy loss (SCE-loss), which calculates a loss wordby-word. However, such existing training objectives do not necessarily guarantee that a generated response has high discriminability to for a given class label. In other words, SCE-loss is not an appropriate objective that explicitly evaluates whether a generated response reflects the property of the given class label or not. Therefore, the generated response will be biased by majority class labels.
To prevent these problems, we introduce the framework of the generative adversarial network Li et al., 2017a;Tuan and Lee, 2019). This framework makes it possible to consider the total quality of generated sequences unlike SCE-loss, which is optimized for each token. We extend adversarial networks to generate qualified and controlled sentences given a condition, especially dialogue act labels.

Response Generation Conditioned by
Dialogue Act Label

Task Settings
The task we focus on is building a controllable NCM with a given condition, typically dialogue act labels. The problem is defined as generating the ith response word sequence R i = {w 1 , w 2 , · · · , w T } given a dialogue history M = {M i−1 , M i−2 , . . . , M i−n } and dialogue act label d i . Here, n is the length of a dialogue, and T is the number of words in an utterance. As shown in Figure 1, a response R i is required to satisfy not only the behavioral characteristic of a given dialogue act but also the appropriateness in the dialogue context (=history). One of the simplest approaches to building such a conditional generation system given a class label in NCM is adding the class label to the input of a decoder (Li et al., 2016;Sun et al., 2017). We describe this baseline in the following section.

Conditional NCM with Dialogue Acts
We introduce a general conditional NCM that is conditioned by dialogue act labels as a baseline. We built an NCM on the basis of a hierarchical encoder-decoder model that explicitly gives labels to the decoder at any of the steps of decoding as shown in Figure 2.
Recurrent neural networks (RNNs) such as long short-term memory (LSTM) are generally used to model a sequential generation of responses in NCMs (Hochreiter and Schmidhuber, 1997;Vinyals and Le, 2015;. The encoder receives a word at each time step by using forward RNNs to encode an utterance into a fixed length vector (utterance encoder). Utterance vectors of a dialogue are input to another fixed length vector according to their time sequence to encode the dialogue context (dialogue encoder). The resultant vector is fed into the decoder to generate a response sentence (word sequence). We used the same encoder architecture as Tian et al. (2017).
In the decoding steps of the NCMs, the decoder receives a previous hidden vector h t−1 , memory cell c t−1 , and generated word w t−1 to generate a word w t . Here, t is an actual time-step of generation. The model has been changed to receive not only the previous word w t−1 but also the dialogue act label d in conditional generation. The vector representations of d and w t are concatenated and used as the input of the decoder at time-step t. 1 The decoder itself also predicts the vector representation of words. This architecture is also the same as those of the models proposed by .
Softmax cross-entropy loss (SCE-loss) is widely used to train the model.
Here, |V| indicates the vocabulary size, x ∈ R |V| indicates the output of the projection layer in the decoding steps, and x k ∈ R |V| indicates the kth element of x. x c is the target word. SCE-loss optimizes the prediction of words at each decoding step. However, it does not use the information of a given dialogue act label in the loss calculation during training. Thus, the resultant model often generates a response that does not consider a given dialogue act label or a biased response by using the majority dialogue act labels in the training data. We tackle this problem by introducing an explicit training objective to generate a conditioned word sequence in adversarial learning.

Conditional Response Generation Based on Adversarial Learning
We introduce sequential generative adversarial networks (SeqGANs) Li et al., 2017a;Tuan and Lee, 2019) to improve the controllability and quality of conditional response generation. SeqGAN is a prospective approach to preventing the problems caused by SCE-loss based training because it can evaluate not only the word prediction of each decoding step but also the whole quality of a generated sequence. In this section, we first describe the architecture of Se-qGAN (Section 4.1) and then propose our extension of SeqGAN to realize conditional response generation by using given dialogue act labels (Section 4.2).

SeqGAN for Response Generation
The generation process in SeqGAN is formalized as a Markov decision process (MDP) and optimized with reinforcement learning (RL) (Li et al., 2017a;Tuan and Lee, 2019). The problem of response generation in NCMs is generating response word sequence R = {w 1 , · · · , w T } given a dialogue context M . Such a word selection process in the generation is defined as an action sequence, which is generated by an actual policy in MDP. In SeqGAN, the generator generates a sentence according to the current policy. The discriminator gives an evaluation score to the generated sentence after the generation. The evaluation score is fed as a reward to update the policy of the generator in RL. We use a policy gradient (Williams, 1992) to train the policy. The objective function and its gradient of the policy gradient are defined as follows 2 .
Here, θ is a parameter of the policy. w 1:t−1 indicates a word sequence, V is a vocabulary, and p is the generative probability of word w t ∈ V.
is an action-value function that gives an expected future reward when the system takes the action of generating word w t given the state: already generated word sequence w 1:t−1 and dialogue context M . ϕ is a parameter of the discriminator. The discriminator only outputs the reward after the whole generation of the sequence. Thus, the value of the action-value function Q G θ D ϕ ((w 1:t−1 , M ), w t ) for each step is calculated by using a Monte Carlo tree search (MCTS) under the current policy and its parameter θ (Browne et al., 2012).
The discriminator is trained to classify a generated sentence (fake) and sentence in training data (real). Its training objective is defined as, The generator and the discriminator are trained alternatively to train their network adversarially.

SeqGAN for Conditional Response Generation with Dialogue Acts
The generator and the discriminator in SeqGAN are extended to produce responses according to given dialogue acts. The adversarial framework is extended for jointly optimizing both networks: a generator network to produce response utterances 2 Detailed derivation is shown in .
under specified dialogue acts, and a discriminator network to distinguish between generation (fake) and training data (real) that reflect their conditions. As the generator network, we applied the conditional-NCM described in Section 3.2. Equations (2) -(4) are changed as follows.
As the discriminator network, we incorporated dialogue act labels in the classification model (Figure 3). In the discriminator model, the utterance encoder converts dialogue contexts into fixed length vectors and uses them as features of discrimination. We propose to use two discriminators for incorporating dialogue act label information in the discriminator: implicit and explicit. Each method is described below in respective sections.

Binary Objective; Implicit-Discriminator
We built a simple extension for the discriminator that incorporates dialogue acts in the feature vectors of the discriminator. We call this architecture "implicit." This discriminator is defined as, We expect that the implicit discriminator can use the information of dialogue acts as a feature and discriminate generated results as fakes if they do not follow a given dialogue act (Figure 3, lower-right). There are some works that have similar approaches in emotional response generation (Sun et al., 2018;Kong et al., 2019). However, this discriminator is still a simple extension of the standard discriminator, which classifies response in two classes. In other words, the objective is not changed; thus it probably has difficulty in distinguishing the class (dialogue act label) of responses. We propose another discriminator to solve this problem in the next section.

Multi-class Objective; Explicit-Discriminator
We propose an approach extending the classification problem of the discriminator from the binary classification of fake/real to multi-class classification to distinguish target dialogue act classes (Figure 3 upper-right). This discriminator has a multiclass objective for N +1 class classification. Here, N is the number of unique dialogue act classes; another one is a fake class for categorizing the responses as generated. We call this architecture "explicit." Its objective function is defined as, We used the posterior probability D ϕ (d|R, M ) estimated by the discriminator as the reward of the generator. We expect that this discriminator will encourage training the generator to generate discriminative sentences with dialogue acts because generations that follow different dialogue act manners are penalized even if they are natural. Odena (2016) proposed a similar idea to use multi-class objective in GAN for the image generation task.

Speeding Up Adversarial Learning Using Simple Recurrent Unit
Using LSTM or GRU as an encoder and decoder is a general method for building NCMs (Vinyals and Le, 2015;. LSTM is also often used for classification problems to encode a hierarchical structure, such as the discriminators of GANs (Tran et al., 2017). However, the training speed of LSTM is much slower than other types of networks, although LSTM has dominant performance (Lei et al., 2017). This characteristic is critical for adversarial learning, which requires a large number of iterations.
We used the policy gradient in this research to update the parameters of the generator, which is based on expected reward calculation by MCTS. However, MCTS requires enormous calculation costs because it requires scanning the discriminator and generator r × w times per one update of generator, where r is the number of rollouts and w is the number of words in a response each time step. Thus, we propose to use a simple recurrent unit (SRU) (Lei et al., 2018) in our generator and discriminator. SRU is known as an extension of RNN, which has comparable performance to LSTM even if it works at significantly higher speed. SRU is defined as follows.
Here, v t is the input vector at time-step t, f t is the forgetting gate, r t is the input gate, c t is the memory cell, and h t is the hidden vector. The key idea of SRU is minimizing the number of vectors and gates affected by previous states. Under this definition, only c t is affected by the previous state c t−1 . Furthermore, c t and h t are calculated only by the element-wise production and summation of a vector for easy speed up. It was reported that forward and backward propagations in SRU are 10-16 times faster than LSTM (Lei et al., 2018). SRU leads to computational advantage compared with another type of RNNs including GRU as well. We expect to have a significant improvement in speeding up adversarial learning by using SRU instead of LSTM. However, applying SRU to NCM has no track record; thus, we introduce SRU to both the existing methods and our proposed adversarial network; however, we also perform a comparison with another baseline implemented by LSTM.

Dataset
We used the DailyDialog corpus that covers ten categories from a wide variety of topics (Li et al., 2017b). The corpus contains 13,118 dialogues, with a total of 102,979 utterances annotated with dialogue act labels: inform (46,532 utterances), questions (29,428 utterances), directives (17,295 utterances) and commissive (9,724 utterances). We divided the corpus into training/validation/test sets with 11,118/1,000/1,000 dialogues according to the work of Li et al. (2017b). In all experiments, the vocabulary size was set to 25,000, and all the OOV words were replaced by "UNK" symbol.

Training Settings
We used the same setting for embedding: words, 256, dialogue acts, 100. The mini-batch size was 32. In the training for conditional-NCM, we set two-layers RNNs in both the encoder and the decoder and used the Adam optimizer with a learning rate of 1e-5.
In the proposed adversarial learning, we followed the training procedure proposed by Li et al. (2017a). The training algorithm that we used is shown as follows.
Algorithm 1 Training procedure 1: for number of iterations do 2: for number of G-steps do 4: for number of D-steps do In the training of SeqGAN for conditional response generation based on dialogue acts, we prepare well pre-trained conditional generator and a discriminator in advance. After initializing parameters by pre-trained models, G-steps for the generator G and D-steps for the discriminator D are applied alternatively to train them. In G-steps, a generated responseR G is sampled by using a dialogue history M G and dialogue act d G , and then the reward rR G for the generation is calculated by the discriminator D. By using the calculated reward rR G , parameters of G are updated. In Dsteps, a real response R D , given a dialogue history M D and a dialogue act label d D , is sampled from the training data. A fake responseR D is generated from the generator G by using the dialogue history M D and the dialogue act label d D . Parameters of the discriminator D is updated by using the real sample and the fake sample.
We set the number of G-steps to 4 and D-steps to 20. In the generator, we used two-layers SRUs in both the encoder and the decoder as 1024 hidden units. We used the Adam optimizer with a learning rate of 1e-5. For the discriminator, we used a one-layer SRU, 1024 hidden units, and the SGD optimizer with a learning rate of 1e-3. We set the number of rollout to 5 in MCTS.

Automatic Evaluation Metrics
We automatically evaluated generation results by comparing with references in the test-set. As the automatic evaluation, we used three different types of metrics: perplexity, relevance scores, and controllability.
Perplexity is a metric for evaluating a language model performance. Likelihoods of models for reference responses are calculated as perplexities. Note that the perplexity score does not directly reflect the quality of generation; dull responses also have good perplexity scores.
Relevance scores are similarities between references and generated results. We used NIST, a variation of BLEU, which focuses on content words more than BLEU (Doddington, 2002). However, using count-based metrics such as BLEU and NIST are not appropriate, because they have small correlations with human judgment score in response generation tasks (Liu et al., 2016). Thus, we also used three different relevance scores proposed by Liu et al. (2016): embedding average ("Average"), greedy matching ("Greedy") and vector extrema ("Extrema") 3 . "Average" calculates a cosine similarity between the reference sentence vector and the generated sentence vector. Each sentence vector is calculated by an average of word embedding vectors in the sentence. "Extrema" also calculates a cosine similarity between sentence vectors; however, the sentence vector is constructed in a different way. Each dimension of the sentence vector is selected from the same dimension of a word embedding vector, which has the highest absolute value in the sentence. "Greedy" calculates cosine similarities of word pairs in the reference and the generated sentence, which is paired by alignment, and then averages these similarities.
The last automatic metric we used is "controllability", which is given by the classification result of the pre-trained dialogue act classifier by using the training set. We connected our encoder for conditional-NCM (Figure 2 left side) to a multiclass softmax layer to build the classifier. 4 Any generated sentences are labeled by the classifier and then compared with the given condition label to calculate the label accuracy.

Human Subjective Evaluation Metrics
The automatic evaluation scores still have a problem in that they do not have high correlations with human subjective evaluation results (Liu et al., 2016). Thus, we also evaluated systems with a human subjective evaluation to confirm the naturalness and controllability of responses.
In the evaluation of naturalness, we used a 3point scale score in accordance with the existing work . Thirty generated responses were randomly selected from each dialog act (120 in total) and human annotators selected an evaluation for the sample in following instructions.
• 2: The response can be used as a reply and it is informative and interesting; the response is natural and can make the conversation continue.
• 1: The response can be used as a reply, but it is too generic like I don't know. • 0: The response cannot be used as a reply to the given dialogue history. It is either semantically irrelevant or disfluent. Each sample was evaluated by three annotators, and the final score was decided by majority voting. If the evaluation was completely separated (0, 1 and 2), the example was evaluated as 1.
In the controllability evaluation, we requested one annotator to annotate dialogue acts for generated responses, who had two years experience in dialogue act annotation. The annotator was trained by using training data of DailyDialog corpus before the evaluation.
6 Experimental Results Table 1 shows the results of the automatic objective evaluation. We compared our proposed SeqGAN based on explicit-discriminator (Adversarial-Explicit) with the following baselines. "Vanilla-NCM" shows performances of vanilla LSTM, which has no mechanism to receive condition labels. These scores indicate a general performance of systems in DailyDialog corpus. "Conditional-NCMs" show performances of NCMs that receives condition labels on its decoder as proposed by . We compared two variations of "Conditional-NCMs", LSTM and SRU, to check the performance of SRU compared with LSTM. "Adversarial-Implicit" shows performances of SeqGAN that has implicit discriminator, which is proposed by Sun et al. (2018) and Kong et al. (2019). "Adversarial-Explicit" indicates the proposed model that has a multi-class discriminator on its SeqGAN. The table shows both results of beam search (width=5) and random sampling in the decoding process.

Speeding up Using SRU
The comparison between "Conditional-NCM w/ LSTM" and "Conditional-NCM w/ SRU" indicates that the speeding up using SRU works well; SRU achieves higher relevance scores to LSTM. SRU used 53,539K parameters, whereas LSTM required 82,924K parameters. These results support our experiments using SRU instead of LSTM. We will mainly focus on comparisons of SRU models in the following sections.

Qualities of Generated Responses
The "Vanilla-NCM w/ LSTM" scores show the difficulty of conversation modeling in Daily-Dialog corpus. By comparing these scores with scores of other conditional generation models, "Conditional-NCM" and "Adversarial", conditional generation models improved relevance scores even if they have controllability. This is probably because the dialogue act condition can be a training constrain to prevent dull responses. Generation methods using adversarial learning improved relevance scores than "Conditional-NCM w/ SRU".

Controllability of each Dialogue Act
By comparing "Controllability" scores, the proposed "Adversarial-Explicit" achieved best scores in both beam-search and sampling decoding. For a detailed analysis, we show a controllability score of "Adversarial-Explicit" for each dialogue act label in Table 2 and Table 3 (beam-search and sampling). Tables show precision, recall and their harmonic mean (F1) for each dialogue act, and the improvement from the score of "Conditional-NCM w/ SRU", which achieved the best controllability in baselines (improv.). The proposed "Adversarial-Explicit" achieved improvements for any classes, but in particular, it achieved large improvements on "Directives" and "Commissive"   Table 3: Result for random sampling, controllability of each dialogue act (adversarial-explicit). The score in round brackets indicates the improvement from conditional-NCM w/ SRU. tags. Our model generated more discriminative sentences even if these classes have similar attribute. Table 4 and Table 5 show human evaluation results for naturalness and controllability, respectively. We used beam search (beam width of 5)

Results of Human-Subjective Evaluation
for generating examples to be evaluated. Regarding the naturalness of responses (Table 4), models used adversarial learning produced a more acceptable response to the dialogue context. Regarding the controllability of response generation (Table 5), the adversarial-explicit model achieved the best performance among the compared models.
In summary, the proposed model based on   adversarial learning with multi-class objective achieved the best controllability, the main focus of this paper, even if it realized a comparable naturalness to existing methods.

Conclusion
In this paper, we introduced an extended framework of the generative adversarial network that is optimized by both conditioned generation and discrimination of dialogue act classes. Experimental results showed that our conditional response generation model improved both the response quality and controllability in neural conversation generation. In future works, we will examine the possibility of incorporating our adversarial framework in various generation approaches (Serban et al., 2017;Shen et al., 2017;Zhou and Wang, 2018) to build a more generalized conditional response generation model. We will also focus on different types of labels to be used as conditions.