Responsive and Self-Expressive Dialogue Generation

A neural conversation model is a promising approach to develop dialogue systems with the ability of chit-chat. It allows training a model in an end-to-end manner without complex rule design nor feature engineering. However, as a side effect, the neural model tends to generate safe but uninformative and insensitive responses like “OK” and “I don’t know.” Such replies are called generic responses and regarded as a critical problem for user-engagement of dialogue systems. For a more engaging chit-chat experience, we propose a neural conversation model that generates responsive and self-expressive replies. Specifically, our model generates domain-aware and sentiment-rich responses. Experiments empirically confirmed that our model outperformed the sequence-to-sequence model; 68.1% of our responses were domain-aware with sentiment polarities, which was only 2.7% for responses generated by the sequence-to-sequence model.


Introduction
Dialogue systems that conduct non-goal-oriented chat, i.e., chit-chat, is an active research area. The sequence-to-sequence model (SEQ2SEQ) (Vinyals and Le, 2015;Shang et al., 2015) is commonly used for implementation, however, recent studies, e.g., (Li et al., 2016a), point out that SEQ2SEQ frequently generates overly generic responses. Among different approaches to address this problem, previous studies propose to generate more engaging responses by reacting to topics in users' utterances (Xing et al., 2017) or embodying emotions (Zhou et al., 2018;. Herein we make a step further to generate responsive and self-expressive replies simultaneously. 1 The interpersonal process model for intimacy (Reis and Shaver, 1988) indicates that conversational responsiveness (Miller and Berg, 1984), i.e., showing concern for what was said, and self-expression, i.e., sharing thoughts and feelings, are primary factors to create intimacy. Motivated by this theory, we believe that the con-1 In this study, we focus on single-turn conversations, i.e., generating a response to a single utterance from the user. versational responsiveness and self-expression are also valid for a dialogue system to generate engaging responses. We implement the conversational responsiveness as domain-awareness because it effectively conveys an impression that the dialogue agent is listening to the user by responding about mentioned topics. Also, we implement the self-expression as sentiment-richness by representing sentiment polarity to generate subjective responses with feelings. Specifically, the encoder predicts the domain of a user's utterance and integrates domain and utterance representations to tell the decoder the target domain explicitly. Then the decoder embodies sentiment polarity in its generation process. Fig. 1 shows real responses generated by our model. You may find that our responses react to the domains of input utterances while showing salient sentiments. On the other hand, SEQ2SEQ ends up generating generic responses.
To the best of our knowledge, this is the first study that simultaneously achieved both domain- aware and sentiment-rich response generation. Our contributions are twofold. First, we achieve these features in a simple architecture integrating existing methods on top of SEQ2SEQ in order to make it easily reproducible in existing dialogue systems. Second, our model utilizes fine-tuning to compensate for the training data scarcity, which is essential because there is a limited amount of domain-dependent and sentiment-rich dialogues.
Our codes and scripts are publicly available. 2 Evaluation results empirically confirmed that our model significantly outperformed SEQ2SEQ from the human perspective. Annotators judged that responses generated by our model are consistent with the utterances' domains and show salient sentiments for 89% and 72% of cases while preserving fluency and consistency. Furthermore, they judged 68.1% responses by our model as both domain-aware and sentiment-rich, which was only 2.7% for responses by SEQ2SEQ.

Related Work
The generic response problem in SEQ2SEQ is a central concern in recent studies. Different approaches have been proposed to generate diversified responses; by an objective function (Li et al., 2016a;Zhang et al., 2018b), segment-level reranking via a stochastic beam-search in a decoder (Shao et al., 2017), or by incorporating auto-encoders so that latent vectors are expressive enough for the utterance and response (Zou et al., 2018). In these approaches, balancing the diversity and coherency in a response is not trivial. Zou et al. (2018) show that metrics to measure the diversity are not proportional to human evaluation.
Another group of studies tackles the generic response problem by improving coherence in the response, which is relevant to conversational responsiveness. Approaches include reinforcement learning (Zhang et al., 2018a) and prediction of a keyword that will be the gist of a response given an input utterance and its generation in the decoder (Mou et al., 2016;Yao et al., 2017;Wang et al., 2018). In our study, we consider domainlevel coherency to achieve the conversational responsiveness similar to (Xing et al., 2017).
Several studies focus on self-expression in responses. Some add persona in dialogue agents to generate consistent responses to paraphrased input utterances (Li et al., 2016b;Zhang et al., 2018c;Qian et al., 2018). Zhou et al. (2018) conducted the first study that controls emotions in dialogue agents using two factors. The first is embedding of a desired emotion label as in (Li et al., 2016b;. The second is internal and external memories, which control the emotional state and the output of the decoder, respectively. These previous studies propose methods to achieve either conversational responsiveness or self-expression. Herein we aim to achieve both features simultaneously.

Proposed Architecture
To be easily implemented on existing dialogue systems, our model design aims to be simple. We integrate TWEET2VEC (Dhingra et al., 2016) and the external memory (Zhou et al., 2018) with SEQ2SEQ (Fig. 2). While sentiments in texts are well-understood in natural language processing, emotions need more studies to be considered in practical applications. Besides, determining the appropriate emotions for a specific utterance remains problematic (Hasegawa et al., 2013). In our model, we focus on sentiments and input the embedding of a sentiment label s to the decoder, which specifies the desired sentiment to represent in a response.  Input Utterance Encoding The input utterance is represented as a vector. Bi-directional recurrent neural networks empirically show superior performance in generation tasks (Bahdanau et al., 2015) because they refer to the preceding and subsequent sequences. We apply bi-directional Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks to encode an input utterance into a vector. Given the input utterance X = {x 1 , x 2 , · · · , x M } of length M , the forward LSTM network encodes the input at time step t as

Encoder
h fw t ∈ R λ is the representation output, which is computed based on the embedding of x t (denoted as e enc xt ∈ R ω ) and the previous representation output h fw t−1 . c fw t−1 ∈ R λ is a cell state vector that works as a memory in LSTM. The backward LSTM works in the same fashion by reading the input in the reverse order. The final vector representation h txt ∈ R λ is computed by averaging the concatenated forward and backward outputs where [·; ·] concatenates two vectors, σ(·) is a sigmoid function, and W txt ∈ R λ×2λ . In this way, h txt encodes the summaries of both the preceding and subsequent words.

Domain Estimation & Representation
Another task of the encoder is predicting the domain of the input utterance and integrating the domain label with the utterance. For domain estimation, we apply TWEET2VEC due to its superior ability to predict a label of short and colloquial text, which should be the case for input utterance to dialogue agents. Although the original paper predicted hashtags of tweets, we predict domains of utterances. Another advantage of TWEET2VEC is that it is language-independent and easily adapted to different languages. Specifically, TWEET2VEC encodes the input utterance using bi-directional recurrent neural networks adapting gated recurrent units (GRUs) (Cho et al., 2014). The final vector representation of inputĥ txt is computed by integrating the forward and backward outputs using a fully-connected layer. Thenĥ txt is passed through a linear layer, and the posterior probabilities of the domains are computed in a softmax layer.
Domain d of the highest posterior probability is converted into dense vector representation h dom ∈ R δ . Specifically, a two-layer multilayer perceptron (MLP) is employed where a rectifier is used as the activation function Utterance & Domain-Label Integration Finally, the utterance and domain representations pass through another fully-connected layer and are integrated into a vector h dec where W enc ∈ R λ×(λ+δ) . h dec 0 is then passed to the decoder for response generation.

Decoder
Given h dec 0 encodes the input utterance and the predicted domain, the decoder generates a response embodying the desired sentiment. Input utterance X is paired with a sequence of outputs to predict Y = {y 1 , y 2 , . . . , y N } of length N .
We apply the external memory to SEQ2SEQ in order to proactively control the sentiments in the outputs. Fig. 4 shows the detailed design of the decoder. First, we concatenate the output sequence with the embedding of the desired sentiment label as a soft-constraint to instruct the decoder of the desired sentiment for response generation (Li et al., 2016b). The external memory then directly controls response generation by switching outputs between words with sentiment polarities (hereafter referred to as sentiment words) and generic ones. Specifically, in the external memory, vocabulary V is divided into two subsets: V = {V s ∪ V g }. V s contains only sentiment words, such as cool and terrible, while V g contains other generic words, such as day and me. The weight of a switcher, which determines the priority of the sets of vocabulary is computed based on the representation output from an LSTM network.
Embedding of s (denoted as e s ∈ R δ ) is concatenated with output y t−1 at the previous time step and then input into the LSTM network as where c dec t ∈ R λ is the cell state vector in the LSTM, h dec t ∈ R λ is the representation output from the LSTM, and e dec y t−1 is the embedding of y t−1 . Recall that the initial input to the decoder h dec 0 is computed in Eq. (1). Then h dec t is passed to the external memory to sequentially predict output as where W a ∈ R 1×λ , W g ∈ R λ×|Vg| , and W s ∈ R λ×|Vs| . a t ∈ [0, 1] weighs either the probabilities of generic words or sentiment words based on context represented in h dec t . o g ∈ R |Vg| and o s ∈ R |Vs| are the posterior probabilities to output a word in each vocabulary. o t ∈ R |V | is the final probability of each word adjusted by a t . At run-time, a beam-search with a beam-size of 5 is conducted to avoid outputting an unknown tag.
Our model optimizes the cross-entropy loss between predicted word distribution o t and gold distribution p t . In addition, a regularizer constrains the selection of a sentiment or generic word where q t ∈ {0, 1} is the gold choice of a sentiment word or a generic word. Consequently, we designed a training framework that pre-trains sub-models independently and then conducts fine-tuning on the connected model, where a model is trained using the pretrained parameters as the initial weights. The training process uses not only a small-scale conversational (in-domain) corpus of specific domains but also a large-scale conversational corpus of general domain.

Sentiment Annotation
Training requires sentiment annotations on the general and in-domain corpora. Because it is cost prohibitive to annotate sentiments to these corpora manually, we rely on automatic sentiment analysis. Given that input utterances to dialogue agents are short, incomplete, extremely casual, and potentially noisy, we need a robust method to predict sentiments with guaranteed accuracy. Although we tried several state-of-the-art methods for sentiment analysis (Severyn and Moschitti, 2015;Zhu et al., 2015), our preliminary evaluation showed that they were easily confused by colloquial styles in conversational texts. Hence, we used a simple heuristics based on a sentiment lexicon to prioritize the robustness in analysis. Specifically, a sentence is annotated as positive (negative) if there are more positive (negative) words. If there is an equal number of positive and negative words, then the sentence is annotated as neutral.
We extracted words with strong polarities from existing sentiment lexicons (Kobayashi et al., 2005;Takamura et al., 2005). Besides, we collect casual and recent sentiment words by crawling Twitter. 3 This sentiment lexicon is used for the above sentiment analysis and the external memory as V s after the filtering described in Sec. 5.2. More details of lexicon construction are in Sec. A.

Pre-Training on Sub-Models
After annotating sentiments on the general and indomain corpora, we conducted pre-training. In the pre-training step, sub-models are independently trained (Fig. 5).
SEQ2SEQ requires large-scale training data for fluent response generation. Thus, we used the general corpus here. We directly connected the bidirectional LSTM in the encoder and the LSTM in the decoder to train this sub-model. The loss function (Eq. (2)) is computed by referring to the gold-responses in the corpus. Embeddings to represent sentiments are trained at this stage.
TWEET2VEC is independently trained using the in-domain corpus for domain prediction. The model optimizes the categorical cross-entropy loss between the predicted and gold domain labels.

Fine-Tuning on the Entire Model
After pre-training, fine-tuning is conducted using the in-domain corpus to train MLPs that integrate the domain label and input utterance (Fig. 6). Additionally, embeddings of domain labels e dom d are trained at this stage. To avoid error propagation from the pre-trained TWEET2VEC, gold domain labels are inputted into the MLP to learn correct representations of domain labels.
Once fine-tuned, these sub-models are connected to generate domain-aware responses with sentiments (Fig. 2).

Evaluation Design
Because the effectiveness of each component for embodying emotions have been evaluated in (Zhou et al., 2018;, we focus on evaluating whether both domain-awareness and sentiment-richness are achieved simultaneously by our model compared to SEQ2SEQ.

Data Collection
To train our model, we collected both general and in-domain conversational texts in Japanese. The general corpus is constructed by crawling conversational tweets using Twitter API. 4 We also crawled conversational tweets used in the NTCIR Short Text Conversation Task (Shang et al., 2016). In total, the general corpus contains about 1.6M utterance-response pairs.
The in-domain corpus crawls conversations in public Facebook Groups using Facebook Graph API. 5 Because members are fans of specific products, organizations, and people, we expect that their conversations are domain-dependent. 6 Specifically, we used two domains, Japanese pro-   fessional baseball leagues and Pokémon Go 7 , anticipating that salient sentiments are easily manifested in sports and game domains. Experiments using a wider range of domains is our future work. We crawled conversations since a group's inception to November 2017. In total, the in-domain corpora contain about 29k baseball-related conversations and 28k game-related conversations. We assume that sentiments can be embodied in domains with weaker sentiment tendencies due to pre-training in the general domain corpus. Verification of this assumption is a future task. After crawling, we preprocessed the corpora to remove noise and standardize texts (details are described in Sec. B). Table 1 shows the amount of our training data after the preprocessing step.
As a validation set of pre-training, 1k conversation pairs were sampled from the general corpus. Similarly, 1k pairs for validation and another 1k as a test set were sampled from the in-domain corpus for the automatic evaluation. The training set excluded these validation and test sets. Table 2 summarizes the hyper-parameters in our model and their settings. The vocabulary size was 45k, which consisted of frequent words in the general and in-domain corpora. The general and indomain corpora contained 1, 387 sentiment words, which were used as V s in the external memory.

Model Setting
In both pre-training and fine-tuning, submodels, except for TWEET2VEC, were trained at most 100 epochs with early stopping using the validation set. Batch size was set to 200, dropout was used with a rate of 0.2, and Adam (Kingma and Ba, 2015) with a learning rate of 0.01 was applied as an optimizer.
During pre-training and fine-tuning, an out-ofvocabulary (OOV) word in input utterances was replaced with a similar word in the vocabulary to reduce the effects of data sparsity (Li et al., 2016c). We generated word embeddings using the fastText (Bojanowski et al., 2017) with the default settings feeding Wikipedia dumps 8 as training data. When a word is OOV, the top-50 similar words are detected using cosine similarities between their embeddings. If one of these similar words is in the vocabulary, it replaces the original OOV word. Otherwise, the original word is replaced with an unknown word tag.
TWEET2VEC was trained on the in-domain corpus using the official implementation 9 with the default settings. We crawled 200 new domaindependent conversational pairs as a validation set. The prediction accuracy was 89.0%, which is reasonable considering that our texts are colloquial. We compare our model to SEQ2SEQ that was implemented using bi-directional LSTM networks as an encoder and an LSTM network as a decoder. Our model has the same hyper-parameters and training procedures, except that SEQ2SEQ was trained using both general and in-domain corpora. For SEQ2SEQ, a validation set of 1k pairs was randomly sampled from the combined corpus excluding from the training and test sets described in Sec. 5.1.

Human Evaluation
Because each utterance has many appropriate responses, an automatic evaluation scheme has yet to be established. To assess the quality of the generated responses from the human perspective, we designed two evaluation tasks. Task 1 evaluates the overall quality of our model compared to SEQ2SEQ from the perspectives of domain-awareness and sentiment-richness. Task 2 evaluates if an intended sentiment is embodied as desired without being affected by domainawareness.
We recruited five graduate students majoring in computer science that are Japanese native speakers (hereafter called annotators). After an instruction session to explain judgment standards, they annotated Task 1 and Task 2. As a token of appreciation, each annotator received a small stipend.
Test Set Creation To exclude external factors, e.g., word segmentation failures, that may affect the evaluation results, we manually created a test set consisting of 300 utterances in the baseball domain and another 300 utterances in the Pokémon Go domain.
First, we crawled new conversational pairs from the same Facebook Groups from November to December 2017. Next, we manually excluded conversations in the general domain (e.g., greetings). We then cleaned sentences in the same manner with the general and in-domain corpora. Besides, we manually replaced OOV words within vocabulary words that preserve the original meanings of sentences. Slang and uncommon expressions were also manually converted to standard expressions to avoid impacting the accuracy of word segmentation. Half of the test set (150 conversations for each domain) was used for Task 1 and the other half was used for Task 2. Note that all annotators annotated the same conversations, in total 600 pairs of utterances and responses.
Task 1: Overall Evaluation Annotators judged triples of an input utterance and responses by our model and by SEQ2SEQ. The order of responses was randomly shuffled to ensure a fair evaluation. Annotators assessed the following aspects: • Fluency: Annotators judged if a response is fluent and at an acceptable level to understand its meaning (1 = fluent, 0 = influent).
• Consistency: Annotators evaluated whether a response is semantically consistent with the utterance (1 = consistent, 0 = inconsistent). Generic responses can be regarded as consistent if they are acceptable for given utterances.
Responses judged as influent are automatically annotated as inconsistent.
• Domain-awareness: Annotators compared the two responses and determined which one better matched the domain of the input utterance (1 = model that generated the better response, 0 = the other model).
• Sentiment-richness: Annotators compared the two responses and determined one showing salient sentiments like Domain-awareness annotation. Only positive or negative responses were considered for our model.
For Domain-awareness and Sentiment-richness, we conduct a pairwise comparison of our model and SEQ2SEQ, which enables reliable judgments for subjective annotations (Ghazvininejad et al., 2018;Wang et al., 2018), rather than independently judging different models.
Task 2: Evaluation of Sentiment Control Our model takes a sentiment label that is desired to be expressed in a generated response as input, which we refer to as intended sentiment. This task evaluates if such an intended sentiment is embodied in a response by comparing the intended sentiment and a sentiment that annotators perceive in practice.
Annotators were shown a pair of input utterance and generated response by our model, and then asked to judge if the response was positive, negative, or neutral. We evaluated the agreement between the intended and perceived sentiments.

Evaluation Results
As an automatic evaluation measure, we computed the BLEU score (Papineni et al., 2002) following evaluations in (Li et al., 2016a;Ghazvininejad et al., 2018). Our model achieved the higher BLEU score (1.54) than SEQ2SEQ (1.39). However, as discussed in (Liu et al., 2016;Lowe et al., 2017), current automatic evaluation measures show either weak or no correlation with human judgements, or worse, they tend to favor generic responses. Hence, we focus on human evaluation in the following.
First of all, the agreement level of annotations is examined based on Fleiss' κ. All annotations have reasonable agreements (κ ≥ 0.37) except the annotation of fluency for SEQ2SEQ whose κ value is as low as 0.21 (all the κ values are shown in Sec. C). This phenomenon may be because SEQ2SEQ tends to output generic responses that are less dependent on the utterances, making judgments difficult due to the limited clues to evaluate fluency. Table 3 shows the macro-averages and the 95% confidence intervals of the scores obtained by the  annotators in Task 1. Our model achieved significant improvements over SEQ2SEQ; 89% and 72% of the responses generated by ours were deemed as consistent with the utterance domain and showing salient sentiments, respectively. Furthermore, 68.1% responses by our model were judged as both domain-aware and sentiment-rich, which was only 2.7% for responses by SEQ2SEQ.
As for fluency and consistency, SEQ2SEQ yields slightly more fluent (99.5%) and consistent (77.3%) responses compared to our model (95.5% and 75.3%, respectively). SEQ2SEQ benefits from the generic responses because such responses apply to various inputs, making it easier to achieve a high consistency compared to our model that generates domain-dependent responses. Additionally, generic responses are easier to generate because they are typically short. The average numbers of characters in responses when inputting the test set were 19 and 32 for SEQ2SEQ and our model, respectively. This result reveals that our model achieves a reasonably high fluency even when generating significantly longer responses. Another reason is the side-effect of external memory that influences the internal state of the decoder as reported in (Zhou et al., 2018).
As a result of Task 2, the macro-average of the agreement between the intended and perceived sentiments is 64.5 ± 2.3%, where Fleiss' κ of annotation is 0.52. Fig. 7 is a confusion matrix showing the distribution of the obtained 1, 500 annotations. Neutral responses tend to be judged as either positive (28.5%) or negative (15.6%). One reason is our simple sentiment annotation, which assigns a neutral label when the numbers of positive and negative words in a sentence are equal. Improving the polarity strength is a future task.
The annotators perceived 17.6% of the intended negative responses as positive. Detailed analyses of generated responses revealed that this category contained sentiment words whose polarities Figure 7: Confusion matrix of intended (true) sentiments and the sentiments that annotators perceived depend on the context, e.g., envy, great, and surprising. These words are considered negative in our sentiment lexicon because they tend to be used with negative emoticons to show humor in Twitter. In the future, we will develop post-processing to clean our lexicon and consider the self-attention (Vaswani et al., 2017) to resolve such context-dependent cases. Fig. 1 shows real examples of generated responses.
While SEQ2SEQ produces generic responses like "Really?", our model generates domain-aware responses with sentiments like "Sugano is cool!" (positive response) and "No way? There is no hope for Sugano!" (negative response) for the baseball domain. Sec. D provides more examples that show how our model achieved domain-awareness and sentiment-richness.

Conclusion
As a solution to the generic response problem in SEQ2SEQ, we implemented conversational responsiveness and self-expression to a neural dialogue model. Different from previous studies, our model achieves these features simultaneously in forms of domain-awareness and sentimentrichness, respectively. Evaluation results empirically demonstrated that our model significantly outperformed SEQ2SEQ. In the future, we will improve the accuracy in embodying sentiments and extend our dataset to cover diverse domains.

A Construction of the Sentiment Lexicon
We used two sentiment lexicons created by Kobayashi et al. (2005) and Takamura et al. (2005). The former is manually created, while the latter is automatically created by estimating the strengths of semantic orientations of words in the range of [−1.0, 1.0]. We only used words with a strong polarity. Specifically, words with scores of [−1.0, 0.9] or [0.9, 1.0]. These lexicons contain only formal words like headings in dictionaries. Therefore, we extended our sentiment lexicon to collect casual and recent sentiment words.
We searched tweets that are expected to contain sentiments by querying Twitter with positive and negative emoticons. In total, we crawled 400k potential positive and negative tweets and generated word embeddings from these tweets using the fastText (Bojanowski et al., 2017) with the default setting. We then manually selected 57 sentiment words from the vocabulary as seeds. The top-15 similar words per seed were extracted as sentiment words, which were ranked by the cosine similarity between embeddings of the seed and a candidate. In total, we collected 1, 621 negative and 2, 666 positive words as our sentiment lexicon.

B Preprocessing
We employed conversational text crawled from Twitter and Facebook, which are inherently noisy. We conducted data cleaning before training our model.
First, line breaks, emoticons, Japanese emoticons (kaomoji), URLs, and consecutive duplicate symbols were removed. Then texts less than or equal to 25 words were obtained after word segmentation using Mecab (Kudo et al., 2004). Table 4 shows detailed statistics of our training data after this preprocessing. Table 5 shows the Fleiss' κ for each annotation result in our human evaluation. It confirms that reasonably high agreements were achieved.     domain-aware and sentiment-rich responses compared to SEQ2SEQ.