Modeling Situations in Neural Chat Bots

Social media accumulates vast amounts of online conversations that enable data-driven modeling of chat dialogues. It is, however, still hard to utilize the neural network-based S EQ 2S EQ model for dialogue modeling in spite of its acknowledged success in machine translation. The main challenge comes from the high degrees of freedom of outputs (responses). This paper presents neural conversational models that have general mechanisms for handling a variety of situations that affect our responses. Response selection tests on massive dialogue data we have collected from Twitter conﬁrmed the effectiveness of the proposed models with situations derived from utterances, users or time.


Introduction
The increasing amount of dialogue data in social media has opened the door to data-driven modeling of non-task-oriented, or chat, dialogues (Ritter et al., 2011). The data-driven models assume a response generation as a sequence to sequence mapping task, and recent ones are based on neural SEQ2SEQ models (Vinyals and Le, 2015;Shang et al., 2015;Li et al., 2016a,b;Xing et al., 2017). However, the adequacy of responses generated by these neural models is somewhat insufficient, in contrast to the acknowledged success of the neural SEQ2SEQ models in machine translation (Johnson et al., 2016).
The contrasting outcomes in machine translation and chat dialogue modeling can be explained by the difference in the degrees of freedom on output for a given input. An appropriate response to a given utterance is not monolithic in chat dialogue. Nevertheless, since only one ground truth response is provided in the actual dialogue data, the supervised systems will hesitate when choosing from the vast range of possible responses.
So, how do humans decide how to respond? We converse with others while (implicitly) considering not only the utterance but also other various conversational situations ( § 2) such as time, place, and the current context of conversation and even our relationship with the addressee. For example, when a friend says "I feel so sleepy." in the morning, a probable response could be "Were you up all night?" (Figure 1). If the friend says the same thing at midnight, you might say "It's time to go to bed." Or if the friend is driving a car with you, you might answer "If you fall asleep, we'll die." Modeling situations behind conversations has been an open problem in chat dialogue modeling, and this difficulty has partly forced us to focus on task-oriented dialogue systems (Williams and Young, 2007), the response of which has a low degree of freedom thanks to domain and goal specificity. Although a few studies have tried to exploit conversational situations such as speakers' emo-tions (Hasegawa et al., 2013) or personal characteristics (Li et al., 2016b) and topics (Xing et al., 2017), the methods are specially designed for and evaluated using specific types of situations.
In this study, we explore neural conversational models that have general mechanisms to incorporate various types of situations behind chat conversations ( § 3.2). These models take into account situations on the speaker's side and the addressee's side (or those who respond) when encoding utterances and decoding its responses, respectively. To capture the conversational situations, we design two mechanisms that differ in how strong of an effect a given situation has on generating responses.
In experiments, we examined the proposed conversational models by incorporating three types of concrete conversational situations ( § 2): utterance, speaker/addressee (profiles), and time (season), respectively. Although the models are capable of generating responses, we evaluate the models with a response selection test to avoid known issues in automatic evaluation metrics of generated responses (Liu et al., 2016a). Experimental results obtained using massive dialogue data from Twitter showed that modeling conversational situations improved the relevance of responses ( § 4).

Conversational situations
Various types of conversational situations could affect our response (or initial utterance) to the addressee. Since neural conversational models need massive data to train a reliable model, our study investigates conversational situations that are naturally given or can be identified in an unsupervised manner to make the experimental settings feasible.
In this study, we represent conversational situations as discrete variables. That allows models to handle unseen situations in testing by classifying them into appropriate situation types via distributed representations or the like as described below, and helps to analyze the outputs. We consider the following conversational situations to each utterance and response in our dialogue dataset ( § 4), and cluster the situations to assign specific situation types to the utterances and responses in the training data of our conversational models.
Utterance The input utterance (to be responded to by the system) is a primary conversational situation and is already modeled by the encoder in the neural SEQ2SEQ model. However, we may be able to induce a different aspect of situations that are represented in the utterance but are not captured by the SEQ2SEQ sequential encoder (Sato et al., 2016). We first represent each utterance of utterance-response pairs in our dialogue dataset by a distributed representation obtained by averaging word2vec 1 vectors (pre-trained from our dialogue datasets ( § 4.1)) for words in the utterances. The utterances are then classified by k-means clustering to identify utterance types. 2 User (profiles) User characteristics should affect his/her responses as Li et al. (2016b) have already discussed. We classify profiles provided by each user in our dialogue dataset ( § 4.1) to acquire conversational situations specific to the speakers and addressees. The same as with the input utterance, we first construct a distributed representation of each user's profile by averaging the pre-trained word2vec vectors for verbs, nouns and adjectives in the user profiles. The users are then classified by k-means clustering to identify user types. 3 Time (season) Our utterances can be affected by when we speak as illustrated in § 1, so we adopted time as one conversational situation. On the basis of timestamp of the utterance and the response in our dataset, we split the conversation data into four season types: namely, spring (Mar. -May.), summer (Jun. -Aug), autumn (Sep. -Nov.), and winter (Dec. -Feb.). This splitting reflects the climate in Japan since our data are in Japanese whose speakers mostly live in Japan.
In training our neural conversational models, we use each of the above conversational situation types for the speaker side and addressee (who respond) side, respectively. Note that the utterance situation is only considered for the speaker side since its response is unseen in response generation. In testing, the conversational situation types for input utterances (or speaker and addressee's profiles) are identified by finding the closest centroid obtained by the k-means clustering of the utterances (profiles) in the training data.

Method
Our neural conversational models are based on the SEQ2SEQ model  and integrate mechanisms to incorporate various conversa- tional situations ( § 2) at speaker side and addressee side. In the following, we briefly introduce the SEQ2SEQ conversational model (Vinyals and Le, 2015) and then describe two mechanisms for incorporating conversational situations.

SEQ2SEQ conversational model
The SEQ2SEQ conversational model (Vinyals and Le, 2015) consists of two recurrent neural networks (RNNs) called an encoder and a decoder. The encoder takes each word of an utterance as input and encodes the input sequence to a realvalued vector representing the utterance. The decoder then takes the encoded vector as its initial state and continues to generate the most probable next word and to input the word to itself until it finally outputs EOS.

Situation-aware conversational models
The challenge in designing situation-aware neural conversational models is how to inject given conversational situations into RNN encoders or decoders. In this paper, we present two situationaware neural conversational models that differ in how strong of an effect a given situation has.

Local-global SEQ2SEQ
Motivated by a recent success in multi-task learning for a deep neural network (Liu et al., 2016c,b;Gupta et al., 2016;Luong et al., 2016), our localglobal SEQ2SEQ trains two types of RNN encoder and decoder for modeling situation-specific dialogues and universal dialogues jointly (Figure 2).
Local-RNNs are meant to model dialogues in individual conversational situations at both the speaker and addressee sides. Each local-RNN is trained (i.e., its parameters are updated) only on dialogues under the corresponding situation. A salient disadvantage of this modeling is that the size of training data given to each local-RNN decreases as the number of situation types increases.
To address this problem, we combine another global-RNN encoder and decoder trained on all the dialogue data and take the weighted sum of the hidden states hs of the two RNNs for both the encoder and decoder to obtain the output as: L (·) denote global-RNN and local-RNN, respectively, and the W s are trainable matrices for the weighted sum. The embedding and softmax layers of the RNNs are shared.

SEQ2SEQ with situation embeddings
The local-global SEQ2SEQ ( § 3.2.1) assumes that dialogues with different situations involve different domains (or tasks) that are independent of each other. However, this assumption could be too strong in some cases and thus we devise another weakly situation-aware conversational model.
We represent the given situations at speaker and addressee sides, s k and s k , as situation embeddings and then feed them to the encoder and decoder prior to processing sequences ( Figure 3) as: where h init is a vector filled with zeros and h  This encoding was inspired by a neural machine translation system (Johnson et al., 2016) that enables multilingual translation with a single model. Whereas it inputs the target language embedding only to the encoder to control the target language, we input the speaker-side situation to the encoder and the addressee-side one to the decoder.

Evaluation
In this section, we evaluate our situation-aware neural conversational models on massive dialogue data obtained from Twitter. We compare our models ( § 3.2) with SEQ2SEQ baseline ( § 3.1) using a response selection test instead of evaluating generated responses, since Liu et al. (2016a) recently pointed out several problems of existing metrics such as BLEU (Papineni et al., 2002) for evaluating generated responses.

Settings
Data We built massive dialogue datasets from our Twitter archive that have been compiled since March, 2011. In this archive, timelines of about 1.5 million users 4 have been continuously collected with the official API. It is therefore suitable for extracting users' conversations in timelines.
On Twitter, a post (tweet) and a mention to it can be considered as an utterance-response pair. We randomly extracted 23,563,865 and 1,200,000 pairs from dialogues in 2014 as training and validation datasets, and extracted 6000 pairs in 2015 as a test dataset in accordance with the following procedure. Because we want to exclude utterances that need contexts in past dialogue exchanges to respond from our evaluation dataset, we restrict ourselves to only tweets that are not mentions to other tweets (in other words, utterances without past dialogue exchanges are chosen for evaluation). For each utterance-response pair in the test dataset, we randomly chose four (in total, 24,000) responses in 2015 as false response 4 Our collection started from 26 popular Japanese users in March 2011, and the user set has iteratively expanded to those who are mentioned or retweeted by already targeted users.  candidates which together constitute five response candidates for the response selection test. Each utterance and response (candidate) is tokenized by MeCab 5 with NEologd 6 dictionary to feed the sequence to the word-based encoder decoder. 7 Table 1 shows statistics on our dialogue datasets.

Models
In our experiments, we compare our situation-aware neural conversational models (we refer to the model in § 3.2.1 as L/G SEQ2SEQ and the model in § 3.2.2 as SEQ2SEQ emb) with situation-unaware baseline ( § 3.1) for taking each type of conversational situations ( § 2) into consideration. We also evaluate the model in § 3.2.1 without global-RNNs (referred to as L SEQ2SEQ) to observe the impact of global-RNNs. We used a long-short term memory (LSTM) (Zaremba et al., 2014) as the RNN encoder and decoder, sampled softmax (Jean et al., 2015) to accelerate the training, and TensorFlow 8 to implement the models. Our LSTMs have three layers and are optimized by Adam (Kingma and Ba, 2015). The hyperparameters are fixed as in Table 2.
Evaluation procedure We use the above models to rank response candidates for a given utterance in the test set. We compute the averaged cross-entropy loss for words in each response candidate (namely, its perplexity) by giving the candidate following the input utterance to each conversational model, and used the resulting values for ranking candidates to choose top-k plausible ones. We adopt 1 in t P@k  as the evaluation metric, which indicates the ratio of utterances that are provided the single ground truth in top k responses chosen from t candidates. Here we use 1 in 2 P@1, 9 1 in 5 P@1, and 1 in 5 P@2.   Table 3 lists the results of the response selection test. The proposed conversational models successfully improved the relevance of selected responses by incorporating conversational situations. The proposed model that performed best is different depending on the situation type. We found from the dataset that many of the conversations did not seem to be affected by the seasons, that is, time (season) situation is less influential than other situations. This explains the poor performance of L SEQ2SEQ with time (season) situations due to the data sparseness in training local-RNNs, although the sparseness is mostly addressed by global RNNs in L/G SEQ2SEQ.

Results
As stated in § 3.2.2, L/G SEQ2SEQ is expected to capture situations more strongly than SEQ2SEQ emb. To confirm this, we plotted scattergrams of the utterance vectors ( Figure 4) and the user profile vectors ( Figure 5) in the training data by using t-SNE (Maaten and Hinton, 2008). We provide cluster descriptions by manually looking into the content of the utterances and user profiles in each cluster. The descriptions are followed by if L/G SEQ2SEQ performed better than SEQ2SEQ emb in terms of 1 in 5 P@1 for test utterances with the corresponding situation type, by if the opposite and by → if comparable (differences are within ± 1.0%). Elements of clusters were randomly sampled.
L/G SEQ2SEQ tends to perform better for utterances with densely concentrated (or coherent) speaker profile clusters ( Figure 5). This is because utterances given by the speakers in these coherent clusters (and the associate responses) have similar conversations, situations of which are captured by local-RNNs in the local-global SEQ2SEQ. Diverse topics ր Opinions, q estions ր Food, sightseeing ր Emotional ց Self-explos re ր Good morning ր Desires ր Coming home ր Good night ր Reporting ret rn ր This explains the reason why L/G SEQ2SEQ outperformed the other situation-aware conversational models when utterance situations are considered ( Figure 4). Conversations in the same clusters are naturally consistent, and conversations assigned to the same clusters form typical activities or specific tasks (e.g., greetings, following other users, and questions (and answering)) in Twitter conversation. L/G SEQ2SEQ, designed as a kind of multi-task SEQ2SEQ, literally captures these task-specific behaviors in the conversations.
Although some utterance clusters have general conversations (e.g., diverse topics), the response performances in those clusters have still improved. This is because these general clusters are free from harmful common responses that are quarantined into situation-specific clusters (e.g., greetings etc.) and the corresponding local-RNNs should avoid generating those common responses. Note that this problem has been pointed out and addressed by Li et al. (2016a) in a totally different way.  Examples Table 4 lists the response candidate selected by the baseline and our models. As we had expected, the situation-aware conversational models are better at selecting ground-truth responses for situation-specific conversations.

Related Work
Conversational situations have been implicitly addressed by preparing datasets specific to the target situations and by solving the problem as a taskoriented conversation task (Williams and Young, 2007); examples include troubleshooting (Vinyals and Le, 2015), navigation (Wen et al., 2015), interviewing (Kobori et al., 2016), and restaurant search (Wen et al., 2017). In what follows, we introduce non-task-oriented conversational models that explicitly consider conversational situations. Hasegawa et al. (2013) presented a conversational model that generates a response so that it elicits a certain emotion (e.g., joy) in the addressee mind. Their model is based on statistical machine translation and linearly interpolates two conversational models that are trained from a small emotion-labeled dialogue corpus and a large nonlabeled dialogue corpus, respectively. This model is similar to our local-global SEQ2SEQ but differs in that it has hyperparameters for the interpolation, whereas our local-global SEQ2SEQ automatically learns W G and W L from the training data. Li et al. (2016b) proposed a neural conversational model that generates responses taking into consideration speakers' personalities such as gender or living place. Because they fed a specific speaker ID to their model and represent individual (known) speakers with embeddings, Their model cannot handle unknown speakers. In contrast, our model can consider any speakers with profiles because we represent each cluster of profiles with an embedding and find an appropriate profile type for the given profile by nearest-neighbor search. Sordoni et al. (2015) encoded a given utterance and the past dialogue exchanges, and combined the resulting representations for RNN to decode a response. Zhao et al. (2017) used a conditional variational autoencoder and automaticallyinduced dialogue acts to handle discourse-level diversity in the encoder. While these sophisticated architectures are designed to take dialogue histories into consideration, our simple models can easily exploit various situations.
Recently, Xing et al. (2017) proposed to explicitly consider topics of utterances to generate topiccoherent responses. Although they used latent Dirichlet allocation while we use k-means clustering, both methods confirmed the importance of utterance situations. The way to obtain specific situations is still an open research problem. As demonstrated in this study, our primary contribution is the invention of neural mechanisms that can consider various conversational situations.
Our local-global SEQ2SEQ model is closely related to a many-to-many multi-task SEQ2SEQ proposed by Luong et al. (2016). The critical difference is in that their model assumes only local tasks, while our model assumes many local tasks (situation-specific dialogue modeling) and one global task (general dialogue modeling).

Conclusion
We proposed two situation-aware neural conversational models that have general mechanisms for handling various conversational situations represented by discrete variables: (1) local-global SEQ2SEQ that combines two SEQ2SEQ models ( § 3.2.1) to handle situation-specific dialogues and universal dialogues jointly, and (2) SEQ2SEQ with situation embeddings ( § 3.2.2) that feeds the situations directly to a SEQ2SEQ model. The response selection tests on massive Twitter datasets confirmed the effectiveness of using situations such as utterances, user (profiles), or time.