Estimating User Interest from Open-Domain Dialogue

Dialogue personalization is an important issue in the field of open-domain chat-oriented dialogue systems. If these systems could consider their users’ interests, user engagement and satisfaction would be greatly improved. This paper proposes a neural network-based method for estimating users’ interests from their utterances in chat dialogues to personalize dialogue systems’ responses. We introduce a method for effectively extracting topics and user interests from utterances and also propose a pre-training approach that increases learning efficiency. Our experimental results indicate that the proposed model can estimate user’s interest more accurately than baseline approaches.


Introduction
Chat is a very important part of human communication. In fact, it has been reported that it makes up about 62% of all conversations (Koiso et al., 2016). Since chat is also important for human-to-machine communication, studies of dialogue systems that aim to enable opendomain chat have received much attention in recent years (Ritter et al., 2011;Higashinaka et al., 2014;Sordoni et al., 2015;Vinyals and Le, 2015;Zhao et al., 2017). In these studies, dialogue personalization is an important issue: if such systems could consider users' experiences and interests when engaging them in a conversation, it would greatly improve user satisfaction. To this end, Hirano et al. extracted predicate-argument structures (Hirano et al., 2015), Zhang and Chai focused on conversational entailment Chai, 2009, 2010) and Bang et al. extracted entity relationships (Bang et al., 2015). These studies aimed to employ users' utterance histories to generate personalized responses.
In contrast, this study aims to estimate the user's interest in particular topics (e.g., music, fashion, or health) to personalize the dialogue system's responses based on these interests. This would allow it to focus on topics the user is interested in and avoid topics they dislike, enhancing user engagement and satisfaction.
This paper therefore proposes a neural networkbased method for estimating users' interests using their utterances in chat dialogues. Our method estimates their levels of interest not only in topics that appear in the dialogues, but also in other topics that have not appeared. Even if a user enjoys talking about the current topic, they will get bored if the system talks about it endlessly. By gauging the user's potential interest in topics that have not directly appeared in the dialogue, the system can expand the discussion to other topics before the user gets bored.
In this study, we use data from human-tohuman dialogues because the current performance of chat-oriented dialogue systems is not sufficient for them to talk with humans naturally. We also use textual dialogue data to avoid speech recognition issues. In addition, to estimate the target user's interests independently of the dialogue system's utterances, we only consider their own utterances and ignore those of their dialogue partner. This paper brings three main contributions, as follows. 1. We propose a topic-specific sentence attention approach that enables topics and user interests to be efficiently extracted from utterances. 2. We develop a method for pre-training our model's utterance encoder, so it learns what topics are related to each target user's utterance. 3. We show experimentally that the proposed sentence attention and pre-training methods can provide high performance when used together.

Related Work
Many studies related to estimating user interest from text data have targeted social network services (SNS), especially Twitter users. For example, Chen et al. proposed a method of modeling interest using the frequencies of words in tweets by the target user and followers . Some methods have also been proposed that consider superordinate concepts acquired from knowledge bases. For example, Abel et al. modeled Twitter users using the appearance frequencies of certain named entities (e.g., people, events, or music groups), acquired using OpenCalais 1 (Abel et al., 2011). In addition, some methods have used categories from Wikipedia (Michelson and Macskassy, 2010;Kapanipathi et al., 2014;Zarrinkalam et al., 2015) or DBPedia (Kapanipathi et al., 2011). Several methods have also been proposed that use topic models, such as latent dirichlet allocation (LDA) (Weng et al., 2010;Bhattacharya et al., 2014;Han and Lee, 2016). However, it is difficult to apply such methods directly to dialogue because they assume that users are posting about subjects they are interested in. This is a reasonable assumption for SNS data, but in conversations, people do not always limit themselves to topics they are interested in. For instance, people will play along and discuss subjects the other persons are interested in, even if they are not interested in them, as well.
Other studies have attempted to estimate users' levels of interest (LOI) from dialogues. Schuller et al. tackled the task of estimating listeners' interest in a product from dialogues between them and someone introducing a particular product, proposing a support vector machine (SVM)-based method incorporating acoustic and linguistic features (Schuller et al., 2006) In 2010, LOI estimation was selected as a sub-challenge of the INTER-SPEECH Paralinguistic Challenge (Schuller et al., 2006), but there the focus was on single-topic (product) interest estimation from spoken dialogue, not open-domain estimation. In addition, that task considered business dialogues, not chats.

Model Architecture
The task considered in this paper is as follows. Given an utterance set U s = (u 1 , u 2 , ..., u n ) ut-1 http://www.opencalais.com/ tered by a speaker s during dialogues with other speakers, we estimate their degrees of interest Y s = (y 1 , y 2 , ..., y m ) in topics in a given topic set T = (t 1 , t 2 , ..., t m ). Here, the t i correspond to concrete topics, such as movies or travel while y i indicates the speaker's level of interest in t i , on the three-point scale used for the LOI estimation task described in the previous section. Using this scale, the y i can take the values 0 (disinterest, indifference and neutrality), 1 (light interest), or 2 (strong interest).
To accurately gauge the speaker's interest from their utterances, we believe it is important to extract the following two types of information efficiently.
• The topic of each utterance

• How interested the speaker is in the topic
Our proposed interest estimation model extracts this information efficiently and uses a pre-training method to improve learning. Figure 1 presents an overview of our neural network model, which first encodes the word sequence, applies word attention and topic-specific sentence attention, and finally estimates the degrees of interest D s = (d t 1 , d t 2 , ..., d tm ).
The proposed pre-training method is used for the word sequence encoder. The model is described in detail below.

Word Sequence Encoder
The word sequence encoder converts utterances into fixed-length vectors using a recurrent neural network (RNN). First, the words in each utterance are converted into word vectors using Word2vec (Mikolov et al., 2013), giving word vector sequences x = (x 1 , x 2 , ..., x l ). The RNN encoder uses a hidden bidirectional-GRU (BiGRU) layer, which consists of a forward GRU that reads from x 1 to x l in order and a backward GRU that reads from x l to x 1 in reverse order. The forward GRU computes the forward hidden states − → h i as follows.
The backward GRU calculates the backward hidden states ← − h i in a similar way. By combining the outputs of both GRUs, we obtain the objective hidden state h i : where [:] represents vector concatenation.

Topic Classification Pre-Training
Estimating the user's level of interest in each topic requires first assessing the topic of each utterance. Since this is not given explicitly, the model must infer this information from the utterance set and degrees of interest in each topic, so the learning difficulty is high. In this study, based on the idea of pre-training (Erhan et al., 2010), we introduce a new pre-training method for the sentence topic classification task to the word sequence encoder. The important point to note about this task is that the topic classes involved are identical to those in the topic set Y s . This helps to reduce the difficulty of learning to estimate the relationships between utterances and topics and allows the model to focus on interest estimation during the main training phase. During pre-training, the classification probability p for each topic is calculated as follows, based on the output h l of the BiGRU after inputting the last word vector x l . (Word attention, as described in the next section, is not used in pre-training.) where W c and b c are parameters for topic classification. The cross-entropy is used as the loss function during pre-training.

Word Attention
Based on an idea from Yang et al., we also included word attention in our model (Yang et al., 2016). Word attention is based on the idea that all words do not contribute equally to the desired result and involves using an attention mechanism to weight each word differently. The resulting utterance vector z is obtained as follows.
where W ω and b ω are parameters. Unlike the original attention mechanism used in neural translation (Bahdanau et al., 2015) and neural dialogue (Shang et al., 2015) models, the word attention mechanism uses a common parameter, called context vector v ω to calculate weight α i for each hidden state. v ω is a high-level representation for calculating word importance and, like the model's other parameters, is randomly initialized and then optimized.

Topic Specific Sentence Attention
Our model uses a word sequence encoder with word attention to convert the utterance set U s = (u 1 , u 2 , ..., u n ) into the utterance vector set Z s = (z 1 , z 2 , ..., z n ). It then extracts information for estimating the level of interest in each topic from Z s , but, as with word attention, not all utterances contribute equally. Yang et al. proposed a sentence attention mechanism that takes the same approach as for word attention, but, since it uses only one parameter to calculate sentence importance (similar to the context vector v ω for word attention), it is not capable of topic-specific estimation. This is because the important utterances in a given utterance set differ from topic to topic. For example, "I jog every morning" is probably useful for estimating interest in topics, such as sports or health, but not in, say, computers or vehicles.
In this study, we therefore propose a new topicspecific sentence attention approach. The topic vector v t i represents the importance of each sentence for topic t i , and the associated content vector c t i is calculated as follows.
Here, W r and b r are shared, topic-independent parameters. The topic vector v t i is randomly initialized and then optimized during training.

Interest Estimation
We then use the content vector c jt i to compute the degree of interest d t i in topic t i as follows.
Here, the parameters W t i and b t i estimate the overall degree of interest in the topics t i , and it is different for each topic after optimization. Finally, one is added, so that d t i uses the same 0 to 2 range as the correct values y i . During training, we use the mean squared error (MSE) between the correct answer y i and d t i as the loss function:

Experiments
We conducted a series of experiments to evaluate the proposed method's performance. For these, we created a dataset based on logs of one-to-one text chats between human subjects and the results of questionnaires answered by each subject. We also tested several baseline methods for comparison purposes.

Datasets
We asked each subject to first fill out a questionnaire about their interests and then engage in text chats in Japanese with partners they had not previously been acquainted with. We recruited 163 subjects via the CrowdWorks 2 crowd-sourcing site. The subjects were asked to rate their levels of interest in the 24 topic categories shown in Table  1 using a three-point scale discussed in Section 3. These topics were selected based on the categories used by Yahoo! Chiebukuro 3 , a Japanese question-and-answer site, focusing on topics that are likely to appear in one-to-one dialogues between strangers.
2 https://crowdworks.jp/ 3 https://chiebukuro.yahoo.co.jp/  Each dialogue lasted for one hour and was conducted via Skype instant messaging. We only instructed the subjects to "Please try to find things you and your partner are both interested in and then try to broaden your conversation about these subjects." We gave the subjects no specific instructions as to the intended content or topics of their conversations. Table 2 shows an example dialogue between subject A and B.
All the utterances in the chat data were then classified by subject. Each data point consisted of all the data about one subject, namely their chat ut- terances and questionnaire responses (correspond to U s and Y s defined in Section 3). The data was evaluated using 10-fold cross-validation and their statistics are shown in Table 3.

Settings
Word2Vec (Mikolov et al., 2013) was trained using 100 GB of Twitter data with 200 embedding cells, a minimum word frequency of 10, and a skip-gram window size of 5. The word sequence encoder was a single-layer BiGRU RNN with 200 input cells and 400 output cells. The word and sentence attention layers had 400 input and output cells while the estimation layer had 400 input cells and 1 output cell. The model was trained using Adam (Kingma and Ba, 2015).
During pre-training, we used questions and answers from the Yahoo! Chiebukuro Data (2nd edition)) 4 ) for each topic. All topics were equally covered, and a total of 770k sentences were used for training while 2400 sentences (100 for each topic) were used for testing. After pre-training, the topic classification accuracy for the test data was 0.755.

Evaluation
When using the proposed method as part of a dialogue system, it is effective to select the best topic from those available for the system to generate an appropriate response. Therefore, in this experiment, each topic was ranked based on the estimated degree of interest d t i , and the methods were evaluated based on whether the topics the user was interested in should have been ranked higher or the other topics ranked lower. The rankings were evaluated using the normalized discounted cumulative gain (N DCG), a widely used metric in the field of information retrieval. This gives values between 0 and 1, with higher values indicating more accurate ranking predictions and is calculated as follows.
Here, k is the number of top-ranked objects used for the N DCG calculation, and rel i is the graded relevance of the result at position i, which was given by the degree of interest Y s in this experiment. The ideal DCG (IDCG) is the DCG if the ranking list had been correctly ordered by relevance.
In addition, to evaluate the accuracy of the estimated degrees of interest in each topic, we also calculated the MSEs between the results of each method and the correct answers.

Baseline Methods
To evaluate the proposed model, we also conducted experiments using the following three modified models.

Without Pre-Training
To evaluate the effectiveness of topic classification pre-training, we tested our model without this step. Instead, the word sequence encoder was randomly initialized and then trained. This model was otherwise identical to the proposed method.

Without Sentence Attention
To evaluate the effectiveness of topic-specific sentence attention, we tried instead using max-pooling to obtain the content vector. Again, this model was identical to the proposed method.

Without Pre-Training or Sentence Attention
This model combined the two modifications mentioned above: it did not use topic classification pre-training and used max-pooling to obtain the content vectors, but was otherwise identical to the proposed method.
We also compared our model's performance to those of the following two baseline methods.

Topic Frequency
The first baseline was based on a method, proposed by Abel et al., that identifies the named entities (such as people, events, or  0.611 music groups) associated with words in the user's tweets using OpenCalais and models the user's interests using a named entity frequency vector (Abel et al., 2011). However, as we used Japanese dialogues, we could not use OpenCalais, so we instead used the topic classifier described in Section 3.2. Since this classifier is trained for classification for sentences and not for words, we employed sentence level topic frequency. The topic frequency was used to gauge the user's interest, and the topics were ranked in frequency order.

SVR
The second baseline method used support vector regression (SVR) to estimate the degree of interest. We conducted experiments using only unigrams, and using both unigrams and bigrams. We used the RBF kernel function. The SVR models were trained for each topic individually and then used to estimate the degrees of interest. Figure 2 shows the N DCG results for the topics ranked in the top k. These indicate that the proposed method performed better than the other methods for all values of k. Comparing the performances of the methods that used pre-training ("Proposed" and "Without Sentence Attention") with those of the ones that did not ("Without Pre-Training" and "Without Pre-Training or Sentence Attention") indicates that the proposed pretraining step was effective. On the other hand, a method that used sentence attention ("Without Pre-Training") showed nearly the same results as one that did not ("Without Pre-Training or Sentence Attention"), although the latter did achieve higher NDCGs for k ≥ 5. This indicates that using sentence attention alone does not improve performance. However, the proposed method performed better than the method without sentence attention, confirming that sentence attention is useful, but only if it is used in conjunction with pretraining.

Results
Turning now to the SVR-based methods, we observe that using only unigram features worked better than using both unigrams and bigrams, although both methods were still inferior to the neural network-based methods, including the proposed method.
When k = 1, the topic frequency baseline achieved higher N DCGs than the SVR-based methods, because it correctly noted that users were strongly interested in the topics they spoke about most frequently. However, these results were still inferior to those of the neural network-based methods. Furthermore, it presented the worst N DCG results among all the methods for k ≥ 4, due to speakers sometimes talking about subjects they were not interested in, as discussed in Section 2. Table 4 shows the MSEs between the degree of interest results for each method and the correct answers (excluding Topic Frequency, which cannot output the degree of interest). The proposed method gave significantly smaller MSE value, indicating that its estimates were the most accurate. In addition, the "Without Pre-Training" method showed the lowest performance of all the neural network-based methods, also indicating that the proposed sentence attention approach is not effective without also using pre-training.

Discussion
The experimental results discussed in the previous section indicate that it is important to use the proposed pre-training and sentence attention steps together. To analyze the sentence attention mechanism further, we visualized the sentence weights α jt i given by equation (8) for selected topics and utterances. Figures 3 and 4 show the sentence weights with and without pre-training, respectively. Here, darker cells indicate higher α jt i values. Figure 3 shows that the sentence weights for the topics corresponding to the actual meaning of the sentence are high. (1), (2) and (3) are easy-tounderstand examples. The topics related to each utterance take the highest weights. In addition, utterance (4) includes sports-related words, such as "baseball" and "rule", but the weight of the "Sports / Exercise" topic is not high because the utterance did not indicate such an interest on the part of the speaker. Thus, the sentence weights do not simply reflect the topics of the words, but also the user's level of interest in the topic. Interestingly, although the utterance (6) refers to the smartphone game "Pokemon GO", the weight of the "Game" topic is not very high, but those of the "Sports/Exercise" and "Health" topics are both high. Pokemon GO is interesting to people who do not usually play games, and this appears to be reflected in the results. On the other hand, utterance (7) shows high weights for several topics that intuitively appear to be unrelated to the utterance itself.
The sentence weights shown in Figure 4 often do not correspond to the topics or meanings of the utterances. For example, utterance (5) is not important for interest estimation and its weights in Figure 3 are small. However, in Figure 4, all weights are relatively high. Similarly, utterances (7) and (8) show high weights for unrelated topics.
The above results confirm that the pre-training step is important for learning the topic-specific sentence attention correctly. Without pre-training, the model must learn the relationships between utterances and topics by starting from a clean slate, and the difficulty of this task makes harder to determine the appropriate results. The experimental results in the previous section show that pre-training makes this task easier and improves performance. With proper training, topicspecific sentence attention then enabled the proposed method to achieve the best performance.

Conclusion
In this paper, we have presented a neural networkbased method for estimating users' levels of interest in a pre-determined list of topics based on their utterances in chat dialogues. The proposed method first encodes utterances by using BiGRU and considering word attention, a set of utterance vectors was obtained. It then uses these to generate content vectors corresponding to each topic via topic-specific sentence attention. Finally, it uses the content vectors to estimate the user's degree of interest in each topic. The utterance encoder is pre-trained to classify sentences by topic before the whole model is trained. Our experimental results showed that the proposed method can estimate degrees of interest in topics more accurately than baseline methods. In addition, we found that it was most effecting to use topic-specific sentence attention and topic classification pre-training in combination.
In future work, we plan to apply the proposed method to a dialogue system and conduct dialogue experiments with human users. Even if we can estimate which topics a user is interested in, generating and selecting concrete utterances remains a challenging problem. For example, users who are interested in sports are not equally interested in all of them: someone may be interested in football but not in golf, for instance. We therefore plan to develop an appropriate way of incorporating the proposed method into such a system.