Chat Detection in an Intelligent Assistant: Combining Task-oriented and Non-task-oriented Spoken Dialogue Systems

Recently emerged intelligent assistants on smartphones and home electronics (e.g., Siri and Alexa) can be seen as novel hybrids of domain-specific task-oriented spoken dialogue systems and open-domain non-task-oriented ones. To realize such hybrid dialogue systems, this paper investigates determining whether or not a user is going to have a chat with the system. To address the lack of benchmark datasets for this task, we construct a new dataset consisting of 15,160 utterances collected from the real log data of a commercial intelligent assistant (and will release the dataset to facilitate future research activity). In addition, we investigate using tweets and Web search queries for handling open-domain user utterances, which characterize the task of chat detection. Experimental experiments demonstrated that, while simple supervised methods are effective, the use of the tweets and search queries further improves the F_1-score from 86.21 to 87.53.


Introduction 1.Chat detection
Conventional studies on spoken dialogue systems (SDS) have investigated either domain-specific task-oriented SDS1 (Williams and Young, 2007) or open-domain non-task-oriented SDS (a.k.a., chatbots or chat-oriented SDS) (Wallace, 2009).The former offers convenience by helping users complete tasks in specific domains, while the latter offers entertainment through open-ended chatting (or smalltalk) with users.Although the functionalities offered by the two types of SDS are complementary to each other, little practical effort has been made to combine them.This unfortunately has limited the potential of SDS.
This situation is now being changed by the emergence of voice-activated intelligent assistants on smartphones and home electronics (e.g., Siri2 and Alexa3 ).These intelligent assistants typically perform various tasks (e.g., Web search, weather checking, and alarm setting) while being able to have chats with users.They can be seen as a novel hybrid of multi-domain task-oriented SDS and open-domain non-task-oriented SDS.
To realize such hybrid SDS, we have to determine whether or not a user is going to have a chat with the system.For example, if a user says "What is your hobby?" it is considered that she is going to have a chat with the system.On the other hand, if she says "Set an alarm at 8 o'clock," she is probably trying to operate her smartphone.We refer to this task as chat detection and treat it as a binary classification problem.
Chat detection has not been explored enough in past studies.This is primarily because little attempts have been made to develop hybrids of task-oriented and non-task-oriented SDS (see Section 2 for related work).Although task-oriented and non-task-oriented SDS have long research histories, both of them do not require chat detection.Typically, users of task-oriented SDS do not have chats with the systems and users of non-taskoriented SDS always have chats with the systems.

Summary of this paper
In this work, we construct a new dataset for chat detection.As we already discussed, chat detection has not been explored enough, and thus there exist no benchmark datasets available.To address this situation, we collected 15, 160 user utterances from real log data of a commercial intelligent assistant, and recruited crowd workers to annotate those utterances with whether or not the users are going to have chats with the intelligent assistant.The resulting dataset will be released to facilitate future studies.
The technical challenge in chat detection is that we have to handle open-ended utterances of intelligent assistant users.Commercial intelligent assistants have a vast amount of users and they talk about a wide variety of topics especially when chatting with the assistants.It consequently becomes labor-intensive to collect a sufficiently large amount of annotated data for training accurate chat detectors.
We develop supervised binary classifiers to perform chat detection.We address the open-ended user utterances, which characterize chat detection, by using unlabeled external resources.We specifically utilize tweets (i.e., Twitter posts) and Web search queries to enhance the supervised classifiers.
Experimental results demonstrated that, while simple supervised methods are effective, the external resources are able to further improve them.The results demonstrated that the use of the external resources increases over 1 point of F 1 -score (from 86.21 to 87.53).

Related Work
2.1 Previous studies on combining task-oriented and non-task-oriented SDS Task-oriented and non-task-oriented SDS have long been investigated independently, and little attempts have been made to develop hybrids of the two types of SDS.As a consequence, previous studies have not investigated chat detection without only a few exceptions. 4iculescu and Banchs (2015) explored using non-task-oriented SDS as a back-off mechanism for task-oriented SDS.They, however, did not propose any concrete methods of automatically determining when to switch to non-task-oriented SDS. Lee et al. (2007) proposed an example-based dialogue manager to combine task-oriented and nontask-oriented SDS.In such a framework, however, it is difficult to flexibly utilize state-of-the-art supervised classifiers as a component.
Other studies proposed machine-learning-based frameworks for combining multi-domain taskoriented SDS and non-task-oriented SDS (Wang et al., 2014;Sarikaya, 2017).These assume that several components including a chat detector are already available, and explore integrating those components.They discuss little on how to develop each of the components.On the other hand, the focus of this work is to develop one of those components, a chat detector.Although it lies outside the scope of this paper to explore how to exploit chat detection method in a full dialogue system, the chat detection method is considered to serve, for example, as one component within those frameworks.

Intent and domain determination
Chat detection is related to, but different from, intent and domain determination that have been studied in the field of SDS (Guo et al., 2014;Xu and Sarikaya, 2014;Ravuri and Stolcke, 2015;Kim et al., 2016;Zhang and Wang, 2016).
Both intent and domain determination have been investigated in domain-specific task-oriented SDS.Intent determination aims to determine the type of information a user is seeking in singledomain task-oriented SDS.For example, in the ATIS dataset, which is collected from an airline travel information service system, the information type includes flight, city, and so on (Tur et al., 2010).On the other hand, domain determination aims to determine which domain is relevant to a given user utterance in multi-domain task-oriented SDS (Xu and Sarikaya, 2014).Note that it is possible that domain determination is followed by intent determination.
Unlike intent and domain determination, chat detection targets hybrid systems of multi-domain task-oriented SDS and open-domain non-taskoriented SDS, and aims to determine whether the non-task-oriented component is responsible to a given user utterance or not (i.e., the user is going to have a chat or not).Therefore, the objective of chat detection is different from intent and domain determination.
It may be possible to see chat detection as a spe-cific problem of domain determination (Sarikaya, 2017).We, nevertheless, discuss it as a different problem because of the uniqueness of the "chat domain."It greatly differs from ordinary domains in that it plays a role of combining the two different types of SDS that have long been studied independently, rather than combining multiple SDS of the same types.In addition, we discuss the use of external resources, especially tweets, for chat detection.This approach is unique to chat detection and is not considered effective for ordinary domain determination.
It is interesting to note that chat detection is not followed by slot-filling unlike intent and domain determination, as far as we use a popular response generator such as seq2seq model (Sutskever et al., 2014) or an information retrieval based approach (Yan et al., 2016).Although joint intent (or domain) determination and slot-filling has been widely studied to improve accuracy (Guo et al., 2014;Zhang and Wang, 2016), the same approach is not feasible in chat detection.

Intelligent assistant
Previous studies on intelligent assistants have not investigated chat detection.Their research topics are centered around those on user behaviors including the prediction of user satisfaction and engagement (Jiang et al., 2015;Kobayashi et al., 2015;Sano et al., 2016;Kiseleva et al., 2016a,b) and gamification (Otani et al., 2016).For example, Jiang et al. (2015) investigated predicting whether users are satisfied with the responses of intelligent assistants by combining diverse features including clicks and utterances.Sano et al. (2016) explored predicting whether users will keep using the intelligent assistants in the future by using long-term usage histories.
Some earlier works used the Cortana dataset as a benchmark of domain determination (Guo et al., 2014;Xu and Sarikaya, 2014;Kim et al., 2016) or proposed a development framework for Cortana (Crook et al., 2016).Those studies, however, regarded the intelligent assistant as merely one example of multi-domain task-oriented SDS and did not explore chat detection.

Non-task-oriented SDS
Non-task-oriented SDS have long been studied in the research community.While early studies adopted rule-based methods (Weizenbaum, 1966;Wallace, 2009), statistical approaches have re-cently gained much popularity (Ritter et al., 2011;Vinyals and Le, 2015).This research direction was pioneered by Ritter et al. (2011), who applied a phrase-based SMT model to the response generation.Later, Vinyals and Le (2015) used the seq2seq model (Sutskever et al., 2014).To date, a number follow-up studies have been made to improve on the response quality (Hasegawa et al., 2013;Shang et al., 2015;Sordoni et al., 2015;Li et al., 2016a,b;Gu et al., 2016;Yan et al., 2016).Those studies assume that users always want to have chats with systems and investigate only methods of generating appropriate responses to given utterances.Chat detection is required for integrating those response generators into intelligent assistants.

Use of conversational data
The recent explosion of conversational data on the Web, especially tweets, have triggered a variety of dialogue studies.Those typically used tweets either for training response generators (c.f., Section 2.4) or for discovering dialogue acts in an unsupervised fashion (Ritter et al., 2010;Higashinaka et al., 2011).This treatment of tweets differs from that in our work.

Chat Detection Dataset
In this section we explain how we constructed the new benchmark dataset for chat detection.We then analyze the data to provide insights into the actual user behavior.

Construction procedure
We sampled 15, 160 unique utterances5 (i.e., automatic speech recognition results) from the real log data of a commercial intelligent assistant, Yahoo!Voice Assist. 6The log data were collected between Jan. and Aug. 2016.In the log data, some utterances such as "Hello" appear frequently.To construct a dataset containing both high and low frequency utterances, we set frequency thresholds7 to divide the utterances into three groups (high, middle, and low frequency) and then randomly sampled the same number of utterances from each of the three groups.During the data collection, we ensured privacy by manually removing utterances that included the full name of a person or detailed address information.
Next, we recruited crowd workers to annotate the 15, 160 utterances with two labels, CHAT and NONCHAT.The workers annotated the CHAT label when users were going to have chats with the intelligent assistant and annotated the NONCHAT label when users were seeking some information (e.g., searching the Web or checking the weather) or were trying to operate the smartphones (e.g., setting alarms or controlling volume).Note that our intelligent assistant works primarily on smartphones and thus the NONCHAT utterances include many operational instructions such as alarm setting.Example utterances are given in Table 1.
Seven workers were assigned to each utterance, and the final labels were obtained by majority vote to address the quality issue inherent in crowdsourcing.The last column in Table 1 shows the number of votes that the majority label obtained.For example, five workers provided the CHAT label (and the other two provided the NONCHAT label) to the first utterance "Let's talk about something."

Data analysis
The construction process described above yielded a dataset made up of 4, 833 CHAT and 10, 327 NONCHAT utterances.
We investigated the annotation agreement among the crowd workers.Table 2 shows the distribution of the numbers of votes that the majority labels obtained.The annotation given by the seven workers agreed perfectly in 5, 811 of the 15, 160 utterances (38%).Also, at least six workers agreed in the majority of cases, 10, 789 (= 4, 978 + 5, 811) utterances (71%).This indicates high agreement among the workers and the reliability of the annotation results.
During the data construction, we found that a typical confusing case arises when the utterance can be interpreted as an implicit information request.For example, the utterance "I am hungry" can be seen as the user trying to have a chat with the assistant, but it might be the case that she is looking for a local restaurant.Similar examples include "I have a backache" and so on.One solution in this case might be to ask the user a clarification question (Schlöder and Fernandez, 2015).Such an exploration is left for our future research.
Additionally, we manually classified the CHAT utterances according to their dialogue acts to figure out how real users have chats with the intelligent assistant (Table 3).The set of dialogue acts was designed by referring to (Meguro et al., 2010).As shown in Table 3, while some of the utterances are boilerplates (e.g., those in the GREETING act) and thus have limited variety, the majority of the utterances exhibit tremendous diversity.We see a wide variety of topics including private issues (e.g., "I am free today") and questions to the assistant (e.g., "Are you angry?").Also, we even see a movie quote ("May the force be with you") and a rooster crow ("Cock-a-doodle-doo") in the MISC act.These clearly represent the open-domain nature of the user utterances in intelligent assistants.
Interestingly, some users curse at the intelligent assistant probably because it failed to make appropriate responses (see the CURSE act).Although such user behavior would not be observed from paid research participants, we observe a certain amount of curse utterances in the real data.

Detection Method
We formulate chat detection as a binary classification problem to train supervised classifiers.In this section, we first explain the two types of classifiers explored in this paper, and then investigate the use of external resources for enhancing those classifiers.

Base classifiers
The first classifier utilizes SVM for its popularity and efficiency.It uses character and word ngram (n = 1 and 2) features.It also uses word embedding features (Turian et al., 2010).A skipgram model (Mikolov et al., 2013) is trained on the entire intelligent assistant log8 to learn word embeddings.The embeddings of the words in the utterance are then averaged to produce additional features.
The second classifier uses a convolutional neural network (CNN) because it has recently proven to perform well on text classification problems (Kim, 2014;Johnson and Zhang, 2015a,b).We follow (Kim, 2014) to develop a simple CNN that has a single convolution and max-pooling layer followed by the soft-max layer.We use a rectified linear unit (ReLU) as the non-linear activation function.The same word embeddings as SVM are used for the pre-training.

Using external resources
We next investigate using external resources for enhancing the base classifiers.Thanks to the rapid evolution of the Web in the past decade, a variety of textual data including not only conversational (i.e., chat-like) but also non-conversational ones are abundantly available nowadays.These data offer an effective way of enhancing the base classifiers.We specifically use tweets and Web search queries as conversational and non-conversational text data, respectively.
We train character-based9 language models on tweets and Web search queries, and use their scores (i.e., the normalized log probabilities of the utterance) as two additional features.Let u = c 1 , c 2 , . . ., c m be an utterance made up of m characters.Then, the score score r (u) of the language model trained on the external resource r ∈ {tweet, query} is defined as The GRU language model is adopted for its superior performance (Cho et al., 2014;Chung et al., 2014).Let x t be the embedding of t-th character and h t be the t-th hidden state.GRU computes the hidden state as where is the element-wise multiplication, σ is the sigmoid and tanh is the hyperbolic tangent.r) , and U (r) are weight matrices.The hidden states are fed to the soft-max to predict the next word.
We also use a binary feature indicating whether the utterance appears in the Web search query log or not.We observe that some NONCHAT utterances are made up of single entities such as location and product names.Such utterances are considered to be seeking information on those entities.We therefore use the query log as an entity dictionary to derive a feature indicating whether the utterance is likely to be a single entity.
The resulting three features are incorporated into the SVM-based classifier straightforwardly (Figure 1).For the CNN-based classifier, they are provided as additional inputs to the soft-max layer (Figure 2).

Experimental Results
We empirically evaluate the proposed methods on the chat detection dataset.

Experimental settings
We performed 10-fold cross validation on the chat detection dataset to train and evaluate the proposed classifiers.In each fold, we used 80%, 10%, and 10% of the data for the training, development, and evaluation, respectively.
We used word2vec10 to learn 300 dimensional word embeddings.They were used to induce the additional 300 features for SVM.They were also used as the pre-trained word embeddings for CNN.
We used the faster-rnn toolkit11 to train the GRU language models.The size of the embedding and hidden layer was set to 256.Noise contrastive estimation (Gutmann and Hyvärinen, 2010) was used to train the soft-max function and the number of noise samples was set to 50.Maximum entropy 4-gram models were also trained to yield a combined model (Mikolov et al., 2011).
The language models were trained on 100 millions tweets collected between Apr. and July 2016 and 100 million Web search queries issued between Mar. and Jun.2016.The tweets were sampled from those received replies to collect only conversational tweets (Ritter et al., 2011).The same Web search queries were used to derive the binary feature.Although it is difficult to release those data, we plan to make the feature values available together with the benchmark dataset.

Baselines
The following baseline methods were implemented for comparison: Majority Utterances are always classified as the majority class, NONCHAT.
Tweet GRU Utterances are classified as CHAT if the score of the GRU language model trained on the tweets exceeds a threshold.We used exactly the same GRU language model as the one that was used for deriving the feature.
The threshold was calibrated on the development data by maximizing the F 1 -score of the CHAT class.
In-house IA Our in-house intelligent assistant system, which adopts a hybrid of rule-based and example-based approaches.Since we cannot disclose its technical details, the result is presented just for reference.

Result
Table 4 gives the precision, recall, F 1 -score (for the CHAT class), and overall classification accuracy results.We report only accuracy for Majority baseline.+embed.and +pre-train.represent using the word embedding features for SVM and the pre-trained word embeddings for CNN, respectively.+tweet-query represents using the three features derived from the tweets and Web search query.Table 4 represents that both of the classifiers, SVM and CNN, perform accurately.We see that both +embed.and +pre-train.improve the results.The best performing method, SVM+embed.+tweet-query,achieves 92% accuracy and 87% F 1 -score, outperforming all of the baselines.CNN performed worse than SVM contrary to results reported by recent studies (Kim, 2014).We think this is because the architecture of our CNN is rather simplistic.It might be possible to improve the CNN-based classifier by adopting more complex network, although it is likely to come at the cost of extra training time.Another reason would be that our SVM classifier uses carefully designed features beyond word 1-grams.
Table 4 also represents that the external resources are effective, improving F 1 -scores almost 1 points in both SVM and CNN.Table 5 illustrates example utterances and their language model scores.We see that the language models trained on the tweets and queries successfully provide the CHAT utterances with high and low scores, respectively.Table 6 shows chat detection results when each of the three features derived from the external resources is added to SVM+embed.The results represent that they are all worse than SVM+embed.+tweet-queryand thus it is crucial to combine all of them for achieving the best performance.
Table 7 shows examples of feature weights of SVM+embed.+tweet-query.Tweet GRU and query GRU denote the language model score features.The others are word n-gram features.We see that the language model scores have the large positive and negative weights, respectively.This indicates that effectiveness of the language models.We also see that the first person has a large positive weight, while terms related to device controlling ("call to" and "volume")  Table 8 represents chat detection results of SVM+embd.+tweet-queryacross the numbers of votes that the majority label obtained.As expected, we see that all metrics get higher as the number of agreement among the crowd workers becomes larger.In fact, we see as much as 98% accuracy when all seven workers agree.This implies that utterances easy for humans to classify are also easy for the classifiers.

Training data size
We next investigate the effect of the training data size on the classification accuracy.
Figure 3   25% of the training data is able to achieve comparable accuracy with SVM+embed.trained on the entire training data.This result suggests that the external resources are able to compensate for the scarcity of annotated data.

Utterance length
We finally investigate how the utterance length correlates with the classification accuracy.Figure 4 reveals that the difference between the two proposed methods is evident in short utter- ances (i.e., ≤ 5).This is because those utterances are too short to contain sufficient information required for classification, and the additional features are helpful.We note that Japanese writing system uses ideograms and thus even five characters is enough to represent a simple sentence.
We also see a clear difference in longer utterances (i.e., 15 ≤) as well.We consider those long utterances are difficult to classify because some words in the utterances are irrelevant for the classification and the n-gram and embedding features include those irrelevant ones.On the other hand, we consider that the language model scores are good at capturing stylistic information irrespective of the utterance length.

Future Work
As discussed in Section 3.2, some user utterances such as "I am hungry" are ambiguous in nature and thus are difficult to handle in the current framework.An important future work is to develop a sophisticated dialogue manager to handle such utterances, for example, by making clarification questions (Schlöder and Fernandez, 2015).
We manually investigated the dialogue acts in the chat detection dataset (c.f., Section 3.2).It is interesting to automatically determine the dialogue acts to help producing appropriate system responses.Some related studies exist in such a research direction (Meguro et al., 2010).
Although we used only text data to perform chat detection, we can also utilize contextual information such as the previous utterances (Xu and Sarikaya, 2014), the acoustic information (Jiang et al., 2015), and the user profile (Sano et al., 2016).It is an interesting research topic to use such contextual information beyond text.It is con-sidered promising to make use of a neural network for integrating such heterogeneous information.
An automatic speech recognition (ASR) error is a popular problem in SDS, and previous studies have proposed sophisticated techniques, including re-ranking (Morbini et al., 2012) and POMDP (Williams and Young, 2007), for addressing the ASR errors.Incorporating these techniques into our methods is also an important future work.
Although the studies on non-task-oriented SDS have made substantial progress in the past few years, it unfortunately remains difficult for the systems to fluently chat with users (Higashinaka et al., 2015).Further efforts on improving nontask-oriented dialogue systems is an important future work.

Conclusion
This paper investigated chat detection for combining domain-specific task-oriented SDS and opendomain non-task-oriented SDS.To address the scarcity of benchmark datasets for this task, we constructed a new benchmark dataset from the real log data of a commercial intelligent assistant.In addition, we investigated using the external resources, tweets and Web search queries, to handle open-domain user utterances, which characterize the task of chat detection.The empirical experiment demonstrated that the off-the-shelf supervised methods augmented with the external resources perform accurately, outperforming the baseline approaches.We hope that this study contributes to remove the long-standing boundary between task-oriented and non-task-oriented SDS.
To facilitate future research, we are going to release the dataset together with the feature values derived from the tweets and Web search queries. 14

Figure 1 :
Figure 1: Feature vector representation of the example utterance "Today's weather."The upper three parts of the vector represent the features described in Section 4.1 (character n-gram, word ngram, and average of the word embeddings).The three additional features explained in Section 4.2 are added as two real-valued features (Tweet GRU and Query GRU) and one binary feature (Query binary).

Figure 2 :
Figure 2: Architecture of our CNN-based classifier when the input utterance is "Today's weather."The output layer of CNN and the three additional features explained in Section 4.2 are concatenated.The resulting vector is fed to the soft-max function.

Figure 3 :
Figure 3: Learning curve of the proposed methods.The horizontal axis represents what percentage of the training portion is used in each fold of the cross validation.The vertical axis represents the classification accuracy.
illustrates the classification accuracies of SVM+embed.and SVM+embed.+tweet-queryfor each utterance length in the number of characters.

Figure 4 :
Figure 4: Classification accuracy across utterance lengths in the number of characters.

Table 1 :
Example utterances and the numbers of votes.NONCHAT utterances are further divided into information seeking (top) and device control (bottom) to facilitate readers' understanding.

Table 2 :
Distribution of the numbers of votes.

Table 3 :
Distribution over dialogue acts and example utterances.

Table 4 :
The dropout Chat detection results.

Table 5 :
have large negative weights.Examples of the language model scores.The first two columns represent the scores provided by the GRU language models trained on the tweets and Web search queries, respectively.The third and fourth columns represent the label and utterance.

Table 6 :
Effect of the three features derived from the tweets and Web search queries.

Table 8 :
Chat detection results across the numbers of votes that the majority label obtained.