When and Who? Conversation Transition Based on Bot-Agent Symbiosis Learning Network

In online customer service applications, multiple chatbots that are specialized in various topics are typically developed separately and are then merged with other human agents to a single platform, presenting to the users with a unified interface. Ideally the conversation can be transparently transferred between different sources of customer support so that domain-specific questions can be answered timely and this is what we coined as a Bot-Agent symbiosis. Conversation transition is a major challenge in such online customer service and our work formalises the challenge as two core problems, namely, when to transfer and which bot or agent to transfer to and introduces a deep neural networks based approach that addresses these problems. Inspired by the net promoter score (NPS), our research reveals how the problems can be effectively solved by providing user feedback and developing deep neural networks that predict the conversation category distribution and the NPS of the dialogues. Experiments on realistic data generated from an online service support platform demonstrate that the proposed approach outperforms state-of-the-art methods and shows promising perspective for transparent conversation transition.


Introduction
Recent years have witnessed a plethora of chatbot-based online customer services (Oraby et al., 2017;Xu et al., 2017;Jiang et al., 2019). Although current chatbots are far from perfect, they are playing increasingly larger roles thanks to more advanced natural language processing (NLP) capabilities and can now respond properly to certain domain-specific problems and do some basic housekeeping tasks before a human agent gets involved. It is also common that the customer support is sourced to multiple agents and bots where the skill sets may vary greatly. Conventionally, users are firstly greeted by a porter bot, and then they could be further transferred to a human agent if they are not happy with the bot's service. Human agent is usually assigned randomly based on availability rather than skill sets and the agent may refer the conversation to another more specialized agent if necessary. Therefore the users may have to navigate themselves among different sources of customer support explicitly or are guided by a customer support, either a bot or a human agent, whose job is quite similar to a switchboard operator before the users can receive corresponding customer service.
However, the aforementioned conversation transition could be largely automated so that the users may not be aware of whether the underneath customer support has been switched, i.e., the conversation transition is transparent to the users. The conversation can also be directed to the customer support with the correct domain knowledge given the questions raised by the users can be well understood automatically. This hybrid system is referred as a Bot-Agent Symbiosis online customer service, or a Symbiosis for short. In the Symbiosis, the users only †Corresponding author. *Both authors contributed equally. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details:http://creativecommons.org/licenses/by/4.0/. need to interact with the system as a whole, instead of figuring out their own ways by speaking to various sources. See Figure 1 for the comparison between a traditional bot-agent hybrid system and a symbiosis.
In our own practice of building towards the symbiosis, two critical questions that are not well answered by the current NLP techniques are identified, namely, when should the transition occur and who the conversation session should be transferred to. The who problem is considered to be more difficult than the when problem as the former requires a broader understanding about agent profiles while the later only involves only the current dialogue. In this work, we propose a framework to utilize historical user feedback in order to handle these two problems in one go. The essential idea is to develop a result-driven symbiosis that attempts to maximize user satisfaction so that when the user is predicted to be upset by the current customer support, the transition will be triggered and another (human or bot) agent who is expected to be good at handling the current dialogue will be automatically assigned to take over the current session.

Related Works
Different from the call center scheduling which aims to make the calling operation efficient and productive on the basis of knowledge of agents and their schedules (Aksin et al., 2007;Fukunaga et al., 2002;Hashemi et al., 2018;Kiseleva et al., 2016;Sano et al., 2016), the bot-agent symbiosis is a relatively new concept and the conversation transition is a new problem owing to the emerging of chatbot-based customer services. As briefly touched in the last section, the conversation transition contains the when question, which can be treated as predicting the satisfaction level in real-time, and the who question, which can be treated as understanding what kind of skills are most needed from the agent based on the current session.
Predicting the satisfaction level of a dialogue can be considered as a scalar regression problem while understanding the required skills is a typical text classification problem. However, due to the fact that current NLP datasets are seldom satisfaction level annotated, the regression problem is little discussed. Traditional methods to assess the satisfaction level is to gather feedback from the customer via survey after the dialogue ends. And some works tried to get the real time feedback through sentiment analysis of the utterances (Bertero et al., 2016;Acosta and Ward, 2011). However, our work shows that sentiment information is not enough to convey the satisfaction level.
Most existing text classification research focus more on monologue text (Zhou et al., 2015;Zhou et al., 2016;Liu et al., 2019) rather than multi-turn conversations between multiple speakers. Compared to text classification, multi-turn conversation classification is more complicated due to its nature of interaction. Related works include dialogue act detection in meetings, customer service, online forum, etc. Previous works from (Khanpour et al., 2016;Oraby et al., 2017) commonly aim to understand how utterances from multiple speakers relate to the roles of the speakers. To the best of our knowledge, our work is the first to formalize the conversation transition problem and the first to attempt the problem with a satisfaction driven approach.

Bot-Agent Symbiosis Conversation Transition
The user satisfaction feedback is crucial to most symbiosis design, as the ultimate goal of the symbiosis is well paralleled to maximizing the user satisfaction. A lot of existing systems do employ similar feedback mechanics yet how exactly can the feedback be harnessed to improve the customer service is not very well understood. Suppose the users seek various supports requiring different skill sets from the customer service and there are ways that the user can submit their feedback in the form of a score or indicator, we propose the following framework for building a Bot-Agent symbiosis.
• Build skill set profiles Z = {z 1 , z 2 , · · · , z q } for every bot agent and human agent to evaluate if specific customer support is good at answering certain questions; • Evaluate how satisfied the users is about the current support provided; • Evaluate the type of customer support that is requested by the users; • Monitor the predicted satisfaction and trigger conversation transition when it drops below certain threshold and match the current session to the best support available based on the profile and predicted type of the user question.
Here we formalize the type classification and satisfaction level prediction problem, starting from the notations. A conversation or a dialogue is composed of multiple utterances from two speakers, namely, the user and the customer support. The user and the custom support are denoted by A and B respectively. A conversation is denoted by the following set D L = {A u 1 , B u 1 , · · · , A u m , B u n }, where m and n are the numbers of the utterances owned by the user and the custom support respectively and L is the total number of utterances. Without loss of generality, the conversation is tagged with a type label c i and there are in total k type, so that the set of all types is C = {c 1 , c 2 , · · · , c k }. It may also be annotated with a scalar satisfaction indicator p where larger p implies better service. Our model aims to predict the type c i and satisfaction indicator p when a fresh unlabeled utterances of the conversation is provided. In fact the prediction of type and satisfaction indicator can be provided by training two separate yet quite similar deep networks, using a corpus of previous labeled dialogues. Now the when question is determined by a function H(p, l). And the who question can be answered by another function E(ĉ, Z, l). We call these two functions when and who transition functions which can then be decided dependently and we will show our specific implementation in the following sections. Figure 2 is the overall architecture of our proposed model. The entire network stacks up three different layers: firstly a convolutional neural network, secondly a bidirectional recurrent neural network and finally an attention layer. Each layer has its unique role in transforming the plain text to either the types label or regressed satisfaction level. We detail how each layer works in the following subsections.

CNN
We first pad each utterance to the length of l, the maximum length of the utterance supported by the system, and then pad each dialogue with length g, the maximum length of the dialogue supported by the system, so that all dialogues can be represented with a fixed size form {x 1 , x 2 , .., x n }, where n = l × g. Let e i be the d-dimensional word vectors for the i-th word in a dialogue. Via word embedding, a dialogue is in turn represented as {e 1 , e 2 , ..., e n }. A convolution operation involves a filter ω ∈ R r×d and a bias term b ∈ R r , which are applied to r (1 ≤ r ≤ n) continuous words to produce a new value o j ∈ R (Kim, 2014): The filter convolves along the height of the dialogue with one stride to produce an output se- We then apply a max-over-time pooling operation over the feature map and take the maximum valuef = max{f } as the feature corresponding to this particular filter (Collobert et al., 2011). A dropout layer is added to the end to prevent networks from over-fitting and produces the CNN output s = [s 1 , s 2 , · · · , s T ]. The essential idea of this layer is to capture the most dominant features in a single utterance and notice that multiple filters with different window sizes help to learn across diverse phrase structures.

BiRNN
Given that the input sequences s is in the order of starting from the first symbol s 1 to the last one s T , we use a bidirectional RNN to abstract the representations of the symbols by summarizing information from both directions. The BiRNN contains a forward − −−−− → RN N which reads the symbol s i from s 1 to s T and a backward ← −−−− − RN N which reads from s T to s 1 .
By concatenating the forward hidden state − → h t and the backward one And finally that the output of BiRNN layer is h = [h 1 , h 2 , · · · , h T ]. This layer attempts to correlate a single utterance with the context, this is especially helpful when it comes to multiple turn dialogue.

Attention
We adopt the attention mechanism by (Yang et al., 2016) to allocate attention weights (α t ) to the outputs (h t ) of the forward layer and backward layer of the BiRNN respectively. The u s in Equation 5 is a context vector with a predefined length and random initial values to measure the importance of each of h t .
According to the above equations, the new representation of the forward layer ( − → S α ) and the backward layer ( ← − S α ) are computed. These two vectors are concatenated to produce the attentive representation (S α ) of the dialogue. This layer can further reward utterance representation that are critical to correctly classifying a dialogue or predicting the satisfaction level.
The final fully connected layer is where we differentiate between regression and classification problem. With the inputs from the attention layer, we devise a fully connected layer with k outputs for classification or a single output for regression. Specially, an additional softmax layer will be attached for classification problem. We then train the network weights by optimizing the loss of either cross entropy or the mean squared error accordingly.

Dataset
In order to evaluate our methods, we use a dataset generated by an actual online support service involving only human agents solving online learning and training related problems, which will be referred as the CSD (customer service dialog) dataset hereafter. The customer service is dedicated to a corporation's online learning portal that offers personalized training resources to its employees. The CSD dataset contains in total 15356 sessions of dialogues, the number of utterance for each dialogue varies from 1 to 210, and the number of word for each utterance varies from 1 to 1006. All dialogues took place between April, 2017 and January, 2018. The anonymized dataset is available at GitHub 1 .
The CSD dataset has several unique features that are not shared by other existing public datasets. Firstly, around half of the dialogues (6,997) are Net Promoter Score (NPS) labeled by the user. NPS is a 0 to 10 scale indicator that serves as a user satisfaction feedback, albeit somewhat subjective, it can in general reflect if the custom service is sufficiently helpful. Secondly, the dataset is also task-oriented labeled by the human agents. There are 8 types (k = 8) in total and types distribution are: course completion (3734), finding contents (2435), general help (1778), technical issue (1812), account issue (644), following up (232), undirected (3915), and other (806). Unlike most other public datasets that are classified based on the specific domain or speech act, the classifications are designed in such way to facilitate further customer supports. Thirdly, the dataset involves 84 agents and we can easily build a classification-based mean NPS profile for each agent. We observe a huge discrepancy among the skill sets and are convinced that this is common and could be exploited by our symbiosis methods.

Baselines
Our approach is compared with several baseline methods including convolution neural networks (CNN) (Kim, 2014), long short-term memory networks (LSTM), bidirectional long shortterm memory (BiLSTM), a cascade of convolution neural networks and long short-term memory networks (CNN-LSTM) (Zhou et al., 2015), a cascade of convolution neural networks and bidirectional long short-term memory networks (CNN-BiLSTM), two bidirectional neural network both with attention mechanism (HAN) (Yang et al., 2016), and the bidirectional encoder representations from transformers model (BERT) (Devlin et al., 2019).

Model Configurations
Several word vectors and different embedding methods are also compared. The word vectors include: fastText Joulin et al., 2016;, Word2Vec with Skip-gram and CBOW architecture (Mikolov et al., 2013), GloVe (Pennington et al., 2014), predictive text embedding (PTE) which utilizes both labeled and unlabeled information of the training corpus to learn embedding of the words (Tang et al., 2015). The embedding methods include Rand-Static (all words are randomly initialized and then kept static), Rand-Dynamic (all words are randomly initialized and then fine-tuned), Embedding-Static (all words are randomly initialized with pre-trained word vectors and are then kept static), Embedding-Dynamic (all words are randomly initialized with pre-trained word vectors and are then fine-tuned), Embedding-2C (a model with two channels of word vectors in which one of the channel is fine-tuned while the other is kept static).

Hyperparameters and Training
The data set is split into training set, validation set and testing set with a ratio of 0.8, 0.1 and 0.1 five times. And each set preserves the percentage of samples for each conversation type. For the BERT model, the best epoch number is 5 in our work which is determined by experiments and we use the "BERT-Base, Uncased" pre-trained model for fine-tuning. For other experiments we use the following configurations: 20 epochs, mini-batch size of 32, rectified linear units as CNN activation function, filter windows of 3, 4 and 5 with 32 feature maps each, drop out rate of 0.5, embedding size of 128, hidden units number of 300, max pooling size of 4 and attention size of 100 for HAN model and our model. The hyperparameters of the models are tuned on the validation set with an early stopping at no Micro F1 increasing during 1000 batches training. We use RMSprop (Tieleman and Hinton, 2012) algorithm to train all models with a learning rate of 0.001 and a discounting factor of 0.9. The mean result of 5 folds cross validation is used as the final result. The source codes of our experiments will be publicly available upon acceptance.

Results and Discussion
In this section, we first compare our model with other baseline models on the task of dialogue classification. Then we initialize our model with different word vectors and embedding methods. And the NPS evaluated by our model is also discussed. To examine how good the text features can be learned by our model, we plot and analyze the features for a few example cases. At last, we present a straightforward Bot-Agent symbiosis conversation transition policy based on the metric change according to the conversation length.

Model Comparison
Results of our model against other methods are listed in Table 1. Except for BERT, the words fed to the models are all represented by randomly initialized vectors and hyperparameters to train the models are all the same. As can be seen from the Table, our model has a better Micro F1 and slightly worse Macro F1 compared to BERT. Because Micro F1 is commonly considered to be a better metric for the imbalance dataset and training the BERT pre-trained model based on the large public corpus and GPUs is fairly expensive (Devlin et al., 2019), our model is a better choice for practical and resource-constrained or time-constrained applications. Moreover, our model significantly outperforms other homogeneous models in terms of both Micro F1 and Macro F1 metric and is better than other mixed models as well.
We attribute the performance of our model to its carefully designed three-layer structure where each layer has its unique role. The CNN layer is responsible for capturing both local and   global context of the multi-turn conversation while the BiRNN layer is responsible for relating utterances from both the users and the agents. CNN or BiRNN alone is clearly not enough as indicated by the outcomes and we believe this is due to the fact that the satisfaction level is dynamically changing as the conversation unfolds. The attention layer prevents the previous two layers from overfitting by correctly directing the learning toward more important parts of the utterances. The order of these three layers is important as well since each layer concentrates lower level features into higher abstraction, reversing the order will lead to significant loss of context. Additionally, reordering them is trivial and the result is not comparable to those presented therefore is omitted.

Configuration Comparison
Initializing word vectors with those obtained from an unsupervised neural language model and fine tuning them during the training phase are popular ways to improve performance when a large supervised training set is inaccessible (Kim, 2014). We first initialize our model with word vectors trained by several classical methods and compare their performance for dialogue classification, then we investigate the best way to feed the word vectors to our model Table 2 shows that initiating model with word vectors can increase the metrics and decrease the average batches needed for model training. The results show that the unlabeled conversation corpus generated from custom service system contains much context informations and should be exploited in Bot-Agent symbiosis transition system. Also, from Table 3 we can see that the best embedding methods are dynamic embedding and two channels embedding in our case.

NPS
Estimating the NPS score at the dialogue level is a challenging task as NPS labels are generated in a rather subjective, if not completely arbitrary, fashion and the conversation itself does not necessarily capture all information needed to predict the NPS score. For example, we found that the NPS is only weakly correlated (r = 0.117, r is the pearson correlation coefficient.) to the sentiment score 2 by IBM Watson NLP API, which is counter-intuitive. However, we managed to achieve an overall root-mean-square error (RMSE) of around 3.0 using other deep neural networks, and our proposed model with the optimal configuration used by the classification also stands out as the best with an RMSE of 2.13, a comparatively salient improvement compared to the baseline RMSE 5.3 of random guessing. The exact threshold of determining when to transfer the conversation could be application specific and a RMSE of around 2 filters most of the unsatisfactory chats according to our observation.

Visualization
To examine how good the features learned by our model are, we analyzed the features for many cases. And here we showcase two intuitive examples in the fashion of visualized feature vector (Mishra et al., 2017;Li et al., 2016). Figure 3a presents an example of test case for the task of dialogue classification as the dialogue progresses. Our model successfully recognizes the correct type from the second utterance as it categorically asked about a course related problem. And as outlined by the red box that the second utterance showcases a unique learned feature that corresponds to the correct category label. Figure 3b presents another example of NPS prediction. The two red boxes again demonstrates the power of our model to capture unique features that are most relevant to deciding the NPS. Notice that from the fourth utterance to the fifth utterance, the agent made the user wait for sometime and the apology from the agent directly results in a drop in the NPS. Similarly in the eleventh utterance where the user replies and acknowledges the help from the agent, the predicted NPS is dramatically raised to another level. In the end the NPS saturates around 9 as the user show his gratitude to the agent. These results show that our model correctly predicts the type and satisfaction score dynamically as the conversation progresses.

A Straightforward Transition Policy
Premature transition may cause inadequate understanding of the question and late transition may wear our the user's patience. So we analyzed how our models perform with the reference to the conversation length, effectively number of utterances. Dialogue type and NPS prediction are not very stable for the first few utterances as can be seen from Figure 4, and the performance saturate around 10 and 55 utterances, respectively for dialogue type and NPS. Dialogue type emerges quicker than NPS can stabilize as the type normally remain unchanged throughout the dialogue while NPS is highly susceptible to single utterance disturbance. Therefore the conversation session should be protected for the first few turns and we decided that 8 is a better number empirically, since we want the NPS to be more sensitive. Furthermore, the profile of an agent can be represented as the mean NPS for each dialogue type 3 , such as z i = {g c 1 , g c 2 , · · · , g c 8 }.
And now the when transition function is Equation 7. When H = 1 transition takes place and otherwise the session remains, and 4.0 is the minimum NPS our implementation can tolerate before transition.

H(p, l)
And the who transition function is Equation 8, effectively the function E mentioned before. Where µ is the input vector to the final softmax layer of the dialogue type network. Meanwhile, we preserve the common design that a user can request a human agent for help when necessary.

Conclusion and Future Work
In this paper, our work formalizes a new research direction of conversation transition and puts forward an elementary framework for building towards the Bot-Agent symbiosis. Additionally our study based on the CSD dataset shows that the CNN-BiRNN-Attention neural network shows promising results on solving the when and who problems. Furthermore, we argue that the incorporation of user satisfaction feedback is vital to current dialogue system designs since it offers a very strong metric to be learned from. In the future, we will mainly focus on actually implementing an hybrid system that allows complicated interactions between the user, agents and bots, in the hope of collecting realistic data from the hybrid platform. And we will test how the transparent transition can affect the satisfaction level and ultimately bring the symbiosis to reality.