Conversational Document Prediction to Assist Customer Care Agents

A frequent pattern in customer care conversations is the agents responding with appropriate webpage URLs that address users' needs. We study the task of predicting the documents that customer care agents can use to facilitate users' needs. We also introduce a new public dataset which supports the aforementioned problem. Using this dataset and two others, we investigate state-of-the art deep learning (DL) and information retrieval (IR) models for the task. Additionally, we analyze the practicality of such systems in terms of inference time complexity. Our show that an hybrid IR+DL approach provides the best of both worlds.


Introduction
Customer care (CC) agents play a crucial role as an organization's main representatives to the public. Our work is motivated by the observation that, in many conversations between CC agents and users, the former tend to provide links to documents that may help resolve user issues. This is a prevalent pattern that is found in around 5-9% of all customer care conversations in multiple domains that we have reviewed. To identify such documents, the agents manually extract the keywords from the conversation and search over their customer service knowledge base (Habibi and Popescu-Belis, 2015;Ferreira et al., 2019). Table 1 shows a conversation where the agent provides a URL 2 to the user.
Although responding with URLs is a common pattern, automating this process to aid the agents remains underexplored in the literature. This task of Conversational Document Prediction (CDP) Dialogue U: My virtual keyboard seems to float in the screen.
Not sure how to undo what I just did. Can you help me please? A: We're happy to help. To start, let us know which device you're working with, and the OS version installed on it. U: It is an iPad A: Ok, to check version, tap Settings; General ; About. U: It's iPad 4, 11 inch -model A1934 A: Thank you. This article can help with how to merge a split keyboard and move the keyboard for an iPad: https://support.apple.com/ en-us/HT207521. Let me know if this helps can be viewed as a conversational search problem, where the entire conversation context or a subset of it could be used as the query for retrieving matching documents. Compared to adhoc retrieval settings, using a conversational interface, the agent/system can ask clarification questions and interactively modify the search results as the conversation progresses (Zhang et al., 2018;Aliannejadi et al., 2019).
The CDP task has been primarily addressed so far using "traditional" information retrieval (IR) techniques. Habibi and Popescu-Belis (2015) proposed a document recommender system by extracting keywords from a conversation using topic modeling techniques. Ferreira et al. (2019) have used a similar keyword extraction framework and reported their results on a proprietary dataset.
Many aspects of IR systems have undergone a revolution with the advent of powerful Deep Learning (DL) techniques in recent years (Mitra et al., 2018;Yang et al., 2019). Yet this superior performance comes with high demand in computational resources as well as longer inference times, which hinders their application in real-world IR systems. Thus, the attention has been focused on techniques that reduce the com-  putation complexity at the run-time without hindering the performance (Reimers and Gurevych, 2019;Lu et al., 2020).
In this work, we formulate the CDP task to support CC agents. We further release a new public dataset which enables research on the aforementioned task and investigate the performance of state-of-the art DL and IR models side-by-side on a number of datasets. We also analyze the runtime complexity of such systems, and propose a hybrid solution which is applicable in real-life systems.

Data
We explore the CDP task using three datasets which contain human-to-human conversations between users and CC agents. Two of these datasets are internal: one from an internal customer support service on Mac devices (Mac-Support) and another from an external client in the telecommunication domain (Telco-Support). We also release a new Twitter dataset, containing conversations between users and CC agents in 25 organizations on the Twitter platform 3 . We summarize the statistics of the three datasets in Table 2. For our internal datasets, we filter out dialogs where: a) the agent doesn't provide a URL to the user, b) the URL is not in-domain (e.g. Google searches, Microsoft forums, etc.), and focus on URLs from internal customer service knowledge base, and c) the URL is either no longer valid or has no content (e.g., login page). For Twitter dataset, we used the user timeline API to collect the tweets from agents containing in-domain URLs. The dialogs were constructed starting from these tweets and identifying the previous user and agent tweets to these tweets. If a dialog contains multiple URLs, we only use the dialog till the first agent utterance containing a URL. The details for document content extraction are in Appendix.
From Table 2, we observe that, around 5-9% dialogs include a URL document provided by the agent. We also note that the website content for organizations gets updated frequently as many URLs return 404 errors. The average number of turns in a dialog and dialog length (in tokens) is much smaller for Twitter in comparison to the Mac-Support and Telco-Support datasets. Our experiments results in Section 4, particularly BM25 and IRC in Table 4, demonstrate the importance of dialog context for the CDP task, even when that context is not very rich, as is the case for the short dialogs of Twitter.

Approaches
We now formally introduce the CDP task and notations below. We then describe two alternative approaches (IR and DL) and their hybrid that we evaluate for this task.

Task Definition
We regard the CDP task as a dialogue-based document classification task, similar to next utterance classification (Lowe et al., 2015). This is achieved by processing the data as described in Section 3.3, without requiring any human labels.
Formally, let d = {s 1 : t 1 , s 2 : t 2 , . . . , s n : t n } denote an n-turn dialog, where s i represents the speaker (user -U or agent -A), and t i represents the i th utterance. The dialog history is concatenated together to form dialog context of length m, represented as d = (d 1 , d 2 , ..., d i , ..., d m ), where d i is the i th word in context. Let Y denote the set of all documents which can be recommended to the user. Similar to dialogs, each document y ∈ Y is represented as y = (y 1 , y 2 , ..., y j , ..., y n ), where y j is the j th word in the document. Given dialog query d, the goal of the CDP task is to recommend k documents in Y to the agent. For evaluation, we use Recall@k and Mean Reciprocal Rank, where the model is asked to select the k most likely documents, and it is correct if the correct URL document is among these k documents.

Information Retrieval approaches
Following previous works, the first approach we evaluate for this task is based on IR models. We use an Apache Lucene index, employed with English language analyzer and default BM25 similarity (Robertson and Zaragoza, 2009). Documents in the index are represented using two fields. The first field contains the actual document content. The second field augments the document's representation with the text of all dialogs that link to it in the train-set (Amitay et al., 2005).
For a given (dialog) query d, matching documents are retrieved using four different ranking steps, which are combined using a cascade approach (Wang et al., 2011).
Following (Van Gysel et al., 2016), we obtain an initial pool of candidate documents using a lexical query aggregation approach. To this end, each utterance t i ∈ d is represented as a separate weighted queryclause, having its weight assigned relatively to its sequence position in the dialog (Van Gysel et al., 2016). Various sub-queries are then combined using a single disjunctive query. The second ranker evaluates each document y obtained by the first ranker against an expanded query (applying relevance model (Lavrenko and Croft, 2001)). The third ranker applies a manifold-ranking approach (Xu et al., 2011), aiming to score contentsimilar documents (measured by Bhattacharyya language-model based similarity) with similar scores.
The last ranker in the cascade treats the dialog query d as a verbose query and applies the Fixed-Point (FP) method (Paik and Oard, 2014) for weighting its words. Yet, compared to "traditional" verbose queries, dialogs are further segmented into distinct utterances. Using this observation, we implement an utterance-biased extension for enhanced word-weighting. To this end, we first score the various utterances based on the initial FP weights of words they contain and their relative position. We then propagate utterance scores back to their associated words. The IR model is denoted as IRC, short for IR-Cascade in Table 4.

Neural approaches
The second type of approaches we evaluate are neural models. We process the datasets to construct triples of <dialog context (d), URL document content (y), label (1/0)>from each dialog. For each d, we create a set of k + 1 triples: one triple containing the correct URL provided by the agent (label -1), and k triples containing incorrect URLs randomly sampled from Y (label -0). We explore different values for k and share additional results in Appendix. During evaluation, we evaluate a given dialog context against the set of all documents (Y ).
We evaluate the CDP task using three stateof-the-art neural models: Enhanced Sequential Inference Model (ESIM) proposed by Chen et al. (2017) which performs well on Natural Language Inference (NLI) and next utterance prediction tasks (Dong and Huang, 2018), BertForSe-quenceClassification model (Wolf et al., 2019) and SBERT. We next briefly describe the details for these models.

ESIM
The ESIM model takes two input sequences: dialog context (d) and document content (y), and feeds them through BiLSTM to generate local context-aware word representations denoted byd andȳ. A co-attention matrix E, where E ij = d T iȳ j , computes the similarity between d and y. The attended dialog context and document content vectors denoted byd andỹ are computed using E, which represent the most relevant word in y's content for each word in d's context and vice-versa.
This local inference information is enhanced by computing the difference and the element-wise product for the tuple <d,d> as well as for <ȳ,ỹ>. The difference and element-wise product are then concatenated with the original vectors,d andd or y andỹ respectively. The concatenated vectors are then fed to another set of BiLSTMs to compose the overall inference between the two sequences. Finally, the result vectors are converted to a fixedlength vector by max pooling and fed to a final classifier.

BERT
We use pre-trained BERT (Devlin et al., 2019) in two settings: a) fine-tuned on the training set, and b) an additional pre-training step on unlabeled data (dialogs in the training set and all documents) followed by fine-tuning on the training set (denoted  Re-ranking of candidate documents {y ∈ Y } for a given context d is done through the confidence score of each pair (d,y) which belongs to the positive class.

Sentence-BERT (SBERT)
We also explore SBERT (Reimers and Gurevych, 2019), which uses a Siamese network structure to fine-tune the pre-trained BERT network and derive semantically meaningful sentence embeddings. The sentence embeddings for d and y are derived by adding a pooling operation (default: mean) on the BERT outputs and then can be compared using cosine-similarity to achieve low inference time. We fine-tune SBERT in the same two settings as BERT mentioned above. The input handling and evaluation is same as BERT above. The fine-tuning and hyperparameter details are available in Appendix.

An hybrid approach
To investigate the real-world use of our approaches, we compare (in Table 3) the number of parameters of each model and inference time for a single query from the Twitter test set. The IRC model is much faster in comparison to the neural models. For incorporating the additional performance gain from neural models (in Table 4), we introduce an hybrid approach by a two-stage pipeline where we utilize the IRC model to gener-

Results and Analysis
Results are presented in Table 4. We provide training setting and hyperparameter details for all neural models in Appendix. We observe that the ESIM model performs best across all datasets and the IRC model performs comparably to the ESIM model except for Telco-Support dataset. We observe a significant performance reduction with BERT models in comparison to both IRC and ESIM models. The BERT * model (additional pre-training) improves performance for Telco-Support dataset, but is still inferior to ESIM model. The SBERT models provide the benefit of low inference time, but reduce performance further. We conclude that for CDP task, explicit cross-attention between dialog context d and document y present in ESIM is crucial. The BERT models try to incorporate cross-attention through self-attention on the concatenated <d, y> pair sequence, but still lag behind.
Finally, the hybrid approach (IRC+ESIM) provides a significant boost in performance (e.g., between +7%-20% in R@1), and reduces the inference time of ESIM. This demonstrates the benefit and importance of combining IR models that are based on exact matching, with neural models that further allow semantic inference in the domain for real-world applications.

Conclusion and Future Work
We introduced the Conversational Document Prediction (CDP) task and investigated the performance of state-of-the-art DL and IR models. We also release a new public Twitter dataset on the CDP task. In this work, we considered only URL documents with content. Other potential document types that could be considered are PDFs, doc etc. and URLs without content (e.g. login, tracking). We plan to address these challenges in future work.

A Appendix: Additional results
The results for corresponding validation performance for Information Retrieval and Neural approaches, as well as Mean Reciprocal Rank (MRR) metric for both validation and test sets on all datasets are available in Table 7.

A.1 Negative samples for neural approaches
For creating training data for our neural approaches, we create k triples containing incorrect URLs sampled randomly from set of all documents. We experimented with different values for the hyper-parameter num negative samples used for generating the training data. The results for ESIM model for the Mac-Support dataset are presented in Table 5. We observe that increasing the number of negative samples doesn't improve the ESIM model performance significantly and num negative samples -4 provides us the best of both worlds, i.e. good performance and lower training time, in comparison to using a higher negative sample ratio. We use the same value for all neural models for all datasets.

A.2 Input handling for BERT models
To handle the BERT model input limitation of 512 tokens max sequence length, we feed BERT with 256 tokens each from dialog context d and document content y. We observe that the initial sentences in a URL document always capture the core gist of the document, so we always use the first 256 tokens from the document content. For dialog context, we observe that as the conversation progresses over multiple turns and the user query gets more complex, the conversation shifts from the original query to another problem in many dialogs. We explore two input approaches for deciding which tokens to consider if dialog context sequence length |d| > 256: 1. Input-A: Truncate the dialog context d to consider only the first 256 tokens from the dialog context.
2. Input-B: Ignore tokens in the middle of dialog context sequence to reduce the |d| to 256.
The results for both approaches for BERT model on the Telco-Support dataset are in Table 6. We use the same heuristic for all neural models for all datasets.   The hyperparameters used to run LMpretraining are: train_batch_size=16 max_seq_length=512 max_predictions_per_seq=20 num_train_steps=100000 num_warmup_steps=10000 save_checkpoints_steps=20000 learning_rate=5e-5

B.3 Fine-tuning BERT model
The hyperparameters used for further fine-tuning BERT model are: The model is periodically evaluated on the validation set after n steps, which is decided based on the training dataset size.

B.4 Fine-tuning SBERT model
The hyperparameters used for fine-tuning SBERT model are: We use a linear learning rate warm-up over 10% of the training data. We fine-tune SBERT with a 3-way softmax-classifier objective function and the default pooling strategy is MEAN. The max seq length is 256 each for dialog context d and document content y. For SBERT*, we use the same additional pre-trained BERT* model from before. The model is periodically evaluated on the validation set after n steps, which is decided based on the training dataset size.

C Appendix: Extracting content from URL documents
For the internal Mac-Support dataset, the document content for each URL was obtained by API calls to the customer service knowledge base. For the Telco-Support and Twitter datasets, we capture  the HTML content using a Selenium Chrome webdriver, which renders the URL document by loading all CSS styling and Javascript. The extracted HTML was cleaned through a Markdown generation pipeline, where we manually identify and filter the DOM tags (using CSS id and/or class) which correspond to header(s), footer, navigation bars etc. This process is repeated for each URL domain in both datasets. The tools for data preprocessing are available here: https://github.com/ IBM/MDfromHTML.