A Deep Learning System for Sentiment Analysis of Service Calls

Sentiment analysis is crucial for the advancement of artificial intelligence (AI). Sentiment understanding can help AI to replicate human language and discourse. Studying the formation and response of sentiment state from well-trained Customer Service Representatives (CSRs) can help make the interaction between humans and AI more intelligent. In this paper, a sentiment analysis pipeline is first carried out with respect to real-world multi-party conversations - that is, service calls. Based on the acoustic and linguistic features extracted from the source information, a novel aggregated method for voice sentiment recognition framework is built. Each party’s sentiment pattern during the communication is investigated along with the interaction sentiment pattern between all parties.


Introduction
The natural reference for AI systems is human behavior.In human social life, emotional intelligence is important for successful and effective communication.A human has the natural ability to comprehend and react to the emotion of their communication partners through vocal and facial expressions [14,30].A long-standing goal of AI has been to create affective agents that can recognize, interpret and express emotions.Early-stage research in affective computing and sentiment analysis has mainly focused on understanding affect towards entities such as movie, product, service, candidacy, organization, action and so on in monologues, which involves only one person's opinion.However, with the advent of Human-Robot Interaction (HRI) such as voice assistants and customer service chatbots, researchers have started to build empathetic dialogue systems to improve the overall HRI experience by adapting to customers' sentiment.
Sentiment study of Human-Human Interactions (HHI) can help machines identify and react to human non-verbal communication which makes the HRI experience more natural.The call center is a rich resource of communication data.A large number of calls are recorded daily in order to assess the quality of interactions between CSRs and customers.Learning the sentiment expressions from well-trained CSRs during communication can help AI understand not only what the user says, but also how he/she says it so that the interaction feels more human.In this paper, we target and use real-world data -service calls, which poses additional challenges with respect to the artificial datasets that have been typically used in the past in multimodal sentiment researches, such as variability and noises.The basic 'sentiment' can be described on a scale of approval or disapproval, good or bad, positive or negative, and termed polarity [31].In the service industry, the key task is to enhance the quality of services by identifying issues that may be caused by systems, rules, or service qualities.These issues are usually expressed by a caller's anger or disappointment on a call.In addition, service chatbots are widely used to answer customer calls.If customers get angry during HRI, the system should be able to transfer the customers to a live agent.In this study, we mainly focused on identifying 'negative' sentiment, especially 'angry' customers.Given the non-homogeneous nature of full call recordings, which typically include a mixture of negative, and nonnegative statements, sentiment analysis is addressed at the sentence level.Call segments are explored in both acoustic and linguistic modalities.The temporal sentiment patterns between customers and CSRs appearing in calls are described.The paper is organized as follows: Section 2 covers a brief literature review on sentiment recognition from different modalities; Section 3 proposes a pipeline which features our novelties in training data creation using real-world multi-party conversations, including a description of the data acquisition, speaker diarization, transcription, and semisupervised learning annotation; the methodology for acoustic and linguistic sentiment analysis are presented in Section 4; Section 5 illustrates the methodology adopted for fusing different modalities; Section 6 presents experimental results including the evaluation measures and temporal sentiment patterns; finally, Section 7 concludes the paper and outlines future work.

Related Work
In this section, we provide a brief overview of related work about text-based and audio-based sentiment analysis.

Text-based Sentiment Analysis
Sentiment analysis has focused primarily on the processing of text and mainly consists of either rulebased classifiers that make use of large sentiment lexicons, or data-driven methods that assume the availability of a large annotated corpora.Sentiment lexicon is a list of lexical features (e.g.words) which are generally labeled according to their semantic orientation as either positive or negative [15].Widely used lexicons include binary polarity-based lexicons such as Harvard General Inquirer [36], Linguistic Inquiry and Word Count (LIWC, pronounced 'Luke') [27,26], Bing [16], and valence-based lexicons, such as AFINN [22], SentiWordNet [2], and SnticNet [6].Employing these lexical, researchers can apply their own rules or use existing rule-based modeling, such as VADER [12] to do sentiment analysis.One big advantage for the rule-based model is that these approaches require no training data and generalizes to multiple domains.However, since words are annotated based on their context-free semantic orientation, word-sense disambiguation [12] may occur when the word has multiple meanings.For example, words like 'defeated', 'envious', and 'stunned' are classified as 'positive' in Bing, but '-2' (negative) in AFINN.Although the rulebased algorithm is known to be noisy and limited, a sentiment lexicon is a useful component for any sophisticated sentiment detection algorithm and is one of the main resources to start from [31].Another major line of work in sentiment analysis consists of data-driven methods based on a large dataset annotated for polarity.The most widely used datasets include the MPQA corpus which is a collection of manually annotated news articles [39,40], movie reviews with two polarity [25], a collection of newspaper headlines annotated for polarity [37].With a large annotated datasets, supervised classifiers have been applied [10,35].Such approaches step away from blind use of keywords and word co-occurrence count, but rather rely on the implicit features associated with large semantic knowledge bases [4].

Audio-based Sentiment Analysis
Vocal expression is a primary carrier of affective signals in human communication.Speech as signals contains several features that can extract linguistic, speaker-specific information, and emotional.Related work about audio-based sentiment analysis along with multimodal fusion is reviewed in this section.Studies on speech-based sentiment analysis have focused on identifying relevant acoustic features.Use open source software such as OpenEAR [9], openS-MILE [8], JAudio toolkit [18] or library packages [19,38] to extract features.Those features along with some of their statistical derivates are closely related to the vocal prosodic characteristics, such as a tone, a volume, a pitch, an intonation, an inflection, a duration, etc. Supervised or unsupervised classifiers can be fitted based on the statistical derivates of those features [13,24].Sequence models can be fitted based on filter banks, Mel-frequency cepstral coefficients (MFCCs), or other low-level descriptors extracted from raw speech without feature engineering [1].However, this approach usually requires highly efficient computation and large annotated audio files.Multimodal sentiment analysis has started to draw attention recently because of the unlimited multimodality source of information online, such as videos and audios [28,29,5].Most of the multimodal sentiment analysis today is focused on monologue videos.In the last few years, sentiment recognition in conversation has started to gain research interest, since reproducing human interaction requires a deep understanding of the conversation, and sentiment plays a pivotal role in conversations.The existing conversation datasets are usually recorded in a controlled environment, such as a lab, and segmented into utterances, transcribe to text and annotated with emotion or sentiment la-bels manually.Widely used dataset includes AMI Meeting Corpus [7], IEMOCAP [3], SEMAINE [21] and AVEC [33].Recently, a few recurrent neural network (RNN) models are developed for emotion detection in conversations, e.g.DialogueRNN [17] or ICON [11].However they are less accurate in emotion detection for the utterances with emotional shift [32] and the training data requires the speaker information.The conversation models are not employed in our polarity sentiment analysis because of the quality of the data and the approach used to gain the training data.More detailed explanations can be found in Section 3.4.At the heart of any multimodal sentiment analysis engine is the multimodal fusion.The multimodal fusion integrates all single modalities into a combined single representation.Features are extracted from the data from each modality independently.Decision-level fusion feeds the features of each modality into separate classifiers and then combines their decisions.Feature-level fusion concatenates the feature vectors obtained from all modalities and feeds the resulting long vector into a supervised classifier.Recent research on multi-modal fusion for sentiment recognition has been conducted at either the feature level or decision level.

Dataset and Pipeline
The data resources used for our experiments are described in Section 3.1.Data preparation including speech transcription and speaker diarization is discussed in Section 3.2.The sentiment annotation guideline is introduced in Section 3.3.Section 3.4 presents a semi-supervised learning annotation pipeline that chains data preparation, model training, model deploying and data monitor.

BSCD: Benefits Service Call Dataset
The main dataset we created in this paper consists of service calls collected from a health care benefits Call Center (named BSCD).Calls are focused on customers looking for help or support with company provided benefits such as health insurance.500 calls are collected from the call center database covering diverse topics, such as insurance plan information, insurance id card, dependent coverage, etc.The call data set had female and male speakers randomly selected with their age ranging approximately from 16-80.Calls involving translators are eliminated to keep only speakers expressing themselves in English.All the calls are presented in Wave format with a sample rate of 8000 Hertz and duration varying from 4 minutes to 26 minutes.All calls are pre-processed to eliminate repetitive introductions.The beginning of each call contains an introduction of the users' company name by a robot.To address this issue, the segment before the first pause (silence duration > 1 second) is removed from each call.A robust computational model of sentiment analysis needs to be able to handle real-world variability and noises.While the previous research on multimodal sentiment or emotion analysis used audio and visual recorded in laboratory settings [3,20,21]; the BSCD gathered real-world calls contains ambient noise present in most audio recordings, as well as diversity in person-to-person communication patterns.Both of these conditions result in difficulties that need to be addressed in order to effectively extract useful data from these sources.

Data Preparation
To discard noise and long pauses (silence duration > 5 seconds) in calls, Voice Activity Detection (VAD) is applied, followed by the application of Automatic Speech Recognition (ASR) and Automatic Speaker Diarization (ASD) to transcribe the verbal statements, extract the start and end time of each utterance, and identify the speaker of each utterance.Each call is segmented into an average of 69 utterances.The duration of the utterances is right-skewed with a median of 2.9 seconds; first and third quantiles 1.6 and 5.1 seconds.By searching keywords such as 'How can I help' in the content of each utterance, speakers are labeled as CSR or customer.Each utterance is linked to the corresponding audio stream, auto transcription, as well as speaker label.The workflow and corresponding results for the first 23 seconds of one selected call are shown in Figure 1, where the original input is a call audio sample.After data preparation, segments of noise and silence are discarded.This call sample is segmented into 4 utterances.The audio streams are from the original audio and split based on the start and end time of each utterance.Auto transcriptions are more likely to be ungrammatical if the recording quality is bad or the conversation contains words that ASR cannot identify or the speakers do not express themselves clearly.The ungrammatical transcriptions usually

Voice Activity Detection
Figure 1: Data preparation workflow occur in customer parts and the frequency of ungrammaticality varies from case to case.Although the sentiment recognition of a whole call tends to be robust with respect to speech recognition errors, the sensitivity of each utterance analysis to ASR errors is not reparable given our study.The Speaker labels are from ASD output which can be misclassified because of the occurrence of speakers overlapping or speakers with similar acoustic features.Conversation sentiment pattern study can be misleading due to the misclassified ASD output, although misclassified ASD is rare.This process allows us to study features from both modalities: transcribed words and acoustics.Distinguishing different parties gives us the ability to study the temporal sentiment transitions of individual speakers and interactions among speakers in a conversation.However, since the data preparation is part of the pipeline described in section 3.4, which runs in real-time, sentiment analysis must rely on error-prone ASR and ASD outputs.

Sentiment Annotation
Sentiment annotation is a challenging task as the label depends on the annotators' perspective, and the differences inherent in the way people express emotions.The sentiment is opinion-based, not factbased.This study aims at identifying negative expressions in calls, especially angry customers who are not satisfied with the services and the business or system rules.By identifying and studying those types of cases, the business can improve call center services and fix the possible business or system issues.
Guidelines are set up for the annotation.The customer negative tag is for negative emotions (e.g."I hate the system"), attitudes (e.g."I am not following you"), evaluations (e.g."your service is the worst"), and negative facts caused by other parties (e.g."I never received my card").Other negative facts are not considered as negative (e.g."My wife died, I need to remove her from my dependents").
The guidelines for CSRs are different.Well trained CSRs usually do not respond negatively, but there are cases that they cannot help the customers.We identify those cases as negative.Cases where a CSR cannot help the customer usually involve business process or system issues.The sentiment is not always explicit in the text.Borderline linguistic utterances stated loudly and quickly are usually identified as 'negative'.E.g. the utterance "Trust me, it could be done" is classified as negative, since it is in the context that the representative fails to help the customer to enroll in the health plan, and in the audio, the customer is irritated.In all the multimodal sentiment analysis, the labels of all modalities are kept consistent for the same utterance.In our data annotation process, we also keep both text and audio labels that agree with each other and the annotation is based on the audio segments.

Semi-supervised Learning Annotation Pipeline
To successfully run and train analytical models, massive quantities of stored data are needed.Creating large annotated datasets can be a very time consuming and labor-intensive process.To keep the human annotation effort to a minimum, a semisupervised learning annotation scheme is applied to tag the polarity of utterances as negative, or nonnegative.Figure 2

Bimodal Sentiment Analysis
To model information for sentiment analysis from calls, we first obtain the streams corresponding to each modality via the methods described in Section 3.2, followed by the extraction of a representative set of features for each modality.These features are then used as cues to build a classifier of binary sentiment.

Sentiment Analysis of Textual Data
General approaches such as sentiment lexical and sentiment APIs are easy to apply.Both approaches are employed in C T to monitor the utterance prediction labels in the early stage of semi-supervised learning annotation to extend training data.VADER [12] is a simple rule-based model for general sentiment analysis.The results have four categories: compound, negative, neutral, and positive.We classify utterances with negative output as negative, neutral and positive as nonnegative † so that it is consistent with BSCD annotation.This model has many advantages, such as being less computationally expensive and easily interpretable.However, one of the main issues with only using lexicons is that most utterances do not contain polarized words.human hearing process, we study the acoustic features based on human perception.Three perceptual categories are described in this section.Their corresponding features are usually short-term based features that are extracted from every short-term window (or frame).Long-term features can be generated by aggregating the short-term features extracted from several consecutive frames within a time window.For each short-term acoustic feature, we calculated nine statistical aggregations: mean, standard deviation, quantiles (5%, 25%, 50%, 75%, 95%), range (95%-5% quantile), and interquartile range (75%-25% quantile) to get the long-term features of each segment.
• Loudness is the subjective perception of sound pressure which is related to sound intensity.Amplitude and mean frequency spectrum features are extracted to measure loudness.The greater the amplitude of the vibrations, the greater the amount of energy carried by the wave, and the more intense the sound will be.
• Sharpness is a measure of the high-frequency content of a sound, the greater the proportion of high frequencies the sharper the sound.Fundamental frequency (pitch) and dominant frequency are extracted.
• Speaking rate is normally defined as the number of words spoken per minute.In general, the speaking rate is characterized by different parameters of speech such as pause and vowel durations.In our study, speaking rate is measured by pause duration, character per second (CPS), and word per second (WPS) which are calculated as following for the ith segment: where for segment i, T i denotes the time, and N i denotes the number of characters or words in the corresponding transcription.Pause duration can be interpreted as the percentage of the time where the speaker was silent.The three variables are aggregated statistics, long-term features.
In nonnegative cases, speakers are in a relaxed and normal emotional state.An agitated or angry emotional state speaker will typically be characterized by increased vocal loudness, sharpness, and speaking rate.C A ={Elastic-Net, KNN, RF, GMM} are built based on the 39 selected features.Hand-crafted features are generally very successful for specificity sound analysis tasks.One of the main drawbacks of feature engineering is that it relies on transformations that are defined beforehand and ignore some particularities of the signals observed at runtime such as recording conditions and recording devices.A more common approach is to select and adapt features initially introduced for other tasks.A now well-established example of this trend is the popularity of MFCC features [34].In our experiments, MFCC is extracted from each segment and fed to RNN models in later iterations with |D LA | > 10, 000.

Fusion
There are two main fusion techniques: feature-level fusion and decision-level fusion.In our experiments, we employ decision-level fusion.Decisionlevel fusion has many advantages [29].One benefit of the decision-level fusion is we can use classifiers for text and audio features separately.The unimodal classifier can use data from another communication channel of the same type to improve its accuracy, e.g.text data from the chat window is borrowed to improve the C T accuracy in our study.Separating modalities permit us to use any learner suitable for the particular problem at hand.Another benefit of the decision-level fusion is its processing speed since fewer features are used for each classifier and separate classifiers can be run in parallel.Decision-level fusion usually adds probabilities or summarized prediction from each unimodal classifier with weights or takes the majority voting among the predicted class labels by unimodal classifiers.In this paper, various fusion methods are evaluated, including a novel approach that uses linguistic ensemble results as the baseline, while then checking acoustic results to modify classification decisions.In Fus1, if the audio ensemble classifies negative and one or more text models classifies negative, we then reclassify the result to negative.In Fus2, if the audio ensemble classifies a sample as negative, we then reclassify the result to negative directly without checking the linguistic modality.

Experiment Results
The test dataset consists of 21 calls with 1,890 utterances, which are manually annotated for negative (848) and nonnegative (1,042).

Evaluation Measures
As evaluation measures, we rely on accuracy and weighted F1-score, which is the weighted harmonic mean of precision and recall.Precision is the probability of returning values that were correct.Recall, also known as sensitivity is the probability of relevant values that the algorithm outputs.As shown in Table 1, general approaches in C T , Vader and APIs, tend to have a low negative recall.The semantic knowledge based classifiers have more than 20% higher F1-score than the general approaches.The classifiers are trained on D LT ={transcription data, chat data}.The overall F1-score is more than 10% higher than the classifiers trained on call transcription only data.BLSTM on MFCC performs better than C A = {Elastic-Net (penalty 0.2||β|| 1 + 0.4||β|| 2  2 ), KNN (k = 3), RF, GMM} on acoustic features.Using audio features alone, an F1-score of 0.584 can be  reached, which is acceptable considering that the real world audio-only system exclusively analyzes the tone of the speaker's voice and doesn't consider any language information.The acoustic modality is significantly weaker than the linguistic modality.In most cases, text already includes enough information to judge the sentiment.A few observed typical situations leading to linguistic modality misclassification are the presence of misleading linguistic cues, ambiguous linguistic utterances whose sentiment polarity are highly dependent on the context described in earlier or later part of the call, or nonnegative linguistic utterances stated angrily.In order to achieve better accuracy, we combine the two modalities together to exploit complementary information.
We simply combine results of the three semantic knowledge based classifiers and all the five audio classifiers by taking the weighted majority vote.The T+A ensemble results are shown in Table 2 and they do not improve when compared to the unimodal text ensemble results.
Since the unimodal performance of linguistic modality is notably better than acoustic modality, our decision-level fusion methods use linguistic ensemble results as the base-line, while acoustic results are used as supplemental information to calibrate each classification.The Fus2 bimodal system discussed in Section 5 yields a 2% improvement in F1-score than the text unimodal system.
The acoustic modality provides important cues to identify negative emotions.It can help correct misclassified nonnegative/ambiguous linguistic utterances.Our results show that relying on the joint use of linguistic and acoustic modalities allows us to better sense the sentiment being expressed as compared to the use of only one modality at a time.The acoustic feature analysis helps us to better understand the spoken intention of the speaker, which is not normally expressed through text.

Tempo Sentiment Pattern
The sentiment is not only regarded as an internal psychological phenomena but also interpreted and processed communicatively through social interactions.Conversations exemplify such a scenario where inter-personal sentiment influences persist.The left panel in Figure 3 shows the negative scores of CRSs and customers in the 21 test calls.The negative score, a weighted negative segment percentage, is calculated to analyze the overall sentiment.Weights 0.8, 1, and 1.2 are assigned to the first third, second third and last third of each call.The negative scores of CRSs are usually lower than customers', and usually high negative scores for customers correspond to high negative scores for CSRs.We can conclude from the figure that sentiment can be affected by other parties during a conversation.
To further analyze the interactions between customers and CSRs, the cumulative negative scores for call 6, 15, and 16 are drawn on the right panel of Figure 3.Call 6 shows the sentiment patterns of a typical bad call, which is characterized by long duration and long hold.The customer has a high negative score from beginning to end, and the CSR fails to help the customer during the call.Call 15 is a typical good call.The overall negative score is low and the negative score pattern goes down for both the customer and the CSR, which means the problem is resolved by the end of the call.Call 16 is another type of call, in which the customer does not get angry even though the CSR is unable to solve his/her issues.

Discussion and Future Work
A new dataset BSCD consisting of real-world conversation, the service calls, is introduced.Human communication is a dynamic process, our eventual goal is to develop a bimodal sentiment analysis engine with the ability to learn the temporal interac-tion sentiment patterns among conversation parties.In the process of fusion, we have approached the study of audio sentiment analysis from an angle that is somewhat different from most people's.Future research will concentrate on evaluations using larger data sets, exploring more acoustic feature relevance analysis, and striving to improve the decision-level fusion process.A call is constituent of a group of utterances that have contextual dependencies among them.However, in our semi-supervised learning annotation pipeline, about half of the segments in calls are discarded.Therefore the interdependent modeling is out of the scope of this paper and we include it as future work.
Thank you for calling this night or benefits center is the same .How can I help you ?Okay .Let me get your account pulled up .I could take a look and see if there's been an update on it yet .Um , could I have your first and last name ?Hey , Sam , I'm just trying to find out what's going on with my , um Yeah .Yeah .Chapter on a Friday .

Figure 3 :
Figure 3: The (cumulative) negative score pattern between customers and CSRs We then ask an annotator to manually correct the annotated tags D U by listening to the call, and move the results D U (I) to D L .For all the other calls, we only keep the utterances where classifiers all agree as D U (M ).
† , GoogleSA ‡ } requires a small size of training data or no extra training data.As the size of D LT increases, we form a new committee C T = {SVM, Long Short-Term U (> 40% for customer or > 20% for CSR), then we say this call is potentially negative and * AWS Comprehend Sentiment Analysis API † AWS Custom Classification API ‡ Google Language Sentiment Analysis API informative.LT still include transcription errors, even though the threshold discussed in the above paragraph is set to eliminate those utterances to add to the training dataset.In addition, 18,705 cleaned text chat data collected from chat windows are also added to D LT via the annotation pipeline to improve the C T accuracy, the details are shown in Section 6.1.Because of the quality of the calls, the poor performance of the ASR for some cases, and the threshold used to annotate the utterances, more than half of the original call segments are discarded * and 18,705 text chat data are added to D LT ={transcription data, chat data} without the corresponding audio files in D LA .It is hard to consider the context of the conversation since the segments are not continuing in the training dataset.Therefore, conversation models are not considered in our committee classifiers C.

Table 2 :
Binary classification of sentiment polarity on both linguistic and acoustic modalities.