Sentiment Analysis for Emotional Speech Synthesis in a News Dialogue System

As smart speakers and conversational robots become ubiquitous, the demand for expressive speech synthesis has increased. In this paper, to control the emotional parameters of the speech synthesis according to certain dialogue contents, we construct a news dataset with emotion labels (“positive,” “negative,” or “neutral”) annotated for each sentence. We then propose a method to identify emotion labels using a model combining BERT and BiLSTM-CRF, and evaluate its effectiveness using the constructed dataset. The results showed that the classification model performance can be efficiently improved by preferentially annotating news articles with low confidence in the human-in-the-loop machine learning framework.


Introduction
As smart speakers and conversational robots become ubiquitous, the demand for expressive speech synthesis has increased. Speech synthesis is a technology that converts input text into a speech signal that represents its contents. In recent years, research has not only improved the quality so that a synthesized voice is similar to a real one, but also has diversified expressions (Govind and Prasanna, 2013;Kaur and Singh, 2015;Skerry-Ryan et al., 2018). Emotional speech synthesis is a technique for diversifying the expression of speech synthesis (Qin et al., 2006;Schröder, 2001;Yang et al., 2018). It specifies both emotional parameters and the text input so that the speech reflects the designated emotion (Charfuelan and Steiner, 2013;Inoue et al., 2017;Iwata and Kobayashi, 2011;Nose and Kobayashi, 2013).
In conversational robots and non-task-oriented dialogue systems, the emotions of speech synthesis are controlled not only by the contents of the text spoken by the system but also by the personality of the robot or environmental information (Bennewitz et al., 2007;Chiba et al., 2018;Robbel et al., 2009). On the other hand, in audiobooks and news reading systems, the emotions when the system speaks are often determined by the content of the text. In such applications, it is desirable that the emotional parameters used in the speech synthesis system be automatically estimated from the input text (Bellegarda, 2011;Jauk et al., 2018;Shaikh et al., 2009;Sudhakar and Bensraj, 2014;Trilla and Alías, 2013;Vanmassenhove et al., 2016).
When news is transmitted by a synthesized voice, it is beneficial for listeners that news with positive content is transmitted with voices that are synthesized with positive emotion, whereas news with negative content is transmitted with voices that are synthesized with negative emotion (Pitrelli et al., 2006). In our spoken dialogue system that delivers news (Takatsu et al., 2018), it is important to speak clearly with emotion according to the content of the news to improve the users' understanding. Table 1 shows an example of a news conversation. S 2 is "positive" because it reports that the subject, Nishikori, won the game, while S 4 is "negative" because it reports that the subject, Andy Murray, lost in the game. S 1 and S 3 are "neutral" because they contain neither positive nor negative content. To realize emotional speech synthesis according to such news content, herein we construct a dataset for machine learning by annotating emotion labels ("positive," "negative," or "neutral") for each sentence in news articles and propose a method to identify them using a model that combines BERT (Devlin et al., 2019) and BiLSTM-CRF (Lample et al., 2016). Furthermore, we show that the model performance can be efficiently improved by repeating active learning in the human-in-the-loop machine learning framework (Munro, 2019). The structure of this paper is as follows. Section 2 discusses related work. Section 3 overviews the annotation method of emotion labels and the statistics of the constructed dataset. Section 4 describes the proposed model. Section 5 shows the performance results of the proposed model. Section 6 reports the effects of active learning. Section 7 provides conclusions and future work.

Related work
Sentiment analysis is a technique to analyze emotions contained in texts (Hu et al., 2018;. Most sentiment analysis studies target texts containing subjective opinions such as Twitter tweets and review documents Islam et al., 2019;Yin et al., 2019;Zhang and Zhang, 2019). Many of these studies classify emotions from the perspective of writers. On the other hand, studies on news articles that mainly refer to objective events classify emotions from the readers' perspective. For example, Lin et al. classified readers' emotions about news into happy, angry, sad, surprised, heartwarming, awesome, bored, and useful (Lin et al., 2007;Lin et al., 2008). They assumed that the most common emotion chosen by users who read the news article to be the correct answer and identified the emotions of news articles using SVM (Cortes and Vapnik, 1995). Li et al. classified readers' emotions into touching, empathy, boredom, anger, amusement, sadness, surprise, and warmness (Li et al., 2016). They formulated the sentiment analysis problem as a multi-label classification problem and proposed a topic model to estimate the weight of different documents for each emotion (Li et al., 2016). Ciptadi et al. classified readers' emotions into proud angry sad happy afraid amused inspired and surprised (Ciptadi and Girsang, 2019). They showed that the classification performance of a Naive Bayes classifier and logistic regression was improved by applying the SMOTE (Chawla et al., 2002) oversampling method to alleviate the problem of imbalanced data.
In addition to studies classifying entire news articles (Ciptadi and Girsang, 2019;Ling et al., 2017;Li et al., 2016;Lin et al., 2007;Lin et al., 2008;Liu et al., 2013;Wang and Liu, 2017), some research classified headlines of news articles (Kirange and Deshmukh, 2012;Strapparava and Mihalcea, 2007), while others classified sentences of news articles (Bhowmick et al., 2009;Bhowmick et al., 2010;Das and Bandyopadhyay, 2009;Li et al., 2015;Patil and Chaudhari, 2012). In studies on sentence classification, for example, Bhowmick et al. classified readers' emotions into anger, disgust, fear, happiness, sadness, and surprise (Bhowmick et al., 2010). They asked multiple people to annotate the sentences of news articles. They confirmed that the agreement rate could be improved by eliminating surprise as well as integrating anger and disgust. In addition, they evaluated the multi-label classification performance by ADTboost. MH (Comité et al., 2003). Similarly, eliminating surprise and integrating anger and disgust improved the model performance. Li et al. modeled the label dependency by assuming a single sentence was likely to have similar labels such as hate and anger, and the context dependency of sentences by assuming that sentences in the same context were likely to have the same label using a factor graph (Li et al., 2015). The model using two neighboring sentences rather than the document or paragraph as the context showed the best performance.
Although it was a study on review documents, Zhang et al. formulated the sentiment analysis problem as a sentence sequence labeling problem (Zhang et al., 2014). They proposed a method to identify emotion labels (positive, negative, and neutral) by CRF (Lafferty et al., 2001). From the experimental results of active learning, the method labeling a document with the smallest average probability of the first half of the sentences in the document most effectively improved the model performance.
In recent years, in the field of natural language processing, approaches for fine-tuning a language model pre-learned with huge unlabeled text data in downstream tasks have been attracting attention (Qiu et al., 2020). BERT (Devlin et al., 2019) is a typical method of pre-training language representations. The effectiveness of the method using BERT has also been confirmed in the sentiment analysis task (Hoang et al., 2019;Sun et al., 2019;Xu et al., 2019). However, these models identify the emotion of an entire review or a single sentence. When considering the classification of emotions in each sentence of a news article, it is necessary to consider not only each sentence but also the contextual information around them.
Most conventional studies on sentiment analysis for news articles classify emotions at a finer granularity from the readers' perspective. In this study, we adopt three classes of "positive," "negative," and "neutral" for the granularity of the classification that can be agreed upon by many listeners. Similar to Zhang et al., we formulate the problem of identifying emotion labels for each sentence of news articles as a sentence sequence labeling problem.
3 News article corpus with emotion labels annotated for the sentences We classified emotions in news content as "positive," "negative," or "neutral," and constructed a dataset for machine learning by annotating these emotion labels for each sentence in a news article. A positive label is annotated when the subject of the sentence considering the context is good or indicates that the subject is heading in a good direction. Examples include social contribution, market expansion, and acquisition of interests. A negative label is annotated when the subject of the sentence considering the context is bad or indicates that the subject is heading in a bad direction. Examples include a decline in business, acts of dishonesty, incidents, and accidents. A neutral label is annotated when neither positive nor negative content is included. Articles containing sentences with both positive and negative content were excluded in this study.
Annotation was performed by a web news clipping expert for news articles with 5 to 20 sentences in the Nihon Keizai Shimbun. The annotator was presented with lists of news articles ranked by category using a rule-based approach (see Section 5.1). The annotator annotated high ranked articles (news expected to be positive), middle ranked articles (news expected to be neutral), and low ranked articles (news expected to be negative) so that they were evenly distributed into the list. To cover various topics, we instructed the annotator to avoid annotating similar topics as much as possible. In the annotation work, first, the annotator read the title of the news article and checked the summary of the article. Next, the annotator assigned an emotion label to the title to understand the emotional tendency of the whole article. After reading all sentences of the article, the annotator assigned emotion labels to each sentence beginning from the first sentence. Table 2 shows the total number of each emotion label annotated for titles and sentences by news category. The number of positive and negative annotated labels were almost equal, but there were fewer neutral labels.  Sports  65  10  71  555  399  564  Technology  91  8  93  776  519  908  Business  83  3  87  685  381  942  Markets  42  5  58  326  165  514  Economy  55  6  62  492  383  455  International  48  3  57  411  216  457  Society  114  4  98  860  559  699  Local  68  1  66  744 207 553 Figure 1: BERT+ SA BiLSTM-CRF : Self-attention is calculated for the representation that combines the embedding of the top layer of BERT corresponding to each word and the embedding of the auxiliary features of each word. Obtained sentence vectors are given to BiLSTM-CRF, and the emotion labels of each sentence are estimated.

Proposed model
We formulated the emotion label identification problem as a sequence labeling problem. Figure 1 shows a schematic diagram of the proposed model. BERT (Devlin et al., 2019), which is a model based on the Transformer (Vaswani et al., 2017) encoder, is used as the word encoder. Self-attention (Lin et al., 2017) is calculated for the representation that combines the embedding of the top layer of BERT corresponding to each word and the embedding of the auxiliary features of each word. The obtained sentence vectors are given to BiLSTM-CRF (Lample et al., 2016), and the emotion labels of each sentence are estimated. At the time of decoding, the labels are estimated by the Viterbi algorithm.
The following information is used as auxiliary features: morphological information (part of speech, inflectional form, inflected type, category, domain) of JUMAN++ (Ver.1.02) 1 (Morita et al., 2015;Tolmachev et al., 2018), named entity classes and types of dependencies obtained by applying KNP (4.19) 2 (Kawahara and Kurohashi, 2006), distance from the top node of the dependency tree to the clause that contains the target word, the number of clauses whose destination is the clause that contains the target word, TF, IDF, TF-IDF, whether the word is included in the range of the corner bracket, clause position from the beginning of the sentence, position of the sentence in the article, position of the paragraph in the article, news category of the article, emotion polarity value of the word (using the "Japanese Sentiment Polarity Dictionary" 3 (Kobayashi et al., 2004;Higashiyama et al., 2008), the "Semantic Orientations of Words" 4 (Takamura et al., 2005) the polarity dictionary included in "Models for Opinion Extraction Tool" 5 ), and whether the word is a polarity inversion expression (using the reverse expression dictionary included in "Models for Opinion Extraction Tool").

Experimental setup
We evaluate the proposed model using the constructed dataset. We used the pre-trained BERT model published by Kyoto University 6 . This BERT model trained BERT BASE (Devlin et al., 2019) by inputting text applied to morphological analysis using JUMAN++ 7 (Morita et al., 2015;Tolmachev et al., 2018) and BPE (Byte Pair Encoding) 8 (Sennrich et al., 2016) for all Japanese Wikipedia articles. The dimensions of the hidden layer of BiLSTM and linear layer were set to 128, and Adam was used for the optimizer. The macro F 1 -measure (Chinchor, 1992) and overall accuracy were adopted as evaluation metrics. The evaluation was performed by the ten-fold cross validation where the dataset was divided into training set (90%) and test set (10%) for each news category. We compared the proposed model with following two types of methods.
Baselines 1 : Sentence classification methods Random : A model that randomly selects a label. Mode : A model that selects the most frequent labels (i.e., negative) in the dataset. Rule-best : A model where the positive, neutral, and negative thresholds are adjusted to achieve the highest performance in the rule-based method. In the rule-based method, the emotion polarity value of a sentence is calculated according to the occurrence frequency of positive words and negative words by considering the polarity inversion using a word emotion polarity dictionary. SVM : An SVM model of a linear kernel trained using the bag-of-words of sentence words as features.

BERT : A model that adds a linear layer on the top layer of BERT corresponding to [CLS] and applies
Softmax. BERT SA+ : A model that applies Softmax to the vector obtained by calculating Self-attention for the combination of the embedding of the top layer of BERT corresponding to each word and the embedding of the auxiliary features of each word.
Baselines 2 : Sequence labeling methods BiLSTM-CRF : A model that inputs the bag-of-words of sentence words into BiLSTM-CRF. BERT BiLSTM-CRF : A model that inputs the embedded representations of the top layer of BERT corresponding to [CLS] into BiLSTM-CRF. BiLSTM+ SA BiLSTM-CRF : A model that inputs vectors obtained by calculating Self-attention for a combination of the output vector of the hidden layer of the BiLSTM that inputs sentence words and the embedding of the auxiliary features of each word into BiLSTM-CRF.

Experimental results
The proposed model had the best performance (Table 3). Models using the embedded representations of BERT outperformed the model using the bag-of-words as word features. Additionally, the sequence labeling models outperformed the sentence classification models. Furthermore, the model performance could be improved by considering the auxiliary features. Table 4 shows the values of each evaluation metric calculated by news category for BERT+ SA BiLSTM-CRF. The "local" category had the best results. This is attributed to the prevalence of news with easy-to-understand tones in the "local" category such as incidents, accidents, and efforts to revitalize the region. In addition, neutral labels had a lower estimation performance than other labels regardless of news category. This is because even if a sentence contains positive or negative expressions, it may be neutral depending on the context.

Active learning
In the human-in-the-loop machine learning framework (Munro, 2019), the model performance can be efficiently improved by preferentially annotating articles with low confidence.

Confidence and accuracy
The confidence of labeling sentences in an article was defined as the value obtained by dividing the score from Viterbi decoding by the number of sentences. The dataset was divided into a training set (75%) and a test set (25%), and BERT+ SA BiLSTM-CRF was trained. Figure 2 shows a scatter plot of the accuracy calculated for each article in the test set and the normalized confidence so that the maximum value is 1. Articles with a higher confidence tended to have a higher accuracy. Pearson's product moment correlation coefficient was 0.576, indicating an appropriate correlation.

Human-in-the-loop machine learning
The model performance and efficiency should be improved by preferentially annotating articles with low confidence. We evaluated the change in accuracy by repeating the following three steps: (1) Apply the trained model to news articles with unknown labels that are not included in the training set and rank them in ascending confidence for each news category.
(2) Select the articles with the least confidence that match the annotation condition one-by-one for each news category and annotate the emotion labels.
(3) Add the annotated articles to the training set, retrain the model, and evaluate the performance using the test set. Figure 3 shows an image of this annotation cycle. As a comparison, we also employed a model trained by adding data annotated in randomly selected articles for each news category regardless of the confidence.  (1) Deploying model: Apply the trained model to news articles with unknown labels that are not included in the training set and rank them in ascending confidence for each news category.
(2) Annotation: Select the articles with the least confidence that match the annotation condition one-by-one for each news category and annotate the emotion labels.
(3) Training model: Add the annotated articles to the training set, retrain the model, and evaluate the performance using the test set.

Experimental results
Figure 4 plots the change in accuracy according to the number of active learning loops using the least confidence method and the random sampling method. The model using data annotating articles with a low confidence improved performance more efficiently than that using data annotated from randomly selected articles. Figure 5 shows the error tendency for the before-the-loop model that had an accuracy of 0.794. Figure  6 shows the error tendency for the human-in-the-loop model after the five loops that had an accuracy of 0.809. Human-in-the-loop reduced the estimation errors of positive sentences as neutral, but it increased the errors of neutral sentences as positive, indicating that it is more difficult to distinguish between positive and neutral. In emotional speech synthesis, it is more critical when positive sentences are mistakenly identified as negative and vice versa. Such critical errors in both models were not more than 15%. Table 5 shows an example of a news article where all the emotion labels were correctly estimated in the human-in-the-loop model. Table 6 shows an example of a news article where the emotion labels of some sentences were incorrectly estimated in the human-in-the-loop model. The model judged sentences with the content "the registered trademark is canceled" as negative. However, careful reading of the sentences revealed that "a trademark registered without permission is canceled," can be considered positive content. One method to correct such errors is to create a model that learns the interpretation according to the context by increasing the amount of learning data. Another method is to introduce a mechanism into the model that can learn polarity inversion such that if the content is negative in a particular context, it becomes positive when the context changes.

Conclusion
To enable emotional speech synthesis based on news content in a spoken dialogue system, we constructed a dataset for machine learning by annotating the emotion labels for each sentence in a news article. In addition, we proposed a model that identifies the emotion label of each sentence, and evaluated its effectiveness using the constructed dataset. The model performance can be efficiently improved by preferentially annotating articles with low confidence in the human-in-the-loop machine learning framework.
In the future, we will develop a speech synthesis system that can control the emotional parameters using the emotion label estimated by the proposed model. We will also confirm whether speaking with emotion promotes users' understanding in news delivery tasks.  Table 6: Example of a news article where the emotion labels of some sentences were incorrectly estimated Sentence Correct label Predicted label 2 2 It was revealed on the 2nd that the Chinese trademark office revoked the registration of two brands, "Mori Izo" and "Isami" due to the problem that potato shochus in Kagoshima prefecture, which are very popular, were registered without a trademark in China.

Positive
Negative 2 A company in Fukuoka prefecture, which is unrelated to the two trademarks, registered the trademark. However, Mori Izo Shuzo and Kai Shoten, which manufacture their respective shochus, were seeking cancellation.

Neutral Negative
Following this decision, both companies applied for trademark registration with the Chinese Trademark Office.

Neutral Neutral
The two companies challenged the Chinese authorities for unapproved applications. However, they have not been admitted, hampered by the first-tofile barrier that gives rights to earlier applicants.
Negative Negative 3 This time, they used the system to cancel the trademark because there is a rule about non-use for a certain period, and the request was granted as "not used for 3 years." Positive Negative