WRIME: A New Dataset for Emotional Intensity Estimation with Subjective and Objective Annotations

We annotate 17,000 SNS posts with both the writer’s subjective emotional intensity and the reader’s objective one to construct a Japanese emotion analysis dataset. In this study, we explore the difference between the emotional intensity of the writer and that of the readers with this dataset. We found that the reader cannot fully detect the emotions of the writer, especially anger and trust. In addition, experimental results in estimating the emotional intensity show that it is more difficult to estimate the writer’s subjective labels than the readers’. The large gap between the subjective and objective emotions imply the complexity of the mapping from a post to the subjective emotion intensities, which also leads to a lower performance with machine learning models.


Introduction
Emotion analysis is one of the major NLP tasks with a wide range of applications, such as a dialogue system (Tokuhisa et al., 2008) and social media mining (Stieglitz and Dang-Xuan, 2013). Since emotion analysis has been actively studied, not only the classification of the sentiment polarity (positive or negative) of the text (Socher et al., 2013), but also more detailed emotion detection and emotional intensity estimation (Bostan and Klinger, 2018) have been attempted in recent years. Previous studies on emotion analysis use six emotions (anger, disgust, fear, joy, sadness, and surprise) by Ekman (1992), eight emotions (anger, disgust, fear, joy, sadness, surprise, trust, and anticipation) by Plutchik (1980), and VAD model (Valence, Arousal, and Dominance) by Russell (1980). Table 1 lists datasets with emotional intensity. 1 1 In this paper, the emotions of the text writers themselves are called subjective emotions, and the emotions that the readers receive from the text are called objective emotions.
It depends on the applications whether the writer's emotions or the reader's ones to be estimated in NLP-based emotion analysis. For example, in a dialogue system, it is important to estimate the reader's emotion because we want to know how the user feels in response to the system's utterance. On the other hand, in applications such as social media mining, we want to estimate the writer's emotion. In other applications such as story generation, it is worth considering the difference between the emotions the writer wants to express and the emotions the reader receives. As shown in Table 1, most existing datasets have collected only objective emotions. 2 Therefore, previous studies on emotion analysis have focused on estimating objective emotional intensity.
In this study, we introduce a new dataset, WRIME, 3 for emotional intensity estimation. We collect both the subjective emotional intensity of the writers themselves and the objective one annotated by the readers, and explore the differences between them. In our data collection, we hired 50 2 EmoBank (Buechel and Hahn, 2017) is a dataset that aims to collect the emotional intensity of both writers and readers. However, crowdsourcing annotators, who are different from the text writer, infer the writer's emotions, so they are not able to collect the writer's subjective emotions.  Table 1: List of datasets with emotional intensity. In the "Emotion" column, datasets with E6 adopt the six emotions by Ekman (1992): anger, disgust, fear, joy, sadness, surprise, ones with P8 adopts the eight emotions by Plutchik (1980): anger, disgust, fear, joy, sadness, surprise, trust, anticipation, and ones with M4 adopts the four emotions by Mohammad et al.: joy, sadness, anger, fear. participants via crowdsourcing service. They annotated their own past posts on a social networking service (SNS) with the subjective emotional intensity. We also hired 3 annotators, who annotated all posts with the objective emotional intensity. Consequently, our Japanese emotion analysis dataset consists of 17,000 posts with both subjective and objective emotional intensities for Plutchik's eight emotions (Plutchik, 1980), which are given in a four-point scale (no, weak, medium, and strong). Our comparative study over subjective and objective labels demonstrates that readers may not well infer the emotions of the writers, especially of anger and trust. For example, even for posts written by the writer with a strong anger emotion, our readers (i.e., the annotators) did not assign the anger label at all to more than half of the posts with the subjective anger label. Overall, readers may tend to underestimate the writers' emotional intensities. In addition, experimental results on emotional intensity estimation with BERT (Devlin et al., 2019) show that predicting the subjective labels is a more difficult task than predicting the objective ones. This large gap between the subjective and objective annotations implies the challenge in predicting the subjective emotional intensity for a machine learning model, which can be viewed as a "reader" of the posts.

Related Work
To estimate the emotional intensity of the text, datasets labeled with Ekman's six emotions (Ekman, 1992) and Plutchik's eight emotions (Plutchik, 1980) has been constructed for languages such as English, as shown in Table 1. EmoBank 4 (Buechel and Hahn, 2017), which is most relevant to ours, 4 https://github.com/JULIELab/EmoBank labels the emotional intensity of both the writers and readers of the text. However, the annotators for EmoBank are not writers, and readers are required to guess the writer's emotion; therefore, to be strict, this dataset only contains the objective labels. Our dataset is the first to collect the subjective emotional intensity of the writers themselves.
ISEAR (Scherer and Wallbott, 1994) is a dataset with subjective emotional labels. This is a dataset in which annotators describe their own past events in each emotion. They use a label set that adds shame and guilt to Ekman's six emotions. Although ISEAR is the only dataset with subjective emotional labels, their intensity is not considered.
Early studies in collecting objective emotional labels were annotated by experts. Aman and Szpakowicz (2007) labeled each sentence of English blog posts with Ekman's six emotions and their intensity on a three-point scale. Strapparava and Mihalcea (2007) labeled Ekman's six emotional intensities to English news headlines and held a competition of SemEval-2007 Task 14. 5 In recent years, there have been many studies on collecting objective emotional labels using crowdsourcing. Mohammad and Bravo-Marquez (2017a); Mohammad and Kiritchenko (2018) labeled tweets in English, Arabic, and Spanish with the intensity of four emotions (joy, sadness, anger, and fear). Using these datasets, they held a series of competitions to estimate the emotional intensity in WASSA-2017 Shared Task on Emotion Intensity 6 (Mohammad and Bravo-Marquez, 2017b) and SemEval-2018 Task 1 7 . Some datasets (Kaji and Kitsuregawa, 2006;Suzuki, 2019) are available in Japanese. However, these are sentences with sentiment polarity, and do not cover the various emotions dealt with in this study. Our study is the first to label Japanese texts with various emotional intensity.

Annotating Subjective Labels
We hired 50 participants via crowdsourcing service Lancers. 8 Those participants include 22 men and 28 women, where 2 are teens, 26 are in their 20s, 18 are in their 30s, and 4 are above 40 years old. They copy and paste their own past SNS posts and then labeled the posts with the subjective emotional intensity according to Plutchik's eight emotional intensities (Plutchik, 1980) with a four-point scale (0: no, 1: weak, 2: medium, and 3: strong). They did not provide us with all the posts, but chose only those posts that they could agree to publish. Here, for the purpose of emotion analysis from the text, posts with images or URLs were excluded. Each participant labeled 100 to 500 posts, resulting in 17,000 posts in total. We did not limit the posts to be annotated based on when they are posted. As a result, our dataset contains posts in the 9-year range from June 2011 to May 2020. We assumed that each post would require 50 seconds for annotation and paid 21.5 JPY per post. This roughly corresponds to 15 USD per hour, which is a good reward for crowdsourcing. 9 To assess the quality of annotations, we randomly sampled 30 posts for each participant. One of our graduate students evaluated the posts and the corresponding eight emotional intensity labels on a four-point scale based on the following criteria. 8 https://www.lancers.jp/ 9 One of the popular crowdsourcing services, Prolific, has a minimum payment of 6.5 USD per hour. https://www. prolific.co/pricing • 3: I fully agree with the label given.
• 2: I can find the relevance between the post and label.
• 1: I hardly find the relevance between the post and label.
• 0: I do not think the annotator seriously engaged for this post.
The average score for each participant was 2.1, where 1.8 at minimum, and 2.5 at maximum. There were no posts rated as 0. We had five annotators whose average score was below 2, but reviewing their posts and labels does not necessarily show obvious clues of improper annotation.

Annotating Objective Labels
We hired three objective annotators via the same crowdsourcing service as in Section 3.1. Annotators include two women in their 30s and one woman in their 40s. They labeled all the 17,000 posts with Plutchik's eight emotional intensities (Plutchik, 1980) in the same way as subjective annotation. Note that while the subjective annotators labeled their own emotions as the writer of each post, the objective annotators labeled each post based on the emotions they received from the post. Objective annotators do not have to fill in the text, so their task is simply to label emotional intensity. We assumed that each post takes 10 seconds and paid 3.8 JPY per post, which results in the reward of roughly 13 USD per hour.
To assess the quality of annotations, we calculated the quadratic weighted kappa 10 (Cohen, 1968) as a metric of the inter-annotator agreement. The upper part of Table 2 shows the agreement between the objective annotators. The best case, joy, shows  a substantial agreement (κ > 0.6), but trust, is with a fair agreement (κ < 0.4). Overall, we confirmed a moderate agreement (0.5 < κ < 0.6) among the objective annotators.
The lower part of Table 2 shows the agreement between the subjective and the objective annotators. These are discussed in Section 4.2.

Writers' Personality Assessment
We also performed personality assessments of our writers (i.e., subjective annotators) in order to explore the relationship between personality and emotion. Through 60 questions (Saito et al., 2001) based on the Big Five personality traits (Goldberg, 1992), the following five factors were assessed: agreeableness, extraversion, neuroticism, openness, and conscientiousness. In this personality assessment, the writer's own applicability to each of 60 adjectives, such as "cheerful" and "honest" is reported on a 7-point scale, and the five factors of personality indicators are derived. Figure 1 shows the results of the personality assessment over all 50 writers, where we can see various personalities. For example, well-balanced writers can be seen near the center of the figure, and writers with low neuroticism appear in the lower right. In Section 5, we shall show how the personality helps to improve emotional intensity estimation. Table 3 shows some examples of labeled posts in our dataset. The first post was written with a strong emotions of both joy and anticipation. Readers can have similar emotions as the writer for this post. The second post was written with a strong emotions of both sadness and anger. Readers can share emotions of sadness, but they are more surprised than angry.   Table 4 shows the distribution of emotional intensity labels. For all emotions, intensity 0 is most frequently assigned. This is not surprising, as it is rare for a single post to come with many emotions, which may be contradictory to each other, at the same time. 11 However, for emotions of anger and trust, about 95% of labels by the objective annotators have an intensity 0, which is particularly high. In other words, with regard to emotions of anger and trust, readers may tend to underestimate the emotions of the writers. In addition, we can see some characteristics of each objective annotator, e.g., the number of times that reader 1 gives intensity 1 is small.

Difference between Writers and Readers
The lower part of Table 2 shows the agreement between the subjective and the objective annotators. As with the agreement between the objective annotators in Section 3.2, we calculated the quadratic weighted kappa (Cohen, 1968). Agreement between subjective and objective annotators are lower than agreement between objective annotators (the upper part of Table 2). Especially for the emotion of anger, there is a large gap between the reader-reader agreements and writer-reader agreements. In addition, for the emotion of trust, the 11 90% of posts have less than 4 emotions at the same time.
writer-reader agreement is even lower, although the reader-reader agreements are also low. These results imply that there is a large difference between the subjective and objective emotion. Table 5 shows the confusion matrix between the subjective emotional intensity labels and the objective ones for respective emotions. For example, in posts where the writer labeled intensity 0 for joy, the percentages where the reader labeled intensities 0, 1, 2, and 3 were 91.7%, 3.1%, 4.0%, and 1.2%, respectively. This confusion matrix shows the fine-grained differences in emotional intensity between writers and readers, which reinforces our discussion in Section 3 that readers hardly detect the emotions associated with the post. Focusing on the emotion of anger in the confusion matrix, in 58.6% of the posts where the writer labeled intensity 3 (strong anger), the reader labeled intensity 0 (no emotions of anger). This is more prominent in the emotion of trust: for 81.5% of posts that the writer labeled intensity 3, the reader labeled intensity 0. This clearly demonstrates that the readers cannot infer the emotion trust of the writer. As for other emotions, readers are most likely to label an intensity 0 in posts labeled with an intensity 2 or less by the writer. Overall, the readers tend to underestimate the writer's emotions, and they rarely label intensity 1 or more when the writer label intensity 0.  Table 5: Confusion matrix of subjective and objective labels. (%) This is a total of the three sub-matrices. Each sub-matrix is a confusion matrix for each reader.

Emotional Intensity Estimation
We conduct experiments on the four-class classification as an ordinal classification to estimate emotional intensity {0, 1, 2, 3} using the dataset constructed in Section 3.

Experimental Settings
In this experiment, we divided the dataset 12 into training set of 15,000 posts from 30 writers, validation set of 1,000 posts from 10 writers, and evaluation set of 1,000 posts from 10 writers. That is, there is no duplication of writers between the splits. We used MeCab (IPADIC-2.7.0) 13 (Kudo et al., 2004) to tokenize Japanese text. The performance of the emotional intensity estimation models is evaluated by the mean absolute error (MAE) and the quadratic weighted kappa (QWK). We evaluated the model using both the emotional intensity labels given by the subjective annotators (subjective labels) and the average of the emotional intensity labels given by the three objective annotators (objective labels). 12 Each writer provided 500 posts for the training set and 100 posts for the validation and test sets. 13 https://taku910.github.io/mecab/ Following the standard emotional intensity estimation models (Acheampong et al., 2020), we train the following three types of four-class classification models for each emotion.
• BoW+LogReg employs Bag-of-Words to extract features and Logistic Regression to the estimate emotional intensity.
• fastText+SVM vectorizes each word with fastText 14 (Bojanowski et al., 2017) and estimates the emotional intensity with a Support Vector Machine based on their average vector.
• BERT is a model that fine-tunes the pretrained BERT 15 (Devlin et al., 2019) and estimates the emotional intensity as y = softmax(hW ), where h is a feature vector obtained for the [CLS] token of BERT. We investigate the performance of both BERT trained with subjective labels (Subj. BERT) and BERT trained with objective labels (Obj.  BERT), in both evaluations on subjective and objective labels.
We also evaluate the following two baselines.
• Random outputs one of the four emotional intensity labels {0, 1, 2, 3} randomly with the uniform distribution.
• Modal Class always outputs the most frequent intensity label for each emotion. As shown in Table 4, in this dataset, intensity 0 has the highest frequency for all emotions, so in practice, this baseline always gives label 0.
As for BERT-based models, we used the implementation in Transformers 17 (Wolf et al., 2020). We used the whole-word-masking model with a batch size of 32, a dropout rate of 0.1, a learning rate of 2e-5, and Adam (Kingma and Ba, 2015) for optimization. The training stopped after 3 epochs without improvement in the validation loss.
In the evaluation of subjective labels, the personality of the writers is considered in the Subj. BERT in the following two ways. • w/ Pa: Feature extraction is performed with h a = attention(uW Q , vW K , vW V ) in consideration of personality. That is, in the calculation of the attention mechanism, the personality representation u is used as the query, and the text representation v is used as both the key and the value. h a is used instead of h for emotional intensity estimation.

Results
The performance of each model on subjective and objective labels is shown in Tables 6 and 7, respectively. Regardless of the method, the evaluation of subjective label estimation gets a larger mean absolute error than the evaluation of objective labels. In our previous discussion, we have stated that it is difficult for readers to estimate the emotions of writers; this also applies to machine learning models.

Evaluation with Subjective Labels
In the evaluation of subjective labels, the traditional models of BoW+LogReg and fastText+SVM achieved lower mean absolute errors than the Random baseline, but were inferior to the Modal Class baseline. The BERT methods achieved a mean absolute error lower than the Modal Class baseline. Surprisingly, Obj. BERT trained with objective labels, rather than Subj. BERT trained with subjective labels, achieved the highest performance. Since it is difficult to estimate subjective labels, which are the emotion of the writer, a simple model may not provide sufficient performance.
Therefore, we examined Subj. BERT w/ Pc and Subj. BERT w/ Pa to assist training using the personality information of the writer. As a result, Subj. BERT w/ Pc, which simply concatenates the personality representation and the text representation, was not effective, but Subj. BERT w/Pa, which considers personality representation with weighting, achieved higher performance than simple Subj. BERT. The evaluation by QWK also shows the usefulness of using the personality information of the writer. However, even with personality information, the performance is not comparable with that of Obj. BERT. Improving methods for accurate estimation of subjective emotions is our future work.
Below the dotted line in Table 6, the performance of the human readers is shown for comparison. Estimating the emotional intensity of writers is difficult for both human readers and machine learning models.

Evaluation with Objective Labels
In the evaluation of objective labels (Table 7), the traditional models of BoW+LogReg and fast-Text+SVM were comparable to the Modal Class baseline. Similar to the evaluation in the subjective labels, the BERT-based models achieved mean absolute errors lower than the Modal Class baseline, and Obj. BERT achieved the highest performance.
Below the dotted line in Table 7, the performance of the human readers is shown for comparison. Note that the objective labels are the average of each of these readers. Compared to each reader, Obj. BERT does not reach human performance.

Conclusion
We introduce a new dataset, WRIME, for Japanese emotional intensity estimation. Our dataset is based on Plutchik's eight emotions (Plutchik, 1980), labeling both the writer's subjective emotional intensity and the reader's objective one in SNS posts.
Overall, the readers tend to underestimate the writer's emotions. Even the strong emotions of the writer cannot be detected by the reader, especially in the emotions of anger and trust.
Experimental results on emotional intensity estimation show that it is more difficult to estimate the writer's subjective labels than the readers'. The large gap between the subjective and objective emotions imply the complexity of the mapping from a text to the subjective emotional intensities, which also leads to a lower performance with machine learning models.
Estimating the writer's subjective emotions with higher accuracy is future work. We have shown the possibility of improving the performance of subjective emotional intensity estimation by considering the personality of the writer. It may be worth considering the writer's meta information, including personality, and the writer's past posting history.

Ethical Considerations
We ensure that our work is conformant to the ACM Code of Ethics.