CancerEmo: A Dataset for Fine-Grained Emotion Detection

Emotions are an important element of human nature, often affecting the overall wellbeing of a person. Therefore, it is no surprise that the health domain is a valuable area of interest for emotion detection, as it can provide medical staff or caregivers with essential information about patients. However, progress on this task has been hampered by the absence of large labeled datasets. To this end, we introduce C ANCER E MO , an emotion dataset created from an online health community and annotated with eight ﬁne-grained emotions. We perform a comprehensive analysis of these emotions and develop deep learning models on the newly created dataset. Our best BERT model achieves an average F1 of 71% , which we improve further using domain-speciﬁc pre-training.


Introduction
Life-threatening diseases such as cancer and AIDS make people extremely vulnerable and stir a diverse range of feelings and emotions in them, e.g., from fear to trust or joy and from anger to surprise or sadness. These feelings and emotions shape a person's behavior, beliefs, and actions, and many turn to online health communities to share their health concerns and emotions. Recent research shows that this form of sharing is very beneficial to a patient's progress and well-being. For example, Qiu et al. (2011) show that cancer patients feel better and change to positive attitudes when they interact with others during or after the disease. Pollak et al. (2007) show that less anxiety and depression lead to better adherence to cancer care therapies.
The online sharing of emotions in online health communities on topics such as treatment, medication, side effects, moods, and the disease itself, has resulted into a large amount of user-generated content in the form of discussions. This together with the fact that people find it easier to express themselves and reveal personal details in health forums, rather than in a face-to-face context (Kummervold et al., 2002), make online health communities a great place to examine and study patients' emotions at a large scale using computational models.
However, despite that emotion detection has started to emerge in the health domain, the lack of large annotated datasets in the field greatly hinders the capabilities of supervised techniques and limits an understanding of fine-grained expressions of emotions at a large scale. For example, available datasets contain only about 1, 000 sentences annotated with Ekman's six basic emotions. Since some emotions appear very rarely in the annotated set, only the most frequent ones joy and sadness are analyzed (Khanpour and Caragea, 2018).
In this paper, we explore fine-grained emotion detection in online health communities and present a large dataset for this task. Specifically, we introduce CANCEREMO , a health-related dataset, composed of 8, 500 sentences annotated with emotions taken out of 25, 000 sentences sampled from an online cancer survivors network. This network, which is designed for patients suffering from cancer, and their caregivers, friends, and families, contains several discussion boards grouped by cancer type, where users can start a discussion thread or comment to messages in an existing thread. We construct our dataset from the breast, lung, and prostate cancer discussion boards, since there are higher stakes involved for patients with this type of disease. For example, breast cancer is the most common women cancer with about 18% of all women's cancers (McPherson et al., 2000); lung cancer is the leading cause of death among men and second among women (Torre et al., 2016), while prostate cancer is the third leading cause of cancer deaths in the United States (Haas et al., 2008). Our dataset is fine-grained, being annotated with SADNESS I just cant stand seeing her like this.  Plutchick-8 basic emotions (Plutchik, 1980), composed of anger, fear, disgust, sadness, surprise, anticipation, trust, and joy. We use crowd-sourcing and ensure quality control measures to exclude spurious annotations.
Detecting emotions is inherently challenging, requiring a deep understanding of the writer's beliefs and reasoning, especially when dealing with healthrelated data. To illustrate some of these challenges, we present examples from our dataset in Table 1, and discuss a few patterns. For example, in the sentence I just cant stand seeing her like this, we can easily notice the writer's discontent, regardless of the absence of emotion-rich words in its content. Our data also includes a great deal of medical terminology, which adds another layer of complexity to the language used across the discussion boards. For example, in My cancer was very rare, non invasive Mucusom Cancer, in order to predict the perceived conveyed emotions -fear and sadness, computational models must distinguish whether Mucusom Cancer is a dangerous or harmless disease. In addition, a sentence may be the expression of a mixture of emotions, not just one. We further speculate that distantly supervised techniques focusing on lexical information to collect emotion-rich data (Abdul-Mageed and Ungar, 2017) are unable to capture these subtleties in a health domain, and we reinforce this idea in §3.
Our contributions in this paper are as follows: (1) We create CANCEREMO , a novel healthrelated dataset for fine-grained emotion detection composed of 8, 500 sentences. We study how emotions are distributed in our dataset and how they co-occur with each other. We further analyze emotions associations with topics such as medical procedures, side effects, and drugs, and with events or activities that happen in the past, present, and future; (2) We experiment on the fine-grained emotion detection task and establish strong baselines based on BERT and variants; (3) We study different supervised and unsupervised pre-training techniques and reveal the importance of choosing the right pre-training domain.
Interestingly, despite the importance of emotion detection in the health domain, computational studies for this task are limited. Specifically, most of these studies focus mainly on identifying two types of social support from online health communities (OHCs): emotional (Eysenbach et al., 2004) or informational (Boon et al., 2007). Along the same lines, Wang et al. (2012b) used Linear Regression to predict the degree of emotional or informational support from an OHC related to breast cancer, while Biyani et al. (2014) studied the presence of such support from breast and lung cancer data using models such as Naïve Bayes, Support Vector Machines, and Logistic Regression with part-of-speech tags and bag-of-words. Wang et al. (2014) studied social support using lexical and sentiment features, and analyzed user engagement in OHCs. Yang et al. (2019a), on the other hand, modeled social roles in OHCs. They used a Gaussian mixture model to identify coherent roles such as emotional support provider, informational support provider, newcomer, or all-round expert. The types of features they used range from linguistic behaviors or network (i.e., relationship with other users) to features regarding the context of communication (i.e., public or private). Khanpour and Caragea (2018) highlighted the need to examine emotions from health-related posts at a finer granularity and used annotators to label two datasets with the Ekman's six basic emotion set (Ekman, 1992). The authors trained a hybrid neural model composed of a word-level Convolutional Neural Network followed by a Long Short Term Memory network. However, given the limited size of the annotated datasets (~1, 000 sentences each) and the fact that most emotions were extremely infrequent, the analysis could only be performed on the most frequent emotions: joy and sadness. In contrast to the above works, we study Plutchick-8 basic emotions and present CANCEREMO , which, to our knowledge, is the first large health dataset for the fine-grained emotion detection task, being more than eight times larger than the currently available datasets of Khanpour and Caragea (2018).
CANCEREMO enables complex explorations of deep learning models including pre-trained language models, such as BERT (Devlin et al., 2018), XLNet (Yang et al., 2019b) and RoBERTa (Liu et al., 2019b), which achieve state-of-the-art performance on several NLP tasks. We use the aforementioned pre-trained language models, fine-tune the models on our dataset, then compare these approaches with baselines from Traditional and Deep Natural Language Processing.

Task Structure
Corpus We choose an online cancer network as the basis of our data, which we will call Cancer-Net 1 throughout the paper. CancerNet was founded in 2002 and represents a platform for people suffering from cancer as well as for their caregivers, friends, and families to socialize, share experiences and emotions, and feel supported. We collected the data from the beginning until the year of 2018. The network consists of multiple discussion boards, corresponding to different types of cancer. To create our dataset, we randomly sampled sentences from the discussion boards corresponding to three frequent types of cancer: breast, lung and prostate (BLP). We model the emotion detection task at sentence level since longer messages usually contain multiple topics and could possibly switch between many emotions from one sentence to another (Biyani et al., 2014).
Objective Given a predefined set of emotions -Plutchik-8 basic emotions, the goal is to classify a sentence with all emotions contained in it, i.e., identify all emotions conveyed in a piece of text.

Task Construction
Sampling Strategy Current datasets for emotion detection usually utilize some type of sampling bias, e.g., using emotion words as a proxy for sampling. For example, Abdul-Mageed and Ungar (2017) used cues in the data (i.e., emotion hashtags) to collect and further annotate a large Twitter dataset with emotions, while making the strong assumption that a sentence can only express one emotion. We argue that a sentence can not only express emotions even in the absence of emotion words but also convey multiple emotions, as shown in Table 1 in §1. Thus, we sample at random 25, 000 sentences from the BLP boards and annotate them using crowd-sourcing. This sampling strategy also helps us analyze how many sentences convey emotions out of all sampled sentences and how many sentences that do not contain emotion words (i.e., do not have surface lexical patterns) in fact appear to convey emotions.
Annotation To annotate our data, we use the Amazon Mechanical Turk (AMT) crowd-sourcing platform. The emotion definitions provided to the annotators are shown in Appendix A. We ran the annotation task in several iterations in order to develop our quality control steps. Initially, we internally annotated a batch of 100 sentences using all emotions that apply from all 8 Plutchik's emotions, in a multi-class setting. Then, we explored two settings with the AMT annotators: First, we designed a form that asked annotators to select all emotions that apply for a sentence and used the same batch of 100 sentences for analysis. We noticed that the task was very difficult and resulted in a low inter-agreement. Second, we created a separate annotation form for each emotion: for an emotion x, a form asks the annotators to annotate a sentence with true or false, i.e., if a sentence contains x, the label is true, otherwise it is false. We used again the same batch of 100 sentences for analysis. We noticed that this task was much easier and resulted in a higher inter-agreement among the AMT annotators, as well as a much higher agreement with our internal annotations. Thus, for our final annotation, we chose the latter approach over creating a single annotation form for all eight emotions, in order to leverage annotation ease and prevent any implicit associations annotators might make -one might refrain from assigning both fear and joy to the same sentence, which could in fact appear together; such an example is shown in Table 1.
We use three annotators for each sentence, and the final label for a specific emotion is computed through majority vote. We avoid spamming by ruling out the annotators that are inconsistent with the majority vote in more than 25% of the cases. We compute the inter-annotator agreement using Krippendorff Alpha, and obtain an average value of α = 0.69 on all emotions. We also studied the peremotion inter-agreement, and observed lower interannotator agreement on the emotion anticipation, which, in line with our beliefs, was the hardest emotion to distinguish, with α = 0.5. Emotions such as joy, sadness, and fear produced a higher agreement, with α = 0.75. Table 2 shows the number of sentences annotated with no emotions and with 1-4 emotions. Interestingly, out of the 25, 000 sampled sentences, 16, 500 sentences (66%) do not contain any emotions at all, and only 8, 500 contain at least one emotion, out of which 16% contain two or more emotions. Figure 1 shows the distribution of our 8 emotions in the 8, 500 sentences. We can notice that the distribution is very unbalanced: joy, fear and sadness appear most frequently, amounting for about 75% of the data, while anticipation, anger, surprise, disgust, and trust appear rarely, a few orders of magnitude less than the frequent ones. It is interesting to see that joy is the most prevalent, despite dealing with a cancer forum. Table 3 shows the number of sentences annotated with no emotions (EMOSENT − ) and with one or more emotions (EMOSENT + ) and for each cate-

Analysis Emotion Distribution
#SENT 16, 500 7, 098 1, 292 96 14   gory the number of sentences that contain at least one emotion word from EmoLex (Mohammad and Turney, 2013). EmoLex is a word-emotion lexicon composed of a list of English emotion rich words and their associations with Plutchik's eight basic emotions. As an example, the sentence "He is always in pain .. (chest and back pain) and has trouble swallowing pills." contains an emotion word pain from EmoLex, which is associated with sadness in EmoLex. The sentence is annotated with sadness by our annotators as well. In contrast, the sentence "I just miss him so much.....we would hold hands every night", does not contain any emotion word from EmoLex and is annotated with sadness by our annotators. Moreover, the sentence "So get a second opinion and don't be afraid to change doctors." contains the emotion rich word afraid from EmoLex, which is associated with fear in EmoLex, whereas the sentence conveys no emotion at all (and is annotated with no emotion by our annotators). Notably, 10% of the sentences annotated with emotions do not contain EmoLex words, while 23% of sentences with EmoLex words, do not convey any emotion. We further use EmoLex to compare sentences with and without EmoLex emotion words with respect to the difficulty to distinguish the emotions present in them. For each of the eight emotions, we separate sentences with EmoLex emotion words from those without EmoLex emotion words and calculate the AMT inter-annotator agreement. Interestingly, we find that the agreement is higher for sentences with EmoLex words only for anger, anticipation, fear, joy, and trust, and is lower on sadness, surprise, and disgust.
Emotion co-occurrence Since each sentence can be annotated with multiple emotions, we study what emotions tend to appear in the same context with others through a co-occurrence heatmap, shown in Figure 2. We use a logarithmic scale for a better visualization of the less frequent emotions. As expected, emotion pairs like fear-sadness or trust-joy are commonly used together. However, we observe quite a few unusual co-occurrences (of even opposing emotions) such as fear-joy or joysadness. For example, in the sentence "Yesterday they told me they didnt see anything which brought tears of joy, but also a wave of fear.", we speculate that the writer is expressing joy because of recent good medical analysis results, but at the same time fear, facing the possibility of the disease reappearing. When humans become emotional, they may indeed experience a mixture of emotions (not just one). We allow multi-labels for the same text to capture this mixture of emotions. Emotion Associations with Past, Present, or Future Events or Activities We investigate whether user posts are more emotional about events or activities that happen in the past, present, or future, and how these emotions are distributed along these three dimensions. For example, in the sentence "I just cant stand seeing her like this", the writer's discontent is expressed towards an event in the present, while in "I have been through the worst fear when I started to have the pain.", the expressed emotions are relative to an event in the past. We study this using Stanford CoreNLP Natural Language Software (Manning et al., 2014) in three steps: first, we perform a dependency parsing to extract the verb phrase in a sentence, then we take the POS tag of the verb in the verb phrase to get the sentence tense, followed by investigating the emotion conveyed in the sentence and how it relates to the identified verb tense of the sentence. Figure 3 shows the results obtained. We observe that events or activities in the present are frequently discussed across all emotions. Anticipation is, as expected, rarely discussed in the past, as well as anger and trust. Surprise, sadness, and fear on the other hand are conveyed more frequently towards past events or activities. We can also notice that emotions are associated most often with events or activities in the present.
Topics Recognizing how patients feel about different medical topics can provide information into potential causes for the conveyed emotions. These topics are frequently discussed in OHCs and range from prescribed drugs to side effects of medication and medical procedures. We study how these medical topics relate to patient's emotions by using three medical lexicons specifically created for our cancer domain, which contain words and phrases associated with medical procedures, side effects of medication, and drugs. We collected these lexicons from online resources such as Wikipedia and WebMD. 2 These medical topics are extremely important from a practical point of view, as can provide insight into how patients react to their medication, or what side-effects they may be experiencing. We match words from the three lexicons to our dataset, then study how emotions correlate with these topics. We report our findings in Figure 4. As we can see from the figure, interestingly, the topic on Drugs is discussed most frequently (across all emotions), while the topics on Side Effects and Medical Procedures appear more often in sentences conveying fear or sadness as compared to joy.
Benchmark Dataset To enable development on the fine-grained emotion detection task in health related posts, we construct a benchmark dataset. We group the positive examples (sentences conveying one or more emotions) into eight pools -one for each emotion; a sentence is part of a pool if the sentence is annotated with the respective emotion. We remind that a sentence can convey more than one emotion, so it can be part of two different pools at the same time. Next, we sample an equal amount of negative examples for each pool using the following strategy: 1 3 are sampled from the sentences that convey no emotions, while the other 2 3 are sampled from all the positive examples from the other pools. We followed this strategy in order to create a challenging negative set for each emotion. We sample an equal number of positives and negatives because of the imbalanced emotion distribution, which would lead to an extremely skewed ratio of positive to negative samples. Next, we randomly create an 80/10/10 split to create the train, validation and test split. We present specific details about each split in Appendix B.
To facilitate future research, we make our code available 3 along with all other resources of this project (for research purposes).

Baseline Modeling
We model the Plutchik-8 basic set of emotions in CANCEREMO using the following methods:

Statistical and Machine Learning Methods
We experiment with (1) EmoLex -a simple annotation scheme based on EmoLex words' emotions: we label a sentence with the union of the emotion labels of the EmoLex words (Mohammad and Turney, 2013) contained in the sentence, or no emotion if no EmoLex words appear in the sentence.

Pre-Trained Language Models
Recently, pretrained language models have risen in popularity, because they use transfer learning, the process of storing information learned from a task and applying it to another task. The process usually involves unsupervised pre-training on a large corpus, followed by a less computationally expensive finetuning, performed on the task at hand. We experiment with three models: (1) BERT (Devlin et al., 2018) (2) RoBERTa (Liu et al., 2019b), a variant of BERT, which underwent significantly more pretraining, and (3) XLNet (Yang et al., 2019b), which has a different language modeling objective than BERT called Permutation Language Modeling.

Experiments and Results
In this section, we present the set of experiments performed on the fine-grained emotion detection task on CANCEREMO , as well as show the results obtained using the aforementioned baselines.
Experimental Setting All the traditional neural network models were tested with pre-trained Fast-Text (Bojanowski et al., 2017) word embeddings. The LSTM-based models have 300 hidden units and a dropout rate of 0.5. For the CNN, we follow the best hyper-parameters presented by Kim (2014). For the pre-trained language models, we start from the best reported hyper-parameters and perform a bi-directional linear sweep. More details on the fine-tuning techniques and the hyper-parameter values used for the best models can be found in Appendix C. The reported results represent the average of five independent runs. All experiments were carried out on an NVIDIA V100 GPU.
Results Table 4 shows the results in terms of F1score, obtained using BERT-like models compared with the other weaker baselines. We can observe that EmoLex performs very poorly, reinforcing our premise that lexical level information in the form of   emotion words does not necessarily reveal the emotion conveyed. Interestingly, the Conv-Bi-LSTM model manages to improve upon the other statistical and standard neural network methods by as much as 5%. The BERT base model is extremely successful across all emotions, greatly outperforming all the other baselines by 4% F1 on average.
Next, we explore intermediate task pre-training to understand if this improves the performance of our BERT models further (Pruksachatkun et al., 2020;Han and Eisenstein, 2019).

CANCEREMO
is created from a health forum, i.e., a network of cancer survivors that we call Can-cerNet (or CNet for short). Thus, our data differs substantially from the pre-training domain of BERT (Devlin et al., 2018) (Wikipedia and Bookcorpus). As Xia and Ding (2019) noted, domain-adaptive fine-tuning (i.e., adapting the contextualized embeddings to the target domain) might implicitly incorporate inductive biases and improve the performance of the models. To investigate this, we perform an additional set of comprehensive experiments with the best performing model from the previous experiment: BERT. The experimental pipeline consists of two steps: starting from a pretrained BERT model, we (1) perform an unsupervised or supervised pre-training on an intermediate pre-training task, followed by (2) fine-tuning on the target task, which is always the fine-grained emotion detection on CANCEREMO .
Intermediate Tasks The unsupervised pretraining is performed using the Masked Language Modeling objective, while the supervised pretraining is carried out by adding a linear layer, followed by fine-tuning on the emotion detection task. The intermediate tasks are as follows: (1) Unsupervised EmoNet EmoNet (Abdul-Mageed and Ungar, 2017) is a Twitter dataset composed of tweets automatically annotated using distant supervision with Plutchik-24 emotion set. We obtained a smaller version of the dataset from the authors which contains the Plutchik-8 basic emotions. We pre-train the BERT model on all EmoNet sentences.
(2) Unsupervised CNet We pre-train the BERT model on all CancerNet sentences, hoping to implicitly learn information specific to the health domain.
(3) Unsupervised Filtered CNet We use lexical features to filter CancerNet. To this end, we implicitly induce both health and emotion specific biases, by only pre-training on CancerNet sen-   (Abdul-Mageed and Ungar, 2017). We use a linear layer to perform the fine-grained emotion classification task on EmoNet, and after achieving an F1 of 0.83%, we drop this layer. Next, the target fine-tuning on CANCEREMO is performed using a freshly initialized linear layer.

Results
The results in terms of F1-score obtained are compared with the BERT models in Table 5. In the unsupervised setting, we observe a few patterns. First, unsupervised pre-training on EmoNet (Abdul-Mageed and Ungar, 2017) largely hurts downstream performance. Second, approaches inducing health specific biases from CNet and Clinical perform better than BERT on sadness, joy and anticipation. Third, Clinical Filtered CNet consistently outperforms all the other models by as much as 5% on sadness, joy, fear and anticipation, while keeping the same overall F1-score on the other 4 emotions. We speculate that this happens because the pre-training corpus used is very close to the task domain, and we manage to implicitly induce both emotion-specific and health-specific biases. Last, interestingly, the supervised intermediate task pre-training on EmoNet improves the performance on emotions like sadness, joy, and anticipation, but performs similarly or degrades the performance on the other emotions. Still, the Supervised EmoNet performs much better compared with the Unsupervised EmoNet.
Takeaways One should pay close attention when dealing with very narrow domains like emotion or health, where the pre-training corpus greatly influences the performance of the models, and the right pre-training can improve the performance.

Emotion Word Testing
A good amount of sentences annotated with emotions by our annotators in CANCEREMO do not contain any emotion words from EmoLex ( §3.3). Thus, we now investigate if the absence of emotion words affects the model performance. To this end, to depict a real scenario, we keep the train set unchanged and divide the test set in two: one set contains only sentences that have at least an emotion word, while the other contains only sentences without emotion words. As Table 6 shows, testing on sentences with emotion words provides a considerable 8% average F1 increase over sentences with no emotion words. Next, we perform the same experiment using the Unsupervised Filtered CNet method. Surprisingly, the performance improves on both test sets (with and without emotion words) on several emotions, e.g., sadness, joy and fear. all emotions. Second, the improvement of our best performing model (Clinical Filtered CNet) over the BERT model with no additional pre-training is statistically significant on sadness, joy and anticipation, but not on fear.
Next, using our best Clinical Filtered CNet BERT model, we manually investigate test errors to understand potential drawbacks of the model. We observe the following: First, the model often performs poorly on sentences with abbreviations or writing errors. For example, in the sentence "As i will have alot of time, cuz i cant really sleep any significant amount of sleep.", although the expressed emotion is sadness, the model assigns no emotion to it. Next, some errors arise from antithetic emotions in the same sentence. For example, the model assigns sadness to the following sentence: "Still get tired but it's better every day." Although the first part of the sentence could convey sadness, the overall emotion expressed is joy.
Finally, we construct confusion matrices to visualize commonly mislabeled classes, shown in Figure 5. We use a logarithmic scale to be able to better picture less frequent classes such as surprise, disgust, trust and anticipation. The EmoLex (Mohammad and Turney, 2013) visualization shows the poor performance of the lexicon approach, and reflects the results reported in Table 4. Next, we investigate commonly mislabeled classes by BERT and Clinical BERT, and observe a few patterns. For example, the most common mislabeling for the fear emotion is sadness and vice-versa, while quite a few sentences conveying disgust are annotated with sadness and fear.

Conclusion and Future Work
We introduced CANCEREMO , a cancer-related health dataset for perceived emotion detection, which is an order of magnitude larger and more fine-grained compared with previous datasets for health-related emotion detection. Composed of 8, 500 sentences that convey at least one emotion, and 16, 500 sentences that convey no emotion at all, CANCEREMO is a challenging benchmark for fine-grained emotion detection, as shown by our results. We believe that CANCEREMO is novel and has unique characteristics: 1) covers a large spectrum of emotions -being annotated with the Plutchik-8 fine-grained emotions; 2) has a large dataset size for exploring deep learning models; and 3) provides an invaluable context -cancer -for dealing with emotions. The value of our dataset arises also from: the expressions of emotions even in the absence of emotion words and the expressions of mixtures of (sometimes opposing) emotions in the same text. We believe that these characteristics add interestingness and challenges to our dataset and we hope that our work will spur future research in emotion detection from health data, especially in the context of life-threatening diseases such as cancer. Our dataset, which is anonymized and follows ethical considerations, can be used as a benchmark for both multi-class and multi-label emotion detection.
In the future, we plan to study how contextual information (i.e., different aspects of people's interactions captured through contiguous posts in a discussion thread) affects the perceived emotions. We also plan to perform a cross-corpus analysis to investigate if emotions are expressed differently in the health domain compared to other domains. Finally, we will carry out a thorough investigation into emotion-cause pairs (Xia and Ding, 2019). Specifically, in the health domain, the cause that leads to an emotion expressed in text can be just as important as the emotion itself. A deeper understanding of emotion causes can potentially help make people feel better.

SADNESS
The condition or quality of being sad. JOY A feeling of great pleasure and happiness. FEAR An unpleasant emotion caused by the belief that someone or something is dangerous, likely to cause pain, or a threat ANGER A strong feeling of annoyance, displeasure, or hostility. SURPRISE An unexpected or astonishing event, fact, or thing. DISGUST A feeling of revulsion or strong disapproval aroused by something unpleasant or offensive. TRUST Firm belief in the reliability, truth, ability, or strength of someone or something.

ANTICIPATION
The action of anticipating something; expectation or prediction. Similarly, anticipation is a feeling of excitement about something pleasant or exciting that you know is going to happen.  Table 7 shows the emotion definitions provided in the task instructions, which annotators have to read before starting to label the data.

B Split Details
We present the emotion counts in every train/val/test split through Table 8. We color the emotion counts of the split in question. For instance, the first train/val/test line corresponds to the sadness split, as the column corresponding to sadness is colored.

C Hyperparameters
We present the hyperparameters obtained by tuning in Table 9 and 10. The highest variance in the results is obtained by varying the learning rate, which we tune the most. For each emotion, we start from an initial value of 5e-05, then search for 5 iterations forward and backwards in steps of 1e-05. This type of tuning is performed for each emotion, and took in total 2 days on our V100 GPU. We use a batch size of 64 for the traditional baselines, while only 16 for BERT and RoBERTA and 8 for XLNet due to GPU ram restrictions.