GoEmotions: A Dataset of Fine-Grained Emotions

Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.


Introduction
Emotion expression and detection are central to the human experience and social interaction.With as many as a handful of words we are able to express a wide variety of subtle and complex emotions, and it has thus been a long-term goal to enable machines to understand affect and emotion (Picard, 1997).
In the past decade, NLP researchers made available several datasets for language-based emotion classification for a variety of domains and applications, including for news headlines (Strapparava and Mihalcea, 2007), tweets (CrowdFlower, 2016;Mohammad et al., 2018), and narrative sequences (Liu et al., 2019), to name just a few.However, existing available datasets are (1) mostly small, containing up to several thousand instances, and (2) cover a limited emotion taxonomy, with coarse clas- sification into Ekman (Ekman, 1992b) or Plutchik (Plutchik, 1980) emotions.
Recently, Bostan and Klinger (2018) have aggregated 14 popular emotion classification corpora under a unified framework that allows direct comparison of the existing resources.Importantly, their analysis suggests annotation quality gaps in the largest manually annotated emotion classification dataset, CrowdFlower (2016), containing 40K tweets labeled for one of 13 emotions.While their work enables such comparative evaluations, it highlights the need for a large-scale, consistently labeled emotion dataset over a fine-grained taxonomy, with demonstrated high-quality annotations.
To this end, we compiled GoEmotions, the largest human annotated dataset of 58k carefully selected Reddit comments, labeled for 27 emotion categories or Neutral, with comments extracted from popular English subreddits.Table 1 shows an illustrative sample of our collected data.We design our emotion taxonomy considering related work in psychology and coverage in our data.In contrast to Ekman's taxonomy, which includes only one positive emotion (joy), our taxonomy includes a large number of positive, negative, and ambiguous emotion categories, making it suitable for downstream conversation understanding tasks that require a subtle understanding of emotion expression, such as the analysis of customer feedback or the enhancement of chatbots.
We include a thorough analysis of the annotated data and the quality of the annotations.Via Principal Preserved Component Analysis (Cowen et al., 2019b), we show a strong support for reliable dissociation among all 27 emotion categories, indicating the suitability of our annotations for building an emotion classification model.
We perform hierarchical clustering on the emotion judgments, finding that emotions related in intensity cluster together closely and that the toplevel clusters correspond to sentiment categories.These relations among emotions allow for their potential grouping into higher-level categories, if desired for a downstream task.
We provide a strong baseline for modeling finegrained emotion classification over GoEmotions.By fine-tuning a BERT-base model (Devlin et al., 2019), we achieve an average F1-score of .46 over our taxonomy, .64 over an Ekman-style grouping into six coarse categories and .69over a sentiment grouping.These results leave much room for improvement, showcasing this task is not yet fully addressed by current state-of-the-art NLU models.
We conduct transfer learning experiments with existing emotion benchmarks to show that our data can generalize to different taxonomies and domains, such as tweets and personal narratives.Our experiments demonstrate that given limited resources to label additional emotion classification data for specialized domains, our data can provide baseline emotion understanding and contribute to increasing model accuracy for the target domain.

Emotion Datasets
Ever since Affective Text (Strapparava and Mihalcea, 2007), the first benchmark for emotion recognition was introduced, the field has seen several emotion datasets that vary in size, domain and taxonomy (cf.Bostan and Klinger, 2018).The majority of emotion datasets are constructed manually, but tend to be relatively small.The largest manually labeled dataset is CrowdFlower (2016), with 39k labeled examples, which were found by Bostan and Klinger (2018) to be noisy in comparison with other emotion datasets.Other datasets are automatically weakly-labeled, based on emotion-related hashtags on Twitter (Wang et al., 2012;Abdul-Mageed and Ungar, 2017).We build our dataset manually, making it the largest human annotated dataset, with multiple annotations per example for quality assurance.
Several existing datasets come from the domain of Twitter, given its informal language and expressive content, such as emojis and hashtags.Other datasets annotate news headlines (Strapparava and Mihalcea, 2007), dialogs (Li et al., 2017), fairytales (Alm et al., 2005), movie subtitles ( Öhman et al., 2018), sentences based on FrameNet (Ghazi et al., 2015), or self-reported experiences (Scherer and Wallbott, 1994) among other domains.We are the first to build on Reddit comments for emotion prediction.

Emotion Taxonomy
One of the main aspects distinguishing our dataset is its emotion taxonomy.The vast majority of existing datasets contain annotations for minor variations of the 6 basic emotion categories (joy, anger, fear, sadness, disgust, and surprise) proposed by Ekman (1992a) and/or along affective dimensions (valence and arousal) that underpin the circumplex model of affect (Russell, 2003;Buechel and Hahn, 2017).
Recent advances in psychology have offered new conceptual and methodological approaches to capturing the more complex "semantic space" of emotion (Cowen et al., 2019a) by studying the distribution of emotion responses to a diverse array of stimuli via computational techniques.Studies guided by these principles have identified 27 distinct varieties of emotional experience conveyed by short videos (Cowen and Keltner, 2017), 13 by music (Cowen et al., in press), 28 by facial expression (Cowen and Keltner, 2019), 12 by speech prosody (Cowen et al., 2019b), and 24 by nonverbal vocalization (Cowen et al., 2018).In this work, we build on these methods and findings to devise our granular taxonomy for text-based emotion recognition and study the dimensionality of language-based emotion space.

Emotion Classification Models
Both feature-based and neural models have been used to build automatic emotion classification models.Feature-based models often make use of handbuilt lexicons, such as the Valence Arousal Dominance Lexicon (Mohammad, 2018).Using representations from BERT (Devlin et al., 2019), a transformer-based model with language model pretraining, has recently shown to reach state-of-theart performance on several NLP tasks, also including emotion prediction: the top-performing models in the EmotionX Challenge (Hsu and Ku, 2018) all employed a pre-trained BERT model.We also use the BERT model in our experiments and we find that it outperforms our biLSTM model.

GoEmotions
Our dataset is composed of 58K Reddit comments, labeled for one or more of 27 emotion(s) or Neutral.

Selecting & Curating Reddit comments
We use a Reddit data dump originating in the redditdata-tools project2 , which contains comments from 2005 (the start of Reddit) to January 2019.We select subreddits with at least 10k comments and remove deleted and non-English comments.
Reddit is known for a demographic bias leaning towards young male users (Duggan and Smith, 2013), which is not reflective of a globally diverse population.The platform also introduces a skew towards toxic, offensive language (Mohan et al., 2017).Thus, Reddit content has been used to study depression (Pirina and C ¸öltekin, 2018), microaggressions (Breitfeller et al., 2019), and Yanardag and Rahwan (2018) have shown the effect of using biased Reddit data by training a "psychopath" bot.To address these concerns, and enable building broadly representative emotion models using GoEmotions, we take a series of data curation measures to ensure our data does not reinforce general, nor emotion-specific, language biases.We identify harmful comments using pre-defined lists containing offensive/adult, vulgar (mildly offensive profanity), identity, and religion terms (included as supplementary material).These are used for data filtering and masking, as described below.Lists were internally compiled and we believe they are comprehensive and widely useful for dataset curation, however, they may not be complete.
Reducing profanity.We remove subreddits that are not safe for work3 and where 10%+ of comments include offensive/adult and vulgar tokens.We remove remaining comments that include offensive/adult tokens.Vulgar comments are preserved as we believe they are central to learning about negative emotions.The dataset includes the list of filtered tokens.
Manual review.We manually review identity comments and remove those offensive towards a particular ethnicity, gender, sexual orientation, or disability, to the best of our judgment.
Length filtering.We apply NLTK's word tokenizer and select comments 3-30 tokens long, including punctuation.To create a relatively balanced distribution of comment length, we perform downsampling, capping by the number of comments with the median token count (12).
Sentiment balancing.We reduce sentiment bias by removing subreddits with little representation of positive, negative, ambiguous, or neutral sentiment.To estimate a comment's sentiment, we run our emotion prediction model, trained on a pilot batch of 2.2k annotated examples.The mapping of emotions into sentiment categories is found in Figure 2. We exclude subreddits consisting of more than 30% neutral comments or less than 20% of negative, positive, or ambiguous comments.
Emotion balancing.We assign a predicted emotion to each comment using the pilot model described above.Then, we reduce emotion bias by downsampling the weakly-labelled data, capping by the number of comments belonging to the median emotion count.
Subreddit balancing.To avoid over representation of popular subreddits, we perform downsampling, capping by the median subreddit count.
From the remaining 315k comments (from 482 subreddits), we randomly sample for annotation.
Masking.We mask proper names referring to people with a [NAME] token, using a BERT-based Named Entity Tagger (Tsai et al., 2019).We mask religion terms with a [RELIGION] token.The list of these terms is included with our dataset.Note that raters viewed unmasked comments during rating.

Taxonomy of Emotions
When creating the taxonomy, we seek to jointly maximize the following objectives.
1. Provide greatest coverage in terms of emotions expressed in our data.To address this, we manually labeled a small subset of the data, and ran a pilot task where raters can suggest emotion labels on top of the pre-defined set.
2. Provide greatest coverage in terms of kinds of emotional expression.We consult psychology literature on emotion expression and recognition (Plutchik, 1980;Cowen and Keltner, 2017;Cowen et al., 2019b).Since, to our knowledge, there has not been research that identifies principal categories for emotion recognition in the domain of text (see Section 2.2), we consider those emotions that are identified as basic in other domains (video and speech) and that we can assume to apply to text as well.
3. Limit overlap among emotions and limit the number of emotions.We do not want to include emotions that are too similar, since that makes the annotation task more difficult.Moreover, combining similar labels with high coverage would result in an explosion in annotated labels.
The final set of selected emotions is listed in Table 4, and Figure 1.See Appendix B for more details on our multi-step taxonomy selection procedure.

Annotation
We assigned three raters to each example.For those examples where no raters agree on at least one emotion label, we assigned two additional raters.All raters are native English speakers from India. 4nstructions.Raters were asked to identify the emotions expressed by the writer of the text, given pre-defined emotion definitions (see Appendix A) and a few example texts for each emotion.Raters were free to select multiple emotions, but were asked to only select those ones for which they were reasonably confident that it is expressed in the text.If raters were not certain about any emotion being expressed, they were asked to select Neutral.We included a checkbox for raters to indicate if an example was particularly difficult to label, in which case they could select no emotions.We removed all examples for which no emotion was selected.
The rater interface.Reddit comments were presented with no additional metadata (such as the author or subreddit).To help raters navigate the large space of emotion in our taxonomy, they were presented a table containing all emotion categories aggregated by sentiment (by the mapping in Figure 2) and whether that emotion is generally expressed towards something (e.g.disapproval) or is more of an intrinsic feeling (e.g.joy).The instructions highlighted that this separation of categories was by no means clear-cut, but captured general tendencies, and we encouraged raters to ignore the categorization whenever they saw fit.Emotions with a straightforward mapping onto emojis were shown with an emoji in the UI, to further ease their interpretation.

Data Analysis
Table 2 shows summary statistics for the data.Most of the examples (83%) have a single emotion label and have at least two raters agreeing on a single label (94%).The Neutral category makes up 26% of all emotion labels -we exclude that category from the following analyses, since we do not consider it to be part of the semantic space of emotions.
Figure 1 shows the distribution of emotion labels.We can see a large disparity in terms of emotion frequencies (e.g.admiration is 30 times more frequent than grief ), despite our emotion and sentiment balancing steps taken during data selection.This is expected given the disparate frequencies of emotions in natural human expression.

Interrater Correlation
We estimate rater agreement for each emotion via interrater correlation (Delgado and Tibau, 2019). 5or each rater r ∈ R, we calculate the Spearman correlation between r's judgments and the mean of other raters' judgments, for all examples that r rated.We then take the average of these rater-level correlation scores.In Section 4.3, we show that each emotion has significant interrater correlation, after controlling for several potential confounds.
Figure 1 shows that gratitude, admiration and amusement have the highest and grief and nervousness have the lowest interrater correlation.Emotion frequency correlates with interrater agreement but the two are not equivalent.Infrequent emotions can have relatively high interrater correlation (e.g., fear), and frequent emotions can have have relatively low interrater correlation (e.g., annoyance).

Correlation Among Emotions
To better understand the relationship between emotions in our data, we look at their correlations.Let N be the number of examples in our dataset.We obtain N dimensional vectors for each emotion by averaging raters' judgments for all examples labeled with that emotion.We calculate Pearson correlation values between each pair of emotions.The heatmap in Figure 2 shows that emotions that are related in intensity (e.g.annoyance and anger, joy and excitement, nervousness and fear) have a strong positive correlation.On the other hand, emotions that have the opposite sentiment are negatively correlated.We also perform hierarchical clustering to uncover the nested structure of our taxonomy.We use correlation as a distance metric and ward as a linkage method, applied to the averaged ratings.The dendrogram on the top of Figure 2 shows that emotions that are related by intensity are neighbors, and that larger clusters map closely onto sentiment categories.Interestingly, emotions that we labeled as "ambiguous" in terms of sentiment (e.g.surprise) are closer to the positive than to the negative category.This suggests that in our data, ambiguous emotions are more likely to occur in the context of positive sentiment than that of negative sentiment.

Principal Preserved Component Analysis
To better understand agreement among raters and the latent structure of the emotion space, we apply Principal Preserved Component Analysis (PPCA) (Cowen et al., 2019b) to our data.PPCA extracts linear combinations of attributes (here, emotion judgments), that maximally covary across two sets of data that measure the same attributes (here, randomly split judgments for each example).Thus, PPCA allows us to uncover latent dimensions of J ∈ R n×|R|×|E| ← all ratings for the examples annotated by r 7: X, Y ∈ R n×|E| ← randomly split J −r and average ratings across raters for both sets 10: for all components † w i∈{1,...,|E|} in W do 12: end for 16: end for 17: C ← Wilcoxon signed rank test on C 18: C ← Bonferroni correction on C (α = 0.05) † in descending order of eigenvalue ‡ we demean vectors before projection emotion that have high agreement across raters.
Unlike Principal Component Analysis (PCA), PPCA examines the cross-covariance between datasets rather than the variancecovariance matrix within a single dataset.We obtain the principal preserved components (PPCs) of two datasets (matrices) X, Y ∈ R N ×|E| , where N is the number of examples and |E| is the number of emotions, by calculating the eigenvectors of the symmetrized cross covariance matrix X T Y + Y T X.
Extracting significant dimensions.We remove examples labeled as Neutral, and keep those examples that still have at least 3 ratings after this filtering step.We then determine the number of significant dimensions using a leave-one-rater out analysis, as described by Algorithm 1.
We find that all 27 PPCs are highly significant.Specifically, Bonferroni-corrected p-values are less than 1.5e-6 for all dimensions (corrected α = 0.0017), suggesting that the emotions were highly dissociable.Such a high degree of significance for all dimensions is nontrivial.For example, Cowen et al. (2019b) find that only 12 out of their 30 emotion categories are significantly dissociable.
t-SNE projection.To better understand how the examples are organized in the emotion space, we apply t-SNE, a dimension reduction method that seeks to preserve distances between data points, using the scikit-learn package (Pedregosa et al., 2011).The dataset can be explored in our interactive plot 6 , where one can also look at the texts and the annotations.The color of each data point is the weighted average of the RGB values representing those emotions that at least half of the raters selected.

Linguistic Correlates of Emotions
We extract the lexical correlates of each emotion by calculating the log odds ratio, informative Dirichlet prior (Monroe et al., 2008) of all tokens for each emotion category contrasting to all other emotions.Since the log odds are z-scored, all values greater than 3 indicate highly significant (>3 std) association with the corresponding emotion.We list the top 5 tokens for each category in Table 3.We find that those emotions that are highly significantly associated with certain tokens (e.g.gratitude with "thanks", amusement with "lol") tend to have the highest interrater correlation (see Figure 1).Conversely, emotions that have fewer significantly associated tokens (e.g.grief and nervousness) tend to have low interrater correlation.These results suggest certain emotions are more verbally implicit and may require more context to be interpreted.

Modeling
We present a strong baseline emotion prediction model for GoEmotions.

Data Preparation
To minimize the noise in our data, we filter out emotion labels selected by only a single annotator.
We keep examples with at least one label after this filtering is performed -this amounts to 93% of the original data.We randomly split this data into train (80%), dev (10%) and test (10%) sets.We only evaluate on the test set once the model is finalized.
why we release all 58K examples with all annotators' ratings.
Grouping emotions.We create a hierarchical grouping of our taxonomy, and evaluate the model performance on each level of the hierarchy.A sentiment level divides the labels into 4 categoriespositive, negative, ambiguous and Neutral -with the Neutral category intact, and the rest of the mapping as shown in Figure 2. The Ekman level further divides the taxonomy using the Neutral label and the following 6 groups: anger (maps to: anger, annoyance, disapproval), disgust (maps to: disgust), fear (maps to: fear, nervousness), joy (all positive emotions), sadness (maps to: sadness, disappointment, embarrassment, grief, remorse) and surprise (all ambiguous emotions).

Model Architecture
We use the BERT-base model (Devlin et al., 2019) for our experiments.We add a dense output layer on top of the pretrained model for the purposes of finetuning, with a sigmoid cross entropy loss function to support multi-label classification.As an additional baseline, we train a bidirectional LSTM.

Parameter Settings
When finetuning the pre-trained BERT model, we keep most of the hyperparameters set by Devlin et al. (2019) intact and only change the batch size and learning rate.We find that training for at least 4 epochs is necessary for learning the data, but training for more epochs results in overfitting.We also find that a small batch size of 16 and learning rate of 5e-5 yields the best performance.
For the biLSTM, we set the hidden layer dimensionality to 256, the learning rate to 0.1, with a decay rate of 0.95.We apply a dropout of 0.7.

Results
Table 4 summarizes the performance of our best model, BERT, on the test set, which achieves an average F1-score of .46 (std=.19).The model obtains the best performance on emotions with overt lexical markers, such as gratitude (.86), amusement (.8) and love (.78).The model obtains the lowest F1-score on grief (0), relief (.15) and realization (.21), which are the lowest frequency emotions.We find that less frequent emotions tend to be confused by the model with more frequent emotions related in sentiment and intensity (e.g., grief with sadness, pride with admiration, nervousness with fear)see Appendix G for a more detailed analysis.
Table 5 and Table 6 show results for a sentimentgrouped model (F1-score = .69)and an Ekmangrouped model (F1-score = .64),respectively.The significant performance increase in the transition from full to Ekman-level taxonomy indicates that this grouping mitigates confusion among innergroup lower-level categories.
The biLSTM model performs significantly worse than BERT, obtaining an average F1-score of .41 for the full taxonomy, .54 for an Ekmangrouped model and .6 for a sentiment-grouped model.

Transfer Learning Experiments
We conduct transfer learning experiments on existing emotion benchmarks, in order to show our data generalizes across domains and taxonomies.The goal is to demonstrate that given little labeled data in a target domain, one can utilize GoEmotions as baseline emotion understanding data.ity and taxonomy.In the interest of space, we only discuss three of these datasets here, chosen based on their diversity of domains.In our experiments, we observe similar trends for the additional benchmarks, and all are included in the Appendix H.The International Survey on Emotion Antecedents and Reactions (ISEAR) (Scherer and Wallbott, 1994) is a collection of personal reports on emotional events, written by 3000 people from different cultural backgrounds.The dataset contains 8k sentences, each labeled with a single emotion.The categories are anger, disgust, fear, guilt, joy, sadness and shame.
EmoInt (Mohammad et al., 2018) is part of the SemEval 2018 benchmark, and it contains crowdsourced annotations for 7k tweets.The labels are intensity annotations for anger, joy, sadness, and fear.We obtain binary annotations for these emotions by using .5 as the cutoff.

Experimental Setup
Training set size.We experiment with varying amount of training data from the target domain dataset, including 100, 200, 500, 1000, and 80% (named "max") of dataset examples.We generate 10 random splits for each train set size, with the remaining examples held as a test set.
We report the results of the finetuning experiments detailed below for each data size, with confidence intervals based on repeated experiments using the splits.
Finetuning.We compare three different finetuning setups.In the BASELINE setup, we finetune BERT only on the target dataset.In the FREEZE setup, we first finetune BERT on GoEmotions, then perform transfer learning by replacing the final dense layer, freezing all layers besides the last layer and finetuning on the target dataset.The NOFREEZE setup is the same as FREEZE, except that we do not freeze the bottom layers.We hold the batch size at 16, learning rate at 2e-5 and number of epochs at 3 for all experiments.

Results
The results in Figure 3 suggest that our dataset generalizes well to different domains and taxonomies, and that using a model using GoEmotions can help in cases when there is limited data from the target domain, or limited resources for labeling.
Given limited target domain data (100 or 200 examples), both FREEZE and NOFREEZE yield significantly higher performance than the BASELINE, for all three datasets.Importantly, NOFREEZE results show significantly higher performance for all training set sizes, except for "max", where NOFREEZE and BASELINE perform similarly.

Conclusion
We present GoEmotions, a large, manually annotated, carefully curated dataset for fine-grained emotion prediction.We provide a detailed data analysis, demonstrating the reliability of the annotations for the full taxonomy.We show the general-izability of the data across domains and taxonomies via transfer learning experiments.We build a strong baseline by fine-tuning a BERT model, however, the results suggest much room for future improvement.Future work can explore the cross-cultural robustness of emotion ratings, and extend the taxonomy to other languages and domains.
Data Disclaimer: We are aware that the dataset contains biases and is not representative of global diversity.We are aware that the dataset contains potentially problematic content.Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India.All these likely affect labeling, precision, and recall for a trained model.The emotion pilot model used for sentiment labeling, was trained on examples reviewed by the research team.Anyone using this dataset should be aware of these limitations of the dataset.

E BERT's Most Activated Layers
To better understand whether there are any layers in BERT that are particularly important for our task, we freeze BERT and calculate the center of gravity (Tenney et al., 2019) based on scalar mixing weights (Peters et al., 2018).We find that all layers are similarly important for our task, with center of gravity = 6.19 (see Figure 4).This is consistent with Tenney et al. (2019), who have also found that tasks involving high-level semantics tend to make use of all BERT layers.

F Number of Emotion Labels Per Example
Figure 5 shows the number of emotion labels per example before and after we filter for those labels that have agreement.We use the filtered set of labels for training and testing our models.

G Confusion Matrix
Figure 6 shows the normalized confusion matrix for our model predictions.Since GoEmotions is a multilabel dataset, we calculate the confusion matrix similarly as we would calculate a co-occurrence matrix: for each true label, we increase the count for each predicted label.Specifically, we define a matrix M where M i,j denotes the raw confusion count between the true label i and the predicted label j.For example, if the true labels are joy and admiration, and the predicted labels are joy and pride, then we increase the count for M joy,joy , M joy,pride , M admiration,joy and M admiration,pride .
In practice, since most of our examples only has a single label (see Figure 5), our confusion matrix is very similar to one calculated for a single-label classification task.
Given the disparate frequencies among the labels, we normalize M by dividing the counts in each row (representing counts for each true emotion label) by the sum of that row.The heatmap in Figure 6 shows these normalized counts.We find that the model tends to confuse emotions that are related in sentiment and intensity (e.g., grief and sadness, pride and admiration, nervousness and fear).
We also perform hierarchical clustering over the normalized confusion matrix using correlation as a distance metric and ward as a linkage method.We find that the model learns relatively similar clusters as the ones in Figure 2, even though the training data only includes a subset of the labels that have agreement (see Figure 5).
We describe the experimental setup in Section 6.2, which we use across all datasets.We find that transfer learning helps in the case of all datasets, especially when there is limited training data.Interestingly, in the case of Crowd-Flower, which is known to be noisy (Bostan and Klinger, 2018) and Electoral Tweets, which is a small dataset of ∼4k labeled examples and a large taxonomy of 36 emotions, FREEZE gives a significant boost of performance over the BASELINE and NOFREEZE for all training set sizes besides "max".
For the other datasets, we find that FREEZE tends to give a performance boost compared to the other setups only up to a couple of hundred training examples.For 500-1000 training examples, NOFREEZE tends to outperform the BASELINE, but we can see that these two setups come closer when there is more training data available.These results suggests that our dataset helps if there is limited data from the target domain.

Figure 1 :
Figure 1: Our emotion categories, ordered by the number of examples where at least one rater uses a particular label.The color indicates the interrater correlation.

Figure 2 :
Figure 2: The heatmap shows the correlation between ratings for each emotion.The dendrogram represents the a hierarchical clustering of the ratings.The sentiment labeling was done a priori and it shows that the clusters closely map onto sentiment groups.

Figure 3 :
Figure 3: Transfer learning results in terms of average F1-scores across emotion categories.The bars indicate the 95% confidence intervals, which we obtain from 10 different runs on 10 different random splits of the data.

Figure 4 :
Figure 4: Softmax weights of each BERT layer when trained on our dataset.

Figure 5 :
Figure 5: Number of emotion labels per example before and after filtering the labels chosen by only a single annotator.

Table 1 :
Example annotations from our dataset.

Table 2 :
Summary statistics of our labeled data.

Table 4 :
Results based on GoEmotions taxonomy.

Table 5 :
Results based on sentiment-grouped data.