XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection

We introduce XED, a multilingual fine-grained emotion dataset. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource languages. We use Plutchik’s core emotions to annotate the dataset with the addition of neutral to create a multilabel multiclass dataset. The dataset is carefully evaluated using language-specific BERT models and SVMs to show that XED performs on par with other similar datasets and is therefore a useful tool for sentiment analysis and emotion detection.


Introduction
There is an ever increasing need for labeled datasets for machine learning. This is true for English as well as other, often under-resourced, languages. We provide a cross-lingual fine-grained sentence-level emotion and sentiment dataset. The dataset consists of parallel manually annotated data for English and Finnish, with additional parallel datasets of varying sizes for a total of 32 languages created by annotation projection. We use Plutchik's Wheel of Emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) (Plutchik, 1980) as our annotation scheme with the addition of neutral on movie subtitle data from OPUS (Lison and Tiedemann, 2016).
We perform evaluations with fine-tuned cased multilingual and language specific BERT (Bidirectional Encoder Representations from Transformers) models (Devlin et al., 2019), as well as Suport Vector Machines (SVMs). Our evaluations show that the human-annotated datasets behave on par with comparable state-of-the-art datasets such as the GoEmotions dataset (Demszky et al., 2020). Furthermore, the projected datasets have accuracies that closely resemble human-annotated data with macro f1 scores of 0.51 for the human annotated Finnish data and 0.45 for the projected Finnish data when evaluating with FinBERT (Virtanen et al., 2019).
The XED dataset can be used in emotion classification tasks and other applications that can benefit from sentiment analysis and emotion detection such as offensive language identification. The data is open source 1 licensed under a Creative Commons Attribution 4.0 International License (CC-BY).
In the following sections we discuss related work and describe our datasets. The datasets are then evaluated and the results discussed in the discussion section.

Background & Previous Work
Datasets created for sentiment analysis have been available for researchers since at least the early 2000s (Mäntylä et al., 2018). Such datasets generally use a binary or ternary annotation scheme (positive, negative + neutral) (e.g. Blitzer et al. (2007)) and have traditionally been based on review data such as, e.g. Amazon product reviews, or movie reviews (Blitzer et al., 2007;Maas et al., 2011;Turney, 2002). Many, if not most, emotion datasets on the other hand use Twitter as a source and individual tweets as level of granularity (Schuff et al., 2017;Abdul-Mageed and Ungar, 2017;Mohammad et al., 2018). In the case of emotion datasets, the emotion taxonomies used are often based on Ekman (1971) and Plutchik (1980) (which is partially based on Ekman).
A majority of recent papers on multilabel emotion classification focus on the SemEval 2018 dataset which is based on tweets. Similarly, many of the non-multilabel classification papers use Twitter data. Twitter is a good base for emotion classification as tweets are limited in length and generally stand-alone, i.e. the reader or annotator does not need to guess the context in the majority of cases. Furthermore, hashtags and emojis are common, which further makes the emotion recognition easier for both human annotators and emotion detection and sentiment analysis models. Reddit data, as used by Demszky et al. (2020), and movie subtitles used by this paper, are slightly more problematic as they are not "selfcontained". Reddit comments are typically longer than one line and therefore provide some context for annotators to go by, but often lacks the hashtags and emojis of twitter and can be quite context-dependent as Reddit comments are by definition reactions to a post or another comment. Movie subtitles annotated out of sequence have virtually no context to aid the annotator and are supposed to be accompanied by visual cues as well. However, annotating with context can reduce the accuracy of one's model by doubly weighting surrounding units of granularity (roughly 'sentences' in our case) (Boland et al., 2013). On the other hand, contextual annotations are less frustrating for the annotator and therefore, would likely provide more annotations in the same amount of time (Öhman, 2020).
In table 1 we have gathered some of the most significant emotion datasets in relation to this study. The table lists the paper in which the dataset was released (study), what the source data that was used was (source), what model was used to obtain the best evaluation scores (model), the number of categories used for annotation (cat), whether the system was multilabel or not (multi), and the macro f1 scores and accuracy score as reported by the paper (macro f1 and accuracy respectively). Some papers only reported a micro f1 and no macro f1 score. These scores have been marked with a µ. 2 CrowdFlower was created in 2016 but has since been acquired by different companies at least twice and is now hard to find. It is currently owned by Appen.
The datasets in table 1 differ from each so much in content, structure, and manner of annotation that direct comparisons are hard to make. Typically, the fewer the number of categories, the easier the classification task and the higher the evaluation scores. It stands to reason that the easier it is to detect emotions in the source data, the easier it is for annotators to identify and agree upon annotation labels and therefore it becomes easier for the system or model to correctly classify the test data as well. The outlier in these datasets is EmoNet (Abdul-Mageed and Ungar, 2017) which achieved astonishing accuracies by using 665 different hashtags to automatically categorize 1.6 million tweets into 24 categories (Plutchik's 8 at 3 different intensities), unfortunately neither the dataset or their model has been made available for closer inspection.
The downside of datasets trained on Twitter is that they are likely not that good at classifying anything other than tweets. It is plausible that datasets trained on less specific data such as XED and those created by Tokuhisa et al. (2008) and Demszky et al. (2020) are better at crossing domains at the cost of evaluation metrics.

Annotation Projection
Research shows that affect categories are quite universal (Cowen et al., 2019;Scherer and Wallbott, 1994). Therefore, theoretically they should also to a large degree retain emotion categories when translated. Annotation projection has been shown to offer reliable results in different NLP and NLU tasks Yarowsky et al., 2001;Agić et al., 2016;Rasooli and Tetreault, 2015). Projection is sometimes the only feasible way to produce resources for under-resourced languages. By taking datasets created for high-resource languages and projecting these results on the corresponding items in the underresourced language using parallel corpora, we can create datasets in as many languages as exist in the parallel corpus. A parallel corpus for multiple languages enables the simultaneous creation of resources for multiple languages at a low cost.
Previous annotation tasks have shown that even with binary or ternary classification schemes, human annotators agree only about 70-80% of the time and the more categories there are, the harder it becomes for annotators to agree (Boland et al., 2013;Mozetič et al., 2016). For example, when creating the DENS dataset , only 21% of their annotations had consensus between all annotators with 73.5% having to resort to majority agreement, and a further 5.5% could not be agreed upon and were left to expert annotators to be resolved.
Some emotions are also harder to detect, even for humans. Demszky et al. (2020) show that the emotions of admiration, approval, annoyance, gratitude had the highest interrater correlations at around 0.6, and grief, relief, pride, nervousness, embarrassment had the lowest interrater correlations between 0-0.2, with a vast majority of emotions falling in the range of 0.3-0.5 for interrater correlation. Emotions are also expressed differently in text with anger and disgust expressed explicitly, and surprise in context (Alm et al., 2005).
Some emotions are also more closely correlated. In Plutchik's wheel (Plutchik, 1980) related emotions are placed on the same dyad so that for example for anger as a core emotion, there is also rage that is more intense, but highly correlated with anger, and annoyance which is less intense, but equally correlated. In this way it is also possible to map more distinct categories of emotions onto larger wholes; in this case rage and annoyance could be mapped to anger, or even more coarsely to negative. This approach has been employed by for example Abdul-Mageed and Ungar (2017).
We used Plutchik's core emotions as our annotation scheme resulting in 8 distinct emotion categories plus neutral. The Sentimentator platform (Öhman and Kajava, 2018;Öhman et al., 2018) allows for the annotation of intensities resulting in what is essentially 30 emotions and sentiments, however, as the intensity score is not available for all annotations, the intensity scores were discarded. The granularity of our annotations roughly correspond to sentence-level annotations, although as our source data is movie subtitles, our shortest subtitle is ! and the longest subtitle consists of three separate sentences.  A majority of the subtitles for English were assigned one emotion label (78%), 17% were assigned two, and roughly 5% had three or more categories (see also Table 3).

Movie Subtitles as Multilingual Multi-Domain Proxy
We use the OPUS (Lison and Tiedemann, 2016) parallel movie subtitle corpus of subtitles collected from opensubtitles.org as a multi-domain proxy. As the movies we use for source data cover several different genres and, although scripted, represents real human language used in a multitude of situations similar to many social media platforms.
Because OPUS open subtitles is a parallel corpus we are able to evaluate our annotated datasets across languages and at identical levels of granularity. Although the subtitles might be translated using different translation philosophies (favoring e.g. meaning, mood, or idiomatic language as the prime objective) (Carl et al., 2011), we expect the translations to have aimed at capturing the sentiments and emotions originally expressed in the film based on previous studies (e.g. Cowen et al. (2019), Scherer and Wallbott (1994), Creutz (2018), Scherrer (2020) and ).

Data Annotation
The vast majority of the dataset was annotated by university students learning about sentiment analysis with some annotations provided by expert annotators for reliability measurements (Öhman et al., 2018). The students' annotation process was monitored and evaluated. They received only minimal instructions. These instructions included that they were to focus on the quality of annotations rather than quantity, and to annotate from the point of view of the speaker. We also asked for feedback on the annotation process to improve the user-friendliness of the platform for future use. In tables 2 and 5 the number of active annotators have been included. All in all over 100 students annotated at least some sentences with around 60 active annotators, meaning students who annotated more than 300 sentences (Öhman, 2020).
It should be noted that the annotators were instructed to annotate the subtitles without context, a task made harder by the fact that we chose subtitles that were available for all languages, which likely meant that some of the most famous movies were included thus creating recognizable context for the annotators.
The data for annotation was chosen randomly from the OPUS subtitle corpus (Lison and Tiedemann, 2016) from subtitles that were available for the maximum number of languages. We chose 30,000 individual lines to be annotated by 3 annotators. For the final dataset, some of these annotations were not annotated by all 3 annotators, as it was possible to skip difficult-to-annotate instances, but the subtitle was included if at least 2 annotators agreed on the emotion score. In some cases if the expert annotators agreed that the annotation was feasible during the pre-processing phase, subtitles annotated by a single annotator and checked by expert annotators, were also included.

Pre-processing
After the annotations were extracted from the database, the data needed to be cleaned up. The different evaluations required different pre-processing steps. Most commonly, this included the removal of superfluous characters containing no information. We tried to keep as much of the original information as possible, including keeping offensive, racist, and sexist language as is. If such information is removed, the usefulness of the data is at risk of being reduced, particularly when used for e.g. offensive language detection (Pàmies et al., 2020).
For the English data we used Stanford NER (named entity recognition) (Finkel et al., 2005) to replace names and locations with the tags: [PERSON] and [LOCATION] respectively. We kept organization names as is because we felt that the emotions and sentiments towards some large well-known organizations differ too much (cf. IRS, FBI, WHO, EU, and MIT). For the Finnish data, we replaced names and locations using the Turku NER corpus (Luoma et al., 2020).
Some minor text cleanup was also conducted, removing hyphens and quotations marks, and correcting erroneous renderings of characters (usually encoding issues) where possible.

English Dataset Description
The final dataset contained 17,520 unique emotion-annotated subtitles as shown in table 3. In addition there are some 6.5k subtitles annotated as neutral. The label distribution can be seen in  The emotion labels are surprisingly balanced with the exception of anger and anticipation, which are more common than the other labels. In comparison with one of the most well-known emotion datasets using the same annotation scheme, the NRC emotion lexicon (EmoLex) (Mohammad and Turney, 2013), the distribution differs somewhat. Although anger is a large category in both datasets, fear is average in our dataset, but the largest category in EmoLex. It is hard to speculate why this is, but one possible reason is the different source data.
The number of unique label combinations is 147, including single-label. The most common label combinations beyond single-label are anger with disgust (2.4%) and joy with trust (2.1%) followed by different combinations of the positive emotions of anticipation, joy, and trust. These findings are in line with previous findings discussing overlapping categories (Banea et al., 2011;Demszky et al., 2020). However, these are followed by anger combined with anticipation and sadness with surprise. The first combination is possibly a reflection of the genre, as a common theme for anger with anticipation is threats. The combination of surprise with negative emotions (anger, disgust, fear, sadness) is much more common than a combination with positive emotions.
Note that the difference between total annotations excluding neutral (24,164) and the combined number of annotations (22,424) differ because once the dataset was saved as a Python dictionary, identical lines were merged as one (i.e. some common movie lines like "All right then!" and "I love you" appeared multiple times from different sources). Additionally, lines annotated as both neutral and an emotion were removed from the neutral set.

Crosslingual Data & Annotation projection
From our source data we can extract parallel sentences for 43 languages. For 12 of these languages we have over 10,000 sentences available for projection as per table 4. We removed some of these languages for having fewer than 950 lines, resulting in a total of 32 languages 5 including the annotated English and Finnish data. We have made all 32 datasets available on GitHub plus the raw data for all 43 languages including the 11 datasets that had fewer than 950 lines. IT  FI  FR  CS  PT  PL  SR  TR  EL  RO  ES  PT BR 10,582 11,128 11,503 11,885 12,559 12,836 14,831 15,712 15,713 16,217 16,608 22,194  To test how well our data is suited for emotion projection, we projected the English annotations onto our Finnish unannotated data using OPUS tools (Aulamo et al., 2020). We chose Finnish as our main test language as we also have some annotated data for it to use as a test set. The manually annotated Finnish data consists of nearly 20k individual annotations and almost 15k unique annotated sentences plus an additional 7,536 sentences annotated as neutral 6 . The criteria for the inclusion of an annotation was the same as for English. The distribution of the number of labels and the labels themselves are quite similar to that of the English data. Relatively speaking there is a little less anticipation in the Finnish data, but anger is the biggest category in both languages.  12.15% 11.19% 13.10% 11.18% 10.15% 12.31% 100% Table 6: Emotion label distribution in the XED Finnish dataset.
We used the 11,128 Finnish sentences for which directly parallel sentences existed and projected the English annotations on them using the unique alignment IDs for both languages as guide. Some of those parallel sentences were part of our already annotated data and were discarded as training data. This served as a useful point of comparison. The average annotation correlation using Cohen's kappa is 0.44 (although accuracy by percentage is over 90%), and highest for joy at 0.65, showing that annotation projection differs from human annotation to a similar degree as human annotations differ from each other.

Evaluation
A dataset for classification tasks is useful only if the accuracy of its annotations can be confirmed. To this end we use BERT to evaluate our annotations as it has consistently outperformed other models in recent classification tasks (see e.g Zampieri et al. (2020)), and Support Vector Machines for its simplicity and effectiveness. We use a stratified split of 70:20:10 for training, dev, and test data.  6 The same calculations apply here as for English. Annotations are calculated as labels which can be more than one for each line, and unique data points refer to the number of lines that had 1 or more annotations.
We use a fine-tuned English uncased BERT, with a batch size of 96. The learning rate of Adam optimizer was set to 2e-5 and the model was trained for 3 epochs. The sequence length was set to 48. We perform a 5-fold cross validation.
We also use an SVM classifier with linear kernel and regularization parameter of 1. Word unigrams, bigrams and trigrams were used as features in this case. Implementation was done using the LinearSVC class from the scikit-learn library (Pedregosa et al., 2011).
Binary refers to positive and negative, and ternary refers to positive, negativeand neutral. For binary evaluations we categorized anger, disgust, fear, and sadness as negative, and anticipation, joy, and trust as positive. Surprise was either discarded or included as a separate category (see table 7). For this classification task BERT achieved macro f1 scores of 0.536 and accuracies of 0.544. This is comparable to other similar datasets when classes are merged (e.g. Demszky et al. (2020)).

Evaluation Metrics
We achieve macro f1 scores of 0.54 for our multilabel classification with a fine-tuned BERT model. Using named-entity recognition increases the accuracy slightly. For binary data mapped from the emotion classifications onto positive and negative (non-multilabel classification) our model achieves a macro f1 score of 0.838 and accuracy of 0.840. Our linear SVM classifier using one-vs-rest achieves an f1 score of 0.502 with per class f1 scores between 0.8073 (anger) and 0.8832 (fear & trust) (see tables 7 and 8).   The confusion matrix (see Figure 1) reveals that disgust is often confused with anger, and to some extent this is true in the other direction as well. This relation between labels can also be observed in the correlation matrix (see Figure 2), where anger and disgust appear as one of the most highly correlated pair of categories, only behind joy and trust. On the other hand, the least correlated pair is joy and anger, closely followed by trust and anger. Disgust is also the hardest emotion to categorize correctly. In fact, it is more often classified as anger than disgust. Joy, anger and anticipation are the categories that are categorized correctly the most.
The correlation matrix for the multilabel English evaluation shows (table 2) how closely correlated the emotions of anger and disgust, and joy and trust in particular are.

Evaluating Annotation Projection
With the same parameters as for English, we used language-specific BERT models from Huggingface transformers (Wolf et al., 2019) for the Arabic, Chinese, Dutch, Finnish, German and Turkish datasets with 5-fold cross-validation. The annotated Finnish dataset achieves an f1 score of 0.51. The projected annotations achieve slightly worse f1 scores than the annotated dataset at 0.45 for Finnish (see table  9). The other datasets achieve similar f1 scores, with the Germanic languages of German and Dutch achieving almost as high scores as the original English dataset. This is likely a reflection of typological, cultural, and linguistic similarities between the languages making the translation to begin with more similar to the original and therefore minimizing information loss.
We also evaluated all the projected datasets using a linear SVC classifier. In most cases the linear SVC classifier performs better than language-specific BERT. We speculate this is related to the size of

Discussion
The results from the dataset evaluations show that the XED is on par with other similar datasets, but they also stress that reliable emotion detection is still a very challenging task. It is not necessarily an issue with natural language processing and understanding as these types of tasks are challenging for human annotators alike. If human annotators cannot agree on labels, it is not reasonable to think computers can do any better regardless of annotation scheme or model used since these models are restricted by human performance. The best accuracies are those that are in line with annotator agreement.
XED is a novel state-of-the-art dataset that provides a new challenge in fine-grained emotion detection with previously unavailable language coverage. What makes the XED dataset particularly valuable is the large number of annotations at high granularity, as most other similar datasets are annotated at a much coarser granularity. The use of movie subtitles as source data means that it is possible to use the XED dataset across multiple domains (e.g. social media) as the source data is representative of other domains and not as restricted to the domain of the source data (movies) as many other datasets. Perhaps the greatest contribution of all is that, for the first time, many under-resourced languages have emotion datasets that can be used in other possible downstream applications as well.