SemEval-2018 Task 3: Irony Detection in English Tweets

This paper presents the first shared task on irony detection: given a tweet, automatic natural language processing systems should determine whether the tweet is ironic (Task A) and which type of irony (if any) is expressed (Task B). The ironic tweets were collected using irony-related hashtags (i.e. #irony, #sarcasm, #not) and were subsequently manually annotated to minimise the amount of noise in the corpus. Prior to distributing the data, hashtags that were used to collect the tweets were removed from the corpus. For both tasks, a training corpus of 3,834 tweets was provided, as well as a test set containing 784 tweets. Our shared tasks received submissions from 43 teams for the binary classification Task A and from 31 teams for the multiclass Task B. The highest classification scores obtained for both subtasks are respectively F1= 0.71 and F1= 0.51 and demonstrate that fine-grained irony classification is much more challenging than binary irony detection.


Introduction
The development of the social web has stimulated the use of figurative and creative language, including irony, in public (Ghosh et al., 2015). From a philosophical/psychological perspective, discerning the mechanisms that underlie ironic speech improves our understanding of human reasoning and communication, and more and more, this interest in understanding irony also emerges in the machine learning community (Wallace, 2015). Although an unanimous definition of irony is still lacking in the literature, it is often identified as a trope whose actual meaning differs from what is literally enunciated. Due to its nature, irony has important implications for natural language processing (NLP) tasks, which aim to understand and produce human language. In fact, automatic irony detection has a large potential for various applications in the domain of text mining, especially those that require semantic analysis, such as author profiling, detecting online harassment, and, maybe the most well-known example, sentiment analysis.
Due to its importance in industry, sentiment analysis research is abundant and significant progress has been made in the field (e.g. in the context of SemEval (Rosenthal et al., 2017)). However, the SemEval-2014 shared task Sentiment Analysis in Twitter (Rosenthal et al., 2014) demonstrated the impact of irony on automatic sentiment classification by including a test set of ironic tweets. The results revealed that, while sentiment classification performance on regular tweets reached up to F 1 = 0.71, scores on the ironic tweets varied between F 1 = 0.29 and F 1 = 0.57. In fact, it has been demonstrated that several applications struggle to maintain high performance when applied to ironic text (e.g. Liu, 2012;Maynard and Greenwood, 2014;Ghosh and Veale, 2016). Like other types of figurative language, ironic text should not be interpreted in its literal sense; it requires a more complex understanding based on associations with the context or world knowledge. Examples 1 and 2 are sentences that regular sentiment analysis systems would probably classify as positive, whereas the intended sentiment is undeniably negative.
(1) I feel so blessed to get ocular migraines.
(2) Go ahead drop me hate, I'm looking forward to it.
For human readers, it is clear that the author of example 1 does not feel blessed at all, which can be inferred from the contrast between the positive sentiment expression "I feel so blessed", and the negative connotation associated with getting ocular migraines. Although such connotative infor-mation is easily understood by most people, it is difficult to access by machines. Example 2 illustrates implicit cyberbullying; instances that typically lack explicit profane words and where the offense is often made through irony. Similarly to example 1, a contrast can be perceived between a positive statement ("I'm looking forward to") and a negative situation (i.e. experiencing hate). To be able to interpret the above examples correctly, machines need, similarly to humans, to be aware that irony is used, and that the intended sentiment is opposite to what is literally enunciated.
The irony detection task 1 we propose is formulated as follows: given a single post (i.e. a tweet), participants are challenged to automatically determine whether irony is used and which type of irony is expressed. We thus defined two subtasks: • Task A describes a binary irony classification task to define, for a given tweet, whether irony is expressed.
• Task B describes a multiclass irony classification task to define whether it contains a specific type of irony (verbal irony by means of a polarity clash, situational irony, or another type of verbal irony, see further) or is not ironic. Concretely, participants should define which one out of four categories a tweet contains: ironic by clash, situational irony, other verbal irony or not ironic.
It is important to note that by a tweet, we understand the actual text it contains, without metadata (e.g. user id, time stamp, location). Although such metadata could help to recognise irony, the objective of this task is to learn, at message level, how irony is linguistically realised.

Automatic Irony Detection
As described by Joshi et al. (2017), recent approaches to irony can roughly be classified as either rule-based or (supervised and unsupervised) machine learning-based. While rule-based approaches mostly rely upon lexical information and require no training, machine learning invariably makes use of training data and exploits different types of information sources (or features), such as bags of words, syntactic patterns, sentiment information or semantic relatedness.
Previous work on irony detection mostly applied supervised machine learning mainly exploiting lexical features. Other features often include punctuation mark/interjection counts (e.g Davidov et al., 2010), sentiment lexicon scores (e.g. Bouazizi and Ohtsuki, 2016;Farías et al., 2016), emoji (e.g. González-Ibáñez et al., 2011, writing style, emotional scenarios, part of speechpatterns (e.g. Reyes et al., 2013), and so on. Also beneficial for this task are combinations of different feature types (e.g. Van Hee et al., 2016b), author information (e.g. Bamman andSmith, 2015), features based on (semantic or factual) oppositions (e.g Karoui et al., 2015;Gupta and Yang, 2017;Van Hee, 2017) and even eye-movement patterns of human readers (Mishra et al., 2016). While a wide range of features are and have been used extensively over the past years, deep learning techniques have recently gained increasing popularity for this task. Such systems often rely on semantic relatedness (i.e. through word and character embeddings (e.g. Amir et al., 2016;Ghosh and Veale, 2016)) deduced by the network and reduce feature engineering efforts.
Regardless of the methodology and algorithm used, irony detection often involves binary classification where irony is defined as instances that express the opposite of what is meant (e.g. Riloff et al., 2013;Joshi et al., 2017). Twitter has been a popular data genre for this task, as it is easily accessible and provides a rapid and convenient method to find (potentially) ironic messages by looking for hashtags like #irony, #not and #sarcasm. As a consequence, irony detection research often relies on automatically annotated (i.e. based on irony-related hashtags) corpora, which contain noise (Kunneman et al., 2015;Van Hee, 2017).

Task Description
We propose two subtasks A and B for the automatic detection of irony on Twitter, for which we provide more details below.

Task A: Binary Irony Classification
The first subtask is a two-class (or binary) classification task where submitted systems have to predict whether a tweet is ironic or not. The following examples respectively present an ironic and nonironic tweet.
(3) I just love when you test my patience!! #not.

40
(4) Had no sleep and have got school now #not happy Note that the examples contain irony-related hashtags (e.g. #irony) that were removed from the corpus prior to distributing the data for the task.

Task B: Multiclass Irony Classification
The second subtask is a multiclass classification task where submitted systems have to predict one out of four labels describing i) verbal irony realised through a polarity contrast, ii) verbal irony without such a polarity contrast (i.e. other verbal irony), iii) descriptions of situational irony, and iv) non-irony. The following paragraphs present a description and a number of examples for each label.
Verbal irony by means of a polarity contrast This category applies to instances containing an evaluative expression whose polarity (positive, negative) is inverted between the literal and the intended evaluation, as shown in examples 5 and 6: (5) I love waking up with migraines #not (6) I really love this year's summer; weeks and weeks of awful weather In the above examples, the irony results from a polarity inversion between two evaluations. For instance, in example 6, the literal evaluation ("I really love this year's summer") is positive, while the intended one, which is implied by the context ("weeks and weeks of awful weather"), is negative.
Other verbal irony This category contains instances that show no polarity contrast between the literal and the intended evaluation, but are nevertheless ironic.
(7) @someuser Yeah keeping cricket clean, that's what he wants #Sarcasm (8) Human brains disappear every day. Some of them have never even appeared. http://t.co/Fb0Aq5Frqs #brain #humanbrain #Sarcasm Situational irony This class label is reserved for instances describing situational irony, or situations that fail to meet some expectations. As explained by Shelley (2001), firefighters who have a fire in their kitchen while they are out to answer a fire alarm would be a typically ironic situation. Some other examples of situational irony are the following: (9) Most of us didn't focus in the #ADHD lecture. #irony (10) Event technology session is having Internet problems. #irony #HSC2024 Non-ironic This class contains instances that are clearly not ironic, or which lack context to be sure that they are ironic, as shown in the following examples:

Corpus Construction and Annotation
A data set of 3,000 English tweets was constructed by searching Twitter for the hashtags #irony, #sarcasm and #not (hereafter referred to as the 'hashtag corpus'), which could occur anywhere in the tweet that was finally included in the corpus. All tweets were collected between 01/12/2014 and 04/01/2015 and represent 2,676 unique users.
To minimise the noise introduced by groundless irony hashtags, all tweets were manually labelled using a fine-grained annotation scheme for irony (Van Hee et al., 2016a). Prior to data annotation, the entire corpus was cleaned by removing retweets, duplicates and non-English tweets and replacing XML-escaped characters (e.g. &amp;).
The corpus was entirely annotated by three students in linguistics and second-language speakers of English, with each student annotating one third of the whole corpus. All annotations were done using the brat rapid annotation tool (Stenetorp et al., 2012). To assess the reliability of the annotations, and whether the guidelines allowed to carry out the task consistently, an interannotator agreement study was set up in two rounds. Firstly, inter-rater agreement was calculated between the authors of the guidelines to test the guidelines for usability and to assess whether changes or additional clarifications were recommended prior annotating the entire corpus. For this purpose, a subset of 100 instances from the SemEval-2015 Task Sentiment Analysis of Figurative Language in Twitter (Ghosh et al., 2015) dataset were annotated. Based on the results, some clarifications and refinements were added to the annotation scheme, which are thoroughly described in Van Hee (2017). Next, a second agreement study was carried out on a subset (i.e. 100 randomly chosen instances) of the corpus. As metric, we used Fleiss' Kappa (Fleiss, 1971), a widespread statistical measure in the field of computational linguistics for assessing annotator agreement on categorical ratings (Carletta, 1996). The measure calculates the degree of agreement in classification over the agreement which would be expected by chance, i.e. when annotators would randomly assign class labels. annotation Kappa κ Kappa κ round 1 round 2 ironic / not ironic 0.65 0.72 ironic by clash / other / not ironic 0.55 0.72 Table 1: Inter-annotator agreement scores (Kappa) in two annotation rounds. Table 1 presents the inter-rater scores for the binary irony distinction and for three-way irony classification ('other' includes both situational irony and other forms of verbal irony). We see that better inter-annotator agreement is obtained after the refinement of the annotation scheme, especially for the binary irony distinction. Given the difficulty of the task, a Kappa score of 0.72 for recognising irony can be interpreted as good reliability 2 .
The distribution if the different irony types in the experimental corpus are presented in Table 2  Based on the annotations, 2,396 instances out of the 3,000 are ironic, while 604 are not. To balance the class distribution in our experimental corpus, 1,792 non-ironic tweets were added from a background corpus. The tweets in this corpus were collected from the same set of Twitter users as in the hashtag corpus, and within the same time span. It is important to note that these tweets do not contain irony-related hashtags (as opposed to the non-ironic tweets in the hashtag corpus), and were manually filtered from ironic tweets. Adding 2 According to magnitude guidelines by Landis and Koch (1977). these non-ironic tweets to the experimental corpus brought the total amount of data to 4,792 tweets (2,396 ironic + 2,396 non-ironic). For this shared task, the corpus was randomly split into a class-balanced training (80% or 3,833 instances) and test (20%, or 958 instances) set. In an additional cleaning step, we removed ambiguous tweets (i.e. where additional context was required to understand their ironic nature), from the test corpus, resulting in a test set containing 784 tweets (consisting of 40% ironic and 60% nonironic tweets).
To train their systems, participants were not restricted to the provided training corpus. They were allowed to use additional training data that was collected and annotated at their own initiative. In the latter case, the submitted system was considered unconstrained, as opposed to constrained if only the distributed training data were used for training.
It is important to note that participating teams were allowed ten submissions at CodaLab, and that they could submit a constrained and unconstrained system for each subtask. However, only their last submission was considered for the official ranking (see Table 3).

Evaluation
For both subtasks, participating systems were evaluated using standard evaluation metrics, including accuracy, precision, recall and F 1 score, calculated as follows: accuracy = true positives + true negatives total number of instances (1) precision = true positives true positives + f alse positives (2) recall = true positives true positives + f alse negatives (3) While accuracy provides insights into the system performance for all classes, the latter three measures were calculated for the positive class only (Task A) or were macro-averaged over four class labels (Task B). Macro-averaging of the F 1 score implies that all class labels have equal weight in the final score.
For both subtasks, two baselines were provided against which to compare the systems' performance. The first baseline randomly assigns irony labels and the second one is a linear SVM classifier with standard hyperparameter settings exploiting tf-idf word unigram features (implemented with scikit-learn (Pedregosa et al., 2011)). The second baseline system is made available to the task participants via GitHub 3 .

Systems and results for Task A
In total, 43 teams competed in Task A on binary irony classification. Table 3 presents each team's performance in terms of accuracy, precision, recall and F 1 score. In all tables, the systems are ranked by the official F 1 score (shown in the fifth column). Scores from teams that are marked with an asterisk should be interpreted carefully, as the number of predictions they submitted does not correspond to the number of test instances.
As can be observed from the table, the SVM unigram baseline clearly outperforms the random class baseline and generally performs well for the task. Below we discuss the top five bestperforming teams for Task A, which all built a constrained (i.e. only the provided training data were used) system. The best system yielded an F 1 score of 0.705 and was developed by THU NGN (Wu et al., 2018). Their architecture consists of densely connected LSTMs based on (pre-trained) word embeddings, sentiment features using the AffectiveTweet package (Mohammad and Bravo-Marquez, 2017) and syntactic features (e.g. PoS-tag features + sentence embedding features). Hypothesising that the presence of a certain irony hashtag correlates with the type of irony that is used, they constructed a multi-task model able to predict simultaneously 1) the missing irony hashtag, 2) whether a tweet is ironic or not and 3) which fine-grained type of irony is used in a tweet.
Also in the top five are the teams NTUA-SLP (F 1 = 0.672), WLV (F 1 = 0.650), NLPRL-IITBHU (F 1 = 0.648) and NIHRIO (F 1 = 0.648). NTUA-SLP (Baziotis et al., 2018) built an ensemble classifier of two deep learning models: a word-and character-based (bi-directional) LSTM to capture semantic and syntactic information in tweets, respectively. As features, the team used pre-trained character and word embeddings on a corpus of 550 million tweets. Their ensem-3 https://github.com/Cyvhee/SemEval2018-Task3/ ble classifier applied majority voting to combine the outcomes of the two models. WLV (Rohanian et al., 2018) developed an ensemble voting classifier with logistic regression (LR) and a support vector machine (SVM) as component models. They combined (through averaging) pretrained word and emoji embeddings with handcrafted features, including sentiment contrasts between elements in a tweet (i.e. left vs. right sections, hashtags vs. text, emoji vs. text), sentiment intensity and word-based features like flooding and capitalisation). For Task B, they used a slightly altered (i.e. ensemble LR models and concatenated word embeddings instead of averaged) model. NLPRL-IITBHU (Rangwani et al., 2018) ranked fourth and used an XGBoost Classifier to tackle Task A. They combined pre-trained CNN activations using DeepMoji (Felbo et al., 2017) with ten types of handcrafted features. These were based on polarity contrast information, readability metrics, context incongruity, character flooding, punctuation counts, discourse markers/intensifiers/interjections/swear words counts, general token counts, WordNet similarity, polarity scores and URL counts. The fifth best system for Task A was built by NIHRIO (Vu et al., 2018) and consists of a neural-networks-based architecture (i.e. Multilayer Perceptron). The system exploited lexical (word-and character-level unigrams, bigrams and trigrams), syntactic (PoS-tags with tfidf values), semantic features (word embeddings using GloVe (Pennington et al., 2014), LSI features and Brown cluster features (Brown et al., 1992)) and polarity features derived from the Hu and Liu Opinion Lexicon (Hu and Liu, 2004).
As such, all teams in the top five approached the task differently, by exploiting various algorithms and features, but all of them clearly outperformed the baselines. Like most other teams, they also showed a better performance in terms of recall compared to precision. Table 3 displays the results of each team's official submission for Task A, i.e. no distinction is made between constrained and unconstrained systems. By contrast, Tables 4 and 5 present the rankings of the best (i.e. not necessarily the last, and hence official submission) constrained and unconstrained submissions for Task A.
As can be deduced from  that the UCDCC team ranks first (F 1 = 0.724), followed by THU NGN, NTUA-SLP, WLV and NLPRL-IITBHU, whose approach was discussed earlier in this paper. The UCDCC-system is an LSTM model exploiting Glove word embedding features.   In the top five unconstrained (i.e. using additional training data) systems for Task A are #NonDicevoSulSerio, INAOE-UPV, RM@IT, Va-lenTO and UTMN, with F 1 scores ranging between 0.622 and 0.527. #NonDicevoSulserio extended the training corpus with 3,500 tweets from existing irony corpora (e.g. Riloff et al. (2013); Barbieri and Saggion (2014); Ptáček et al. (2014) and built an SVM classifier exploiting structural features (e.g. hashtag count, text length), sentiment-(e.g. contrast between text and emoji sentiment), and emotion-based (i.e. emotion lexicon scores) features. INAOE-UPV combined pretrained word embeddings from the Google News corpus with word-based features (e.g. n-grams). They also extended the official training data with benchmark corpora previously used in irony research and trained their system with a total of 165,000 instances. RM@IT approached the task using an ensemble classifier based on attentionbased recurrent neural networks and the Fast-Text (Joulin et al., 2017) library for learning word representations. They enriched the provided training corpus with, on the one hand, the data sets provided for SemEval-2015 Task 11 (Ghosh et al., 2015) and, on the other hand, the sarcasm corpus composed by Ptáček et al. (2014). Altogether, this generated a training corpus of approximately 110,000 tweets. ValenTO took advantage of irony corpora previously used in irony detection that were manually annotated or through crowdsourcing (e.g. Riloff et al., 2013;Ptáček et al., 2014). In addition, they extended their corpus with an unspecified number of self-collected irony tweets using the hashtags #irony and #sarcasm. Finally, UTMN developed an SVM classifier exploiting binary bag-of-words features. They enriched the training set with 1,000 humorous tweets from SemEval-2017 Task 6 (Potash et al., 2017) and another 1,000 tweets with positive polarity from SemEval-2016 Task 4 (Nakov et al., 2016), resulting in a training corpus of 5,834 tweets.
Interestingly, when comparing the best constrained with the best unconstrained system for Task A, we see a difference of 10 points in favour of the constrained system, which indicates that adding more training data does not necessarily improve the classification performance.

Systems and Results for Task B
While 43 teams competed in Task A, 31 teams submitted a system for Task B on multiclass irony classification. Table 6 presents the official ranking with each team's performance in terms of accuracy, precision, recall and F 1 score. Similar to Task A, we discuss the top five systems in the overall ranking (Table 6) and then zoom in on the best performing constrained and unconstrained systems (Tables 7 and 8).
For Task B, the top five is nearly similar to the top five for Task A and includes the following teams: UCDCC (Ghosh, 2018), NTUA-SLP (Baziotis et al., 2018), THU NGN (Wu et al., 2018), NLPRL-IITBHU (Rangwani et al., 2018) and NIHRIO (Vu et al., 2018). All of the teams tackled multiclass irony classification by applying (mostly) the same architecture as for Task A (see earlier). Inspired by siamese networks (Bromley et al., 1993) used in image classification, the UCDCC team developed a siamese architecture for irony detection in both subtasks. The neural network architecture makes use of Glove word embeddings as features and creates two identical subnetworks that are each fed with different parts of a tweet. Under the premise that ironic statements are often characterised by a form of opposition or contrast, the architecture captures this incongruity between two parts in an ironic tweet.  NTUA-SLP, THU NGN and NIHRIO used the same system for both subtasks. NLPRL-IITBHU also used the same architecture, but given the data skew for Task B, they used SMOTE (Chawla et al., 2002) as an oversampling technique to make sure each irony class was equally represented in the training corpus, which lead to an F 1 score increase of 5 points.
NLPRL-IITBHU built a Random Forest classifier making use of pre-trained DeepMoji embeddings, character embeddings (using Tweet2Vec) and sentiment lexicon features.   As can be deduced from Table 7, the top five constrained systems correspond to the five bestperforming systems overall (Table 6). Only four unconstrained systems were submitted for Task B. Differently from their Task A submission, #NonDicevoSulSerio applied a cascaded approach for this task, i.e. the first algorithm served an ironic/non-ironic classification, followed by a system distinguishing between ironic by clash and other forms of irony. Lastly, a third classifier distinguished between situational and other verbal irony. To account for class imbalance in step two, the team added 869 tweets of the situational and other verbal irony categories. INAOE-UPV, INGEOTEC-IIMAS and IITG also added tweets to the original training corpus, but it is not entirely clear how many were added and how these extra tweets were annotated.
Similar to Task A, the unconstrained systems do not seem to benefit from additional data, as they do not outperform the constrained submissions for the task.  A closer look at the best and worst-performing systems for each subtask reveals that Task A benefits from systems that exploit a variety of handcrafted features, especially sentiment-based (e.g. sentiment lexicon values, polarity contrast), but also bags of words, semantic cluster features and PoS-based features. Other promising features for the task are word embeddings trained on large Twitter corpora (e.g. 5M tweets). The classifiers and algorithms used are (bidirectional) LSTMs, Random Forest, Multilayer Perceptron, and an optimised (i.e. using feature selection) voting classifier combining Support Vector Machines with Logistic Regression. Neural networkbased systems exploiting word embeddings derived from the training dataset or generated from Wikipedia corpora perform less well for the task.
Similarly, Task B seems to benefit from (ensemble) neural-network architectures exploiting large corpus-based word embeddings and sentiment features. Oversampling and adjusting class weights are used to overcome the class imbalance of labels 2 and 3 versus 1 and 0 and tend to improve the classification performance. Ensemble classifiers outperform multi-step approaches and combined binary classifiers for this task.
Task B challenged the participants to distinguish between different types of irony. The class distributions in the training and test corpus are natural (i.e. no additional data were added after the annotation process) and imbalanced. For the evaluation of the task, F 1 scores were macro-averaged; on the one hand, this gives each label equal weight in the evaluation, but on the other hand, it does not show each class contribution to the average score. Table 9 therefore presents the participating teams' performance on each of the subtypes of irony in Task B. As can be deduced from Table 9, all teams performed best on the non ironic and ironic by clash classes, while identifying situational irony and other irony seems to be much more challenging. Although the scores for these two classes are the lowest, we observe an important difference between situational and other verbal irony. This can probably be explained by the heterogeneous nature of the other category, which collects diverse realisations of verbal irony. A careful and manual annotation of this class, which is currently being conducted, should provide more detailed insights into this category of ironic tweets.

Conclusions
The systems that were submitted for both subtasks represent a variety of neural-network-based approaches (i.e. CNNs, RNNs and (bi-)LSTMs) exploiting word-and character embeddings as well as handcrafted features. Other popular classification algorithms include Support Vector Machines, Maximum Entropy, Random Forest, and Naïve Bayes. While most approaches were based on one algorithm, some participants experimented with ensemble learners (e.g. SVM + LR, CNN + bi-LSTM, stacked LSTMs), implemented a voting system or built a cascaded architecture (for Task B) that first distinguished ironic from nonironic tweets and subsequently differentiated between the fine-grained irony categories.
Among the most frequently used features are lexical features (e.g. n-grams, punctuation and hashtag counts, emoji presence) and sentimentor emotion-lexicon features (e.g. based on Sen-ticNet (Cambria et al., 2016), VADER (Hutto and Gilbert, 2014), aFinn (Nielsen, 2011)). Also important but to a lesser extent were syntactic (e.g. PoS-patterns) and semantic features, based on word, character and emoji embeddings or semantic clusters. The best systems for Task A and Task B obtained an F 1 score of respectively 0.705 and 0.507 and clearly outperformed the baselines provided for this task. When looking at the scores per class label in Task B, we observe that high scores were obtained for the non-ironic and ironic by clash classes, and that other irony appears to be the most challenging irony type. Among all submissions, a wide variety of preprocessing tools, machine learning libraries and lexicons were explored.
As the provided datasets were relatively small, participants were allowed to include additional training data for both subtasks. Nevertheless, most submissions were constrained (i.e. only the provided training data were used): only nine unconstrained submissions were made for Task A, and four for Task B. When comparing constrained to unconstrained systems, it can be observed that adding more training data does not necessarily benefit the classification results. A possible explanation for this is that most unconstrained systems added training data from related irony research that were annotated differently (e.g. automatically) than the distributed corpus, which presumably limited the beneficial effect of increasing the training corpus size.
This paper provides some general insights into the main methodologies and bottlenecks for binary and multiclass irony classification. We observed that, overall, systems performed much better on Task A than Task B and the classification results for the subtypes of irony indicate that ironic by clash is most easily recognised (top F 1 = 0.697), while other types of verbal irony and situational irony are much harder (top F 1 scores are 0.114 and 0.376, respectively).