Emotion Classification in a Resource Constrained Language Using Transformer-based Approach

Although research on emotion classification has significantly progressed in high-resource languages, it is still infancy for resource-constrained languages like Bengali. However, unavailability of necessary language processing tools and deficiency of benchmark corpora makes the emotion classification task in Bengali more challenging and complicated. This work proposes a transformer-based technique to classify the Bengali text into one of the six basic emotions: anger, fear, disgust, sadness, joy, and surprise. A Bengali emotion corpus consists of 6243 texts is developed for the classification task. Experimentation carried out using various machine learning (LR, RF, MNB, SVM), deep neural networks (CNN, BiLSTM, CNN+BiLSTM) and transformer (Bangla-BERT, m-BERT, XLM-R) based approaches. Experimental outcomes indicate that XLM-R outdoes all other techniques by achieving the highest weighted f_1-score of 69.73% on the test data.


Introduction
Classification of emotion in the text signifies the task of automatically attributing an emotion category to a textual document selected from a set of predetermined categories. With the growing number of users in virtual platforms generating online contents steadily as a fast-paced, interpreting emotion or sentiment in online content is vital for consumers, enterprises, business leaders, and other parties concerned. Ekman (Ekman, 1993) defined six basic emotions: happiness, fear, anger, sadness, surprise, and disgust based on facial features. These primary type of emotions can also be extracted from the text expression (Alswaidan and Menai, 2020).
The availability of vast amounts of online data and the advancement of computational processes have accelerated the development of emotion classification research in highresource languages such as English, Arabic, Chinese, and French (Plaza del Arco et al., 2020). However, there is no notable progress in low resource languages such as Bengali, Tamil and Turkey. The proliferation of the Internet and digital technology usage produces enormous textual data in the Bengali language. The analysis of these massive amounts of data to extract underlying emotions is a challenging research issue in the realm of Bengali language processing (BLP). The complexity arises due to various limitations, such as the lack of BLP tools, scarcity of benchmark corpus, complicated language structure, and limited resources. By considering the constraints of emotion classification in the Bengali language, this work aims to contribute to the following: • Develop a Bengali emotion corpus consisting of 6243 text documents with manual annotation to classify each text into one of six emotion classes: anger, disgust, fear, joy, sadness, surprise.
• Investigate the performance of various ML, DNN and transformer-based approaches on the corpus.
• Proposed a benchmark system to classify emotion in Bengali text with the experimental validation on the corpus.

Related Work
Substantial research activities have been carried out on emotion analysis in high-resource languages like English, Arabic, and Chinese (Alswaidan and Menai, 2020). A multi-label with multi-target emotion detection of Arabic tweets accomplished using decision trees, random forest, and KNN, where random forest provided the highest f 1 -score of 82.6% (Alzu'bi et al., 2019). Lai et al. (2020) proposed a graph convolution network architecture for emotion classification from Chinese microblogs and their proposed system achieved an F-measure of 82.32%. Recently, few works employed transformer-based model (i.e., BERT) analyse emotion in texts. (Huang et al., 2019;Al-Omari et al., 2020) used a pretrained BERT for embedding purpose on top of LSTM/BiLSTM to get an improved f 1 -score of 76.66% and 74.78% respectively.
Although emotion analysis on limited resource languages like Bengali is in the preliminary stage, few studies have already been conducted on emotion analysis using ML and DNN methods. Irtiza Tripto and Eunus Ali (2018) proposed an LSTM based approach to classify multi-label emotions from Bengali, and English sentences. This system considered only YouTube comments and achieved 59.23% accuracy. Another work on emotion classification in Bengali text carried out by Azmin and Dhar (2019) concerning three emotional labels (i.e., happy, sadness and anger). They used Multinomial Naïve Bayes, which outperformed other algorithms with an accuracy of 78.6%. Pal and Karn (2020) developed a logistic regression-based technique to classify four emotions (joy, anger, sorrow, suspense) in Bengali text and achieved 73% accuracy. Das and Bandyopadhyay (2009) conducted a study to identify emotions in Bengali blog texts. Their scheme attained 56.45% accuracy using the conditional random field. Recent work used SVM to classify six raw emotions on 1200 Bengali texts which obtained 73% accuracy (Ruposh and Hoque, 2019).

BEmoC: Bengali Emotion Corpus
Due to the standard corpus unavailability, we developed a corpus (hereafter called 'BEmoC') to classify emotion in Bengali text. The development procedure is adopted from the guidelines stated in (Dash and Ramamoorthy, 2019).

Data Collection and Preprocessing
Five human crawlers were assigned to accumulate data from various online/offline sources. They manually collected 6700 text documents over three months (September 10, 2020to December 11, 2020. The crawler accumulated data selectively, i.e., when a crawler finds a text that supports the definition of any of the six emotion classes according to Ekman (1993), the content is collected, otherwise ignored. Raw accumulated data needs following pre-processing before the annotation: • Removal of non-Bengali words, punctuation, emoticons and duplicate data.
• Discarding data less than three words to get an unerring emotional context.
After pre-processing the corpus holds 6523 text data. The processed texts are eligible for manual annotation. The details of the preprocessing modules found in the link 1 .

Data Annotation and Quality
Five postgraduate students working on BLP were assigned for initial annotation. To choose the initial label majority voting technique is applied (Magatti et al., 2009). Initial labels were scrutinized by an expert who has several years of research expertise in BLP. The expert corrected the labelling if any initial annotation is done inappropriately. The expert discarded 163 texts with neutral emotion and 117 texts with mixed emotions for the intelligibility of this research. To minimize bias during annotation, the expert finalized the labels through discussions and deliberations with the annotators (Sharif and Hoque, 2021). We evaluated the inter-annotator agreement to ensure the quality of the annotation using the coding reliability (Krippendorff, 2011) and Cohen's kappa (Cohen, 1960) scores. An inter-coder reliability of 93.1% with Cohen's Kappa score of 0.91 reflects the quality of the corpus.

Data Statistics
The BEmoC contains a total of 6243 text documents after the preprocessing and annotation process. Amount of data inclusion in BEmoC  Since the classifier models learn from the training set instances to obtain more insights, we further analyzed this set. Table 2 shows several statistics of the training set.

Class
Total words  The sadness class contains the most unique words(7398), whereas the fear class contains the least(5072). In average all the classes have more than 20 words in each text document. However, a text document in sadness class contained the maximum number of words (107) whereas fear class consisting of a minimum number of words (4). Figure 1 represents the number of texts vs the length of texts distribution for each class of the corpus. Investigating this figure revealed that most of the data varied a length between 15 to 35 words. Interestingly, most of the texts of Disgust class have a length less than 30. The Joy & Sadness classes seem to have almost similar number of texts in all length distributions. For quantitative analysis, the Jaccard similarity among the classes has been computed. We used 200 most frequent words from each emotion class, and the similarity values are reported in table 3. The Anger-Disgust and Joy-Surprise pairs hold the highest similarity of 0.58 and 0.51, respectively. These scores indicate that more than 50% frequent words are common in these pair of classes. On the other hand, the Joy-Fear pair has the least similarity index, which clarifies that this pair's frequent words are more distinct than other classes. These similarity issues can substantially affect the emotion classification task. Some sample instances of BEmoC are shown in  Table 3: Jaccard similarity between the emotion class pairs. Anger (c1), disgust (c2), fear (c3), joy (c4), sadness (c5), surprise (c6). Figure 2 shows an abstract view of the used strategies. Various feature extraction techniques such as TF-IDF, Word2Vec, and Fast-Text are used to train ML and DNN models. Moreover, we also investigate the Bengali text's emotion classification performance using transformer-based models.All the models are trained and tuned on the identical dataset.

Feature Extraction
ML and DNN algorithms are unable to learn from raw texts. Therefore, feature extraction is required to train the classifier models. TF-IDF: Term frequency-inverse document frequency (TF-IDF) is a statistical measure that determines the importance of a word to a document in a collection of documents. Uni-gram and bi-gram features are extracted from the most frequent 20000 words of the corpus.
Word2Vec: It utilizes neural networks to find semantic similarity of the context of the words in a corpus (Mikolov et al., 2013). We trained Word2Vec on Skip-Gram with the window size of 7, minimum word count to 4, and the embedding dimension of 100.
FastText: This technique uses subword information to find the semantic relationships (Bojanowski et al., 2017). We trained Fast-Text on Skip-Gram with character n-grams of length 5, windows size of 5, and embedding dimension of 100.
For both Word2Vec and FastText, there are pre-trained vectors available for the Bengali language trained on generalized Bengali wiki dump data (Sarker, 2021). We observed that the deep learning models perform well on vectors trained with our developed BEmoC rather than the pre-trained vectors.

ML Approaches
We started an investigation on emotion detection system with ML models. Logistic Regres-sion (LR), Support Vector Machine (SVM), Random Forest (RF) and Multinomial Naive Bayes (MNB) techniques are employed using TF-IDF text vectorizer. For LR lbfgs' solver and 'l1' penalty is chosen and C value is set to 1. The same C value with 'linear' kernal is used for SVM. Meanwhile, for RF 'n_estimators' is set to 100 and 'alpha=1.0' is chosen for MNB. A summary of the parameters chosen for ML models are provided in Table 6 (Appendix A).

DNN Approaches
Variation of deep neural networks (DNN) such as CNN, BiLSTM and a combination of CNN and BiLSTM (CNN+BiLSTM) will investigate the performance of emotion classification task in Bengali. To train all the DNN models, 'adam' optimizer with a learning rate of 0.001 and a batch size of 16 is used for 35 epochs. The 'sparse_categorical_crossentropy' is selected as the loss function.
CNN: Convolutional Neural Network (CNN) (LeCun et al., 2015) is tuned over the emotion corpus. The trained weights from the Word2Vec/FastText embeddings are fed to the embedding layer to generate a sequence matrix. The sequence matrix is then passed to the convolution layer having 64 filters of size 7. The convolution layer's output is max-pooled over time and then transferred to a fully connected layer with 64 neurons. 'ReLU' activation is used in the corresponding layers. Finally, an output layer with softmax activation is used to compute the probability distribution of the classes.
BiLSTM: Bidirectional Long-Short Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) is a variation of recurrent neural network (RNN). The developed BiLSTM network consists of an Embedding layer similar to CNN, a BiLSTM layer with 32 hidden units, and a fully connected layer having 16 neurons with 'ReLU' activation. An output layer with 'softmax' activation is used.
CNN+BiLSTM: An embedding layer followed by a 1D convolutional layer with 64 filters of size three and a 1D max-pool layer is employed on top of two BiLSTM layers with 64 and 32 units. Outputs of BiLSTM layer fed to an output layer with 'softmax' activation.

Transformer Models
We used three transformer models: m-BERT, Bangla-BERT, and XLM-R on BEmoC. In recent years transformer is being used extensively for classification tasks to achieve stateof-the-art results (Chen et al., 2021). The models are culled from the Huggingface 2 transformers library and fine-tuned on the emotion corpus by using Ktrain(Maiya, 2020) package. m-BERT: m-BERT (Devlin et al., 2019) is a transformer model pre-trained over 104 languages with more than 110M parameters. We employed 'bert-base-multilingualcased' model and fine-tuned it on BEmoC with a batch size of 12.
Bangla BERT: Bangla BERT (Sarker, 2020) is a pre-trained BERT mask language modelling, trained on a sizeable Bengali corpus. We used the 'sagorsarker/Bangla-bertbase' model and fine-tuned to update the pretrained model fitted for BEmoC. A batch size of 16 is used to provide better results.
XLM-R: XLM-R (Liu et al., 2019) is a sizeable multilingual language model trained on 100 different languages. We implemented the 'xlm-Roberta-base' model on BEmoC with a batch size of 12.
All the transformer models have been trained with 20 epochs with a learning rate of 2e −5 . By using the checkpoint best intermediate model is stored to predict on the test data.

Results and Analysis
This section presents a comprehensive performance analysis of various ML, DNN, and transformer-based models to classify Bengali texts emotion. The superiority of the models is determined based on the weighted f 1 -score. However, the precision (P r), recall (Re) and accuracy (Acc) metrics also considered. Table  4 reports the evaluation results of all models.
Among ML approaches, LR achieved the highest (60.75%) f 1 -score than RF (52.78%), MNB (48.67%) and SVM (59.54%). LR also performed well in P r, Re and Acc than other ML models. In DNN, BiLSTM with Fast-Text outperformed other approaches concerning all the evaluation parameters. It achieved f 1 -score of 56.94%. However, BiLSTM (Fast-Text) achieved about 4% lower f 1 -score than the best ML method (i.e., LR).
After employing transformer-based models, it observed a significant increase in all scores.
Among transformer-based models, Bangla-BERT achieved the lowest of 61.91% f 1 -score. However, this model outperformed the best ML and DNN approaches (56.94% for BiLSTM (FastText) and 60.75% for LR).
Meanwhile, m-BERT shows almost 3% improved f 1 -score (64.39%) than Bangla-BERT (61.91%). XLM-R model shows an immense improvement of about 6% compared to Bangla-BERT and 5% compared to m-BERT, respectively. It achieved a f 1 -score of 69.73% that is the highest among all models.

Error Analysis
It is evident from Table 4 that XLM-R is the best performing model to classify emotion from Bengali texts. A detailed error analysis is performed using the confusion matrix. Figure  3 illustrates a class-wise proportion of the number of predicted labels. It is observed from the In fear class, 6 data out of 83 mistakenly classified as sadness. In the sadness class, misclassification ratio is the higest (15.13%). That means, 18 data out of 119 in sadness class misclassified as disgust.
Moreover, among 73 data in surprise class 9 are predicted as sadness. The error analysis reveals that fear class achieved the highest rate of correct classification (77.15%) while surprise gained the lowest (61.64%). The possible reason for incorrect predictions might be the class imbalance nature of the corpus. However, the high value of Jaccard similarity (Table 3) also reveals some interesting points. Few words are used multi-purposely in multiple classes. For instance, hate words can be used to express both anger and disgust feelings. Moreover, emotion classification is highly subjective, depends on the individual's perception, and people may contemplate a sentence in many ways (LeDoux and Hofmann, 2018). Thus, by developing a balanced dataset with diverse data, incorrect predictions might be reduced to some extent.

Comparison with Recent Works
The analysis of results revealed that XLM-R is the best model to classify emotion in Bengali texts. Thus, we compare the performance of XLM-R with the existing techniques to assess the effectiveness. We implemented previous methods (Irtiza Tripto and Eunus Ali, 2018;Azmin and Dhar, 2019;Pal and Karn, 2020; Ruposh and Hoque, 2019) on BEmoC and reported outcomes in f 1 -score. Table 5      Anger লােশর িমিছল হেব এবার। েচােখর সামেন লাশ পেড় থাকেব, দাফন করেত আসেবনা েকউ। েসই িদন আসেতেস, েবিশ েদরী নাই ভাই। There will be a procession of corpses this time. The dead bodies will be lying in front of the eyes, no one will come to bury it. That day is coming soon brother.) Fear বলেত লজ্জা হয় িক সতয্ হে জািত িহসােবই আমরা ধষর্ ণকামী এবং ববর্ র কৃ িতর। নাহেল িকভােব স ব, অেনয্র েমাবাইল ন র েসাশয্াল িমিডয়ােত েপা করা! (It is a shame to say but the truth is that as a nation we are rapist and barbaric in nature.) Otherwise, how is it possible to post someone else's mobile number on social media !) Disgust আিম িসগােরেটর েধঁ ায়া ছাড়েত িগেয় িবষম েখেয় েগলাম!েছা একটা েমেসজ।"আিম আসিছ!" িনেজর েচাখেক িব াস করেত পারিছলামনা েয েস আসেলই আসেছ। (I was smoking and got shocked! A short message. "I'm coming!" I couldn't believe my eyes that he was really coming.)