Cross-lingual sentiment classification in low-resource Bengali language

Sentiment analysis research in low-resource languages such as Bengali is still unexplored due to the scarcity of annotated data and the lack of text processing tools. Therefore, in this work, we focus on generating resources and showing the applicability of the cross-lingual sentiment analysis approach in Bengali. For benchmarking, we created and annotated a comprehensive corpus of around 12000 Bengali reviews. To address the lack of standard text-processing tools in Bengali, we leverage resources from English utilizing machine translation. We determine the performance of supervised machine learning (ML) classifiers in machine-translated English corpus and compare it with the original Bengali corpus. Besides, we examine sentiment preservation in the machine-translated corpus utilizing Cohen’s Kappa and Gwet’s AC1. To circumvent the laborious data labeling process, we explore lexicon-based methods and study the applicability of utilizing cross-domain labeled data from the resource-rich language. We find that supervised ML classifiers show comparable performances in Bengali and machine-translated English corpus. By utilizing labeled data, they achieve 15%-20% higher F1 scores compared to both lexicon-based and transfer learning-based methods. Besides, we observe that machine translation does not alter the sentiment polarity of the review for most of the cases. Our experimental results demonstrate that the machine translation based cross-lingual approach can be an effective way for sentiment classification in Bengali.


Introduction
Sentiment analysis classifies the semantic orientation of a text. With the rapid growth of usergenerated content, nowadays, it is essential to determine user opinions, attitudes, and feelings from the textual data. In literature, researchers identified sentiment orientations of the text in various levels, such as document, sentence, or aspect. Researchers employed both the machine learning-based and lexicon-based approaches for sentiment analysis. Utilizing labeled data, supervised ML classifiers such as Naive Bayes (NB), Maximum Entropy (ME), Support Vector Machines (SVM), etc. (Pang et al., 2002;Gamon, 2004) and deep learning-based classifiers (Abdi et al., 2019;Araque et al., 2017) have been employed by the researchers for sentiment classification. Though the lexicon-based methods (Turney, 2002) do not require labeled data, they suffer from the lexicon coverage problem and are not robust to deal with the ambiguity and linguistic variations of natural languages.
Though English and few other languages enjoy ample resources for sentiment analysis, such resources are not available in many other languages. Cross-lingual sentiment classification aims to leverage resources like labeled data, polarity lexicons, contextual valence shifters, modifiers, etc. from resource-rich languages (typically English) to classify the sentiment polarity of the text written in a low-resource language (such as Bengali). For language mapping, several approaches such as machine translation (Banea et al., 2008a;Wan, 2009;Demirtas and Pechenizkiy, 2013;Zhou et al., 2016a,b;Abdalla and Hirst, 2017;Balahur and Turchi, 2014), cross-lingual word embedding (Barnes et al., 2018;Xu and Yang, 2017;AP et al., 2014), etc. have been used by the researchers.

Motivation
A limited amount of research in sentiment analysis has been conducted in Bengali in the last few decades; however, still, there is no benchmark dataset. Researchers used their curated datasets in various literatures that are not publicly available. The absence of publicly available datasets made the research findings non-reproducible. Moreover, without a benchmark dataset, it is challenging to compare the performance of various approaches.
Though cross-lingual approaches have been successfully applied to several low-resource languages (Meng et al., 2012;Banea et al., 2008b), in Bengali only a few works utilized it for tasks like sentiment lexicon creation (Das and Bandyopadhyay, 2010a;Sazzed, 2020) and sentiment classification (Sazzed and Jayarathna, 2019). However, until now, no comprehensive study has been performed to explore the applicability of the cross-lingual sentiment classification approach in Bengali. Therefore, in this work, we created and annotated a large Bengali review dataset for binary-level sentiment analysis. This corpus consists of around 12000 Bengali reviews collected from Youtube. We present a comprehensive study of the machinetranslation based cross-lingual approach of sentiment analysis in Bengali.
Using a large and well-annotated dataset, we compare and provide detailed analysis regarding the performance of ML classifiers in the Bengali and machine-translated datasets. Besides, using Cohen's kappa and ML classifiers, we examine sentiment preservation in the machine-translated corpus.
As annotated data are not always obtainable, especially in low-resource languages, we investigate the performance of unsupervised lexiconbased methods in the machine-translated corpus. Popular lexicon-based sentiment analysis methods, VADER (Hutto and Gilbert, 2014), TextBlob 1 , and SentiStrength (Thelwall et al., 2010) are applied and their relative performances are compared.
We investigate the applicability of the simple transfer learning-based approach to the machinetranslated corpus. Resource-rich language such as English contains copious labeled data, which are not available in Bengali. Utilizing machinetranslation and cross-domain labeled data, we show the performance of supervised ML classifiers in the translated corpus.

Contribution
Our major contributions can be summarized as follows: • We introduce a large well-annotated benchmark dataset for sentiment analysis in Bengali.
1 https:textblob.readthedocs.io/ • We perform a comparative evaluation of supervised ML classifiers in Bengali and machinetranslated English corpus and provide a rigorous analysis of the results.
• We investigate cross-lingual lexicon-based methods, as well as a transfer learning-based approach to deal with the lack of labeled data in Bengali.
2 Literature Review

Sentiment Analysis in Bengali
English is the dominant language for sentiment analysis research due to commercial interest and a large research community. In recent years, with the popularity of e-commerce and social networking sites, review data is becoming available in other languages.
In Bengali, limited research has been performed using corpora collected from various sources such as Microblogs, Facebook, and other social media sources (Patra et al., 2015;Das and Bandyopadhyay, 2010b). Various supervised classifiers have been employed for Bengali sentiment analysis such as SVM with maximum entropy (Chowdhury and Chowdhury, 2014), Naive Bayes (NB) (Islam et al., 2016b), Deep Neural Network (Tripto and Eunus Ali, 2018), Convolutional Neural Network (CNN) (Sarkar, 2019). In (Al-Amin et al., 2017), the authors utilized word2vec and polarity score for the binary sentiment analysis problem. A wordembedding based approach was proposed by Islam et al. (2016a). Hassan et al. (2016) predicted sentiment orientation of Bengali and Romanized Bengali text using Long Short-Term Memory (LSTM).

Cross-lingual Sentiment Analysis
The cross-lingual sentiment analysis approaches have been studied in many languages. Mihalcea et al. (2007) leveraged the tools and resources available in English to generate subjectivity analysis resources in Romanian. They created a Romanian subjectivity lexicon translated from the English lexicon and utilized a corpus-based approach. Balamurali et al. (2012) presented an alternative approach to cross-lingual sentiment analysis (CLSA) using WordNet senses as features for supervised sentiment classification. They used the linked Word-Nets of two languages to bridge the language gap. They reported their results on two Indian languages, Hindi and Marathi. Balahur and Turchi (2014) in-vestigated the performance and effectiveness of machine translation systems and supervised methods for multilingual sentiment analysis. In their experiment, they used four languages, English, German, Spanish, and French; three machine translation systems Google, Bing, and Moses; several supervised algorithms and various types of features. Yan et al. (2014) utilizing the SVM algorithm proposed a bilingual approach for sentiment analysis in the Chinese social media dataset. In (Meng et al., 2012), the authors proposed a cross-lingual mixture model (CLMM) to exploit unlabeled bilingual parallel corpus. In (Banea et al., 2008b), authors utilized a machine translation system for projecting resources from English to Romanian and Spanish and provided a comparative performance. Chen et al. (2015) proposed a semi-supervised learning model, CredBoost, to address cross-lingual sentiment analysis in English and Chinese. They introduced a knowledge validation step during transfer learning to reduce the noisy data caused by machine translation errors. Feng and Wan (2019) proposed a cross-lingual sentiment analysis (CLSA) model by leveraging unlabeled data in multiple languages and domains. Without using any supervised crosslingual word embedding (CLWE), their model outperformed baseline methods on multilingual Amazon review datasets. Xu et al. (2018) proposed a learning approach that does not require any crosslingual labeled data. Their algorithm optimizes the transformation functions of monolingual wordembedding space and uses a neural network. They evaluated their proposed approach on benchmark datasets for cross-lingual word similarity prediction and found competitive performance to other methods. Chen et al. (2018) introduced an Adversarial Deep Averaging Network (ADAN) to transfer the knowledge learned from source languages labeled data to the target language. Their experiments on Chinese and Arabic sentiment classification demonstrated the superior performance of ADAN. Rasooli et al. (2018) used multiple source languages to learn a robust sentiment transfer model. They explored the potential of using both the annotation projection approach and a direct transfer approach using cross-lingual word representations and neural networks.
The cross-lingual approach of sentiment analysis in Bengali is still largely unexplored, only a few works investigated it (Das and Bandyopadhyay, 2010a;Sazzed and Jayarathna, 2019;Sazzed, 2020). Das and Bandyopadhyay (2010a) translated English polarity lexicon to Bengali to create a Bengali sentiment dictionary. Sazzed and Jayarathna (2019) utilized two small datasets and n-gram (i.e.,unigram and bigram) feature vectors to compare the performance of supervised ML algorithms in Bengali and machine-translated English corpus. They found supervised ML algorithms showed better performance in the model trained on the translated corpus; however, they did not provide a thorough analysis of the results they reported.
Contrast to previous studies, we perform a comprehensive analysis of various cross-lingual sentiment analysis approaches in Bengali. We created a Benchmark dataset, explored several classification approaches utilizing labeled and unlabeled data, examine the applicability of transfer learning, investigate the sentiment preservation in the translated corpus, and finally provide the direction for future research. To best of our knowledge, this is the first extensive attempt to investigate the applicability of the cross-lingual approach in Bengali sentiment analysis.

Dataset
One of the barriers of sentiment analysis research in Bengali is the lack of publicly available review datasets. In literature, researchers reported results using their curated datasets that are not publicly available. The few publicly available datasets are either small in size or not well-annotated. Therefore, here, we have prepared a well-annotated Bengali review dataset that we made publicly available. 2

Data Collection
We collected and manually labeled a large review dataset for sentiment analysis in Bengali. This dataset contains viewer opinions towards several Bengali dramas. Using a web scraping tool, we first downloaded the raw JSON data from Youtube that contains information such as user name, id, timestamp, comments, and like/dislike, etc. We use a parsing script to extract the viewer's comments from the JSON data. The comments are written in Bengali, English, Romanized Bengali, or use code-mixing. As we are only interested in reviews written in Bengali text, we excluded non-Bengali comments. We utilized a language detection li- Figure 1: Example of Bengali and machine translated reviews brary 3 to identify Bengali comments. After removing the non-Bengali comments, the corpus contains around 15000 reviews, which are labeled using the procedure described in the next section.

Data Annotation
Two native Bengali speakers classified these 15000 reviews into three categories, positive, negative, and non-subjective. From the annotator ratings, we observe an inter-rater agreement of around 0.83 using Cohen's kappa. We exclude all the reviews, which are marked as non-subjective by either of the annotators.
For each subjective reviews, we only include it to the corpus if both annotators assign it to the same category (i.e., positive or negative). Therefore, our dataset contains only highly polarized reviews. Reviews that are ambiguous or contain mixed sentiment are not included in the dataset.
The final labeled corpus consists of 11807 annotated reviews, where each review contains around 2-300 Bengali words. This corpus is classimbalanced, comprised of 3307 negative and 8500 positive reviews. Figure 1 shows some examples of negative and positive reviews. We made this corpus publicly available for the researchers.

Cross-lingual Sentiment Analysis in Bengali
As Bengali is a resource-poor language, we leverage sentiment lexicon and labeled data from English for sentiment analysis in Bengali. We investigate the performances of various approaches (i.e., supervised, unsupervised, and transfer-learning based approaches) of sentiment analysis utilizing resources from English. Figure 2 shows the overview of various approaches we studied.

Language Mapping
The machine translation (MT) service is one of the most common ways to build the language connection (Wan, 2008a(Wan, , 2009Wei and Pal, 2010

Supervised Classification Approach
Supervised ML-based approaches have been successfully applied in English and other languages for sentiment classification. Since supervised ML classifiers do not rely on language resources such as sentiment lexicon, part-of-speech (POS) tagger, Figure 2: Various approaches of cross-lingual sentiment analysis in Bengali etc., they can be applied to any language. In contrast to the rigid rule-based method, supervised ML algorithms learn hidden patterns from the training data; therefore, they can be more robust against noisy machine-translated English corpus.
Utilizing the annotated data, we employ four supervised ML classifiers: Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and Extremely Randomized Trees (ET) on Bengali and its machine-translated English corpus. We use the scikit-learn (Pedregosa et al., 2011) implementation of the aforementioned ML classifiers. For all the ML classifiers, we utilize the default parameter settings. To deal with the class imbalance problem, we set the weight of a class inversely proportional to the number of instances it contains. Both the unigram and bi-grams features are used as input for the ML classifiers. We perform 10fold cross-validation in both Bengali and translated English corpus.

Lexicon-based Approach
To deal with the scenario when annotated data are not available, we study the performances of lexiconbased methods in machine-translated English corpus. In Bengali, no standard lexicon-based tool is publicly available for sentiment analysis; therefore, we could not compare the performance with the English counterpart.
Three popular lexicon-based methods from English: VADER, TextBlob, and SentiStrength are employed to find the effectiveness of the crosslingual unsupervised approach.

Transfer Learning-based Approach
Annotated data are hard to achieve in low-resource languages such as Bengali. But resource-rich languages like English owns a vast amount of labeled data. Hence, we explore the applicability of a transfer learning-based approach to the machine-translated corpus. However, in this work, we did not introduce any new transfer learning method. We examine whether utilizing existing cross-domain labeled data assist in achieving an acceptable performance of sentiment classification in Bengali when labeled data are not available.
In the transfer setting, a classifier is trained on one distribution while applied to a different distribution. The idea is to leverage labeled data from distinct domains but use in a similar task, as annotated in-domain data are not always available.
We employ multiple cross-domain datasets from the English language, IMDB (Maas et al., 2011), Yelp 5 , TripAdvisor (Thelwall, 2018), Clothing 6 , UCI Drug 7 , WebMD 8 as shown in Table 1. We train the Logistic Regression (LR) classifier using cross-domain datasets and use the trained model to predict the semantic orientations of reviews in our machine-translated corpus. The default parameter settings of the LR classifier of scikit-learn (Pedregosa et al., 2011) library is used with a classbalanced weight.

Evaluation Criteria
To compare the performances of various classifiers, we compute precision, recall, macro F1 score, and accuracy. As our dataset is class-imbalanced, the macro F1 score the better metric than the accuracy for the evaluation.
Besides, we assess the agreement of the predictions of various supervised ML classifiers in Bengali and machine-translated English corpus utilizing Cohen's kappa and Gwet's AC1 statistics. Cohen's kappa and Gwet's AC1 are statistical measures used to gauge inter-rater reliability, where a score of 1 refers to perfect agreement. The purpose of evaluating the agreement is to determine the sentiment preservation in the machine-translated English corpus. .

Supervised Approach
In this section, we provide the comparative performances of ML classifiers in Bengali and machinetranslated English corpus and agreement of the predictions.

Performance Comparison
Supervised ML classifiers show similar performance in both Bengali and translated English corpus, as shown in

Agreement of Predictions
We compute the agreement of the predictions of ML classifiers in Bengali and machine-translated English corpus. The purpose is to examine whether the noise induced by machine translation changes the sentiment orientations of the translated reviews. When the sentiment orientation is maintained in the translated corpus, we can expect a high agreement between the predictions of an ML classifier in Bengali and its machine-translated version.   Table 4 shows the results of the lexicon-based methods in the translated corpus. VADER and TextBlob exhibit similar F1 scores and accuracies, while SentiStrength performs relatively worse. Using VADER, we achieve an F1 score of 0.771 and an accuracy of 82.56%, while TextBlob obtains 0.776 and 82.79%, respectively. Table 5 provides the results of LR classifier utilizing cross-domain data. The best performance is obtained by combining all cross-domain datasets, which is 0.78 for the F1 score and 82% for the accuracy.

Lexicon and Transfer learning-based Approaches
6 Discussion 6.1 Supervised Approach Table 2 shows that supervised ML classifiers provide similar performance in the translated corpus and the original Bengali corpus. We found that several factors influence the comparable performance on the machine-translated corpus.

Error Correction
Misspelling is a common scenario in online Bengali content due to the complexity of the Bengali writing system and the education level of most of the internet users. Modern machine translation tools are trained on a huge amount of data and are capable of correcting misspelling. Although the Bengali-English machine translation system is not that sophisticated compared to some major language pairs, occasionally, it can identify misspelled words in Bengali text, and translate to correct English word. For those cases, machine translation improves the quality of data, so the classifier performance is improved.

Word Mapping
The current Bengali-English machine translation system still lacks enough coverage. We observed in some cases, Bengali synonym words are mapped to the same English word. This word-mapping assists supervised ML classifiers to perform well in the machine-translated corpus.

Regional Variety of Bengali
The Bengali language contains a large variety of dialects that are widely used on the web, especially in social media. The machine translation service that is trained on thousands of corpora can identify them as a variant of the same words and translate them to the same English word that positively impacts the performances of ML classifiers in the translated corpus.

Feature Importance and Sentiment Preservation
Supervised ML algorithms utilize the bag-of-words model to train the classifiers. The term frequencyinverse document frequency (tf-idf) score is calculated and used as an input feature vector. tf-idf is a numerical statistic that reflects the importance of a word considering a collection of documents. tf-idf score refers that not all the words in a document are equally important for classification. (Abdalla and Hirst, 2017) showed that sentiment is highly preserved even in the face of poor translation accuracy. Therefore, low-quality translation does not always affect classifier accuracy.
The Cohen's kappa and AC1 scores reveal the sentiment consistency between original Bengali reviews and its machine translated version as shown in Table 3. The Cohen-kappa and AC1 scores from SVM and LR show nearly perfect agreements on the results from Bengali and translated English corpus. For RF and DT, Cohen's kappa and AC1 statistics are a bit lower compared to SVM and LR, which could be affected by the inferior performance of those classifiers, however, still, agreements are substantial.

Lexicon-based Approach
TextBlob and VADER exhibit similar accuracy, precision, recall, and F1-scores, while SentiStrength performs worse. The results demonstrate that lexical-rule based methods are not as robust as supervised ML approaches, exhibited by the lower  scores in all categories. Particularly, the recall scores, due to the non-comprehensive coverage of lexicon, are quite low. The poor performance of the rule-based approach mainly comes from the intrinsic nature (e.g., lexicon/rule coverage) of lexiconbased methods.

Transfer Learning with Cross-domain Datasets
The results obtained using the LR classifier and the cross-domain datasets indicate that the classifier's performance depends on both the-

• Data distribution and
• Size of the training dataset The IMDB movie review dataset is the most similar to our translated drama review dataset considering the essence of the reviews. However, still, they differ in the aspects of data, languages used in the reviews, and the presence of noise due to machine translation. The translated drama reviews are much shorter in length and contain simple English words compared to IMDB reviews, which are written mostly by native English speakers. Utilizing 25000 reviews from IMDB, we achieve the best performance among all the cross-domain datasets used. Leveraging data from different domains, such as clothing or drug, yields worse performance despite using similar or larger size training dataset, which demonstrates the domain specificity in the sentiment analysis dataset.
We consolidate all the six cross-domain datasets to create a large corpus of over 130k reviews. The supervised LR classifier exhibits performance improvement utilizing this aggregated dataset. The results indicate that though datasets from the different domains show poor performance in isolation when aggregated, they can enhance the classifier performance.
With over 130k consolidated cross-domain reviews, the transfer learning-based approach shows noticeably worse performance compared to indomain data, an F1 score of 0.773 compared to 0.910 using the LR classifier. It provides similar performance to the best lexicon-based method, VADER, which yields an F1 score of 0.771. Word level polarity is heavily influenced by context and domain, which was reflected in the classifier's performance when cross-domain data are used.

Findings and Implications
• We find that online content in Bengali consists of lots of misspelled and regional words, which affects the performance of sentiment classifiers. Therefore, it is necessary to build sophisticated tools that can fix misspellings and recognize regional variants of Bengali words.
• Although the existing Bengali-to-English machine translation system is still far from perfect, it is capable of preserving sentiment information; hence can be utilized for crosslingual sentiment analysis.
• We find that the lexicon-based method performs poorly compared to the supervised ML methods in the machine-translated corpus. Therefore, it is imperative to develop an automatic or semi-automatic data annotation method.
• We find that a large number of cross-domain labeled data provides similar performance of the lexicon-based approach. Therefore, transfer learning can help when in-domain labeled data are unavailable.
• Our study reveals that the cross-lingual approach can be effective in Bengali sentiment analysis. Therefore, future research should focus on exploring and developing new methods for the cross-lingual sentiment analysis in Bengali.

Conclusion
To facilitate sentiment analysis research in Bengali, in this work, we introduce a benchmark dataset and explore the adaptation of resources and tools from English. We notice that due to misspellings, usage of regional varieties of Bengali, and advancement of the machine translation system, supervised ML algorithms perform comparably in the Bengali and machine-translated corpus. The agreements of the predictions suggest that Bengali-English machine translation can preserve the sentiment information.
The mediocre performances of the lexicon-based methods infer that annotated data are essential to achieve better classification accuracy. We present the performance of simple transfer learning utilizing cross-domain data. We note that with enough cross-domain training data, supervised ML classifiers provide a comparable performance of the lexicon-based methods, though lag behind the performance achieved through in-domain data. We report our findings regarding cross-lingual sentiment classification approaches in Bengali, which provide directions for future research.