Tw-StAR at SemEval-2017 Task 4: Sentiment Classification of Arabic Tweets

In this paper, we present our contribution in SemEval 2017 international workshop. We have tackled task 4 entitled “Sentiment analysis in Twitter”, specifically subtask 4A-Arabic. We propose two Arabic sentiment classification models implemented using supervised and unsupervised learning strategies. In both models, Arabic tweets were preprocessed first then various schemes of bag-of-N-grams were extracted to be used as features. The final submission was selected upon the best performance achieved by the supervised learning-based model. However, the results obtained by the unsupervised learning-based model are considered promising and evolvable if more rich lexica are adopted in further work.


Introduction
Social media is literally shaping decision making processes in many aspects of our daily lives. Exploring online opinions is therefore becoming the focus of many analytical studies. Twitter is one of the most popular microblogging systems that enables a real-time tracking of opinions towards ongoing events . Hence, it provides the needed feedback information for analytical studies in several domains such as politics and targeted advertising (El-Makky et al., 2014). Sentiment analysis plays an essential role in performing such studies as it can extract the sentiments out of the opinions and classify them into polarities (Tang et al., 2015). Arabic language has recently been considered as one of the most growing languages on Twitter with more than 10.8 million tweets per day (Alhumoud et al., 2015). Yet, Arabic is remarkably less tackled in the research of Sentiment Analysis (Nabil et al., 2015;ElSahar and El-Beltagy, 2015). With more resources and tools for Arabic Natural Language Processing (NLP) becoming available, and with the recent developed sentiment lexica for Modern Standard Arabic (MSA) and dialectal Arabic, this year, SemEval contest offers the opportunity to apply sentiment classification on Arabic tweets through subtask 4A-Arabic (Rosenthal et al., 2017). Analyzing Arabic tweets is significantly challenging due to the complex nature and morphology of the Arabic language. Furthermore, Arabic tweets are mostly informal and written in different dialects in which same words or expressions may have drastically different sentiments. For example, is a compliment of a positive sentiment that means "May GOD grant you health" in the Levantine dialect while it has an aggressive meaning of "burn in fire" in the Moroccan and Tunisian dialects (El-Makky et al., 2014). Additionally, tweeters tend to use abbreviations, neologisms, emoji and sarcasm frequently (Maas et al., 2011;Rajadesingan et al., 2015), and sometimes in the same 140-characters tweet (Maas et al., 2011).
Here, we describe our participation in Task 4, subtask 4A-Arabic of SemEval 2017 under the team name "Tw-StAR" (Twitter-Sentiment analysis team for ARabic). The task requires classifying the sentiment of single Arabic tweets into one of the classes: positive, negative or neutral (Rosenthal et al., 2017). To accomplish this mission, we have used two classification models: • Supervised learning-based model: bag-of-N-grams features of different schemes have been adopted to train the model. Support Vector Machines (SVM) and naïve Bayes (NB) algorithms have been used as classification algorithms.
• Unsupervised learning-based (lexicon-based) model: in which a merged MSA/multidialectal sentiment lexica along with the constant weighting strategy have been employed to classify the tweets' sentiment.
The remainder of the paper is organized as follows: in Section 2, we describe the preprocessing step. In Section 3, we identify the extracted feature sets. Section 4 introduces the learning strategies used in the presented models. Results are reviewed and discussed in Section 5 while Section 6 concludes the study and future work.

Data Preprocessing
In this step, we have first cleaned the tweets from the unsentimental content such as URLs, Username, dates, hashtags, retweet symbols, punctuation, emotions and non-Arabic characters to get the Arabic text only as in (Shoukry and Rafea, 2012;Al-Osaimi and Badruddin, 2014). Secondly, the input data has been filtered from the words that do not affect the text meaning, the so called stopwords (El-Makky et al., 2014). Since our data contains several dialects we had to use an already built stopwords list of 244 words for MSA and Egyptian dialect used in (Shoukry and Rafea, 2012) merged with a manually-built list of 12 words from the Levantine and Gulf dialects such as , which mean"where" and "what" in the Gulf and Levantine dialects respectively. Furthermore, MSA/dialectal negation words such as , that mean "not" in Levantine and Egyptian dialects respectively, have been excluded from the used stopwords lists, as they may reverse the polarity of a tweet (Duwairi et al., 2014). Thus, a tweet such as "https://t.co/wPg3KEz4bW-" which means "what is going on in Trump's mind" becomes " " after preprocessing. Lastly, for the ulexicon-based model, we have subjected each tweet to tokenization then to stemming to facilitate the words lookup process in the lexica. Stemming has been carried out using the Information Science Research Institute's (ISRI) Arabic stemmer provided by NLTK library (Bird, 2006). ISRI is a root-extraction stemmer that can provide a normalized form of unstemmed words rather than leaving them unchanged. Moreover, being a context-sensitive stemmer prevents ISRI from producing insensible and invalid roots (Dahab et al., 2015).

Feature Extraction
Bag-of-N-grams features have been adopted to be used in both of the presented models (Shoukry and Rafea, 2012;Abdulla et al., 2013;. N-grams represent a sequence of adjoining N items collected from a given corpus. Extracting N-grams can be thought of as exploring a large piece of text through a window of a fixed size (Pagolu et al., 2016). Features selection has been performed using NLTK module Fre-qDist which gives a list of the distinct words ordered by their frequency of appearance in the corpus (Bird, 2006). A specific number of features was defined (equals to 40100 for the combination of unigrams+bigrams+trigrams) in order to be selected from the FreqDist's list . The feature extraction pipeline is illustrated in Figure 1. For a certain

Learning Strategies
In this section, we describe the learning strategies adopted by the presented models. The mechanism of each strategy is briefly reviewed, in addition to an introduction of the python 1 supported tools used by these strategies to build the classification models.

Supervised learning
Supervised learning requires a labeled corpus to train the classifier on the text polarity prediction (Biltawi et al., 2016). In our case, a polarity labeled dataset of (3355) Arabic tweets provided by SemEval 2017 has been used such that 2684 tweets were dedicated to train the model while 671 tweets were used to tune it. The learning process has been carried out by inferring that a combination of specific features of a tweet yields a specific class (Shoukry and Rafea, 2012). We have used Naïve Bayes (NB) from Scikit-Learn (Pedregosa et al., 2011) since it is as powerful as Logistic Regression (Räbigera et al., 2016) and has proved its efficiency in classifying sentiment of multidialectal datasets (Itani et al., 2012). Additionally, linear SVM from LIBSVM was employed for its robustness and ease of implementation (Chang and Lin, 2011). Regarding used features, and as higher-order N-grams performed better compared to unigrams (Rushdi-Saleh et al., 2011). We have adopted N-grams schemes ranging from unigrams up to trigrams.

Unsupervised learning (lexicon-based)
In this strategy, neither labeled data nor training step are required to train the classifier. The polarity of a word or a sentence is determined using a sentiment lexicon or lexica that can be either pre-built or manually-built (Abdulla et al., 2013). Sentiment lexica usually contain subjective words along with their polarities (positive, negative). For each polarity, a sentiment weight is assigned using one of these weighting algorithms: • Sum method: adopts the constant weight strategy to assign weights to the lexicon's entries, where negative words have the weight of -1 while positive ones have the weight of 1. The polarity of a given text is thus calculated by accumulating the weights of negative and positive terms. Then, the total polarity is determined by the sign of the resulted value (Abdulla et al., 2016).
• Double Polarity (DP) method: assigns both a positive and a negative weight for each term in the lexicon. For example, if a positive term in the lexicon has a weight of 0.8, then its negative weight will be: -1+0.8 = -0.2. Similarly, a negative term of -0.6 weight has a 0.4 positive weight. Polarity is calculated by summing all the positive weights and all the negative weights in the input text. Consequently, the final polarity is determined according to the greater absolute value of the resulted sum (El-Makky et al., 2014).
Having the MSA/dialectal combination of our training dataset defined by manual annotation (see Table 1), we have adopted a merged of pre-built and manually-built sentiment lexica with 6587 total entries of single and compound terms.  (Salameh et al., 2015) for emojis and Arabic Hashtag Lexicon (Salameh et al., 2015;Mohammad et al., 2016) for MSA/multiple dialects. Levantine and Gulf dialects were targeted through two manually-built lexica. Table 2 lists the used lexica and their sizes.
As in (Abdulla et al., 2013; El-Beltagy and Ali, 2013), we have used the Sum method to determine the tweets' polarity. The polarity calculation procedure involved looking for entries that match the tweet's unigrams or bigrams in the lexica. Besides, we have provided the ability to look for the stemmed word if the unstemmed one could not be found (Al-Horaibi and Khan, 2016). Stopwords and negation words were kept to increase the possibility of matching a tweet's token with the compound terms of the merged lexica. Thus, for a tweet such means "Google is incredibly creative" the polarity is calculated by summing the polarity values of its tokens "google+incredibly+creative= 0+1+1=+2 >0 " hence, it is positive.

Results and Discussion
The provided dataset consists of three parts: TRAIN (2684 tweets) for training models, DEV (671 tweets) for tuning models, and TEST (6100 tweets) for the official evaluation. Data preprocessing involved using regular expressions recognition and substitution provided by the re Python module 2 . N-grams feature schemes (un-igrams+bigrams+trigrams) have been generated via NLTK 3 . Having the data preprocessed and the features extracted, we have trained the supervised learning-based model then classified the sentiment of the DEV set. The used classification algorithms were SVM from LIBSVM 4 and NB from Scikit-Learn 5 . Table 4 lists the results of these two classification algorithms. Considering the baseline results reviewed in Table 3, it can be observed that a slight improvement was achieved by NB compared to the baseline. While SVM outperformed both the baseline and NB by achieving an average F-score (AVG F1) of 0.384 and an average Recall (AVG R) of 0.459.
AVG F1 AVG R Baseline 0.249 0.333  For the lexicon-based model, the tweet's tokens (unigrams+bigrams) have been looked up in the lexica to calculate the tweet's polarity using the Sum method. The lookup process involved looking for the stemmed token if the unstemmed one is not found in the lexica. In Table 5, we notice that when stemming assists the lookup process, the performance degraded from 0.342 to 0.309 in terms of F-score value. This is because dialectal words may not be stemmed correctly by ISRI stemmer 6 (Dahab et al., 2015). For example, the term means "I want" in the Gulf dialect and has a neutral sentiment, while its stem using ISRI is means "injustice" that has a negative polarity. However, the experiment in which stemmer was not used achieved quite a close performance to that of the supervised model as it yielded 0.448 and 0.342 for average recall and average F-score respectively. This is due to the fact that MSA/Egyptian and Gulf/Levantine dialects were efficiently supported by the used lexica.

Stemming
AVG F1 AVG R Available 0.309 0.367 Not available 0.342 0.448 Table 5: Lexicon-based model performance on DEV dataset.
Considering the results in Table 4 and Table 5, the supervised learning-based model with SVM algorithm achieved the best average F-score and Recall values compared to the lexicon-based model. So, we selected it to provide the TEST set classification results for the final submission. Table 6 reviews the scores and the ranking of our system in the official evaluation.

Conclusion and Future work
We have investigated sentiment classification of Arabic tweets via two classification models of various features and two learning strategies. Relatively, satisfying results were obtained by the supervised and lexicon-based models. For the final submission, we selected the supervised learningbased model, as it achieved the best average Fscore and Recall values. However, the lexiconbased model has also yielded good results when the lookup process was not assisted by stemming. We noticed that MSA/multi-dialectal content has been efficiently handled by the merged lexica. Further improvement can be obtained in the future if Levantine/Gulf dialects are more efficiently supported by using their current lexica entries as seeds to produce a richer lexicon.