A Light Lexicon-based Mobile Application for Sentiment Mining of Arabic Tweets

Most advanced mobile applications require server-based and communication. This often causes additional energy consumption on the already energy-limited mobile devices. In this work, we provide to address these limitations on the mobile for Opinion Mining in Arabic. Instead of relying on compute-intensive NLP processing, the method uses an Arabic lexical resource stored on the device. Text is stemmed, and the words are then matched to our own developed ArSenL. ArSenL is the first publicly available large scale Standard Arabic sentiment lexicon (ArSenL) developed using a combination of English SentiWordnet (ESWN), Arabic WordNet, and the Arabic Morphological Analyzer (AraMorph). The scores from the matched stems are then processed through a classifier for determining the polarity. The method was tested on a published set of Arabic tweets, and an average accuracy of 67% was achieved. The developed mobile application is also made publicly available. The application takes as input a topic of interest and retrieves the latest Arabic tweets related to this topic. It then displays the tweets superimposed with colors representing sentiment labels as positive, negative or neutral. The application also provides visual summaries of searched topics and a history showing how the sentiments for a certain topic have been evolving


Introduction
With the growth of social media and online blogs, people express their opinion and sentiment freely by providing product reviews, as well as comments about celebrities, and political and global events. These texts reflecting opinions are of great interest to companies and individuals who base their decisions and actions upon them (Feldman, 2013;Taboada et al., 2011). In particular, there is an increased interest in easy access to Arabic opinion from mobiles. In fact, around "10.8 million tweets come from the Arab region every day. 73.6% of all the tweets from the region are now in Arabic" (Radcliffe, 2013).
There have been many attempts to build sentiment analysis engines and several applications for performing opinion mining for English texts. Most opinion mining approaches in English are based on SentiWordNet (ESWN) (Esuli and Sebastiani, 2006;Baccianella & al., 2010) for extracting word-level sentiment polarity. Some researchers used the stored positive or negative connotation of the words to combine them and derive the polarity of the text (Esuli and Sebastiani, 2005). Recently, special interest has been given to opinion mining from Arabic texts, and as a result, there has also been interest in developing Arabic lexicons for word-level sentiment evaluation. The availability of a large scale Arabic based SWN is still limited (AlHazmi et al., 2013;Abdul-Mageed and Diab, 2012;Elarnaoty et al., 2012;Elarnaoty et al., 2012). In fact, there is no publicly available large scale Arabic sentiment lexicon similar to ESWN. Additionally, there are limitations with existing Arabic lexicons including deficiency in covering the correct num-ber and type of lemmas. Moreover, few applications exist for performing opinion mining in Arabic.
Sophisticated opinion mining models requires highly computational natural language processing tools. As an example for the Arabic language, MADAMIRA (Pasha et al., 2014) is a tool that performs tokenization, POS tagging and sense disambiguation by lemmatizing a given sentence in Arabic. However, the tool cannot be integrated in a mobile application without using a server.
Hence, in this work, we propose a method for opinion mining of Arabic tweets on mobile devices without the need for reliance on computeintensive NLP tools. We propose a computationally light method that uses a lexicon-based approach for Arabic tweets. Our newly developed large-scale sentiment lexicon, ArSenL, is leveraged for the method. ArSenL was created by matching Arabic WordNet (AWN) (Black et al., 2006) and lemmas in AraMorph lexicon to ESWN. Each lemma entry in the lexicon has three scores derived from the mapping with ESWN: positive, negative, and objective. The sum of the three scores is 1. Ideally, one should use NLP tools to process text and produce lemmas that can be matched to ArsenL. However, to keep processing light on the mobile, we produce a stemmed version of ArSenL, and then use word stems for matching. This design reduces the energy and performance costs caused by input/output and transmission operations on the mobile. A mobile application is designed and implemented to automatically analyze Arabic tweets, extract sentiments related to the tweets, and provide a visualization summary of the results. The user inputs a keyword of interests to him/her and the output displays a summary of the tweets' sentiments. The method is deigned to use limited computational and storage resources while achieving acceptable accuracy.
The rest of the paper is organized as follows. A literature review is presented in section 2 covering work that involved developing opinion mining methods based on lexical resources. In section 3, we detail the method, with descriptions of ArsenL and the developed application. Section 4 includes an evaluation of the sentiment model and a description of the developed mobile application. In section 5, we conclude our work and outline possible extensions.

Literature Review
There have been numerous efforts for creating sentiment lexicons in English and Arabic to perform sentiment mining. The primary target for these resources is to aid in automated analysis of sentiment content in text.
In fact, the Arabic language in social media presents several challenges for sentiment mining as detailed by El-Beltagy & Ali (2013). First, the unavailability of colloquial Arabic parsers makes the morphological analysis task harder. Moreover, there is no publicly available sentiment lexicon for Arabic. Entity name recognition and handling idiomatic Arabic expressions in different dialects are also additional challenges for Arabic sentiment mining. For more information on Arabic morphological complexity and dialectal variations, see Habash (2010). Denecke (2008) and Ohana and Tierney (2009) developed a lexicon sentiment model based on the success of the work of Esuli and Sebastiani, in 2006 who introduced ESWN as a resource that assigns for each synset in the English WordNet (EWN) scores for objectivity, positivity, and negativity. The model of Denecke (2008) is proposed to work with multilingual applications where the document is first translated from a foreign language into English and the three sentiment scores are then extracted based on ESWN. The scores are then used as features for the sentiment model. The processing of the document includes stemming and part of speech tagging.
While Denecke (2008) and Ohana and Tierney (2009) relied on ESWN to develop their sentiment mining model, Abdul-Mageed et al. (2011) used manually annotated adjective lexicon (SI-FAAT (Abdul-Mageed and Diab, 2012) to develop an opinion mining model for Arabic. The model uses morphological features and polarity labels of the adjectives matched to SIFAAT. As an extension to their work on lexicon based opinion mining models, Abdul-Mageed and Diab (2014) extended the lexicon to create SANA, a subjectivity and sentiment lexicon for Arabic. SANA has a mix of lemmas and inflected forms, many of which are not diacritized. However, SANA was not tested in the context of an opinion mining model.
As another attempt to create a lexicon-based approach for sentiment mining, Alhazmi et al. (2013) linked the Arabic WordNet to ESWN through the provided synset offset information.
The efficiency of the lexicon for sentiment mining was not evaluated.
While the previous approaches were mainly based on the availability of sentiment lexical resources, El-Halees (2011) developed a three steps opinion-mining model for Arabic documents. First, the documents are passed through a classification model that is based on lexical resources. This part classifies the majority of the document. The resultant classified documents are used as training set for maximum entropy method which then classifies some other documents. Finally, a K-nearest method is employed to classify the remaining documents using the output of the previous two classification models.
Using ESWN, Mukherjee et al. (2012) developed TwiSent which collects tweets and classifies them as positive, negative or objective. Besides detecting the sentiment of the tweet, TwiSent addresses four known problems for tweets: spams, structural anomalies, entity specifications and pragmatics. Addressing these inputs improved sentiment classification by 10 % compared to other sentiment mining applications that were trained on the same tweets' set. Moreover, this work is only limited to the English language. Davidov et al. (2010) describe a technique that transforms hashtags and smileys in tweets into sentiments. The described process is divided into two parts: identifying sentiment expressions, and determining the polarity of the identified expressions. Each tweet is divided into 4 different groups: words, punctuation, n-grams, and patterns. Then for each group a separate technique is applied to detect a positive or a negative sentiment. Although this approach analyzes hashtags and smileys which are multilingual, it is still mainly designed for the English language.
Last but not least, Aly & Atia (2013) presents a LABR: Large Arabic Book Reviews dataset consisting of 63K book reviews with rating from 1 to 5. The authors present baseline approaches for performing sentiment mining and set benchmarks for future research and approaches in sentiment mining.
In summary, while previous methods exist for English sentiment mining, none exist for realtime sentiment mining on mobile for Arabic. Additionally, those methods that do exist, and only for English, often rely on extensive computations making them infeasible for extensive use on a mobile. In this work, we provide a method that uses a recently developed Arabic sentiment lexicon, and requires minimal computations for the mobile.

Proposed Approach
In this section we describe the method, the Arabic sentiment lexicon, and the developed mobile application.

Method Overview
The processing steps of the model are shown in Figure 1. The pre-processing steps include: Tweet tokenization, hashtag removal, stemming, sentiment scores inference for the stemmed words, and then sentiment classification. The scores are then used to derive three aggregate features containing the sum of positive scores, the sum of negative scores, and the sum of objective or neutral scores. In this paper, we use objective and neutral interchangeably. These preprocessing steps are further detailed here.
Removing Hashtags: This step is essential to clean the data from hashtags and keep their corresponding words for sentiment analysis given their importance in the sentiment of the tweet. Stemming: Each tokenized tweet is stemmed to match it to a stemmed version of ArSenL. Lemmatization would have produced higher accuracy, however it would have required more computations. As a result, we used stemming to keep the processing light. Khoja's stemmer (1999) was utilized in the implementation.
Getting the Score of Tweets: Each stemmed word is matched to the stemmed version of Ar-SenL in order to retrieve the corresponding sentiment scores. If a word in the tweet did not have any match in ArSenL, a zero score is assigned for each of the positive, the negative and the objective scores of the word. The sentiment scores are then summed for each tweet. It is worth noting that we tried using an average score per tweet instead of the sum but the sum gave better accuracy results.

Arabic Sentiment Lexicon
For the Arabic Sentiment lexicon, we generate a stemmed version of ArSenL (Badaro et al., 2014). ArSenL was developed based on extending other existing resources in Arabic and English: English WordNet (EWN) (Miller et al., 1990), Arabic WordNet (AWN), English Senti-WordNet (ESWN) and AraMorph were used. The lemma entries in the Arabic resources were linked to the English synsets. The validated version was demonstrated to outperform the other version as well as state-of-the-art lexicons for Arabic sentiment.
A public interface to browsing ArSenL is available at http://www.oma-project.com. Ar-SenL can also be downloaded for research use. The interface allows the user to search for an Arabic word. The output shows the different scores for the Arabic word along with the corresponding sentiment scores. A snapshot of the homepage is shown in Figure 2. The scores are the sentiment scores that were extracted from ESWN after establishing the linking across different resources as detailed in Badaro et al work. Further details can be found in Badaro et al. paper.

Training Data
A corpus of 2300 manually annotated Arabic Tweets (~30k words) is utilized (Mourad and Darwish, 2013). The dataset was randomly sampled from Twitter out of 65 million unique tweets in Arabic. It was annotated by two native Arabic speakers. In case of disagreement, the two annotators discussed the issue of the tweet to resolve it. In case the disagreement remains, the tweet was dropped.

Features
The features used to build the classification model were only restricted to the sum of sentiment scores per tweet as retrieved from ArSenL. We made the features simple in order to reduce the processing and computation efforts given that our aim is to design an energy efficient sentiment model for mobile.

Classification Model
To predict the sentiment of a tweet, we decided to use decision trees as a classification model for ease of results' interpretation. The design is an ensemble classifier consisting of three binary classifiers: positive/not positive, negative/not negative and objective/not objective as shown in Figure 3. In order to train each classifier, an equal number of tweets is used for each class. The results of the three classifiers are then evaluated against custom developed rules that combine the results of the three classifiers in order to assign the correct sentiment label for a given tweet: positive, negative or neutral sentiment. For example, a tweet classified as positive, not negative and objective by the three binary classifiers respectively will be labeled as a positive tweet. These rules were chosen to achieve higher accuracy, and are shown in Figure  3 and the input to the rules are the results of the three binary classifiers. The classification model is shown in Figure 3.

Application Architecture
A 3-tier architecture shown in Figure 4 is used for the design of the application. The design is divided into three main components. The user interface is the component where the model takes as input the topic of interest and where the tweets are displayed after being classified as positive, negative or objective. The logic part consists of the processing performed in order to match the stemmed tweets to the stemmed lemmas in Ar-SenL and extract sentiment scores. The sentiment scores are fed to the classification model described above. The Data component represents all the sources of data that the application makes use of: the tweets accessed through an API, filtered tweets based on the input topic, ArSenL and the classification model. No additional servers are required to perform sentiment classification. Thus, the energy is reduced since there is no need for I/O communication with a remote server or for server-level computations. The mobile application was developed for Android OS mobiles and was titled ‫رأين؟"‬ ‫"شو‬ meaning what is their opinion. It is available for download through http://www.oma-project.com. An example reported in Table 1 to illustrate the different steps of the architecture in Figure 4. Below, we describe the steps involved in retrieving the sentiment of a tweet.

Fetching Tweets
There is a search box at the top of the main page in which the user enters the keyword of interest. Based on the keyword entered, recent tweets are fetched using Twitter API with Arabic filtering so that all fetched tweets are in Arabic. The user has the option to fetch more tweets by clicking on the "Show More" button. The fetched tweets are then stored in an array list for further processing and then deriving the related sentiment.

Detailed Tweets View
The fetched tweets are processed and labeled as positive, negative or objective as described. The tweets are displayed to the user and colored according to their sentiment label: green color for positive sentiments, red color for negative opinions and gray color for objective tweets. A snapshot of the interface is shown in Figure 5, showing classified tweets for the topic ‫"لبنان"‬ (Lebanon). These tweets reflect the latest tweets available on Twitter.

Summary Charts
Instead of looking at each tweet separately, a summary overview on the sentiments towards a specific topic can be accessed through the visual summaries available in the application. A pie chart is used to visualize the summary of the recently analyzed tweets, showing the distribution of the sentiment labels with the three colors green, red and gray. A sample snapshot of the visualization is shown in Figure 6.

Most Hashtag Used
Since hashtags are essential features in tweets and are usually highly correlated with the topic of the tweet, the design of the application allows the user to see the most used hashtags corresponding to the searched topic. A snapshot of this view in the application is shown in Figure 7.

History Fragments
Another important feature in the application is the availability of the history track. This option allows the user to keep track of the evolution of sentiment distributions regarding a specific topic through time. A snapshot reflecting the history fragment is shown in Figure 8.

Evaluation
In this section we evaluate three items: the sentiment model developed for identifying the sentiments of Arabic tweets, and the mobile application performance.

Accuracy of Sentiment Model
As described in section 3, an ensemble model is used to assess the sentiment of the tweet using three decision trees. The model was developed using WEKA data mining tool. The features of the model were the sums of the three sentiment scores per tweet. The dataset which consists of 2300 manu-ally annotated Arabic Tweets (~30k words) is utilized (Mourad and Darwish, 2013) to train the model and construct the trees. The model was optimized with custom rules to achieve a high accuracy in prediction. A 5-fold cross validation was used to evaluate the developed sentiment model. Accuracy measure is used to evaluate the system. Each classifier is evaluated separately and trained using the same number of tweets per class to avoid any bias or over fit in the model. The results are shown in Table 2.

Model
Accuracy ( An average accuracy of 67.33% was achieved for the full system.

Mobile Application Performance
The performance of the application was also evaluated. At first 20 tweets were being retrieved and processed but the response time was relatively long. Hence, we made the application fetch 10 tweets at a time. More tweets can be retrieved by pressing on the "Show More" button as seen in Figure 5. All other processing and computations were done using mobile resources. In this way, we achieved our target of creating a sentiment mining application fully runnable on mobile. Moreover, the user interface of the application has been updated several times to optimize performance and user-friendliness based on users' feedback.

Conclusion
We presented in this paper a light lexicon-based mobile application for sentiment mining of Arabic Tweets. A 3-tier architecture was designed to classify tweets as positive, negative or objective. The mobile application was designed to minimize energy consumption of the mobile by having an algorithm with minimal computational needs and no remote communication for computation. As an essential resource for the development of the mobile application, a stemmed version of ArSenL was generated. Different visualizations options are presented to the user. An ensemble classifier was developed based on manually annotated corpus of Arabic tweets and an average accuracy of 67.3% was achieved for sentiment classification through the mobile application. As future work, we consider enhancing the processing by integrating further intelligence in the classification model to retrieve negations. We are also considering developing the application for other mobile platforms.

Acknowledgments
This work was made possible by NPRP 6-716-1-138 grant from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.