OMAM at SemEval-2017 Task 4: Evaluation of English State-of-the-Art Sentiment Analysis Models for Arabic and a New Topic-based Model

While sentiment analysis in English has achieved significant progress, it remains a challenging task in Arabic given the rich morphology of the language. It becomes more challenging when applied to Twitter data that comes with additional sources of noise including dialects, misspellings, grammatical mistakes, code switching and the use of non-textual objects to express sentiments. This paper describes the “OMAM” systems that we developed as part of SemEval-2017 task 4. We evaluate English state-of-the-art methods on Arabic tweets for subtask A. As for the remaining subtasks, we introduce a topic-based approach that accounts for topic specificities by predicting topics or domains of upcoming tweets, and then using this information to predict their sentiment. Results indicate that applying the English state-of-the-art method to Arabic has achieved solid results without significant enhancements. Furthermore, the topic-based method ranked 1st in subtasks C and E, and 2nd in subtask D.


Introduction
Sentiment Analysis (SA) is a fundamental problem aiming to allow machines to automatically extract subjectivity information from text (Turney, 2002), whether at the sentence or the document level (Farra et al., 2010).This field has been attracting attention in the research and business communities due to the complexity of human language, and given the range of applications that are interested in harvesting public opinion in different domains such as politics, stocks and marketing.
The interest in SA from Arabic tweets has increased since Arabic has become a key source of the Internet content (Miniwatts, 2016), with Twitter being one of the most expressive social media platforms.While models for SA from English tweets have achieved significant success, Arabic methods continue to lag.Opinion mining in Arabic (OMA) is a challenging task given: (1) the morphological complexity of Arabic (Habash, 2010), (2) the excessive use of dialects that vary significantly across the Arab world, (3) the significant amounts of misspellings and grammatical errors due to length restriction in Twitter, (4) the variations in writing styles, topics and expressions used across the Arab world due to cultural diversity (Baly et al., 2017), and (5) the existence of Twitter-specific tokens (hashtags, mentions, multimedia objects) that may have subjective information embedded in them.Further details on challenging issues in Arabic SA are discussed in (Hamdi et al., 2016).
In this paper, we present the different systems we developed as part of our participation in SemEval-2017 Task 4 on Sentiment Analysis in Twitter (Rosenthal et al., 2017).This task covers both English and Arabic languages.Our systems work on Arabic, but is submitted as part of the OMAM (Opinion Mining for Arabic and More) team that also submitted a system that analyzes sentiment in English (Onyibe and Habash, 2017).
The first system extends English state-of-theart feature engineering methods, and is based on training sentiment classifiers with different choices of surface, syntactic and semantic features.The second is based on clustering the data into groups of semantically-related tweets and developing a sentiment classifier for each cluster.The third extends recent advances in deep learning methods.The fourth is a topic-based approach for twitter SA that introduces a mechanism to predict the topics of tweets, and then use this information to predict their sentiment polarity.It further allows operating at the domain-level as a form of generalization from topics.We evaluate these models for message polarity classification (subtask A), topicbased polarity classification (subtasks B-C) and tweet quantification (subtasks D-E).Experimental results show that English state-of-the-art methods achieved reasonable results in Arabic without any customization, with results being in the middle of the group in subtask A. For the remaining subtasks, the topic-based approach ranked 2 nd in subtask D and 1 st in subtasks C and E.
The rest of this paper is organized as follows.Section 2 describes previous efforts on the given task.Section 3 presents the details of the Arabic OMAM systems.Section 4 illustrates the performances achieved for each subtask.We conclude in Section 5 with remarks on future work.

Related Work
SA models for Arabic are generally developed by training machine learning classifiers using different choices of features.The most common features are the word n-grams features that were used to train Support Vector Machines (SVM) (Rushdi-Saleh et al., 2011;Aly and Atiya, 2013;Shoukry and Rafea, 2012), Naïve Bayes (Mountassir et al., 2012;Elawady et al., 2014) and ensemble classifiers (Omar et al., 2013).Word n-grams were also used with syntactic features (root and part-ofspeech n-grams) and stylistic features (digit and letter n-grams, word length, etc.) and achieved good performances after applying the Entropy-Weighted Genetic Algorithm for feature reduction (Abbasi et al., 2008).Sentiment lexicons also provided an additional source of features that proved useful for the task (Abdul-Mageed et al., 2011;Badaro et al., 2014Badaro et al., , 2015) ) A framework was developed for tweets written in Modern Standard Arabic (MSA) and containing Jordanian dialects, Arabizi (Arabic words written using Latin characters) and emoticons.This framework was realized by training different classifiers using features that capture the different linguistic phenomena (Duwairi et al., 2014).A distant-based approach showed improvement over existing fully-supervised models for subjectivity classification (Refaee and Rieser, 2014a).A subjectivity and sentiment analysis system for Arabic tweets used a feature set that includes different forms of the word (lexemes and lemmas), POS tags, presence of polar adjectives, writing style (MSA or DA), and genre-specific features including the user's gender and ID (Abdul-Mageed et al., 2014).Machine translation was used to apply existing state-of-the-art models for English to translations of Arabic tweets.Despite slight accuracy drop caused by translation errors, these models are still considered efficient and effective, especially for low-resource languages (Refaee and Rieser, 2014b;Mohammad et al., 2016).
We briefly mention the state-of-the-art performances achieved in English SA.A new class of machine learning models based on deep learning have recently emerged.
These models achieved high performances in both Arabic and English, such as the Recursive Auto Encoders (RAE) (Socher et al., 2011;Al Sallab et al., 2015), the Recursive Neural Tensor Networks (Socher et al., 2013), the Gated Recurrent Neural Networks (Tang et al., 2015) and the Dynamic Memory Networks (Kumar et al., 2015).These models were only evaluated on reviews documents, and were never tested against the irregularities and noise that exist in Twitter data.A framework to automate the human reading process improved the performance of several state-of-the-art models (Baly et al., 2016;Hobeica et al., 2011).

OMAM Systems
In this section, we present the four OMAM systems that we investigated to perform the different subtasks of SemEval-2017 Task 4. These systems were explored during the development phase, and those that achieved best performances for each subtask were then used to submit the test results.

System 1: English State-of-the-Art SA
The state-of-the-art system selected from English was the winner of SemEval-2016 Subtask C "Fivepoint scale Tweet classification" in English (Ba-likas and Amini, 2016).To apply it for Arabic, we derived an equivalent set of features to train a similar model for sentiment classification in Arabic tweets.The derived features are listed here: • Word n-grams, where n ∈ [1, 4].To account for the morphological complexity and sparsity of Arabic language, lemma n-grams are extracted since they have better generalization capabilities than words (Habash, 2010) • Character n-grams, where n ∈ [3, 5] • Counts of exclamation marks, question marks, and both marks • Count of elongated words • Count of negated contexts; a negated context is any phrase that occurs between a negation particle and the next punctuation We also added two additional binary features that indicate the presence of (1) user mentions and (2) URL or any other media content.

System 2: Cluster-based SA
This system is based on grouping semanticallyrelated tweets, then training different sentiment classifiers for each group independently.At test time, each upcoming tweet is assigned to one of the pre-defined clusters, and the corresponding sentiment classifier is used to predict its polarity.Clusters are identified by applying the kmeans algorithm to cluster the word embedding space that is generated using the skip-gram embedding model (Mikolov et al., 2013).Consequently, each cluster corresponds to a collection of semantically-related word vectors, and each tweet is assigned to the cluster whose word vectors are most similar (closest) to the tweet's words' vectors.Tweets that are assigned to the same cluster are used together to train a sentiment classifier using n-gram features.We trained several classifiers including the logistic regression, linear and nonlinear SVM, Bernoulli Naive Bayes, Multinomial Bayes Naive.During model development, we only tuned the number of clusters k, whereas we used the default parameters of the different classifiers as implemented in scikit-learn (Pedregosa et al., 2011).

System 3: Recursive Auto Encoders
We trained the RAE deep learning model that achieved high performances in both English (Socher et al., 2011) and Arabic (Al Sallab et al., 2015).Briefly, the RAE model derive a sentence representation by combining word embeddings, two at a time, following the structure of a syntactic parse tree.The sentence representation is then used to train a softmax sentiment classifier.We followed the setup proposed by (Al Sallab et al., in press 2017) by applying RAE to morphologically tokenized text which proved to improve the performance by reducing the lexical sparsity of the language.We also use a broader semantic representation of words by concatenating word embeddings trained using the skip-gram model (Mikolov et al., 2013) with sentiment embeddings trained using the ArSenL sentiment lexicon (Badaro et al., 2014).

System 4: Topic-based SA
This system is based on the assumption that tweets discussing a particular topic are likely to share some unique semantic features.Figure 1 shows the architecture of this system.It is composed of several modules; (A) unsupervised topic classifier, (B) supervised topic classifier, (C) supervised domain classifier, in addition to a (D) generic sentiment classifier.The idea behind this system is that, since the test tweets may belong to topics that are not present in the training set, the different modules attempt to predict the topic and then classify the tweet's sentiment given the predicted topic.Before running the system in Figure 1, topic-specific and domain-specific sentiment classifiers are trained offline.Tweets belonging to each topic or domain in the train set are used, along with their sentiment labels, to train sentiment classifiers that are specific to the corresponding topic or domain.These classifiers are used with the above-mentioned modules as follows.
(M 1 ) Unsupervised Topic Classification Since the topic of each new tweet is unknown and can be different from those in the training set, we aim to discover which of the training topics is closest (or mostly related) to that of the tweet.This is achieved by training an embedding model, similar to that in System 2. Then, for each new tweet the system checks the similarity between the vector of each of the training topics and those of the tweet's words.The tweet is then assigned to the topic with the highest similarity, and its sentiment polarity is predicted using the sentiment classifier that is trained using instances of that particular topic.
(M 2 ) Supervised Topic Classification In many cases, all similarity values turn out to be small and close to 0. This is possible if the test tweet's topic is totally different from those in the train set, or if the tweet's words are implicitly related to the discussed topic.In such cases, we refer to a supervised topic classifier; a multi-class classifier, where the number of classes is equal to the number of topics in the training set.The topic classifier is trained using n-gram features extracted from all training tweets.Once the topic of the test tweet is predicted, its sentiment polarity is predicted using the sentiment classifier that is trained using instances of that particular topic.
(M 3 ) Supervised Domain Classification Some topics may not have sufficient instances to train an accurate sentiment classifier, therefore we introduce the concept of "domain"; a generalized form of the topic.A supervised domain classifier is a multi-class classifier, where the number of classes is equal to the number of domains in the training set.The domain classifier is trained using n-gram features extracted from all training tweets.Once the domain of the test tweet is predicted, its sentiment polarity is predicted using the sentiment classifier that is trained using instances of that particular domain.
(M 4 ) Direct Sentiment Classification In addition to the topic-specific and domain-specific classifiers, we also experiment with the direct sentiment classifier that ignores the topic information and is trained using all tweets in the training set.We evaluated the following sequences of these modules: For instance, in the first sequence, the tweet's topic is predicted using the unsupervised module (M 1 ), and then its polarity is predicted using the sentiment classifier for that topic.If no similarity was detected, we proceed to module (M 2 ) to predict the tweet's topic using the topic classifier, and then predict its sentiment using the sentiment classifier for that topic.If the topic is rare and no sentiment classifier exists for that topic, we proceed to module (M 3 ) to predict the tweet's domain using the domain classifier, and then predict its sentiment using the sentiment classifier for that domain.

Experiments and Results
In this section, we describe the experiments and results we achieved as part of our participation in SemEval-2017 Task 4. We describe the datasets we used, the preprocessing steps we applied and the performance of the different systems for each subtask.Table 1 illustrates the design of the evaluation experiments, highlighting the systems that were evaluated for each subtask.The system that achieved the best evaluation results, for each subtask, was then used to submit the test results.

Datasets and Preprocessing
To run our experiments, we used datasets provided by the task organizers (Rosenthal et al., 2017) as follows.During evaluation, we trained our models on the TRAIN set, and evaluated our different systems on the DEV set.During testing, the system that achieved the best development results is trained using the combination of TRAIN and DEV sets, and tested the model on the TEST set.
For the English state-of-the-art approach (System 1), tweets are preprocessed by (1) replacing mentions and URLs with special tokens, (2) extracting emoticons and emojis and replacing them with special tokens using the emojis sentiment lexicon (Novak et al., 2015) and a in-home emoticons lexicon, (3) normalizing hashtags by removing the # symbol and the underscores that connect words in composite hashtags, and (4) normalizing letter repetitions (elongations).Then features are extracted by performing lemmatization and POS tagging using MADAMIRA v2.1, the state-of-the-art morphological analyzer and disambiguator in Arabic (Pasha et al., 2014), that uses the Standard Arabic Morphological Analyzer (SAMA) (Maamouri et al., 2010).We only included n-grams that occurred more than a pre-defined threshold t, where t ∈ [3, 5] is tuned on the "DEV" set.
For the cluster-based SA approach (System 2), we trained the skip-gram word embedding model using a collection of datasets including the TRAIN and the DEV tweets provided by the organizers, the Qatar Arabic Language Bank (QALB) (Zaghouani et al., 2014) and several Arabic Twitter corpora from (Nabil et al., 2015;Refaee and Rieser, 2014b).We also used the k-means algorithm to cluster the embedding space into k clusters, with k ranging between 1 (no clustering) and 12. Best results during development were obtained using k = 4 and 5.
For the RAE approach (System 3), tweets are processed similar to System 1.We used MADAMIRA v2.1 to perform morphological tokenization following the ATB scheme (Habash and Sadat, 2006).We also used the Stanford parser (Green and Manning, 2010) to generate the syntactic parse trees.Since the resulting trees are not necessarily binary, and hence cannot be used to train recursive models, we used left-factoring to transform the trees to the Chomsky Normal Form (CNF) grammar that only contains unary and binary production rules.
For the topic-based approach (System 4), tweets are preprocessed by applying normalization and stemming using the NLTK ISRI stemmer (Taghva et al., 2005) and stopword removal.Then, n-grams are extracted using SKlearn TFiDFvectorizer (Pedregosa et al., 2011), with a variance threshold for feature reduction.The tweets in the training set that is provided by the task organizers pertain to 34 topics.We came up with a list of 8 generic do-mains that correspond to these topics, as shown in Table 2.

Message Polarity Classification (A)
For this subtask, we evaluated the English state-ofthe-art approach (System 1), the cluster-based SA approach (System 2) and RAE (System 3).The development and test results are illustrated in Table 3.It can be observed that System 1 achieved the best development results, and hence was used at the test phase.System 2 achieved slightly lower recall and higher accuracy, which indicates the potential benefits of training different sentiment classifiers for different clusters.Also, the inferior performance produced by System 3 can be due to its reliance on Arabic NLP tools, such as the syntactic parser, that are trained on MSA data, whereas the evaluation data are tweets that are likely to be noisy in terms of containing significant amounts of misspellings and grammatical errors.

Topic-based Polarity Classification (B-C)
For these subtasks, we evaluated the English stateof-the-art approach (System 1) and the different configurations of the topic-based SA approach (System 4) as discussed in subsection 3.2.The development and testing results for the 2-point and the 5-point scale predictions are illustrated in Table 4 and 5, respectively.For Subtask B, it can be observed that ignoring the topic and domain information achieves highest performances.It can also be observed that generalizing from topics to domains in System 4 achieves better results than working at the topic-level only.As for Subtask C, results indicate that using topic-specific sentiment classifiers, and backing them with domain-specific sentiment classifiers, achieves the best performance in the competition on that subtask.

Tweet Quantification (D-E)
For these subtasks, we evaluated the English stateof-the-art approach (System 1) and the different configurations of the topic-based SA approach (System 4).The development and testing results for the 2-point and the 5-point scale quantifications are illustrated in Table 6 and 7 results, and ranked 2 nd in the competition.On the other hand, for subtask E, it turns out that using the simple n-gram features for direct sentiment classification ranked 1 st in the competition.

Conclusion
In this paper, we evaluated the application of recent state-of-the-art English models for sentiment analysis in Arabic tweets.These systems were used to perform all Arabic-related subtasks in SemEval-2017 Task 4.
In some cases, such as for message polarity classification (subtask A), the feature-based approach outperformed a RAE deep learning approach and another system that is based on creating semantic clusters for the tweets and training a sentiment classifier for each cluster.
For topic-based polarity classification (subtasks B and C) and topic-based tweet quantification (subtasks D and E), we evaluated a system that predicts the topic of upcoming tweets, and then predicts their sentiment using topic-specific sentiment classifiers.We allow this system to generalize from topics to domains.Results indicate that ignoring the topic and the domain information achieves better performances, with an exception for subtask C, where using topic-specific classifiers and backing them with domain-specific classifiers performs better.
As part of our future work, we will focus on developing SA models for different Arabic dialects, and also to perform cross-regional evaluations to confirm whether different models are needed for different regions and dialects, or a general model can work for any tweet regardless of its origins.

Figure 1 :
Figure 1: Architecture of the topic-based sentiment analysis system.

Table 2 :
The list of 8 generalized domains corresponding to the 34 topics in the training dataset.

Table 6 :
, respectively.Results for subtask D (rank: #2/3).For both subtasks, it can be observed that ignoring the topic and domain information achieves the best performances.For subtask D, using the features from System 1 achieved best development Sys 4 [M 1 →M 2 →M 3 ] 0.473 Sys 4 [M 2 →M 3 ]