SiTAKA at SemEval-2017 Task 4: Sentiment Analysis in Twitter Based on a Rich Set of Features

This paper describes SiTAKA, our system that has been used in task 4A, English and Arabic languages, Sentiment Analysis in Twitter of SemEval2017. The system proposes the representation of tweets using a novel set of features, which include a bag of negated words and the information provided by some lexicons. The polarity of tweets is determined by a classifier based on a Support Vector Machine. Our system ranks 2nd among 8 systems in the Arabic language tweets and ranks 8th among 38 systems in the English-language tweets.


Introduction
Sentiment analysis in Twitter is the problem of identifying people's opinions expressed in tweets. It normally involves the classification of tweets into categories such as positive, negative and in some cases, neutral. The main challenges in designing a sentiment analysis system for Twitter are the following: • Twitter limits the length of the message to 140 characters, which leads users to use novel abbreviations and often disregard standard sentence structures.
• The informal language and the numerous spelling errors.
Most of the existing systems are inspired by the work presented in (Pang et al., 2002). Machine Learning techniques have been used to build a classifier from a set of tweets with a manually annotated sentiment polarity. The success of the Machine Learning models is based on two main facts: a large amount of labeled data and the intelligent design of a set of features that can distin-guish between the positive, negative and neutral samples.
With this approach, most studies have focused on designing a set of efficient features to obtain a good classification performance (Feldman, 2013;Liu, 2012;Pang and Lee, 2008). For instance, the authors in (Mohammad et al., 2013) used diverse sentiment lexicons and a variety of hand-crafted features.
This paper proposes the representation of tweets using a novel set of features, which include the information provided by seven lexicons and a bag of negated words (BonW). The concatenation of these features with a set of basic features improves the classification performance. The polarity of tweets is determined by a classifier based on a Support Vector Machine.
The system has been evaluated on the Arabic and English language test sets of the Twitter Sentiment Analysis Track in SemEval 2017, subtask A (Message Polarity Classification). Our system (SiTAKA) has been ranked 8th over 38 teams in the English language test set and 2nd out of 8 teams in the Arabic language test set.
The rest of the paper is structured as follows. Section 2 presents the tools and the resources that have been used. In Section 3 we describe the system. The experiments and results are presented and discussed in Section 4. Finally, in the last section the conclusions as well as further work are presented.

Resources
This section explains the tools and the resources that have been used in the SiTAKA system. Let us denote its Arabic language and English language versions by Ar-SiTAKA and En-SiTAKA, respectively.

En-SiTAKA Lexicons
We used for En-SiTAKA five lexicons in this work, namely: General Inquirer (Stone et al., 1968), Hu-Liu opinion lexicon (HL) (Hu and Liu, 2004), NRC hashtags lexicon (Mohammad et al., 2013), SenticNet (Cambria et al., 2014), and TS-Lex (Tang et al., 2014b). More details about each lexicon, such as how it was created, the polarity score for each term, and the statistical distribution of the lexicon, can be found in (Jabreel and Moreno, 2016).

Ar-SiTAKA Lexicons
In this version of the SiTAKA system, we used four lexicons created by (Saif M. Mohammad and Kiritchenko, 2016): Arabic Hashtag Lexicon, Dialectal Arabic Hashtag Lexicon, Arabic Bing Liu Lexicon and Arabic Sentiment140 Lexicon. The first two were created manually, whereas the rest were translated to Arabic from the English version using Google Translator.

Embeddings
We used two pre-trained embedding models in En-SiTAKA. The first one is word2vec which is provided by Google. It is trained on part of the Google News dataset (about 100 billion words) and it contains 300-dimensional vectors for 3M words and phrases (Mikolov et al., 2013b). The second one is SSWEu, which has been trained to capture the sentiment information of sentences as well as the syntactic contexts of words (Tang et al., 2014c). The SSWEu model contains 50dimensional vectors for 100K words.
In Ar-SiTAKA we used the model Arabic-SKIP-G300 provided by (Zahran et al., 2015). Arabic-SKIP-G300 has been trained on a large corpus of Arabic text collected from different sources such as Arabic Wikipedia, Arabic Gigaword Corpus, Ksucorpus, King Saud University Corpus, Microsoft crawled Arabic Corpus, etc. It contains 300-dimensional vectors for 6M words and phrases.

System Description
This section explains the main steps of the SiTAKA system, the features used to describe a tweet and the classification method.

Preprocessing and Normalization
Some standard pre-processing methods are applied on the tweets: • Normalization: Each tweet in English is converted to the lowercase. URLs and usernames are omitted. Non-Arabic letters are removed from each tweet in the Arabic-language sets. Words with repeated letters (i.e. elongated) are corrected.
• Tokenization and POS tagging: All Englishlanguage tweets are tokenized and tagged using Ark Tweet NLP (Gimpel et al., 2011), while all Arabic-language tweets are tokenized and tagged using Stanford Tagger (Green and Manning, 2010).
• Negation: A negated context can be defined as a segment of tweet that starts with a negation word (e.g. no, don't for Englishlanguage, for Arabic-language) and ends with a punctuation mark (Pang et al., 2002). Each tweet is negated by adding a suffix (" NEG" and " ") to each word in the negated context.
It is necessary to mention that in Ar-SiTAKA we did not use all the Arabic negation words due to the ambiguity of some of them. For example, the first word , is a question mark in the following " -What do you think about what happened?" and it means "which/that" in the following example " -The matter that happened today was very bad".
As shown in (Saif et al., 2014), stopwords tend to carry sentiment information; thus, note that they were not removed from the tweets.

Features Extraction
SiTAKA uses five types of features: basic text, syntactic, lexicon, cluster and Word Embeddings. These features are described in the following subsections:

Basic Features
These basic features are extracted from the text. They are the following: Bag of Words (BoW): Bag of words or n-grams features introduce some contextual information.
The presence or absence of contiguous sequences of 1, 2, 3, and 4 tokens are used to represent the tweets.
Bag of Negated Words (BonW): Negated contexts are important keys in the sentiment analysis problem. Thus, we used the presence or absence of contiguous sequences of 1, 2, 3 and 4 tokens in the negated contexts as a set of features to represent the tweets.

Syntactic Features
Syntactic features are useful to discriminate between neutral and non-neutral texts.
Part of Speech (POS): Subjective and objective texts have different POS tags (Pak and Paroubek, 2010). According to , nonneutral terms are more likely to exhibit the following POS tags in Twitter: nouns, adjectives, adverbs, abbreviations and interjections. The number of occurrences of each part of speech tag is used to represent each tweet.
Bi-tagged: Bi-tagged features are extracted by combining the tokens of the bi-grams with their POS tag e.g. "feel VBP good JJ" " JJ VBD". It has been shown in the literature that adjectives and adverbs are subjective in nature and they help to increase the degree of expressiveness (Agarwal et al., 2013;Pang et al., 2002).

Lexicon Features
Opinion lexicons play an important role in sentiment analysis systems, and the majority of the existing systems rely heavily on them (Rosenthal et al., 2014). For each of the chosen lexicons, a tweet is represented by calculating the following features: (1) tweet polarity, (2) the average polarity of the positive terms, (3) the average polarity of the negative terms, (4) the score of the last positive term, (5) the score of the last negative term, (6) the maximum positive score and (7) the minimum negative score.
The polarity of a tweet T given a lexicon L is calculated using the equation (1). First, the tweet is tokenized. Then, the number of positive (P) and negative (N) tokens found in the lexicon are counted. Finally, the polarity measure is calculated as follows:

Cluster Features
We used two set of clusters in En-SiTAKA to represent the English-language tweets by mapping each tweet to a set of clusters. The first one is the well known set of clusters provided by the Ark Tweet NLP tool which contains 1000 clusters produced with the Brown clustering algorithm from 56M English-language tweets. These 1000 clusters are used to represent each tweet by mapping each word in the tweet to its cluster. The second one is Word2vec cluster ngrams, which is provided by (Dong et al., 2015). They used the word2vec tool to learn 40-dimensional word embeddings of 255,657 words from a Twitter dataset and the Kmeans algorithm to cluster them into 4960 clusters. We were not able to find publicly available semantic clusters to be used in Ar-SiTAKA.

Embedding Features
Word embeddings are an approach for distributional semantics which represents words as vectors of real numbers. Such representation has useful clustering properties, since the words that are semantically and syntactically related are represented by similar vectors (Mikolov et al., 2013a). For example, the words "coffee" and "tea" will be very close in the created space. We used sum, standard-deviation, min and max pooling functions (Collobert et al., 2011) to obtain the tweet representation in the embedding space. The result is the concatenation of vectors derived from different pooling functions. More formally, let us consider an embedding matrix E ∈ R d×|V | and a tweet T = w 1 , w 2 , ..., w n , where d is the dimension size, |V | is the length of the vocabulary (i.e. the number of words in the embedding model), w i is the ith word in the tweet and n is the number of words. First, each word w i is substituted by the corresponding vector v j i in the matrix E where j is the index of the word w i in the vocabulary. This step ends with the matrix W ∈ R d×n . The vector V T,E is computed using the following formula: where denotes the concatenation operation. The pooling function is an element-wise function, and it converts texts with various lengths into a fixedlength vector allowing to capture the information throughout the entire text.

Classifier
Up to now, Support Vector Machines (SVM) (Cortes and Vapnik, 1995) have been used widely and reported as the best classifier in the sentiment analysis problem. Thus, we trained a SVM classifier on the training sets provided by the organizers. For the English-language we combined the training sets of SemEval 13-16 and testing sets of SemEval 13-15, and used them as a training set. Table 1 shows the numerical description of the datasets used in this work. We used the linear kernel with the value 0.5 for the cost parameter C. All the parameters and the set of features have been experimentally chosen based on the development sets.

Results
The evaluation metrics used by the task organizers were the macroaveraged recall (ρ), the F1 averaged across the positives and the negatives F 1 P N and the accuracy (Acc) (Rosenthal et al., 2017). The system has been tested on 12,284 Englishlanguage tweets and 6100 Arabic-language tweets provided by the organizers. The golden answers of all the test tweets were omitted by the organizers. The official evaluation results of our system are reported along with the top 10 systems and the baseline results in Table 2 and 3. Our system ranks 8th among 38 systems in the Englishlanguage tweets and ranks 2nd among 8 systems in the Arabic language tweets. The baselines 1, 2 and 3 stand for the cases in which the system classifies all the tweets as positive, negative and neutral respectively.

Conclusion
We have presented a new set of rich sentimental features for the sentiment analysis of the messages posted on Twitter. A Support Vector Machine classifier has been trained using a set of basic features, information extracted from a set of useful and publicly available opinion lexicons, syntactic   features, clusters and embeddings. Deep learning approaches have recently been used to build supervised, unsupervised or even semi-supervised methods to analyze the sentiment of texts and to build efficient opinion lexicons (Severyn and Moschitti, 2015;Tang et al., 2014a,c); thus, the authors are considering the possibility of also using this technique to build a sentiment analysis system.