OMAM at SemEval-2017 Task 4: English Sentiment Analysis with Conditional Random Fields

We describe a supervised system that uses optimized Condition Random Fields and lexical features to predict the sentiment of a tweet. The system was submitted to the English version of all subtasks in SemEval-2017 Task 4.


Introduction
Sentiment analysis, sometimes known as opinion mining, is the process of detecting the contextual polarity of text. That is, given a text (of any length), subjective information pertaining to the sentiment attached to the text is derived using natural language processing tools (Pang et al., 2008;Cambria et al., 2013). Sentiment analysis could be approached in two ways. General sentiment analysis, often termed sentence-level sentiment analysis, extracts the general sentiment of the text based solely on its contents. The sentiment is not related or based on any external entity. On the other hand, topic-level sentiment analysis infers the sentiment of the given text based on a specific topic. This branch of sentiment analysis has been further explored under the term Stance Detection (Faulkner, 2014;Anand et al., 2011). With the rapid increase in different forms of online expression like reviews, political criticism, ratings and punditry, social media has become an invaluable source of data for research in sentiment analysis. With data from social media, sentiment analysis can show the public sentiment towards current topics of public discourse. Twitter is one of the largest of such social media platforms and a prominent source of data for sentiment research (Pak and Paroubek, 2010;Wang et al., 2011;Rajadesingan and Liu, 2014). In this paper, we describe the components and results of a system for English sentiment analysis with which participated in an international shared task on sentiment analysis for Twitter data.

Shared Task Description
The SemEval-2017 Task 4 (Rosenthal et al., 2017) (henceforth SemEval) is aimed at categorizing tweets from Twitter. This task is composed of five subtasks. Subtask A is a message polarity classification task where tweets are classified on general sentiment (not directed at any topic) on a threeway scale: Negative, Neutral and Positive (henceforth, −1, 0, +1). Subtask B is a topic-based message polarity classification where tweets are classified on sentiment towards a given topic on a twoway scale: Negative and Positive (henceforth, −1, +1). Subtask C is the same topic-based task as B, except that it uses a five-point sentiment scale (−2, −1, 0, +1, +2), where −2 is very negative and +2 is very positive. Both subtasks D and E are tweet quantification tasks based on subtasks B and C, respectively. In both D and E, given the same datasets from B and C, the distribution of the tweets for each topic across each label of the given scales is estimated.
This task is a rerun of SemEval-2016 Task 4 (Nakov et al., 2016), with some changes. For this task, user profile information of the author of each tweet were made available. Also, this task included an Arabic language version. Our system works on English but is submitted as part of the OMAM (Opinion Mining for Arabic and More) team that also submitted a system that analyzes sentiment in Arabic (Baly et al., 2017).

Approach
For all subtasks, we used the same setup (process and system). We used CRF++ (Kudo, 2005), which is an implementation of Conditional Random Fields (CRF), as the underlying machine learning component. We were inspired by the work of Yang et al. (2007) who used CRFs to determine sentiment of web blogs, training at the sentence level and classifying at the document level where the sentences sequence was taken in consideration. For this shared task, however, the tweets are not ordered, so there is no sequence information to be exploited. Nevertheless, we were interested in benchmarking how CRFs will fare in this scenario. We optimized the lexical features as well as the CRF++ parameters for each subtask independently against the specific subtask metrics. Although some subtasks involved topiclevel sentiment analysis (i.e. sentiment towards a target), we ignored the topics for all subtasks. This idea is taken from the top scoring submission to SemEval-2016 Task 4C, TwiSE (Balikas and Amini, 2016), who used a single-label multiclass classifier and ignored topics altogether. For the tweet quantification tasks, we used a simple aggregation script (supplied by SemEval for a previous iteration of this task).

Data Preprocessing
We make use of all the data provided by SemEval for training and testing for all five subtasks. Additionally, we use a data set of 4,000 tweets available from SentiStrength. 1 In the SemEval data, each tweet is paired with a T weetID, T opic and Label, except for subtask A data which has no T opic. For the SentiStrength data, each tweet is assigned two values representing positive and negative sentiment.
In order to use the training data from other subtasks and from SentiStrength in a subtask, we convert across the different label sets. For subtask A (three-point scale), we mapped subtask C's data's five-point scale labels −2 and +2 to −1 and +1, respectively; but used subtask B's data as is. We also added the SentiStrength data's two values and mapped them to (−1, 0, +1). For subtask B (and D) (two-point scale), we folded the five-point labels as above and had two options regarding neutral values: remove the neutral tweets or duplicate them and relabel them as positive once and negative once. We also explored classifying with higher point scales and mapping down. Details are discussed in Section 4.2. For subtask C (and E) (five-point scale), we converted data labels from other subtasks using duplication and relabeling: tweets with positive labels were duplicated and labeled with +1 and +2; tweets with negative labels were duplicated and labeled with −1 and −2; and the neutral labels (0) were simply duplicated to maintain balance. When converting subtask A 1 http://sentistrength.wlv.ac.uk/ data for use in other subtasks, a placeholder topic column was added. This did not influence the system as topics are not considered in any of the subtasks. The SentiStrength data was first mapped to subtask A format, then duplicated and relabeled.

Lexical Features
We considered the following lexical features with the CRF++ system. The unigram feature was always used, but feature combinations were explored for the other lexical features.
• Unigrams The unique words in each tweet consisting of alphanumeric characters and punctuation.
• Tweet length (twtlen) The number of words in the tweet.
• Bigrams The unique bigrams in the tweet.
• SentiStrength (senti) The SentiStrength tool estimates the strength of positive and negative sentiment in short texts (Thelwall et al., 2010). The tool returns two values representing negative sentiment (range −1 to −5 ) and positive sentiment (range +1 to +5 ). Both values are used, as well as their sum, and a mapped value (onto the range of −2 to +2).
• Removed URL (rurl) All URLs are replaced with the string 'EXTERNALURL'. If the removed URL feature is true, that string is removed.
• Stopwords (stpwrd) This feature removes all stopwords in the tweet.

Model Optimization
We optimize the CRF++ model on the training data in two phases. First, the seven lexical features discussed above are exhaustively combined to identify the best feature combination for each subtask separately. Additionally, we explored combinations of different data sets, e.g., using SentiStrength data and/or subtask A data for subtask C. Using all of the available data for each subtask produced the best results. During this phase, the CRF++ is run with default parameter values. Next, the model is further optimized by tuning the CRF++ parameters c and f. The c value controls the hyper-parameter for the CRF to balance between overfitting and underfitting. The f parameter  sets the cut-off threshold for the features. We explored all combinations of c and f ranging between 0.5 and 10.0 (in increments of 0.5) and 1 to 4 (in increments of 1), respectively.

Evaluation Metrics
Each subtask had its own target metric (Rosenthal et al., 2017). Subtasks A and B use AvgR, macro-averaged recall (recall averaged across the targeted labels). Subtask C uses M AE M , macroaveraged mean absolute error. Subtask D uses KL, Kullback-Leibler Divergence. Subtask E uses EM D, Earth Mover's Distance.

Subtask A
For subtask A, we used the following data sets for training: SemEval 2016 task 4A data (train, dev and devtest), SemEval 2016 task 4C data (train, test, dev and devtest) and SentiStrength twitter data. Table 1 shows the five top performing combinations from the lexical feature optimization phase. The senti feature with unigrams were the best features. Table 2 shows the five top performing combinations for the CRF parameter optimization phase. The best setup, with c value 8.5, f value 1 and features unigram and senti, was used to produce the predicted file submitted for subtask A for SemEval 2017. Our submission received the following scores and ranks (in subscript) out of 38 systems: average recall AvgR = 0.590 24 , AvgF 1 = 0.542 26 , Acc = 0.615 19 .

Subtasks B and D
For subtasks B and D, we explored the possibility of training and predicting in five-point scale space and then mapping to two-point scale space. In the mapping, −2 and +2 map to −1 and +1, respectively. When the system predicts the neutral label (0), we select the next most probable label determined using CRF++'s verbose mode. For subtask B and D, we used the following training data: Se-mEval task 4B data, SemEval task 4A data, Se-mEval task 4C data and SentiStrength twitter data. Table 3(a) shows the five top performing lexical feature combinations for subtask B. All features combinations in Table 3(a) were from the setup where classification is done in two-point format and the neutral tweets from subtask A and C data were removed.
Table 3(b) shows the five top performing c and f value combinations for subtask B. The highest performing setup, with c value 0.5 and f value 1 and features twtlenbin, rurl, bigram and senti, was used to produce the predicted file submitted for subtask B for SemEval 2017. Our submission received the following scores and ranks (in subscript) out of 23 systems: average recall AvgR = 0.779 15 , AvgF 1 = 0.762 17 , Acc = 0.764 17 .
For subtask D, the five top performing feature combinations are shown in Table 5. All combinations in Table 5, except the second, are from the setup that predicts on five-point labeled data where the neutral tweets are preserved and duplicated. The second combination, unigram; stpwrd; twtlen; senti, is from the setup the predicts on fivepoint labeled data where the neutral tweets are removed.
The combination with the best score, used the following features: unigram; twtlenbin, rurl, senti. This was used for prediction for the test file which was later aggregated and submitted for subtask D. Our submission received the following scores and ranks (in subscript) out of 15 systems: KL = 0.164 12 , AE = 0.204 12 , RAE = 2.790 12 .

Subtasks C and E
For subtask C and E, we used the following data for training: SemEval 2016 task4A data, SemEval 2016 task 4C data and SentiStrength twitter data. Table 4(a) shows the five top performing lexical feature combinations for subtask C. Table 4 shows the five top performing c and f value combinations for subtask C. The highest performing setup, with c value 10.0 and f value 2 and features unigram; twtlen, twtlenbin, rurl and senti, was used to produce the prediction file submitted for subtask C for SemEval 2017. Our submission received the following scores and ranks (in subscript) out of 15 systems: M AE M = 0.895 10 , M AE µ = 0.475 1 .
For subtask E, Table 6 shows the five top performing lexical combinations.
The combination with the best score, used the following features: unigram twtlenbin, rurl, senti. This was used for creating the prediction file which was later aggregated and submitted for subtask E. Our submission received the following

Conclusion and Future Work
In this paper, we presented a system for English sentence-level sentiment analysis of twitter using CRF++ and optimized lexical features. We explore feature combinations and tune CRF++ parameters values to find the best setup for each subtask. Overall, the unigram and SentiStrength (senti) features were always present in the best performing setups for all subtasks. In all subtasks other than A, binned tweet length (twtlenbin) and removing URLs (rurl) consistently helped. We used this system to participate in the SemEval-2017 Task 4. The system's performance was middle of the pack, which was accomplished while ignoring topics in the topic-level tasks.
In the future, we will explore more lexical features and other CRF and SVM implementations. We also look forward to applying the same setup to other languages.