UDLAP: Sentiment Analysis Using a Graph-Based Representation

We present an approach for tackling the Sentiment Analysis problem in SemEval 2015. The approach is based on the use of a co-occurrence graph to represent existing relationships among terms in a document with the aim of using centrality measures to extract the most representative words that express the sentiment. These words are then used in a supervised learning algorithm as features to obtain the polarity of unknown documents. The best results obtained for the different datasets are: 77.76% for positive, 100% for negative and 68.04% for neutral, showing that the proposed graph-based representation could be a way of extracting terms that are relevant to detect a sentiment.


Introduction
In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them. Working with these informal text genres presents challenges for natural language processing (NLP) beyond those encountered when working with more traditional text genres. Typically this kind of texts are short and the language used is very informal. We can find creative spelling and punctuation, slang, new words, URLs, and genre-specific terminology and abbreviations that make their manipulation more challenging.
Representing that kind of text for automatically mining and understanding the opinions and sentiments that people communicate inside them has very recently become an attractive research topic (Pang, 2008). In this sense, the experiments reported in this paper were carried out in the framework of the SemEval 2015 1 (Semantic Evaluation) which has created a series of tasks for Sentiment Analysis on Twitter (Rosenthal, 2015). Among the proposed tasks we find Task 10, subtask B which was named Message Polarity Classification and was defined as follows: "Given a message, classify whether the message is of positive, negative, or neutral sentiment. For messages conveying both a positive and a negative sentiment, whichever is the stronger sentiment should be chosen". In order to solve this task we create an approach that uses a graph based representation to extract relevant words that are used in a supervised learning method to classify a set of unknown documents in different topics and genres provided by the SemEval team. The methodology for our approach is discussed in detail in the next sections.
The rest of the paper is structured as follows. In Section 2 we present some related work found in the literature with respect to the identification of sentiments in text documents. In Section 3 a graph based representation is proposed. In Section 4 the methodology and the tools used to detect the sentiments of a set of unknown documents are explained. In Section 5, the experimental results are presented and discussed. Finally, in Section 6 the conclusions as well as further work are described.

Related Work
There exist a number of works in literature associated to the automatic identification of sentiments in documents. Some of these works have focused on the contribution of particular features, such as the use of the vocabulary to extract lexical elements associated to the documents (Kim, 2006), the use of bigrams and trigrams (Dave, 2008) to capture syntactic features of texts associated with a sentiment, the use of dictionaries and emoticons of positive and negative words (Agarwal, 2011) as well as lexicalsyntactic features or the use of Part of Speech tags (PoS) (Wilks, 1999;Whitelaw, 2005) as syntactic features that can help to disambiguate the polarity of the words in a context.
In the other hand, many contributions focused on the use of structures to represent the features associated to a document like the frequency of occurrence vector (Wrobel, 2002;Aizawa, 2003;Serrano, 2006). Finally, linear representation of documents features combined with the use of a Support Vector Machine (SVM) has shown great performance in the tasks associated with the classification of texts (Vapnik, 1995;Joachims, 1998).
Research works that use graph representations for texts in the context of Sentiment Analysis barely appear in the literature (Pinto, 2014;Poria, 2014). It usually has been proposed the concept of n-grams with a frequency of occurrence vector to solved it (Pang, 2008). However, there is still an enormous gap between this approach and the use of more detailed graph structures that represent in a natural way the lexical, semantic and stylistic features.

Graph-Based Representation
Among different proposals for mapping texts to graphs, the co-occurrence of words (Sonawane, 2014) has become a simple but effective way to represent the relationship of one term over another one in texts where there is no syntactic order (usually social media texts like Twitter or SMS). Formally the proposed co-ocurrence graph is represented by G = (V, E, L, α), where: • V = {v i |i = 1, ..., n} is a finite set of vertices that consists of the words contained in one or several texts.
• E ⊆ V × V is the finite set of edges which represents that two vertices are connected by means of the co-occurrence, where: -Two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words, where N can be set to any value (typically between two and ten words).
• L is the edges tag set which consists of the number of times that two vertices co-occur in a text window.
• α : E → L is a function that assigns a tag to a pair of associated vertices.
As an example, consider the following sentence ζ extracted from a text T in the dataset: "They may have a SuperBowl in Dallas, but Dallas ain't winning a SuperBowl. Not with that quarterback and owner, they are really bad.", which after the preprocessing stage (see Section 4) would be as follows: "may have SuperBowl Dallas Dallas ain't winning SuperBowl quarterback owner are bad". Based on the proposed representation, preprocessed sentence ζ can be mapped to the proposed co-ocurrence graph shown in Figure 1.

Figure 2: Sentiment Analysis Process
The co-occurrence graph shown in figure 1 has the following features: • Terms co-occur within a window of 3 words.
• The set of vertices consists of the preprocessed words in sentence ζ.
• An edge between two vertices represent that both words appear in the same co-occurrence window (at least once).
• The label edge between two vertices represents the number of times that two words appear in a co-occurrence window in sentence ζ. Figure 2 shows the methodology used to detect the sentiments associated to a set of unknown documents, considering the use of graphs to extract the most relevant words associated to the documents. The methodology consists of five steps:

Sentiment Analysis Process Using A Graph Representation
1. Preprocess all documents associated with the SemEval 2015 dataset. This task includes elimination of punctuation symbols and all the elements that are not part of the ASCII encoding.
Then, each preprocessed sentence in a text is tagged with its corresponding PoS tags, for this step, the TreeTagger tool 2 was used.
2. Map only the nouns, verbs and adjectives of all documents in the training set to a graph representation (see section 3).
3. Apply the Degree and Closeness centrality measures (Freeman, 1979) which are indicators that identify the most important vertices within a graph, where: • The Degree centrality is defined as the number of links incident upon a vertex in the graph and is used to find the topologically representative words. • The Closeness centrality is defined as the average sum of the shortest paths from one vertex to the others in the graph and is used to find the most accessible words in the graph which consequently are syntactically relevant.
4. For each document in the training and test collection extract the top 100 ranked vertices (the most important words in the graph) from both centrality measures in the graph without repetition and use them to build a frequency of occurrence vector (Manning, 2008).
5. Apply a SVM classifier (Harrington, 2012) with a polynomial kernel implemented in the scikit-learn 3 platform (Pedregosa, 2011), in order to construct a classification model which is used for determining the sentiment of a given anonymous document.

Experimental results
The results obtained with the proposed approach are discussed in this section. First, we describe the dataset used in the experiments and, thereafter, the results obtained.

Dataset
The description of the three text collections used in the experiments for the SemEval 2015 is shown in the next table: The test corpus was made up of short texts (messages) categorized as: "Progress Test" and Official 2015 Test. The Progress Test includes the following datasets: LiveJournal2014, SMS2013, Twit-ter2013, Twitter2014 and Twitter2014Sarcasm. A 3 http://scikit-learn.org/stable/ complete description of the training and test datasets can be found at the task description paper (Rosenthal, 2015).

Obtained results
In Table 2 we present results obtained with each dataset considered in the SemEval 2015 competition. The results were evaluated according to the (F 1 pos + F 1 neg )/2 measure (Rosenthal, 2014) for the overall score and the precision measure (Manning, 2008) for each one of the sentiments. Our approach performed in all cases above the baseline. We consider that these results were obtained even though the training corpus was very unbalanced (there were more positive texts than others) and there was a high difference between the vocabulary of the training and test datasets. Further analysis on the use of centrality measures and on the methodology for constructing the graph will allow us to find more accurate features that can be used in a supervised learning method for the Sentiment Analysis problem.

Conclusions
We have presented an approach that uses a supervised learning method with a graph based representation. The results obtained show a competitive performance that is above the baseline score. The model presents a good performance on the Twitter dataset. However, there is still a great deal to improve on the LiveJournal and SMS datasets where the text could be smaller and the use of slang and genre-specific terminology is usual. One of the contributions of this paper is that we use a graph based representation (with an excellent runtime) with centrality measures to discover words related to each sentiment instead of using traditional features like ngrams and vocabulary. As further work we propose the following: • Experiment with other graph representations for texts that include alternative levels of language descriptions such as the use of sentence chunks, pragmatic sentences, etc.
• Apply the graph representation described in this paper to the Authorship Attribution problem (Holmes, 1994), where training and test data sets are balanced and belong to the same linguistic domain.