UDLAP at SemEval-2016 Task 4: Sentiment Quantification Using a Graph Based Representation

,


Introduction
In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. There is no limit to the range of information conveyed by tweets and texts. These short messages are extensively used to share opinions and sentiments that people have about their topics of interest. Working with these informal text genres presents challenges for Natural Language Processing (NLP) beyond those encountered when working with more traditional text genres. Typically, this kind of texts are short and the language used is very informal. We can find creative spelling and punctuation, slang, new words, URLs, and genre-specific terminology and abbrevi-ations that make their manipulation more challenging.
Representing that kind of text for automatically mining and understanding the opinions and sentiments that people communicate inside them has very recently become an attractive research topic (Pang and Lee, 2008). In this sense, the experiments reported in this paper were carried out in the framework of the SemEval 2016 1 (Semantic Evaluation) which has created a series of tasks for sentiment analysis on Twitter (Nakov et al., 2016b). Among the proposed tasks we chose Task 4, subtask D which was named tweet quantification according to a two-point scale and was defined as follows: "Given a set of tweets known to be about a given topic, estimate the distribution of the tweets across the Positive and Negative classes". In order to solve this task we created an algorithm that builds up graphs to compare each topic against all possible sentiments for obtaining the polarity percentage of each one. The steps involved in our sentiment quantification process are then discussed in detail.
The rest of the paper is structured as follows: in Section 2 we present some related work found in the literature with respect to the quantification of sentiments in text documents. In Sections 3 to 5 the algorithm and the graph representation used to detect the percentage of texts for each sentiment are explained. In Section 6, the experimental results are presented and discussed. Finally, in Section 7 the conclusions as well as further work are described. Output: /* Positive (p) and negative (n) polarity percentage for each topic*/ P T = {(p 1 , n 1 ), ..., (p s , n s )} Procedure: /* Let G P ositive and G N egative denote the graphs of the positive an negative documents created from X and Y */ G P ositive , G N egative for each z i in Z do /*Let G T opic denote a topic graph created from DT [z i ]*/ G T opic /*Similarity between topic and sentiments, see algorithm 2*/ Sim 1 = Similarity(G T opic , G P ositive ) Sim 2 = Similarity(G T opic , G N egative ) /*Apply a heuristic*/ if Sim 1 > Sim 2 then There exist a number of works in literature associated to the automatic quantification of sentiments in documents. Some of these works have focused on the contribution of particular features, such as the use of the vocabulary to extract lexical elements associated to the documents (Kim and Hovy, 2006), the use of part-of-speech tag n-grams and syntactic phrase patterns (Esuli et al., 2010) to capture syntactic features of texts associated with a sentiment, the use of dictionaries and emoticons of positive and negative words (Go et al., 2009) as well as man-ually and semiautomatically constructed syntactic and semantic phrase and lexicons (Gao and Sebastiani, 2015;Whitelaw et al., 2005).
On the other hand, many contributions focused on the use of structures to represent the features associated to a document like the frequency of occurrence vector (Manning et al., 2008;Balinsky et al., 2011) or the vectors that represent the presence or absence of features (Kiritchenko et al., 2014). But research works that use graph representations for texts in the context of sentiment quantification barely appear in the literature (Pinto et al., 2014;Poria et al., 2014). It has usually been proposed the concept of n-grams with a frequency of occurrence vector to solve it (Pang and Lee, 2008). However, there is still an enormous gap between this approach and the use of more detailed graph structures that represent in a natural way the lexical, semantic and stylistic features.

Sentiment Quantification
Algorithm 1 shows the steps involved in computing the percentage of positive and negative tweets for each topic in the test dataset (see section 6.1) considering the use of graphs to represent the word interaction for each sentiment in the training dataset and for each topic in the test dataset. The algorithm consists of five relevant stages: 1. Preprocess all documents in the dataset. This task includes elimination of punctuation symbols and all the elements that are not part of the ASCII encoding. Then, all the remaining words are changed to lowercase.
2. Create a graph for each sentiment using the training dataset documents (see Section 4).
3. Create a graph for each topic using the test dataset documents (see Section 4).
4. Compare each topic graph against the sentiment graphs and calculate the similarity score between both (see Section 5).
5. Compare those similarities and take the highest to use it as a base to calculate the quantification score for each sentiment in a topic, considering that the sum of all percentages related to a topic must be equal to one 2 .

Graph Based Representation
Among different proposals for mapping texts to graphs, the co-occurrence of words (Sonawane and Kulkarni, 2014;Balinsky et al., 2011) has become a simple but effective way to represent the relationship of one term over another one in texts where there is no syntactic order (usually social media texts like Twitter or SMS). Formally, the proposed cooccurrence graph used in the experiments is represented by G = (V, E), where: • V = {v 1 , ..., v n } is a finite set of vertices that consists of the words contained in one or several texts.
• E ⊆ V × V is the finite set of edges which represent that two vertices are connected if their corresponding lexical units co-occur within a window of maximum 2 words in the text (at least once). We consider this type of window because it represents the natural relationship of words.
As an example, consider the following sentence ζ extracted from a text T in the dataset: "Axel Rose needs to just give up. Now. Not later, not soon, not tomorrow.", which after the preprocessing stage (see Section 3) would be as follows: "axel rose needs to just give up now not later not soon not tomorrow". Based on the proposed representation, preprocessed sentence ζ can be mapped to the cooccurrence graph shown in Figure 1.

Graph similarity
After having created the graph representation for each topic and sentiment in the dataset, the steps involved in computing the similarity score (Castillo et al., 2015) are shown in algorithm 2. The algorithm consists of four relevant stages: 1. Obtain all vertices (words) that share the topic graph as well as the sentiment graph.
2. Apply the Dice similarity measure (Montes et al., 2000;Adamic and Adar, 2003) for each 2 SemEval 2016 task 4, subtask D requirement. graph, taking as input the shared vertices of the previous step and the graph to be analyzed. The result is a matrix that represents the similarity scores for each pair of input vertices, based on their connection patterns. Formally, Dice similarity calculates the similarity of two vertices (x, y) as twice the number of common neighbors (ngb) divided by the sum of the neighbors of the vertices (see equation 1).
Dice(x, y) = 2 |ngb(x) ∩ ngb(y)| |ngb(x)| + |ngb(y)| (1) 3. Obtain the upper triangular values for each matrix and use them to build a vector representation (Manning et al., 2008). The rest of the matrix values are not useful, because the main diagonal represents the similarity of an input vertex with itself and the lower triangular is the same as the upper one. cho, 2004) between the vector representing the topic and the vector representing a sentiment. The result is a value in the range of 0 to 1 that indicates how similar the two graphs are. The Euclidean distance of vector A and B is calculated using equation 2.

Apply the normalized Euclidean distance (Can
Algorithm 2 Similarity between graphs The results obtained with the proposed approach are discussed in this section. First, we describe the dataset used in the experiments and, thereafter, the results obtained.

Dataset
The document collection used in the experiments is a subset of the SemEval 2016 task 4 corpus (Nakov et al., 2016b), which includes, several text documents in English on different topics and genres. The dataset is divided in two groups: • Training documents: It contains a set of topics each one with a set of known documents. For each document a label that indicates the polarity of the text (positive or negative) is assigned.
• Test documents: It contains a set of topics 3 each one with a set of known documents. In this case there is no label that indicates the polarity of the text. These documents are used to test our algorithm taking into account the writing style samples of the training documents.
In Table 1, main dataset features are shown, including the number of documents per topic for the training and test dataset.

Obtained results
In Table 2 we present results obtained with the test dataset considered in the SemEval 2016 task 4 subtask D. The results were evaluated according to the Kullback-Leibler Divergence (KLD), which is a measure of the error made in estimating a true distribution p over a set C of classes by means of a predicted distributionp. KLD (Nakov et al., 2016a) is a measure of error, so lower values are better (see equation 3). Taking into account obtained results, our approach performed above the baseline 1 and slightly below baseline 2. We consider that these results were obtained even though the training corpus was very unbalanced (there were more positive texts than others) and there was a high difference between the vocabulary of the topics of the training and test datasets. The proposed algorithm showed an effective and relative fast way 4 (00:02:48 minutes) to get the percentage of positive and negative documents although it is necessary to perform different experiments using the proposed approach on a test dataset with more topics. Further analysis on the use of a co-occurrence graph and the similarity measure will allow us to find more accurate features that can be used for the sentiment quantification problem.

Conclusions
We have presented an approach that incorporates the use of a graph representation to solve the sentiment quantification problem (task 4 subtask D). The results obtained show a competitive performance that is above one of the baseline scores. However there is still a great challenge to improve the techniques for dealing with the quantification problem where the text could be smaller and there are different topics, each one with his own vocabulary. One of the contributions of this paper is that we proposed a graph based representation and a similarity measure for the quantification problem instead of using traditional classification techniques like a supervised learning method based on the extraction of stylistic features (Kharde and Sonawane, 2016). As further work we propose the following: • Use different co-occurrence windows for modeling the text using a graph based representation.
• Experiment with other graph representations for texts that include alternative levels of language descriptions such as the use of sentence chunks, pragmatic sentences, etc (Mihalcea and Radev, 2011).
• Propose a similarity measure that uses the semantic information of a graph (Alvarez and Yan, 2011).