“Haters gonna hate”: challenges for sentiment analysis of Facebook comments in Brazilian Portuguese

The aim of this paper is to present reflections from Discourse Analysis and from Construction Grammar on the creation of a dictionary for sentiment analysis of Facebook comments. The reflections from Discourse Analysis address problems such as the identification of the semantic orientation of words that present opposite polarities depending on the ideological formation of the speaker. Another reflection from Discourse Analysis regards the fact that the writers of the comments use nouns and noun phrases not only to name some entity, but also to build discourse objects in a way that the label they give to the discourse objects reveals an evaluation. In order to analyze constructions larger than words, such as idioms, we draw on Construction Grammar principles. The investigation of constructions and idioms can provide a better understanding of sentiment in text. The corpus consists of comments extracted manually from Facebook public discussion pages related to diverse themes, such as politics, education, religion, music, lifestyle etc.


Introduction
Facebook is one of the websites with higher data traffic on the internet. On December 2016 it registered 1.86 billion monthly active users (nearly 1 in 4 people worldwide). In Brazil, the ratio is even higher: nearly 55.5% of the population were active Facebook users in November 2016 (Facebook, 2017). Thus, linguists who are interested in investigating language in use have on Facebook an immeasurable research field. Wilson et al.'s (2012) comprehensive review about scientific research conducted about Face-book in the social sciences points to the need for social engagement as the main motivation for people to use Facebook. Seidman (2014) investigated the expression of the "true self" on Facebook, which "consists of qualities that an individual currently possesses but does not normally express to others in everyday life". The need for social engagement and the expression of the true self are encompassed within the metafunctions 1 proposed by Halliday (1985). The ideational metafunction regards self-expression, as it concerns the grammatical features used to construct both the speaker's inner experiences and the experiences with the world around him. On the other hand, the interpersonal metafunction is related to the grammatical resources used by the speaker to interact with his/her interlocutors, assuming social roles and roles concerning the communicative situation (social engagement). Whilst the ideational and the interpersonal are extralinguistic metafunctions, the textual metafunction deals with the presentation of interpersonal and ideational content in the form of information that can be shared by the speaker and his / her interlocutors by means of texts. In other words, texts produced by Facebook users and their linguistic behavior deserve being studied by linguists.
According to Iruskieta et al. (2013), Computational Linguistics depends on discourse annotated corpora for the creation of automatic applications. The research that resulted in this paper intends to create a dictionary for sentiment analysis by extracting comments from Facebook public pages related to diverse themes, such as politics, education, religion, music, lifestyle etc.
However, as we started analyzing the semantic orientation of the comments, we noticed that the same words said by different people had polar opposite semantic orientation, as in examples (1) and (2).
Only functional and political illiterate people believe in the lies of the right wing.
We, who are right wing, do not have pet politicians.
In (1) "right wing" is considered negative by the writer of the comment, as it can be presupposed that assumptions endorsed by right wing are lies. On the other hand, in (2) the writer of the comment, who assumes to have a right wing political orientation, suggests that left wing people have pet politicians, whilst right wing people do not. Therefore, in (2), "right wing" is evaluated positively. That happens because the comments collected were produced by people with different views on the issues discussed in the pages and we decided to draw attention to such problem, as it certainly creates difficulties for sentiment analysis.
Thus, this paper presents some challenges for the creation of a dictionary for sentiments analysis of Facebook comments in Brazilian Portuguese caused by the different positions assumed by the producers of the comments. Furthermore, it is also a goal of the paper to analyze other forms rather than nouns, adjectives, verbs, NPs. The investigation of constructions and idioms can provide a better understanding of sentiment in text.
In terms of structure, besides the introduction, this paper is divided in 4 more sections. In Section 2 we present a short view of what has been done about sentiment analysis in Linguistics and in NLP. We also introduce in Section 2 some theoretical assumptions from Discourse Analysis in order to face the challenges which are addressed in the paper. A brief review of Constructional Grammar is also presented in Section 2. In Section 3 we present the methodology used in the research and the discussion of the data is provided in Section 4. The last section of the paper is the Conclusion, followed by the references.

Theoretical background
In this section we provide a general background about what has been done regarding sentiment analysis and also some contributions from Discourse Analysis which are essential for the challenges discussed in the paper. A brief review of basic assumptions of Constructional Grammar is presented in order to provide a better understanding of the concepts of "construction" and "idiom".

Sentiment analysis
According to Taboada (2016), "sentiment analysis is a growing field at the intersection of linguistics and computer science that attempts to automatically determine the sentiment contained in text". Sentiment is conceived as positive or negative evaluation conveyed by linguistic expression, both lexical and grammatical. Beyond defining sentiment analysis, Taboada (2016) also presents a broad view of the contributions of Linguistics to automatic sentiment analysis.
Two main approaches are used for automatic extraction of sentiments: machine learning and lexicon based (Taboada, 2016). We will focus on the latter, as it is the type of method we intend to implement in the subsequent stages of the project we are developing.
Among the lexicon based approaches, Taboada (2016) mentions some dictionaries for sentiment analysis: SentiWordNet (Baccianella et al., 2010) catalogues about 38,000 words regarding their polarity; also based on polarity, Macquarie Semantic Orientation Lexicon (Mohammad et al., 2009) classifies almost 76,000 words; Subjectivity dictionary (Wilson et al., 2009) not only presents the polarity of the words, but also groups them according to their strength (strong positive, weak positive, neutral, weak negative, strong negative); Semantic Orientation Calculator (SO-CAL) (Taboada et al., 2011) stratifies about 5,000 words in a 10point scale which ranges from -5 to +5.
In Brazilian Portuguese, among many works that deal with sentiment analysis, Sentimeter-Br (Rosa, 2015) is a mechanism for calculating semantic orientation. It is based on a dictionary of words divided according to the area they belong to, e.g. music, technology, beauty, business. The system also implements a mechanism (Enhanced Sentimeter) which uses the user's profile of a social media to calculate sentiment.

Contributions from Discourse Analysis
Regarding the problem presented in the introduction of the paper with the different evaluations of "right wing" held by the writers of the comments, which have opposite political views, Pêcheux (1975) states that words, expressions, propositions etc do not have a self-contained meaning. On the contrary, their meanings change according to the positions supported by the speakers who use them, i.e, their ideological formations. In (1) "right wing" has a negative evaluation because the comment was written by a person of left wing ideological formation. On the other hand, in (2) "right wing" has a positive evaluation because the writer of the comment belongs to a right wing ideological formation. Thus, ideological formation is an important feature to be taken into account in order to identify sentiment towards propositions.
Another important reflection from Discourse Analysis regards the difference between reference and référenciation 2 proposed by Mondada and Dubois (1995). When one makes reference, he / she names in an objective way anything that is in the world (designatum, according to Lyons [1977]). On the other hand, in the référenciation process, the speaker builds discourse objects in a way that they can be categorized and recategorized. In (3) the writer of the comment, which was taken from a left wing Facebook page, uses the "communist doctrinators" NP in order to show how a group of right wing people refers to teachers. Obviously, it is not only naming. If it were like this, the group which is criticized in the comment would use the noun "teachers". Actually, the "communist doctrinators" NP reveals the treatment of the right wing group towards the discourse object "teachers". External world class "teachers" is not affected by the way the right wing group refers to it, as "communist doctrinators" is a discourse object. 2 We will use the French word "référenciation" (as Mondada and Dubois 1995) because there is not such word in English.
For this reason, for this to work, it is necessary to demonize the class of the teachers as "communist doctrinators", i.e., two words that right wing loves to use.
In other words, in the view of référenciation, nouns and noun phrases do not only name a designatum, they also present an evaluation of the discourse object.

Constructions and idioms
Beyond nouns, adjectives, verbs, NPs, other forms should be investigated for a better understanding of sentiment in text. Constructions and idioms are widely used by speakers not only in face to face interactions, but also on social media.
The term "construction grammar" (CG henceforth) refers to a group of distinct frameworks which share some tenets, summarized by Goldberg (2013) as follows: i.
Constructions are the basic units of grammar; ii. Semantic structure is associated directly with syntactic structure without transformations or derivations; iii.
Constructions form a network in which nodes are related by inheritance links; iv. Cross-linguistic variation can be explained in terms of domain-general cognitive processes or by the functions of the constructions involved; v. Items and generalizations are part of the knowledge of language (this last tenet is shared by most, but not all approaches).
Unlike Generative Grammar, CG considers grammar in a holistic way, i.e., no grammatical level is considered core or autonomous; a construction is formed by simultaneous work of phonology, morphosyntax, semantics and pragmatics (Traugott and Trousdale, 2013).

Methodology
The first step of the Methodology was to collect comments from public Facebook pages which discuss issues such as politics, education, religion, music, lifestyle etc. Nearly 1,000 comments were collected, segmented into EDUs 3 and classified either as subjective (present an evaluation) or objective (do not present an evaluation). The latter were eliminated from the corpus.
The remaining 649 EDUs were classified manually by two annotators as positive, negative or neutral, regarding the evaluation of a discourse object. The words and expressions responsible for the evaluation were extracted manually and divided according to their formal classes.
In Table 1 we present the quantity of words per class. Other features such as intensifiers, adverbs, interjections, signals of irony, laughing etc were also annotated but will not be discussed here as they do not refer directly to the main issue discussed in this paper. It is important to remark that words were counted only once, even if they were used more times in the corpus. Thus, the quantity in Table 1 refers to the quantity of words found and not to the amount of times they were used.

Discussion and analysis
In order to stress the importance of taking constructions into account in sentiment analysis, this Section is divided in two subsections: one for commonly investigated classes and forms such as nouns, adjectives, verbs, NPs, and one for constructions and idioms.
School is a children warehouse, ONLY, actually nobody cares about what is taught, they are only interested in having a place to leave "God's gifts" while they are working. That's why parents get so mad when there is a teacher's strike.
In example (4) the writer of the comment quotes the opinion of people in general about Brazilian public schools, which are considered "depósito de criança" ("children warehouse"), i.e., a place where people leave their kids when they go to work. NP "depósito de criança" creates a discourse object which reveals a general conception about the designatum school.
Although NP "presentes de Deus" ("God's gifts") carries nouns of positive semantic orientation, it is written between "quotation marks", in a sarcastic way, which results in a negative evaluation of that discourse object. In other words, the writer of the comment means that there are moments when children are a nuisance to parents, who are not interested in their education, but only in having a place to leave them while they are at work.
Depreciatory collective nouns reveal the speaker's negative evaluation of a discourse object, especially when the referent is human (Neves, 2000). It is the case of the noun "bando" (which could be roughly translated into English as "gang"). In example (5), scoping adjective "demente" ("demented, in English), it evaluates negatively a group of people. In example (6), it is used to convey negative evaluation of a group of "machos" ("chauvinists", in English), which are also strongly qualified in a negative manner as "asquerosos" ("loathful).
It is only the reflex of how we became a gang of demented.

Gang of loathful chauvinists!
However, negative characteristics may be assumed by a group and transformed into positive evaluation. That's what happens with the expression "gang of crazy people" 4 , in example (7), used by the supporters of Brazilian football team Sport Club Corinthians as a motivation yell.
Here there is a gang of crazy people, crazy for you, Corinthians. (5) and (7) shows similarities -both NPs have the noun "gang" scoping an adjective from insanity semantic field -and also differences -in example (5) the evaluation is negative, while, in example (7), the evaluation is positive. To explain the differences, we have to cite Pêcheux (1975) again, to whom the meanings of words change according to the positions supported by the speakers who use them.

The comparison between examples
Nouns may lose their referential function in order to express quality (Neves, 2000) and, in such use, they can convey negative or positive evaluation. In example (8), noun "massa" ("mass") is not used to refer to "matter with no definite shape", but to qualify noun "página" ("page") as "cool".
A page that seems to be cool shows up.
The same happens to noun "show" in example (9). Instead of naming a spectacle, it classifies the "debate" as "amazing, spectacular".
The debate was amazing.
In example (10) noun "shit" qualifies noun "time" ("team") in an extremely pejorative way. In examples (11) and (12), besides the the change of position in the NP, the negative evaluation assigned to noun "filme" ("film") remains the same 5 . The possibility of such positional change is a particular characteristic of the grammar of Brazilian Portuguese. As it can be noticed, the translation into English is not even possible.
*What a shit team. Adjectives have been widely investigated in sentiment analysis due to their nature. According to Taboada (2016), "adjectives convey much of the subjective content in a text". However, not all adjectives can be used to evaluate. Classifier adjectives only subcategorize the nouns that they modify (Neves, 2000). As a result, they are not suitable for subjective evaluation, as in example (13), in which adjective "sexual" only specifies the type of "option" ("opção") mentioned by the writer of the comment.
(13) O que que a gente tem a ver com a opção sexual do outro?
What do we have to do with other people's sexual options?
On the other hand, qualifier adjectives, in a predication process, attribute properties to the nouns they modify. In example (14), the adjective assigns the quality "feliz" ("happy") to anyone who fulfills the conditions presented in the subject clause.
Happy is the one who finds happiness in the small gestures.
(15) … se vier me perguntar se tu é feio… … if you ask us if you are ugly… (16) … gente muito louca, desequilibrada e in-suportável… … very crazy, unbalanced and unbearable people… The semantic orientation of the NP is usually given by the adjective. In example (17), adjective "maravilhoso" ("wonderful") is responsible for the positive semantic orientation of the NP, whilst in example (18) adjective "fake" is responsible for the negative semantic orientation of the NP. Verbs occupy the central role of a predication (Ilari and Basso, 2008). Thus, the correct identification of the semantic orientation of a clause depends to a great extent on the verb. Some verbs convey negative ("odeio" -"hate") or positive ("adooooro" -"loooove"; "prefiro" -"prefer") evaluation by their basic meaning, as in examples (19) and (20) On the other hand, evaluation conveyed by other verbs can only be identified by the analysis of their arguments. In example (21) and (22) verb "merecer" ("to deserve") can point to a negative or to a positive evaluation, depending on the semantic orientation of its second argument (A2). In (21) semantic orientation is positive ("compliments"), while in (22) it is negative ("jail").
I follow and I will keep on following your work, which deserves, yes, compliments in many aspects.

She deserves jail.
However, there are constructions in which objects cannot be considered arguments of the verb. Neves (2002), following Ashby and Bentivoglio (1993), calls them "constructions with support verbs". In such constructions, the object NP forms a predicate with the verb and thus the construction must be analyzed as a whole. In Brazilian Portuguese, the most productive verbs in such constructions are "dar" ("to give") and "fazer" ("to do", "to make"). In example (23), constructions with sup-port verb "dar" have opposite polarities, and the semantic orientations of the comment must be determined by discourse structure (Taboada, 2016). Conjunction "e" ("and") is usually associated with the idea of additive parataxis. However, it has other uses in Brazilian Portuguese, such as signaling contrast (Camacho, 1999), which is the case in example (23).
It had everything to go wrong and it went right.
Neves (2002) presents some reasons why speakers use support verbs, such as the obtainment of more communicative adequacy, more semantic precision and more syntactic versatility. In example (24), the construction with support verb provides more syntactic versatility by the determination of reflexive possession.
(24) Os pais nunca se dão conta disso não, eles acham q o professor não faz mais do que a obrigação deles... e se quiser ganhar mais trabalhe mais Parents do not realize this, they think that teachers don't do more than their obliga-tion… and if teachers want to earn more, they have to work more.
(25) Por mais tatuadores como esse… For more tattoo artists like this… In example (26), the construction conveys negative evaluation. In the first six clauses, the writer of the comment criticizes characteristics and actions of a politician. In the seventh clause, he / she presents the action which he / she considers to be the worst of all. In terms of argumentation, this construction saves the strongest argument for last and can be represented as Não basta PREDICATION In idioms, words do not have independent meanings and the idiosyncrasy (meaning cannot be predicted from form) must be stored in the speakers' long-term memory (Jackendoff, 2013). In example (27), idiom "chutar cachorro morto" (literal translation: "to kick a dead dog") in Brazilian Portuguese means being aggressive or even doing harm to somebody who is not able to defend him / herself and is insignificant to society. The idiom is used by the writer of the comment to convey a view of society towards teachers which is growing common: teachers are not important.
Messing with teachers is like kicking a dead dog, nobody cares.
In example (28), idiom "beijinho no ombro", literally translated as "little kiss over the shoulder" has become one of the most widely used idioms in Brazil after it was used as the chorus and the title of a song by a popular Brazilian funk singer. A quick search on Google, for instance, leads to more than two million results. As it happens with idiom "chutar cachorro morto", its meaning cannot be predicted from the form, as it expresses a gesture of superiority over envious people, negative people who only criticize, haters etc. In the context of the example, the writer of the comment uses the idiom "beijinho no ombro" in order to show that he / she does not care about peoples' critics towards the Facebook page she follows. She sends to those people a "beijinho no ombro" to show that she is superior to their negative and critical behaviour.
I don't follow pages of schools, the only one I identified myself with was this one exactly because it shows my fears, my yearnings as a teacher. If they are not happy with it, I'm sorry. Little kiss over the shoulder.

Conclusion and future work
This paper aimed at discussing some challenges found for the creation of a sentiment analysis dictionary for Facebook comments in Brazilian Portuguese. The analysis of the corpus showed that the same words spoken by different people may have polar opposite semantic orientations. We also noticed that the writers of the comments use nouns and noun phrases not only to name some entity, but also to build discourse objects in a way that the label they give to the discourse objects reveals an evaluation. We propound reflections about such problems within the Discourse Analysis framework, mainly Pêcheux (1975) and Mondada and Dubois (1995).
Besides taking into account reflections from Discourse Analysis, another suggestion of the paper is to use assumptions from Construction Grammar to analyze constructions and idioms rather than only nouns, adjectives, NPs, verbs etc. The investigation of constructions and idioms can provide a better understanding of sentiment in text.
In future works, we intend to expand the dictionary and create a test corpus in order to try to create algorithms for automatic evaluation of sentiment of Facebook comments in Brazilian Portuguese.