A Practical Guide to Sentiment Annotation: Challenges and Solutions

Sentences and tweets are often annotated for sentiment simply by asking respondents to label them as positive, negative, or neutral. This works well for simple expressions of sentiment; however, for many other types of sentences, respondents are unsure of how to annotate, and produce inconsistent labels. In this paper, we outline several types of sentences that are particularly challenging for manual sentiment annotation. Next we propose two annotation schemes that address these challenges, and list benefits and limitations for both.


Introduction
Clear and simple instructions are crucial for obtaining high-quality annotations. This is true even for seemingly simple annotation tasks, such as sentiment annotation, where one is to label instances as positive, negative, or neutral. For word annotations, researchers have often framed the task as 'is this word positive, negative, or neutral?' (Hu and Liu, 2004), 'does this word have associations with positive, negative, or neutral sentiment?' (Mohammad and Turney, 2013), or 'which word is more positive?'/'which word has a greater association with positive sentiment' (Kiritchenko et al., 2016;Kiritchenko and Mohammad, 2016b). Similar instructions are also widely used for sentence-level sentiment annotations-'is this sentence positive, negative, or neutral?' (Rosenthal et al., 2015;Rosenthal et al., 2014;. We will refer to such annotation schemes as the simple sentiment questionnaires. On the one hand, this characterization of the task is simple, terse, and reliant on the intuitions of native speakers of a language (rather than biasing the annotators by providing definitions of what it means to be positive, negative, and neutral). On the other hand, the lack of specification leaves the annotator in doubt over how to label certain kinds of instancesfor example, sentences where one side wins against another, sarcastic sentences, or retweets.
A different approach to sentiment annotation is to ask respondents to identify the target of opinion, and the sentiment towards this target of opinion (Pontiki et al., 2014;Deng and Wiebe, 2014). We will refer to such annotation schemes as the semantic-role based sentiment questionnaires. This approach of sentiment annotation is more specific, and more involved, than the simple sentiment questionnaire approach; however, it too is insufficient for handling several scenarios. Most notably, the emotional state of the speaker is not under the purview of this scheme. Many applications require that statements expressing positive or negative emotional state of the speaker should be marked as 'positive' or 'negative', respectively. Similarly, many applications require statements that describe positive or negative events or situations to be marked as 'positive' or 'negative', respectively. Instructions for annotating opinion towards targets do not specify how such instances are to be annotated, and worse still, possibly imply that such instances are to be labeled as neutral.
In this paper, we present a list of sentence types that are especially challenging for sentiment annotation. Next, we propose two annotation schemes that address these challenges: (1) a simple sentiment annotation questionnaire with more precise annotation directions and some additional label categories; and (2) a semantic-role based questionnaire with additional questions to account for the speaker's emotional state and descriptions of valenced events.
Aspects of annotation that are not specific to sentiment, such as good practices in crowdsourcing, how to aggregate information from multiple annotators, and how to automatically detect and discard poor annotations are beyond the scope of this paper; we refer the readers to Lease (2011), Hsueh et al. (2009, and Mohammad and Turney (2013) for that. Methods for obtaining real-valued sentiment scores are also not covered in this paper; we refer the reader to  and Kiritchenko et al. (2014) for the use of best-worst scaling to obtain reliable real-valued sentiment associations. See Mohammad (2016) for a survey on sentiment and emotion datasets.

Types of Instances that are Difficult to Annotate for Sentiment
There exist several types of sentences that are particularly challenging to annotate for sentiment. Some of the more notable ones are listed below: • Speaker's emotional state: The speaker's emotional state may or may not have the same polarity as the opinion expressed by the speaker. For example, a politician's tweet can imply both a negative opinion about a rival's past indiscretion, and a joyous mental state as the news will impact the rival adversely.
• Success or failure of one side w.r.t. another: Often sentences describe the success or failure of one side w.r.t. another side-for example, 'Yay! France beat Germany 3-1', 'Supreme court judges in favor of gay marriage', and 'the coalition captured the rebels'. If one supports France, gay marriage, and the coalition, then these events are positive, but if one supports Germany, marriage as a union only between man and woman, and the rebels, then these events can be seen as negative. Also note that the framing of an event as the success of one party (or as the failure of another party) does not automatically imply that the speaker is expressing positive (or negative) opinion towards the mentioned party. For example, when Finland beat Russia in ice hockey in the 2014 Sochi Winter Olympics, the event was tweeted around the world predominantly as "Russia lost to Finland" as opposed to "Finland beat Russia". This is not because the speakers were expressing negative opinion towards the Russian team, but rather simply because Russia, being the host nation, was the focus of attention and traditionally Russian hockey teams have been strong.
• Neutral reporting of valenced information: If the speaker does not give any indication of her own emotional state but describes valenced events or situations, then it is unclear whether to consider these statements as neutral unemotional reporting of developments or whether to assume that the speaker is in a negative emotional state (sad, angry, etc.). Example: The war has created millions of refugees.
• Sarcasm and ridicule: Sarcasm and ridicule are tricky from the perspective of assigning a single label of sentiment because they can often indicate positive emotional state of the speaker (pleasure from mocking someone or something) even though they have a negative attitude towards someone or something.
• Different sentiment towards different targets of opinion: The speaker may express opinion about multiple targets, and sentiment towards the different targets might be different. The targets may be different people or objects (for example, an iPhone vs. an android phone), or they may be different aspects of the same entity (for example, quality of service vs. quality of food at a restaurant).
• Precisely determining the target of opinion: Sometimes it is difficult to precisely identify the target of opinion. For example, consider: Glad to see Hillary's lies being exposed.
It is unclear whether the target of opinion is 'Hillary', 'Hillary's lies', or 'Hillary's lies being exposed'. One reasonable interpretation is that positive sentiment is expressed about 'Hillary's lies being exposed'. However, one can also infer that the speaker has a negative attitude towards 'Hillary's lies' and probably 'Hillary' in general.
It is unclear whether annotators should be asked to provide all three opinion-target pairs or only one (in which case, which one?).
• Supplications and requests: Many tweets convey positive supplications to God or positive requests to people in the context of a (usually) negative situation. Examples include: May god help those displaced by war. Let us all come together and say no to fear mongering and divisive politics.
• Rhetorical questions: Rhetorical questions can be treated simply as queries (and thus neutral) or as utterances that give away the emotional state of the speaker. For example, consider: Why do we have to quibble every time?
On the one hand, this tweet can be treated as a neutral question, but on the other hand, it can be seen as negative because the utterance betrays a sense of frustration on the part of the speaker.
• Quoting somebody else or re-tweeting: Quotes and retweets are difficult to annotate for sentiment because it is often unclear and not explicitly evident whether the one who quotes (or retweets) holds the same opinions as that expressed by the quotee.
The challenges listed above can be addressed to varying degrees by providing instructions to the annotators on how such instances are to be labeled. However, detailed and complicated instructions can be counter-productive as the annotators may not understand or may not have the inclination to understand the subtleties involved.

Proposed Annotation Schemes
Two annotation schemes that address many of the challenges laid out above are presented below. The benefits and limitations of both are outlined. The goal here is not to suggest that these are the only ways to annotate for sentiment, but rather to encourage further thought and improved proposals for sentiment annotation (that may or may not be inspired by the questionnaires shown below). Note also that the precise formulation of the sentiment questionnaire should be guided by the specific needs of the application at hand.

A Simple Sentiment Questionnaire
A simple sentiment questionnaire that addresses many of the challenges listed in Section 2 is presented below: PROPOSED SIMPLE SENTIMENT QUESTIONNAIRE What kind of language is the speaker using?
1. the speaker is using positive language, for example, expressions of support, admiration, positive attitude, forgiveness, fostering, success, positive emotional state 2. the speaker is using negative language, for example, expressions of criticism, judgment, negative attitude, questioning validity/competence, failure, negative emotion 3. the speaker is using expressions of sarcasm, ridicule, or mockery 4. the speaker is using positive language in part and negative language in part 5. the speaker is neither using positive language nor using negative language Notes: • A good response to this question is one that most people will agree with. For example, even if you think that sometimes the language can be considered negative, if you think most people will consider the language to be positive, then select the positive language option.
• Agreeing or disagreeing with the speaker's views should not have a bearing on your response. You are to assess the language being used (not the views). For example, given the tweet, 'Evolution makes no sense', the correct answer is 'the speaker is using negative language' since the speaker's words are criticizing or judging negatively something (in this case the theory of evolution). Note that the answer is not contingent on whether you believe in evolution or not.
Benefits and Limitations. This questionnaire groups the speaker's emotional state, speaker's opinion, and description of valenced events all into one category and aims simply to determine the dominant sentiment inferable from the sentence. The phrases 'positive language' and 'negative language' encourage respondents to focus on the language itself as opposed to assigning sentiment based on event outcomes that are beneficial to them. For example, 'Yay! France beat Germany 3-1' will be marked as positive because the speaker is using the positive expression 'Yay!'. The 'Russia lost to Finland' example (described earlier in Section 2), may be difficult to annotate with respect to the opinion of the speaker towards the Russian team, but the framing of the event as a loss is easily identified as negative language. Other instances where one side benefits over another, but the text itself does not use positive or negative language can be labeled as option 5. Sarcasm, ridicule, and mockery are included as a separate option (in addition to option 2) so that respondents do not have to struggle with the decision of whether to mark such instances as positive or negative. Downstream applications can make use of all of these annotations as is appropriate for them: for example, by treating tweets labeled sarcasm and ridicule differently from the other positive and negative sentences, or by marking them as negative.
Instances with different sentiment towards different targets of opinion can be marked with option 4. Supplications and requests that convey a sense of fostering and support can be marked as positive. On the other hand, rhetorical questions that betray a sense of frustration and disappointment can be marked as negative. Thus this simple questionnaire addresses many of the issues raised in the earlier section. However, a limitation of this approach is that it does not produce as nuanced a set of annotations as those that can be obtained from the questionnaire shown ahead. Note, however, that the simplcity of the questionnaire entails low annotation costs and no special educational requirements. Mohammad et al. (2016b) used this questionnaire to annotate a set of tweets for sentiment via crowdsourcing. The dataset, which is also annotated for stance, is made freely available. 1 1 www.saifmohammad.com/WebPages/StanceDataset.htm

A Semantic-Role Based Sentiment Questionnaire
A sentiment questionnaire that includes questions about the target of opinion, as well as additional questions such as those that address the speaker's emotional state, is presented below: PROPOSED SEMANTIC-ROLE BASED SENTIMENT QUESTIONNAIRE Q1. From reading the text, the speaker's emotional state can best be described as: • positive state: there is an explicit or implicit clue in the text suggesting that the speaker is in a positive state, i.e., happy, admiring, relaxed, forgiving, etc.
• negative state: there is an explicit or implicit clue in the text suggesting that the speaker is in a negative state, i.e., sad, angry, anxious, violent, etc.
• both positive and negative, or mixed, feelings: there is an explicit or implicit clue in the text suggesting that the speaker is experiencing both positive and negative feelings • unknown state: there is no explicit or implicit indicator of the speaker's emotional state Q2. From reading the text, identify the entity towards which opinion is being expressed or the entity towards which the speaker's attitude can be determined.
This entity is usually a person, object, company, group of people, or some such entity. We will call this the PRIMARY TARGET OF OPINION (PTO). For example, if the text criticizes certain actions or beliefs of a person (or group of persons), then that person or group is the PTO. If the text mocks people who do not believe in evolution, then the PTO is 'people who do not believe in evolution'. If the text questions or mocks evolution, then the PTO is 'evolution'. If you cannot determine sentiment/attitude of the speaker towards a person, group, or object, but you can identify sentiment/attitude towards an action or event, then consider that action or event as the PTO. If there are more than one targets of opinion, then select that target towards which sentiment is stronger. 177 NOTE: Where possible, copy and paste the primary target of opinion from the text. If the target can be referred to in different ways, for example, Barack Obama, Obama, Obamaaa, #obama, @obama, President, he, etc., copy and paste the snippet from the text showing how the speaker has referred to the target.
Q3. What best describes the speaker's attitude, evaluation, or judgment towards the primary target of opinion (PTO)? If the whole text is a quote from somebody else (original author) and there is no indication of speaker's attitude, then answer below considering the original author as the speaker.
• positive: there is an explicit or implicit clue in the text suggesting that the speaker's attitude or judgment of the PTO is positive (speaker is appreciative, thankful, excited, optimistic, or inspired by the primary entity) • negative: there is an explicit or implicit clue in the text suggesting that the speaker's attitude or judgment of the PTO is negative (speaker is critical, angry, disappointed in, pessimistic, expressing sarcasm about, or mocking the primary entity) • mixed: there is an explicit or implicit clue in the text suggesting that the speaker's attitude or judgment of the PTO is both positive and negative • unknown: there is no explicit or implicit clue indicating that the speaker feels positively or negatively Q4. What best describes the sentimental impact of the primary target of opinion (PTO) on most people?
• positive: the PTO is considered predominantly positive Benefits and Limitations: This detailed questionnaire with questions for the speaker's emotional state, the target of opinion, opinion towards the target, and general opinion (not the speaker's opinion) towards the target provides a rich cross-section of information that can be used by many downstream applications. However, a limitation of this questionnaire is its complexity-annotators (especially on crowdsourcing platforms) might find it difficult to distinguish the subtle difference between Q1, Q3, and Q4. Additionally, even though Q2 is framed with an eye on challenges regarding the identification of the target of opinion, it is difficult to completely address the associated issues. The target may not be explicitly mentioned in the text, it may be mentioned multiple times and in different ways, or it may be referred to via hypernyms, hyponyms, meronyms, and holonyms. It is advisable to first train the annotators on a small set of instances before proceeding to annotate large amounts of data. 178 We outlined several types of sentences that are particularly challenging for manual sentiment annotation. They include sentences describing success (or failure) of one side over another, sentences expressing sarcasm or ridicule, sentences expressing differing sentiment towards multiple entities, supplications, requests, and rhetorical questions. We then presented two different questionnaires that provide clear instructions on how such instances are to be annotated. The first questionnaire, is simple and terse, and thus it is easy and inexpensive to answer. The second questionnaire is markedly more involved, and thus requires more training; however, it provides a plethora of sentiment-related information that can be used in many downstream applications. Apsects of the proposed questionnaires may not be appropriate for all applications. Practitioners are encouraged to tweak the questionnaires as per the needs of the application at hand.