The Effect of Negators, Modals, and Degree Adverbs on Sentiment Composition

Negators, modals, and degree adverbs can significantly affect the sentiment of the words they modify. Often, their impact is modeled with simple heuristics; although, recent work has shown that such heuristics do not capture the true sentiment of multi-word phrases. We created a dataset of phrases that include various negators, modals, and degree adverbs, as well as their combinations. Both the phrases and their constituent content words were annotated with real-valued scores of sentiment association. Using phrasal terms in the created dataset, we analyze the impact of individual modifiers and the average effect of the groups of modifiers on overall sentiment. We find that the effect of modifiers varies substantially among the members of the same group. Furthermore, each individual modifier can affect sentiment words in different ways. Therefore, solutions based on statistical learning seem more promising than fixed hand-crafted rules on the task of automatic sentiment prediction.


Introduction
Sentiment associations are commonly captured in sentiment lexicons-lists of associated wordsentiment pairs (optionally with a score indicating the degree of association). They are mostly used in sentiment analysis (Pontiki et al., 2014;Rosenthal et al., 2015), but are also beneficial in stance detection (Mohammad et al., 2016a;, literary analysis (Hartner, 2013;Kleres, 2011;Mohammad, 2012), detecting personality traits (Grijalva et al., 2015;, and other applications. Manually created sentiment lexicons are especially useful because they tend to be more accurate than automatically generated ones; they can be used to automatically generate large high-coverage lexicons (Tang et al., 2014;Esuli and Sebastiani, 2006); they can be used to evaluate different methods of automatically creating sentiment lexicons; and they can be used for linguistic analysis such as examining how modifiers (negators, modals, degree adverbs, etc.) impact overall sentiment. However, most existing manually created sentiment lexicons tend to provide only lists of positive and negative words with very coarse levels of sentiment (Stone et al., 1966;Wilson et al., 2005;Mohammad and Turney, 2013). The coarse-grained distinctions may be less useful in downstream applications than having access to finegrained (real-valued) sentiment association scores.
Manually created sentiment lexicons usually include only single words. Yet, the sentiment of a phrase can differ markedly from the sentiment of its constituent words. Sentiment composition is the determining of sentiment of a multi-word linguistic unit, such as a phrase or a sentence, from its constituents. Lexicons that include sentiment associations for phrases as well as for their constituents are useful in studying sentiment composition. We refer to them as sentiment composition lexicons (SCLs).
We created a sentiment composition lexicon for phrases formed with negators (such as no and cannot), modals (such as would have been and could), degree adverbs (such as quite and less), and their combinations. Both the phrases and their constituent content words were manually annotated with realvalued scores of sentiment association using a tech-nique known as Best-Worst Scaling, which provides reliable annotations. We refer to the resulting lexicon as Sentiment Composition Lexicon for Negators, Modals, and Degree Adverbs (SCL-NMA). The lexicon is also known as SemEval-2016 General English Sentiment Modifiers Lexicon. 1 We calculate the minimum difference in sentiment scores of two terms that is perceptible to native speakers of a language. For sentiment scores between -1 and 1, we show that the perceptible difference is about 0.07 for English speakers. Knowing the least perceptible difference helps interpret the impact of sentiment composition. For example, we can determine whether a modifier significantly impacts the sentiment of the word it composes with by calculating the difference in sentiment scores between the combined phrase and the constituent, and checking whether this difference is greater than the least perceptible difference.
We use the phrasal terms in the created lexicon to analyze the impact of common modifiers on the sentiment of the terms they modify. We measure the effect of individual modifiers as well as the average effect of the groups of modifiers on overall sentiment. We show that the sentiment of a negated expression (such as not w) on the [-1,1] scale is on average 0.926 points less than the sentiment of the modified term w, if the w is positive. However, the sentiment of the negated expression is on average 0.791 points higher than w, if the w is negative. Similar analysis for modals and degree adverbs shows that they impact sentiment less dramatically than negators. Furthermore, the impact of modifiers substantially varies even within a group, e.g., the average change in sentiment score brought by the negator 'will not be' is 0.41 larger than the change introduced by the negator 'never'. Likewise, each individual modifier can affect sentiment words in different ways. As a result, in automatic sentiment prediction solutions based on statistical learning seem more promising than fixed hand-crafted rules.
In related work (not described here), we also created a sentiment composition lexicon for another challenging category of phrases-phrases that include at least one positive word and at least one negative word . We call such phrases opposing polarity phrases. Both lexicons have been used as official test sets in SemEval-2016 Task 7 'Determining Sentiment Intensity of English and Arabic Phrases' (Kiritchenko et al., 2016). 2 The lexicons are made freely available to the research community. 3 2 Related Work Sentiment Lexicons: There exist a number of manually created lexicons that provide lists of positive and negative words, for example, General Inquirer (Stone et al., 1966), Hu and Liu Lexicon (Hu and Liu, 2004), and NRC Emotion Lexicon (Mohammad and Turney, 2013). Only a few manually created lexicons provide real-valued scores of sentiment association (Bradley and Lang, 1999;Warriner et al., 2013;Dodds et al., 2011). None of these lexicons, however, contain multi-word phrases. Manually created sentiment lexicons can be used to automatically generate larger sentiment lexicons using semisupervised techniques (Esuli and Sebastiani, 2006;Turney and Littman, 2003;De Melo and Bansal, 2013;Tang et al., 2014). (See Mohammad (2016) for a survey on manually created and automatically generated affect resources.) Automatically generated lexicons often have realvalued sentiment association scores, are larger in scale, and can easily be collected for a specific domain; therefore, they were found to be more beneficial in downstream applications, such as sentencelevel sentiment prediction . However, any analysis of the relationship between the sentiment of a phrase and its constituents is less reliable when made from an automatically generated resource as opposed to when made from a manually created resource (as automatically generated resources are less accurate). In this work, we provide an extensive analysis of the impact of different modifiers on sentiment based on reliable finegrained manual annotations.
Contextual Valence Shifters: Negators, modals, and degree adverbs impact the sentiment of the word or phrase they modify and are commonly referred to as contextual valence shifters (Polanyi and Zaenen, 2004;Kennedy and Inkpen, 2005;Jia et al., 2009;Wiegand et al., 2010;Lapponi et al., 2012). Conventionally, the impact of contextual valence shifters is captured by simple heuristics. For example, negation is often handled by reversing the polarities of the sentiment words in the scope of negation (Polanyi and Zaenen, 2004;Kennedy and Inkpen, 2005;Choi and Cardie, 2008) or by shifting the sentiment score of a term in a negated context towards the opposite polarity by a fixed amount (Taboada et al., 2011). However, such heuristics do not adequately capture the true sentiment of multi-word expressions . Liu and Seneff (2009) relax the assumption of a fixed shifting margin and estimate these margins for each modifier separately from data. , on the other hand, estimate the impact of negation on each individual sentiment word through a corpus-based statistical method. Ruppenhofer et al. (2015) automatically rank English adverbs by their intensifying or diminishing effect on adjectives using ratings metadata from product reviews. Annotation techniques: A widely used method of annotation for obtaining numerical scores is the rating scale method-where one is asked to rate an item on a five-, ten-, or hundred-point scale. While easy to understand, rating items on a scale is not natural for people. It is hard for annotators to remain consistent when annotating a large number of items. Also, respondents often use just a limited part of the scale reducing the discrimination among items (Cohen, 2003). To obtain reliable annotations, the rating scale methods require a high number of responses, typically 15 to 20 (Warriner et al., 2013;Graham et al., 2015). A more natural annotation task for humans is to compare items (e.g., whether one word is more positive than the other). Most commonly, the items are compared in pairs (Thurstone, 1927;David, 1963). In this work, we use Best-Worst Scaling-a technique that exploits the comparative approach to annotation while keeping the number of required annotations small (Section 3.2). It has been shown to produce reliable annotations of terms by sentiment (Kiritchenko and Mohammad, 2016a).

Creating SCL-NMA
We now describe the term selection process and the Best-Worst Scaling annotation technique used to create the Sentiment Composition Lexicon for Negators, Modals, and Degree Adverbs. Table 1 shows a few example entries from the lexicon. We also describe how we calculated the minimum difference in sentiment scores of two terms that is perceptible to native speakers of a language.

Term Selection
General Inquirer (Stone et al., 1966) provides a list of 1,621 positive and negative words from Osgood's seminal study on word meaning (Osgood et al., 1957). These are words commonly used in everyday English. We include all of these words. In addition, we include 1,586 high-frequency phrases formed by the Osgood words in combination with simple negators such as no, don't, and never, modals such as can, might, and should, or degree adverbs such as very and fairly. 4 The eligible adverbs are chosen manually from adverbs frequently occurring in the British National Corpus (BNC) 5 . Each phrase includes at least one modal, one negator, or one adverb; a phrase can include several modifiers (e.g., would be very happy). The modifiers and the phrases are chosen in such a way that the full set includes several phrases for each Osgood sentiment word and includes several phrases for each modifier. In total, sixty-four different (single or multi-word) modifiers are selected. The final list contains 3,207 terms.

Best-Worst Scaling
Best-Worst Scaling (BWS), also sometimes referred to as Maximum Difference Scaling (MaxDiff), is an annotation scheme that exploits the comparative approach to annotation (Louviere and Woodworth, 1990;Cohen, 2003;Louviere et al., 2015). Annotators are given four items (4-tuple) and asked which item is the Best (highest in terms of the property of interest) and which is the Worst (least in terms of the property of interest). These annotations can then be converted into real-valued scores and also a ranking of items as per their association with the property of interest through a simple counting procedure: For each item, its score is calculated as the percentage of times the item was chosen as the Best minus the percentage of times the item was chosen as the Worst (Orme, 2009;Flynn and Marley, 2014). The scores range from -1 to 1. Further details on Best-Worst Scaling and its application to the task of sentiment annotation can be found in (Kiritchenko and Mohammad, 2016a).

Annotation process
The complete list of 3,207 terms was randomly sampled (with replacement) to create 6,414 (2 x 3,207) 4-tuples that satisfy the following criteria: 1. no two 4-tuples have the same four terms; 2. no two terms within a 4-tuple are identical; 3. each term in the term list appears approximately in the same number of 4-tuples; 4. each pair of terms appears approximately in the same number of 4-tuples.
Next, the set of 4-tuples was annotated through a crowdsourcing platform, CrowdFlower. The annotators were presented with four terms (single words and multi-word phrases) at a time, and asked which term is the most positive (or least negative) and which is the most negative (or least positive). 6 Each 4-tuple was annotated by ten respondents. We determined accuracy of every annotator on a small set of check questions labeled by the authors of this paper. We discarded all annotations provided by an annotator if their accuracy on these check questions was less than 70%. 6 The full set of instructions to annotators is available at http://www.saifmohammad.com/WebPages/SCL.html#NMA.

Quality of Annotations
Let majority answer refer to the option chosen most often for a question. 80% of the responses to the Best-Worst questions matched the majority answer.
We also tested the reliability of the aggregated scores by randomly dividing the sets of ten responses to each question into two halves and comparing the rankings obtained from these two groups of responses. The Spearman rank correlation coefficient between the two sets of rankings was found to be 0.98. (The Pearson correlation coefficient between the two sets of sentiment scores was also 0.98.) Thus, even though annotators might disagree about answers to individual questions, the aggregated scores produced by applying the counting procedure on the Best-Worst annotations are remarkably reliable at ranking terms by sentiment.

Least Perceptible Difference in Sentiment
In psychophysics, there is a notion of least perceptible difference (aka just-noticeable difference)-the amount by which something that can be measured (e.g., weight or sound intensity) needs to be changed in order for the difference to be noticeable by a human (Fechner, 1966). Analogously, we can measure the least perceptible difference in sentiment. If two words have close to identical sentiment associations, then it is expected that native speakers will choose each of the words about the same number of times when forced to pick a word that is more positive. However, as the difference in sentiment starts getting larger, the frequency with which the two terms are chosen as most positive begins to diverge. At one point, the frequencies diverge so much that we can say with high confidence that the two terms do not have the same sentiment associations. The average of this minimum difference in sentiment score is the least perceptible difference for sentiment.
To calculate the least perceptible difference, we first build a plot of the relationship between 'difference in the sentiment scores between two terms' and 'agreement among annotators' when asked which term is more positive. For each term pair w 1 and w 2 such that d = score(w 1 ) − score(w 2 ) ≥ 0, we count the number of Best-Worst annotations from which we can infer that w 1 is more positive than w 2 and divide this number by the total number of annota- Figure 1: Human agreement on annotating term w 1 as more positive than term w 2 for pairs with difference in scores d = score(w 1 ) -score(w 2 ). The x-axis represents d. The y-axis plots the avg. percentage of human annotations that judge term w 1 as more positive than term w 2 (thick line) and the corresponding 99.9%-confidence lower bound (thin blue line).
tions from which we can infer either that w 1 is more positive than w 2 or that w 2 is more positive than w 1 . (We can infer that w 1 is more positive than w 2 if in a 4-tuple that has both w 1 and w 2 the annotator selected w 1 as the most positive or w 2 as the least positive. The case for w 2 being more positive than w 1 is similar.) This ratio is the human agreement for w 1 being more positive than w 2 . To get more reliable estimates, we average the human agreement for all pairs of terms whose sentiment differs by d ± 0.01. Figure 1 shows the resulting average human agreement. The thin blue line in the Figure depicts the 99.9%-confidence lower bounds on the agreement. The least perceptible difference is the point starting at which the lower bound consistently exceeds 50% threshold (i.e., the point starting at which we observe with 99.9% confidence that the human agreement is higher than chance). The least perceptible difference when calculated from SCL-NMA is 0.069. In the next section, we use the least perceptible difference to determine whether a modifier significantly impacts the sentiment of the word it composes with.

Impact of Negators, Modals, and Degree
Adverbs on Sentiment SCL-NMA contains many phrases formed by different types of modifiers-negators, modals, and degree adverbs. Thus, this lexicon is a good resource for studying the impact of these types of modifiers on sentiment. In the following, we compare the sen-timent score of single-word term w with the sentiment score for phrase mod w, where mod is a modifier from a particular group (negator, modal, or degree adverb). Table 2 shows the average effect of different modifier groups on sentiment. The columns show the average change in sentiment score between w and mod w, the number of pairs (of w and mod w) used to determine the average, the number of phrases mod w whose sentiment score is greater (↑) or less (↓) than the score of w by at least 0.069 (the least perceptible difference). Since the impact of modifiers can be different depending on the sentiment of the modified word w, we present separate analyses for when w is positive and when w is negative. For the analysis in this section only, a word is considered positive if it has a sentiment score greater than or equal to 0.3, and considered negative if its sentiment score is less than or equal to -0.3. 7 Observe that the most change in sentiment is caused by negation; it consistently decreases the scores of positive words, and increases the scores of negative words. The average score difference is substantial for both positive words (0.926 points) and negative words (0.791 points). Modals also tend to decrease the scores of positive words, and increase the scores of negative words, though to a much smaller extent than negators. As with negators, modals affect positive words more strongly than they do negative words. Degree adverbs show less consistency than negators and modals; they can both heighten or lower the sentiment of a word. Moreover, the same adverb can behave differently with different words from the same sentiment group (positive or negative). Therefore, we report the average absolute differences in scores for this modifier group. These average differences are substantially smaller than the ones reported for modals and negators; the effect of degree adverbs is minor. Besides, in contrast to modals and negators, for a large percentage of degree adverb phrases (for 35% of the positive-word phrases and for 37% of the negativeword phrases), the sentiment scores do not differ from the scores for the corresponding single words by the least perceptible difference (0.069 points). In the subsections below, we further examine the im- Table 2: The impact of different modifier groups on sentiment. 'Avg. diff.' is the average difference between the score of mod w and w. '# pairs' is the number of pairs (of w and mod w) used to determine the average. '# score ↑ (↓)' indicates the number of phrases for which score(mod w ) is greater (less) than score(w ) by at least 0.069 (the perceptible difference).

Modifier Group
On positive words On negative words Avg. diff. # pairs # score ↑ # score ↓ Avg. diff. # pairs # score ↑ # score ↓ negators -0  Figure 2: The impact of negators on sentiment. The xaxis is score(w ), the sentiment score of a term w; the y-axis is score(mod w ), the sentiment score of a term w preceded by a negator. Each dot corresponds to one phrase mod w. The black line shows an average effect of the negators group. The dashed red line shows the reversing polarity hypothesis score(mod w )= −score(w ).
pact of each modifier category on the sentiment of its scope. Also, we provide rankings of different negators, modals, and degree adverbs as per the average change in sentiment score between w and mod w. This would allow linguists and other researchers to better understand the behavior of different modifiers.

Negation
There exist two common approaches to incorporate the impact of negation in automatic systems: (1) reversing polarity hypothesis, where the sentiment score of a word 'score(w )' is replaced with '−score(w )'; and (2) shifting hypothesis, where the sentiment score of a word 'score(w )' is reduced by a fixed amount: 'score(w )−sign(score(w )) × b'. We will show that neither hypothesis accurately captures the impact of negation. We will also present an analysis of the overall impact of negation and the impact of individual negators (aka negation triggers).
In our dataset, the negators are formed by 'no' negation words like no, not, never, and nothing in combination with auxiliary and modal verbs. Figure 2 shows the overall impact of negation on sentiment of single words. Each dot in this figure corresponds to one negated phrase 'negator w'. The x-axis corresponds to score(w ) (the sentiment score of a word w); the y-axis is score(mod w ) (the sentiment score of a word w preceded by a negator). The black line shows an average effect of negation. The dashed red line shows the reversing polarity hypothesis: score(mod w )= −score(w ). Observe that on average negators tend to substantially downshift the sentiment of positive words turning them into negative expressions. On the other hand, the scores of negative terms increase, but to a smaller extent than the scores of positive words. Words with high absolute sentiment values tend to experience the greatest shift. This is true for both positive and negative words. However, this shift is substantially smaller than is proposed by the reversing polarity hypothesis. Overall, the reversing polarity hypothesis fit is rather poor. The shifting hypothesis does not explain the data either. Another observation is that words with similar sentiment scores can form negated phrases with very different sentiment scores (appearing as columns of dots in the graph). This is mostly due to the effect of different negators. However, the same negator can sometimes have different effect on words with similar sentiment. For example, the three words easy, good, and better all have similar sentiment scores: score(easy) = 0.598, score(good ) = 0.556, score(better ) = 0.486. Yet, the corresponding negated phrases formed with the same negator never range from negative (score(never good ) = − 0.542), to slightly negative (score(never easy) = − 0.112), to positive (score(never better ) = 0.666). Next, we investigate the effect of individual negators. Table 3 shows the impact of negation triggered by different negators. The majority of the negation triggers have a large effect on both positive and negative words; the absolute difference in scores between a negated phrase and the corresponding sentiment word is 0.8-1.0 points on positive words and 0.7-0.9 points on negative words. The greatest shift in sentiment on positive words was observed for the modifier will not be, and on negative words for modifier will not. The weakest effect is caused by may not, nothing, and never. Verb tenses seem not to affect the behavior of negators significantly. For example, the average change in sentiment caused by not and by was not differs only by 0.03-0.05 points. The modal verbs will and can form strong negation phrases will not, will not be, and cannot that showed the most change in sentiment. Other modal verbs, such as could, would, and may, form negation phrases with smaller effect on sentiment.

Modals
In our dataset, the modal modifiers are formed by modal verbs can, could, should, would, may, might, and must in combination with auxiliary verbs. Figure 3 demonstrates the overall impact of modals on sentiment. One can observe that on average modals have a smoothing effect on sentiment: they make negative words less negative and positive words less positive. Words with high absolute sentiment values tend to experience the greatest shift; though, this shift is still quite small (around 0.4 points).
The effect of individual modal modifiers on positive and negative words is shown in Table 4. The most influential modal modifier is would have been. It consistently downshifts sentiment by a significant margin (about 0.5 points). Modifiers involving modals could, and might also affect sentiment in a consistent and noticeable way for both positive and negative words. Modals can and would form modifiers that have the smallest effect on sentiment of positive and negative words (with the exception of the modifier would have been).

Degree Adverbs
As mentioned earlier, the average differences in sentiment caused by degree adverbs are quite small; many differences are negligible. Furthermore, these modifiers are less consistent than negators and modals; there are many degree adverbs that increase the sentiment intensity of some words from one class (positive or negative) and decrease the sentiment intensity of other words from the same class. For example, certainly heightens the sentiment intensity of positive word important (by about 0.21 points), but lowers the sentiment intensity of another positive word hope (by about 0.31 points). We found that the only degree adverb in our set that affects sentiment to a large extent (0.835 points) is less; it consistently and significantly decreases the sentiment intensity of positive words. In fact, it acts as negator and reduces the sentiment intensity of positive words to a degree similar to that of negators. There are a few other modifiers that consistently reduce the sentiment intensity of positive words by a significant amount: was too, too, probably, fairly, and relatively. Only one intensifier, highly, consistently and significantly increase the sentiment of positive words. The sentiment of negative words is significantly lowered by intensifiers extremely and very very.

Interactive Visualization
As part of this project, we created an interactive visualization for SCL-NMA. 8 The visualization has several components that allow to investigate the effect of sentiment modifiers on individual words as well as to inspect the complete set in one scatter plot. The groups of modifiers are color-coded for ease of exploration. The full information for a phrase, including the sentiment scores of the phrase and its constituent content word, can be viewed by hovering over the point in the graph with the mouse. The scatter plot can be filtered to show phrases that include only a particular type of the modifiers (negators, modals, or degree adverbs). All the components are linked together so that by clicking on a point in one component one can highlight or filter the corresponding points shown in the other components. We hope that the users will find this visualization very helpful in exploring aspects of the data they are interested in.

Conclusions
We created a real-valued sentiment lexicon of phrases that include a variety of common sentiment modifiers such as negators, modals, and degree adverbs. Both phrases and their constituent content words are annotated manually using the Best-Worst Scaling technique. We showed that the obtained annotations are reliable-re-doing the annotation with different sets of annotators produces a very similar ranking of terms by sentiment. We use the annotations for the phrases to present an extensive analysis of how negators, modals, and degree adverbs impact the sentiment of other words in their scope. We demonstrate that these modifiers affect sentiment in complex ways so that their effect cannot be easily modeled with simple heuristics. In particular, we observe that the effect of a modifier is often determined not only by the type of the modifier (whether it is a negator, modal, or degree adverb) but also by the modifier word and the content word themselves. The created lexicon is made freely available to the research community to foster further research, especially towards automatic methods for sentiment composition and towards a better understanding of how sentiment is composed in the human brain.