Automatic Generation and Scoring of Positive Interpretations from Negated Statements

This paper presents a methodology to extract positive interpretations from negated statements. First, we automatically generate plausible interpretations using well-known grammar rules and manipulating semantic roles. Second, we score plausible alternatives according to their likelihood. Manual annotations show that the positive interpretations are intuitive to humans, and experimental results show that the scoring task can be automated.


Introduction
Negation is an intricate phenomenon present in all human languages (Hoeksema, 2000), and studied from a theoretical perspective since Aristotle. Acquiring and understanding negation is more challenging than language in general: children acquire negation after learning to communicate (Nordmeyer and Frank, 2013), and adults take longer to process negative sentences than positive ones (Clark and Chase, 1972). In any given language, humans communicate in positive terms most of the time, and use negation to express something unusual or an exception (Horn, 1989).
In classical logic, negation is a simple unary operator that reverses the truth value of a proposition. In natural language, negation is always marked (Horn and Wansing, 2015) and it is used to reverse polarity, i.e., turning something affirmative into negative, or something negative into affirmative. Albeit most sentences are affirmative, negation is rather ubiquitous (Morante and Sporleder, 2012): In scientific papers, 13.76% of statements contain a negation (Szarvas et al., 2008); in product reviews, 19% (Councill et al., 2010); and in a selection of Conan Doyle stories, 22.23% (Morante and Daelemans, 2012). In health records, 12.3% of concepts are tagged as negated (Elkin et al., 2005); and in OntoNotes (Hovy et al., 2006), 10.15% of statements contain a verb negated with not, n't or never.
From a theoretical perspective, it is accepted that negation conveys positive meaning (Rooth, 1992;Huddleston and Pullum, 2002). For example, when reading (1) John doesn't eat meat, humans intuitively understand that (1a) John eats something other than meat, and (1b) Some people eat meat, but not John. Extracting positive interpretations from negated statements automatically is not straightforward: a negated statement may convey one or more positive interpretations, and not all positive interpretations are equally likely. For example, from (2) They didn't order the right parts, it is very likely that (2a) They ordered the wrong parts, but (2b) Somebody ordered the right parts, but not they is unlikely. This paper presents a methodology to automatically extract and score positive interpretations from negated statements, as intuitively done by humans when reading text. A key feature of the work presented here is that it is not tied to any existing approach to extract meaning from text-we generate positive interpretations in plain text, and these positive interpretations can be semantically represented with any existing approach. The main contributions are: (1) procedure to automatically generate plausible positive interpretations from negated statements, (2) annotations scoring plausible positive interpretations, 1 and (3) experiments detailing results with several combinations of features, as well as goldstandard and predicted linguistic information.

Background and Definitions
Negation is well-understood in grammars, the valid ways to form a negation are well-documented (Quirk et al., 2000;van der Wouden, 1997). Negation can be expressed by verbs (e.g., avoid doing any look-up), nouns (e.g., the absence of any phonic sequence), adjectives (e.g., it is pointless to argue with a fool), adverbs (e.g., I never tried Persian food before), prepositions (e.g., you can always exchange it without a problem), determiners (e.g., the new law has no direct implications to international shipping), pronouns (e.g., nobody will keep election promises).
Huddleston and Pullum (2002) distinguish four negation types: • Verbal if the marker of negation is grammatically associated with a verb, e.g., I did not see anything, non-verbal if it is associated with a dependent of the verb, e.g., I saw nothing. • Synthetic if the negation mark has a function besides marking a negation, e.g., [Nobody] AGENT liked it, analytic otherwise, e.g., Not many people liked it. • Clausal if the negation yields a negative clause, e.g., The terms aren't negotiable, subclausal otherwise, e.g., The terms are non-negotiable. • Ordinary if the negation indicates that something is not the case, e.g., That car does not drive smooth, metalinguistic if it does not dispute the truth but rather reformulates a statement, e.g., That TV is not small, it is tiny.
In this paper, we target verbal, analytic, clausal and both ordinary and metalinguistic negation.

Positive Interpretations.
In philosophy and linguistics, it is generally accepted that negation conveys positive meanings (Horn, 1989). These positive meanings range from implicatures, i.e., what is suggested in an utterance even though neither expressed nor strictly implied (Blackburn, 2008), to entailments. Other terms used in the literature include implied meanings (Mitkov, 2005), implied alternatives (Rooth, 1985) and semantically similars (Agirre et al., 2013). We do not strictly fit into any of this terminology, we reveal positive interpretations as intuitively done by humans when reading text.

Scope and Focus.
From a theoretical perspective, it is accepted that negation has scope and focus, and that the focusnot just the scope-yields positive interpretations (Horn, 1989;Rooth, 1992;Taglicht, 1984). Scope is "the part of the meaning that is negated" and focus "the part of the scope that is most prominently or explicitly negated" (Huddleston and Pullum, 2002).
Consider the following statement in the context of the recent refuge crisis: (3) Mr. Haile was not looking for heaven in Europe. By definition, scope refers to "all elements whose individual falsity would make the negated statement strictly true", and focus is "the element of the scope that is intended to be interpreted as false to make the overall negative true" (Huddleston and Pullum, 2002). The falsity of any of the truth conditions below makes statement (3) true, thus the scope of the negation is (3a-3d): Determining the focus is almost always more challenging than the scope. The challenge lies on determining which of the truth conditions (3a-3d) is intended to be interpreted as false to make the negated statement true: all of them qualify, but some are more likely. A natural reading of statement (3) suggests that Mr. Haile was looking for something (a regular life, a job, etc.) in Europe, but not heaven. Determining that the focus is heaven, i.e., that everything in statement (3) is actually positive except the THEME of looking, is the key to reveal the intended positive interpretation. It is worth noting that other foci yield unlikely interpretations, e.g., Somebody was looking for heaven in Europe, but not Mr. Haile (3b, AGENT), Mr. Haile was looking for heaven somewhere, but not in Europe (3d, LOCA-TION). Note that (1) scope on its own does not yield positive interpretations, and (2) some negated statements convey several likely positive interpretations, e.g., statement (1) in Section 5, Table 3.

Previous Work
Within computational linguistics, approaches to process negation are shallow, or target scope and focus detection. Popular semantic representations such as semantic roles (Palmer et al., 2005;Baker et al., 1998) or AMR (Banarescu et al., 2013) do not reveal the positive interpretations we target in this paper. Shallow approaches are usually application-specific. In sentiment and opinion analysis, negation has been reduced to marking as negated all words between a negation cue and the first punctuation mark (Pang et al., 2002), or within a five-word window of a negation cue (Hu and Liu, 2004). The examples throughout this paper show that these techniques are insufficient to reveal implicit positive interpretations.

Scope Annotation and Detection
Scope of negation detection has received a lot of attention, mostly using two corpora: BioScope in the medical domain (Szarvas et al., 2008) and CD-SCO (Morante and Daelemans, 2012). BioScope annotates negation cues and linguistic scopes exclusively in biomedical texts. CD-SCO annotates negation cues, scopes, and negated events or properties in selected Conan Doyle stories.
There have been several supervised proposals to detect the scope of negation using BioScope and CD-SCO (Özgür and Radev, 2009;Øvrelid et al., 2010). Automatic approaches are mature (Abu-Jbara and Radev, 2012): F-scores are 0.96 for negation cue detection, and 0.89 for negation cue and scope detection (Velldal et al., 2012;Li et al., 2010). Outside BioScope and CD-SCO, Reitan et al. (2015) present a negation scope detector for tweets, and show that it improves sentiment analysis. As shown in Section 2, scope detection is insufficient to reveal positive interpretations from negated statements.

Focus Annotation and Detection
While focus of negation has been studied for decades in philosophy and linguistics (Section 2), corpora and automated tools are scarce.  annotate focus of negation in the 3,993 negations marked with ARGM-NEG semantic role in PropBank (Palmer et al., 2005). Their annotations, PB-FOC, were used in the *SEM-2012 Shared Task (Morante and Blanco, 2012). Their guidelines require annotators to choose as focus the semantic role that "is most prominently negated" or the verb. If several roles may be the focus, they prioritize "the one that yields the most meaningful implicit [positive] information", but do not specify what most meaningful means. Consider again statement (1) John doesn't eat meat. Their approach would determine that the focus is the THEME of eat, meat, because it arguably yields the "most meaningful implicit [positive] information" (using our terminology, positive interpretation): John eats something other than meat. By design, they ignore other valid positive interpretations, e.g., Some people eat meat, but not John. In this paper, we improve upon their work: instead of extracting the "most meaningful" positive interpretation from a negated statement, we generate several positive interpretations and score them according to their likelihood.
Anand and Martell (2012) present a complimentary approach to annotate focus of negation. They refine PB-FOC and argue that positive interpretations arising from scalar implicatures and negraising predicates should be separated from those arising from focus detection. According to their annotations, 27.4% of negations with a focus annotated in PB-FOC do not actually have a focus. Blanco and Moldovan (2012) introduce the concept of finegrained foci and refine the annotations in PB-FOC by annotating foci at the token level, and Matsuyoshi et al. (2014) annotate focus of negation in Japanese. In this paper, we are not concerned about annotating focus of negation per se, but about extracting positive interpretations from negated statements as intuitively understood by humans.
Automatic systems to detect the focus of negation (and reveal up to one positive interpretation) in English texts are trained using PB-FOC.  obtain an accuracy of 65.5 using supervised machine learning and features derived from gold-standard linguistic information, and Blanco and Moldovan (2014) report an F-measure of 64.1. Rosenberg and Bergler (2012) report an Fmeasure of 58.4 using 4 linguistically sound heuristics and predicted linguistic information, and Zou et al. (2014) an F-measure of 65.62 using contextual discourse information. Unlike the work presented here, none of these systems attempts to extract and rank several positive interpretations from one negated statement.

Corpus Creation
Our goal is to create a corpus of negated statements and their positive interpretations as intuitively understood by humans. We put a strong emphasis on automation. First, given a negated statement, we automatically generate plausible positive interpretations following a battery of linguistically motivated deterministic rules (Section 4.1). Second, we collect manual annotation to score the plausible positive interpretations according to their likelihood (Section 4.2). We then use these manually obtained scores to learn models that automatically score positive interpretations (Section 6). We decided to work on top of OntoNotes (Hovy et al., 2006) instead of plain text or other corpora for several reasons. First, OntoNotes includes gold linguistic annotations such as part-of-speech tags, parse trees and semantic roles. Second, state-ofthe-art role labelers trained with Propbank achieve F-measures of 0.835 (Lewis et al., 2015), and we use semantic roles to generate positive interpretations. Third, unlike BioScope, CD-SCO and PB-FOC (Section 2), OntoNotes includes sentences from several genres, e.g., newswire, broadcast news and conversations, magazines, the web.

Generating Positive Interpretations
OntoNotes 2 is a large corpus containing 63,918 sentences. Annotating all positive interpretations from all negations is outside of the scope of this paper. Instead, we target selected representative negations. Selecting Negated Statements. We first selected all verbs negated with ARGM-NEG semantic role and obtained 6,617 verbal negations. After examining the negated verbs, it became clear that negation is not uniformly distributed across verbs in OntoNotes, it roughly follows Zipf's law. In order to alleviate the annotation effort while accounting for all negated verbs in OntoNotes, we randomly selected up to 5 negations for each verb. The number of negated statements selected is 600. Converting Negated Statements into Their Positive Counterparts. We apply 3 steps inspired after the grammatical rules to form negation detailed by Huddleston and Pullum (2002, Ch. 9): 1. Remove the negation mark by removing the tokens within ARGM-NEG semantic role. 2. Remove auxiliaries, expand contractions, and fix third-person singular and past tense. For example (before: after), doesn't go: goes, didn't go: went, won't go: will go, We use a standard list of irregular verbs, 3 and grammar rules to convert to third-person singular and past tense based on orthographic patterns. 3. Rewrite negatively-oriented polarity-sensitive items. For example (before: after), anyone: someone, any longer: still, yet: already. at all: somewhat. We use the correspondences between negatively-oriented and positively-oriented polarity-sensitive items by Huddleston and Pullum (2002, pp. 831).
Generating Positive Interpretations. Once the positive counterpart is obtained, we generate positive interpretations by rewriting each semantic role or the (originally negated) verb. Among others, we use the following rewriting rules: ARG 0 -ARG 4 : someone / some people / something, ARGM-TMP: at some point of time, ARGM-LOC: somewhere, ARGM-MNR: in some manner, ARGM-CAU: because of something and ARGM-PRP: to do something. Additionally, if the semantic role starts with a preposition, we also include it, e.g., gave [to John] ARG 2 : gave to someone, but not John. This methodology generated 1,888 positive interpretations from the 600 selected negations (average: 3.15). Table 1 exemplifies the 3 steps to transform a negated statement into its positive counterpart, and the positive interpretations generated. Additional examples are provided in Table 3.
We acknowledge that some of the positive interpretations we generate automatically are not as specific or intuitive as carefully crafted, manually generated interpretations could be. For example, from [John] ARG 0 does[n't] ARGM-NEG [know] verb [the details about how they met] ARG 1 , the proposed methodology would generate, among others, John knows about something, but not about the details of how they met. A better interpretations that we do not generate is John knows something about how they met, but not the details. We argue that generating interpretations automatically is the only option in order to incorporate this work into an NLP pipeline, and reserve for future work generating positive interpretations beyond rewriting semantic roles.

Ranking Positive Interpretations
Once positive interpretations were automatically generated, we asked annotators to rank them. Annotators were presented with one negated statement and one positive interpretation at a time, and were asked Given the negated statement above, do you think the statement [positive interpretation] below is true? They only had access to the text in the original negated statement, the positive interpretation, and the previous and next sentences as context. We did not display semantic role information for the original negated statement, its positive counterpart or the semantic role from which the positive interpretation was generated. As we shall see (Section 5.1), context often helps scoring interpretations.
Annotators were required to answer with a score from 0 to 5, were 0 means absolutely disagree and 5 means absolutely agree. We did not provide descriptions for intermediate scores or used additional categorical labels. This simple guidelines were sufficient to reliably score plausible positive interpretations automatically generated.  show the number of positive interpretations generated (#), the percentage of sentences for which a positive interpretation is generated (% sent), and the mean and standard deviation (SD) of the annotated scores.

Corpus Analysis
On average, we generated 3.15 positive interpretations per negation (standard deviation: 0.82), and 74% of negations have at least one interpretation scored 4 or higher. Basic counts and statistics for the annotations are provided in Table 2. Overall, we annotated 1,888 positive interpretations generated from 600 sentences, or equivalently, from 600 verbs negated with ARGM-NEG semantic role in OntoNotes. Overall mean score is 3.52 (out of 5) and overall standard deviation, 1.63. The 25th percentile is 2.0, the 50th percentile is 4.0 and the 75th percentile is 5.0. These numbers show that most positive interpretations automatically generated are deemed likely by annotators, and over 25% are scored with a 5 (out of 5). In general, positive interpretations generated from numbered roles (ARG 0 -ARG 4 ) are scored higher than the ones generated from modifiers (ARGM-ADV, ARGM-CAU, ARGM-DIR, etc.). Also, positive interpretations generated from infrequent roles are generally ranked higher, e.g., ARG 4 and ARG 3 vs. ARG 0 , ARGM-PRP and ARGM-MNR vs. ARGM-ADV. Annotation Quality. In order to ensure annotation quality, we calculated inter-annotator Pearson correlation. Kappa and other agreement measures de-Negated statement, context if relevant to determining scores, and all positive interpretations Score 1 Context, previous statement: That change will obviously impact third and fourth quarter earnings for the industry in general, he added.  signed for categorical labels are not well-suited for our annotation task, since not all disagreements between numeric scores are the same, e.g., 4 vs. 5 should be counted as relatively high agreement, and 1 vs. 5 should be counted as high disagreement. Overall Pearson correlation is 0.761. Table 3 presents annotation examples. We show the original negated statement including semantic role annotations from OntoNotes (square brackets), all positive interpretations automatically generated, and their scores. We also include context (previous and next sentence) if it helps determining scores. Example (1) shows that context sometimes is vital to scoring plausible positive interpretations. Given He didn't forecast Phillips' results in isolation, it is uncertain if he forecasted anything at all, or whether somebody forecasted Phillips' results. However, the previous statement makes certain (5/5) the interpretation generated from ARG 1 : he forecasted earnings for the industry in general. Similarly, the next statement makes very likely (5/5) the interpretation generated from ARG 0 : other people (security analysts) made forecasts about Phillips. In this example, 2 positive interpretations are generated from one negation, and they are assigned the highest score (5/5).

Annotation Examples
The positive interpretations generated from example (2) can be annotated without context. Somebody other than Murdoch had to own the rest of Star TV that he bought in 1995 (score 5/5), and people cannot buy what they already own (score 0/5).
Example (3) presents a positive interpretation generated from ARG 0 that is scored low (1/5); recall that the mean score for interpretations generated from ARG 0 is 4.07 and the standard deviation 1.03 (Table 2). In this negated statement, the indefinite you refers to an unspecified person, thus it is not the case that somebody can run a country with 23 million people with revenues of 16 billion to 20 billion dollars (Interpretation 3.1, score 1/5). Interpretations 3.2 and 3.3, however, are scored high: annotators correctly understood that given the negated statement, it is the case that You can run something with revenues of 16 billion to 20 billion dollars, but not a country with 23 million people (you can run a country with less people with that revenue), and You can run a country with 23 million people in some manner, but not with revenues of 16 billion to 20 billion dollar (you could run it with more revenue).
Finally, example (4) presents a short statement from which only one positive interpretations is generated. Annotators were asked whether given Do not utter a word, they think Utter something, but not a word is true. They correctly annotated that this positive interpretations is invalid (score 0/5). Indeed, the negated statement in example (4) can only be interpreted as an order to not utter anything.

Learning to Score Positive Interpretations
We follow a standard supervised machine learning approach. The 1,888 positive interpretations along with their scores become instances, and we divide them into training (80%) and test splits (20%) making sure that all interpretations generated from a sentence are assigned to either the training or test splits. Note that splitting instances randomly would not be sound: training with some interpretations generated from a negated statement, and testing with the rest of interpretations generated from the same statement would be an unfair evaluation. We trained a Support Vector Machine (SVM) for regression with RBF kernel using scikitlearn (Pedregosa et al., 2011), which in turn uses LIBSVM (Chang and Lin, 2011). The feature set and SVM parameters (C and γ) were tuned using 10-fold cross-validation with the training set, and results were calculated using the test set.

Feature Selection
We tried features derived exclusively from the negated statement from which the positive interpretation was generated, more specifically, we extract features from the negated verb or semantic role (sem role) used to generate the positive interpretation, from both of them (verb-sem role) and from the verb-argument structure of verb (verbargstruct), i.e., all semantic roles of verb.
Verb features are straightforward and account for the verb word form and part-of-speech tag.
Sem role features include the label of the semantic role from which the positive interpretation was generated (sem role label), its length (number of tokens), and the word form and part-of-speech tag of its head. Additionally, we add standard features in semantic role labeling (Gildea and Jurafsky, 2002): the syntactic nodes (NP, PP, etc.) of the semantic role and its parent in the parse tree, as well as the left and right siblings, if any.
Verb-sem role features are also standard in role labeling. We include a flag indicating whether the verb occurs before or after the semantic role in the negated statement (not in the positive counterpart), syntactic node of the lowest common ancestor between verb and sem role, and the syntactic path.
Finally, verbarg-struct features encode characteristics of the verb-argument structure to which verb and sem role belong to. Namely, we added flags indicating semantic role presence, and features indicating the first and last semantic roles in order of appearance in the negated statement. We also included the syntactic nodes of semantic roles and their heads.
We exemplify all features with Interpretation 3.1 generated from Statement 3 in Table 3: • Verb features: verb wf=run, verb pos=VBP.

Experimental Results
We report results obtained with several combinations of features in Table 5. We detail results obtained with features extracted from gold-standard and predicted linguistic annotations (part-of-speech tags, parse trees, semantic roles, etc.) as annotated in the gold and auto files from the CoNLL-2011 Shared Task release of OntoNotes (Pradhan et al., 2011). All models are trained with gold-standard linguistic annotations, and tested with either goldstandard or predicted linguistic annotations. Testing with gold-standard linguistic annotations. Using only the label of the semantic role from which the positive interpretation was generated (sem role label), yields a Pearson correlation of 0.603. Using Verb features is virtually useless (Pearson correlation: -0.025), this is due to the Whether verb occurs before or after than sem role in the negated statement Syntactic node of lowest common ancestor of verb and sem role Syntactic paths from verb to sem role verbarg-struct Flags indicating whether verb has each possible semantic role Semantic role labels of the first and last roles of verb Syntactic nodes and heads of each semantic role attaching to verb  Table 5: Pearson correlations in the test set using gold-standard and predicted linguistic annotations (part-of-speech tags, parse trees and semantic roles). Results are provided using sem role label as only feature, and using several features incrementally.
Number of test instances with gold-standard linguistic annotations is 378, and with predicted annotations, 268.
fact that our corpus includes at most 5 instances for each negated verb in OntoNotes (Section 4). Adding sem role features yields a correlation of 0.630, and incorporating verb-sem role features is useless (Pearson: 0.627). Considering all features, however, yields the highest correlation, 0.642.
Testing with predicted linguistic annotations. We assigned 20% of instances to the test split, totalling 378 instances (Section 6). However, some of these instances cannot be obtained using predicted role labels: a missing or incorrect semantic role will unequivocally lead to positive interpretations that are not in our corpus and thus evaluation is not straightforward. Results presented with predicted linguistic annotations are calculated using only the 268 (out of 378) positive interpretations that are generated from predicted semantic roles and are also generated (and thus annotated in our corpus) from gold-standard linguistic annotations.
Results using only sem role label feature (0.642) are better or very similar than using any combination of features (0.638-0.650) except Verb features alone, which perform poorly as explained earlier.
These results should be taken with a grain of salt: the number of test instances is much lower (268 vs. 378). Additionally, these 268 test instances correspond to positive interpretations generated from semantic roles that were predicted correctly automatically. Role labels are predicted better for shorter sentences without complicated syntactic structure; positive interpretations for this kind sentences are also easier to score, e.g., Statement 4 in Table 3.

Conclusions
Humans intuitively understand negated statements in positive terms when reading text. This paper presents an automated methodology to generate plausible positive interpretations from verbal negation, and score them based on their likelihood. We use simple grammar rules and manipulate semantic roles to generate positive interpretations. Experimental results show that these interpretations can be scored automatically using standard supervised machine learning techniques.
An annotation effort shows that most positive interpretations automatically generated are likely (scores ≥4), thus the amount of positive meaning revealed by the methodology presented here is substantial (on average, 3.15 interpretations are generated per negation). We believe that more annotations and a learning algorithm that scores jointly all positive interpretations generated from each negation (as opposed to individually) would yield better results.