Automatic Extraction of Implicit Interpretations from Modal Constructions

This paper presents an approach to extract implicit interpretations from modal constructions. Importantly, our approach uses a deterministic procedure to normalize eventuali-ties and generate potential interpretations. An annotation effort demonstrates that these interpretations are intuitive to humans and most modal constructions convey at least one interpretation. Experimental results show that the task is challenging but can be automated


Introduction
People use language to communicate not only facts, but also intentions, uncertain information and points of view. Modality can be broadly defined as a grammatical phenomenon used to express the speaker's opinion or attitude towards a proposition (Lyons, 1977). Modality has also been defined as "the category of meaning used to talk about possibilities and necessities, essentially, states of affairs beyond the actual." (Hacquard, 2011). Within computational linguistics, processing modality has proven useful for, among others, recognizing textual entailment (Snow et al., 2006;MacCartney et al., 2006), machine translation (Murata et al., 2005;Baker et al., 2012), and sentiment analysis (Wiebe et al., 2005).
In the absence of modality markers, it is understood that the author of a proposition agrees with it (Hengeveld and Mackenzie, 2008). Adding a modality marker-also referred to as cue-casts doubt on the truth of the proposition, e.g., Mary got a new job last week vs. Mary likely got a new job last week. Modality is surprisingly common (Morante and Sporleder, 2012), and notoriously difficult to annotate and process automatically (Rubinstein et al., 2013;Vincze et al., 2011). In MEDLINE, 11% of sentences contain speculative language (Light et al., 2004) and in biomedical abstracts, 18% (Vincze et al., 2008). Rubin (2006) reports that 59% of statements in 80 New York Times articles include epistemic modality. Despite modality being ubiquitous, there is not an agreed upon annotation schema.
In this paper, we extract implicit interpretations intuitively understood by humans when reading modal constructions. We do not follow any specific theory of modality. Instead, we manipulate modal constructions to automatically generate potential interpretations, and then assign factuality scores to them. Consider statement (1) below: 1. John likely contracted the disease when a mouse bit him in the Adirondacks. Even though likely syntactically attaches to contracted, a natural reading suggests that John contracted the disease is factual; the only bit of uncertain information is how (or when) he contracted the disease. In other words, assuming that the author of statement (1) is truthful, event contracted occurred with AGENT John and THEME the disease, but the MANNER (or TIME) may not have been when a mouse bit him in the Adirondacks.
A key feature of the work presented in this paper is that the interpretations extracted from modal constructions are not tied to any syntactic or semantic representation. Given modal constructions in plain text, we extract implicit interpretations in plain text, and these interpretations can be processed with any existing NLP pipeline. The main contributions of this paper are: (1) procedure to automatically generate potential interpretations from modal constructions; (2) annotations assessing the factuality of potential interpretations generated from OntoNotes; 1 and (3) experimental results using several features.
Beyond theoretical works, there are many proposals to annotate modality. Doing so has proven challenging: following different annotations schemas on the same source text yields little overlap (Vincze et al., 2011), andCarretero andZamorano-Mansilla (2013) present an analysis of disagreements when targeting modal adverbs. Annotation schemas typically include 3 tasks: identifying modality triggers, their scopes, and sources (Quaresma et al., 2014;Sánchez and Vogel, 2015). Many also classify the modality into several types (epistemic, circumstantial, ability, deontic, etc.) or a fine-grained taxonomy (Rubinstein et al., 2013;Nissim et al., 2013). In this paper, we are not concerned with modeling modality per se, or classifying instances of modality into predefined classes or hierarchies. Instead, we extract implicit interpretations from modal constructions in order to mirror intuitive readings.
FactBank is probably the best-known corpus for event factuality (Saurí and Pustejovsky, 2009). It was created following carefully crafted annotation 1 Available at www.sanders.tech guidelines and examples comprising 34 pages. 2 The guidelines detail a manual normalization step to "identify the full event that needs to be assessed in terms of its factuality" (p. 12), and the annotation process includes identifying the sources that are assessing factuality (p. 15). de Marneffe et al. (2012) reannotate a subset of FactBank with factuality values from the reader's perspective-they call it veridicality-using crowdsourcing. Both FactBank and de Marneffe et al. (2012), rely on manual normalization to identify the eventuality whose factuality is being annotated. Instead, we present an automated approach: we manipulate semantic roles and syntactic dependencies deterministically to generate several potential interpretations per modal construction, and then assess their factuality.
Many other efforts expand on FactBank using crowdsourced annotations, different annotation schemas (usually simpler) or other domains. Prabhakaran et al. (2012) use crowdsourcing to classify propositions into 5 modalities: ability, effort, intention, success and want. Soni et al. (2014) target the factuality of quotes (direct and indirect) in Twitter. Lee et al. (2015) detect events and assess factuality using easy-to-understand short instructions to crowdsource annotations. Unlike us, they annotate factuality at the individual token level, where annotated tokens are deemed events by annotators. Prabhakaran et al. (2015) define and annotate propositional heads with four categories: (1) non-belief propositions, or (2) committed, non-committed or reported belief. Instead of assessing factuality only for propositional heads (usually verbs, one assessment per proposition), we do so for potential interpretations automatically generated by manipulating verbs and their arguments deterministically.
All works cited in the previous two paragraphs either manually normalize text prior to assessing factuality-making automation from plain text impossible-or assess factuality for tokens deemed events (ordered, delay, agreed, etc.) or full propositions (a verb and all its arguments). Unlike them, we automatically generate potential interpretations from a single modal construction-or, equivalently, automatically generate several normalizations-and then assess their factuality.

Terminology and Background
We use the term modal construction to refer to verbargument structures modified by a modal adverb (possibly, probably, etc.). We use the term implicit interpretation, or interpretation to save space, to refer to meaning intuitively understood by humans when reading a modal construction. Potential interpretations are interpretations automatically generated whose factuality has yet to be determined. The factuality of an interpretation is a score indicating its likelihood-whether it is true, false or unknown given the modal construction.
We work on top of OntoNotes (Hovy et al., 2006) because it includes text from several genres (news, broadcast and telephone conversations, weblogs, etc.) and includes part-of-speech tags, parse trees, PropBank-style semantic roles and other linguistic information. 3 Very briefly, PropBank (Palmer et al., 2005) has two kinds of semantic roles: numbered roles (ARG 0 , ARG 1 , etc.), which are defined in verb-specific framesets, and argument modifiers (ARGM-TMP, ARGM-LOC, etc.), we refer the reader to the aforementioned reference, and the guidelines and framesets 4 for more details. We transformed the parse trees in OntoNotes into syntactic dependencies using Stanford CoreNLP (Manning et al., 2014).

Corpus Creation
We define a two-step procedure to create a corpus of modal constructions and the implicit interpretations intuitively understood by humans when reading them. First, we automatically generate potential interpretations from modal constructions by manipulating syntactic dependencies and semantic roles. Second, we manually score potential interpretations according to their likelihood. These interpretations and scores are later used to learn how to score potential interpretations automatically (Section 6).

Generating Potential Interpretations
Selecting Modal Constructions. OntoNotes is a large corpus containing 63,918 sentences. Creating a corpus of interpretations for all modal constructions is outside the scope of this paper. In order to alleviate the annotation effort, we focus on selected modal constructions. Specifically, we select verb-argument structures that have one ARGM-ADV or ARGM-MNR role, and that role is one of the following modal adverbs: certainly, clearly, definitely, likely, obviously, possibly, probably, surely, or unlikely. These adverbs are the most frequent that satisfy the above filter. Additionally, we discard verbargument structures with to be as the main verb. These rules retrieve 324 modal constructions. Automatic Normalization. Modal constructions often occur in long multi-clause sentences. In order to identify the eventuality from which potential interpretations should be generated, we automatically normalize the original sentence. Normalizing consists of a battery of deterministic steps implemented using syntactic dependencies and semantic roles. In contrast with previous work (Section 2), our normalization is fully automated. Hereafter, we use verb to refer to the main verb in the modal construction, adverb to the modal adverb, and sem roles to all semantic roles in the modal construction.

Convert negated verb-argument structures into
their positive counterparts. We follow 3 steps inspired by the rules to form negation proposed by (Huddleston and Pullum, 2002): (a) Remove the negation mark by deleting the token whose syntactic dependency is neg. (b) Remove auxiliaries, expand contractions, and fix third-person singular and past tense. For example (before: after), doesn't go: goes, didn't go: went, won't go: will go. To implement this step, we loop through tokens whose head is the negated verb with dependency aux, and use a list of irregular verbs 5 and grammar rules to convert to third-person singular and past tense based on orthographic patterns. (c) Rewrite negatively-oriented polaritysensitive items. For example (before: after), anyone: someone, any longer: still, yet: already. at all: somewhat. We use the correspondences between negatively-oriented and positively-

Normalization
Step Output 1 The danger is he cannot deliver the promises that he made during the campaign. 2 The danger is he can deliver the promises that he made during the campaign. 3 The danger is he will deliver the promises that he made during the campaign. 4 He will deliver the promises that he made during the campaign. 5 He will deliver the promises that he made during the campaign.

Normalization
Step Output 1 I wouldn't define [...] although I would like to defer raising taxes as long as prudently possible. 2, 3 I would define [...] although I will like to defer raising taxes as long as prudently possible. 4 I will like to defer raising taxes as long as prudently possible.

5
Normalization 1: I will like to defer raising taxes as long as prudently possible. Normalization 2: I will defer raising taxes as long as prudently possible.
Interpretations From Potential Interpretation norm. 1 {ARG 0 } will like to defer raising taxes as long as prudently possible. I will like {to ARG 1 }.
norm. 2 {ARG 0 } will defer raising taxes as long as prudently possible. I will defer {ARG 1 } as long as prudently possible. I will defer raising taxes {ARG 2 }. Step-by-step execution of the procedure to automatically normalize modal constructions (Sentences 1 and 2) and generate potential interpretations (Sentence 2).
oriented polarity-sensitive items by (Huddleston and Pullum, 2002, pp. 831). 3. Fix modal verbs and tense. If a modal verb (can, could, may, would, should, must, etc.) has as syntactic head verb, we transform the modal construction into past or future depending on the modal and tense of verb. For example: could go: went, can go: will go, should have gone: went. We use the same grammar rules and list of irregular verbs as in Step (2b). 4. Select relevant tokens. We remove all tokens in the original sentence except verb and tokens belonging to the roles in sem roles. Additionally, we fix phrasal verbs by adding tokens with the part-of-speech tag RP whose syntactic head is verb and dependency type prt (semantic roles in OntoNotes are annotated for verb tokens, missing the preposition when verb is a phrasal verb would inadvertently change meaning). We also add all tokens to the left of verb until we find the first token whose part-of-speech tag does not start with VB, MD, RB or EX (verbs, modals, adverbs and existential there). 5. Generate additional normalizations. If verb is followed by TO + verb 2 (e.g., want to go, like to play, intend to pass), we generate an additional normalization for verb 2 after merging the semantic roles of verb and verb 2 . Table 1 exemplifies the automatic normalization step by step with 2 modal constructions. Generating Potential Interpretations in Plain Text. Inspired by the rules Blanco and Sarabi (2016) used to generate interpretations from negation, we generate potential interpretations from modal constructions by toggling off combinations of roles in sem roles. We consider numbered roles (ARG 0 -ARG 5 ), and argument modifiers (ARGM-) ending in LOC, TMP, MNR, PRP, CAU, EXT, PRD or DIR. Table 1 lists some potential interpretations generated from a sample modal construction. The total number of potential interpretations for the 324 selected modal construction is 1,756 (average: 5.4).
We recognize that our procedure to generate implicit interpretations is unable to generate some useful interpretations. For example, from This is [a person who] ARG 1 [likely] ARGM-ADV [died] verb [on impact versus perhaps freezing to death] ARGM-MNR , we generate This is a person who died {ARGM-MNR}, which is factual: the only uncertain information is the manner in which the person died. Since we toggle off semantic roles of verb, our procedure is unable to generate A person died on impact and A person died freezing to death; the former interpretation would receive a higher factuality score than the latter. We argue that automation is preferable, and reserve for future work generating interpretations that require splitting semantic roles.

Scoring Potential Interpretations
After automatically generating potential interpretations, we collected manual annotations to determine their factuality. The annotation interface showed the original sentence containing the modal construction, the previous and next sentences as context, and no additional information. Following previous work (Saurí and Pustejovsky, 2009;de Marneffe et al., 2012), we found it useful not to restrict answers to yes or no, but to allow for degrees of certainty. Specifically, we asked "Given the 3 sentences above, do you believe that the statement [potential interpretation] below is true?". Answers are a score ranging from −5 to 5, where −5 indicates Certainly no, 5 indicates Certainly yes, and the scores in between indicate a continuum of certainty (0 indicates unknown).
After pilot annotations, we examined disagreements and defined the following simple guidelines: 1. Context (previous sentence, target sentence, and next sentence) is taken into account. 2. World knowledge available at the time the original sentence was authored-not new knowledge available after-is taken into account. 3. Semantic roles toggled off are replaced with a semantically related substitute (Turney and Pantel, 2010) for the original role, e.g., give: take, customer: sales associate.

Corpus Analysis
The total number of modal constructions selected is 324 and the number of potential interpretations automatically generated in 1,756 (average: 5.4 interpretation per modal construction). 39.4% of interpretations are scored with a high degree of certainty. We define high certainty as a score below −3 (interpretation is false) or larger than 3 (interpretation is    true). Importantly, on overage, modal constructions have 2.13 interpretations scored with high certainty, and 1.23 scored 3 or higher. In other words, on average, our procedure generates over 2 interpretation that are either true or false, and over 1 interpretation that is true per modal construction. Tables 2 and 3 present basic corpus statistics. The percentage of interpretations annotated with a score different than 0 depends greatly on the number of roles toggled off (Table 2): 0: 87.25%, 1: 48.50%, 2: 20.46%, 3: 5.83%. Note that the number of roles toggled off does not significantly affect the mean score of interpretations not scored 0 (Table 2, last 2 columns). Most interpretations have either ARG 0 or ARG 1 toggled off (Table 3), and the percentages of interpretations not scored zero range from 20% to 32.84% depending on the semantic role. Note that the average score of interpretations scored positively and negatively, however, does not depend on whether a semantic role is toggled off.
Original sentence and sample of automatically generated potential interpretations Score 1 Context, previous sentence: The last thing we want to do is react to every wild statement that they make. -Interpretation 4.1: I will like to defer raising taxes as long as prudently possible. 5 -Interpretation 4.2: I will defer raising taxes as long as prudently possible. 1

Annotation Quality
The annotation guidelines (Section 4.2) to score potential interpretations were defined after examining disagreements in pilot annotations. After defining the guidelines, inter-annotator agreement was 0.92 on 18% of randomly selected interpretations. 6 Agreement measures designed for categorical labels are unsuitable, as not all disagreements are equal, e.g., 4 vs. 5, -2 vs. 5. Because of the high agreement and following previous work (Agirre et al., 2012), the rest of interpretations were annotated once. Table 4 presents annotation examples. For each example, we include the original sentence containing a selected modal construction, its context (previous and next sentence) if helpful for scoring, and 2 automatically generated potential interpretations with their annotated scores. Example (1) shows that context helps in determining the factuality of potential interpretations (item (1) in the guidelines). After reading the three sen-tences, it is clear that they are making wild statements, and are hoping to get attention for it. Interpretation 1.1 removes adverb certainly and receives the highest score, 5. Interpretation 1.2 is obtained after toggling off ARG 1 , and receives the lowest score, −5. This low score is justified by item (3) in our annotation guidelines: replacing wild statements with a semantically (different but) related substitute, e.g., But they chose reasonable statements / good manners to get our attention and that of the international community, yields an unlikely interpretation. (2) show again the importance of context, and also exemplify item (2) in the annotation guidelines. Interpretation 2.1, We will find them one day receives a high score (4/5), as given the context (and assuming that Rumsfeld is truthful), it is very likely that they will find the weapons of mass destruction, but it is not guaranteed. Note that annotators are not allowed to use the fact that the weapons were never found (item (2) in the guidelines). In Interpretation 2.2, one day could be replaced with never / at no time or similar constructions, and doing so yields the opposite of the intended meaning (score: −3).  tion of these scores could be "almost certainly true" (4 out of 5), and "most probably false" (-3 out of -5). We see scores as a continuum of certainty, but textual description may help understand the examples.

The interpretations in Example
Example (3) demonstrates the usefulness of the normalization process-specifically, Step 4, selecting relevant tokens-and the importance of replacing roles with semantically related substitutes (item (3) in the guidelines). In interpretation 3.1, {ARG 0 } will act in the interests of the minority holders, ARG 0 can be replaced with a company with several minority holders, yielding a valid interpretation scored 4 (out of 5). Similarly, in interpretation 3.2, A company with a big majority holder will act {ARG 1 }, ARG 1 can be replaced with in the interests of the big majority holder, yielding another valid interpretation also scored 4 (out of 5).
Finally, Example (4) shows Step 5 in the automatic normalization procedure (Section 4). By creating an additional verb-argument structure, we are able to differentiate between liking to do something (Interpretation 4.1, score 5/5) and actually doing that something (Interpretation 4.2, score 1/5).

Learning to Score Potential Interpretations
In order to automatically score potential interpretations, we follow a standard supervised machine learning approach. Each potential interpretation becomes an instance, and we split modal constructions (and their potential interpretations) into training (80%) and test (20%). When splitting, we make sure that the amount of modal constructions for each adverb in each split is proportional, i.e., 80% of modal constructions with each adverb are in the train split and the rest in the test split. Splitting instances randomly would assign interpretations generated from the same modal construction to the train and test splits, and bias the results.
We trained a Support Vector Machine (SVM) for regression with RBF kernel using scikit-learn (Pedregosa et al., 2011), which uses LIBSVM (Chang and Lin, 2011). The SVM parameters (C and γ) were tuned using 10-fold cross-validation with the training set, and we report results using the test split.

Feature Selection
The full set of features is detailed in Table 5. Baseline features are simple features characterizing adverb and verb and we do not elaborate on them.
Adverb and verb features are extracted from the modal construction (constituent tree and semantic roles) and provide additional information about the modal construction. Interpretation features characterize the potential interpretation whose factuality is being scored, and are also derived from the constituent tree and semantic roles. Most adverb and verb features are standard in semantic role labeling (Gildea and Jurafsky, 2002). We include the part-of-speech tags of the parent, and left and right siblings of adverb and verb, as well as their subcategorization, i.e., the concatenation of the sibling's part-of-speech tags. We also include syntactic path between adverb and verb, and its length. Additionally, we include the common ancestor, i.e., the syntactic node of the lowest common node that is an ancestor of both adverb and verb, and use binary features to indicate whether each semantic role is present in the modal construction.
Finally, interpretation features characterize the semantic roles toggled off to generate the potential interpretation. We include the number of roles toggled off to generate the potential interpretation, and binary flags indicating which roles. Additionally, for each role toggled off, we include the distance from the verb (number of tokens), whether it occurs before or after the verb, the syntactic path to the verb and the length of the path. Table 6 details results obtained with test instances using several feature combinations derived from gold linguistic information (POS tags, parse trees, semantic roles, etc.). Baseline and adverb and verb features, which characterize the modal construction from which potential interpretation are extracted, are virtually useless. They yield Pearson correlations of −0.029 and 0.025 individually, and −0.013 combined. These results suggest that the verb and adverb in the modal construction (word forms, syntactic paths, etc.) are insufficient to rank potential interpretations generated from the modal construction.

Experimental Results
Interpretation features, which capture differences between potential interpretations being scored (number of roles toggled off, roles toggled off, etc.), obtain a modest Pearson correlation of 0.494. Combining interpretation features with other features proved detrimental, Pearson correlations are between 0.463 and 0.468.

Conclusions
Modality is a pervasive phenomenon used to talk about what is not factual. In this paper, we have presented a methodology to extract implicit interpretations from modal constructions. First, we automatically generate potential interpretations using syntactic dependencies and semantic roles, and then assign to them a factuality score.
The most important conclusion of the work presented here is that several interpretations automatically generated from a single modal construction often receive scores indicating high certainty. Indeed, on average, modal constructions have 2.13 interpretations scored lower or equal than −3, or higher or equal than 3. This contrast with previous work, which only assess factuality of one normalization per proposition.
Experimental results using supervised machine learning and relatively simple features show that the task is challenging but can be automated. We believe better results could be obtained by incorporating features capturing knowledge in the context of the modal construction, including other clauses in the same sentence, and the previous and next sentences. Another extension of the current work is to investigate a similar approach for other modality markers such as nouns (e.g., possibility, chance), adjectives (e.g.necessary, probable, ) and certain verbs (e.g., claim, suggests).