A Compositional Interpretation of Biomedical Event Factuality

We propose a compositional method to assess the factuality of biomedical events extracted from the literature. The composition procedure relies on the notion of semantic embedding and a ﬁne-grained classiﬁcation of extra-propositional phenomena, including modality and valence shifting, and a dictionary based on this classiﬁcation. The event factuality is computed as a product of the extra-propositional operators that have scope over the event. We evaluate our approach on the GENIA event corpus enriched with certainty level and polarity annotations. The results indicate that our approach is effective in identifying the certainty level component of factuality and is less successful in recognizing the other element, negative polarity.


Introduction
The scientific literature is rich in extra-propositional phenomena, such as speculations, opinions, and beliefs, due to the fact that the scientific method involves hypothesis generation, experimentation, and reasoning to reach, often tentative, conclusions (Hyland, 1998). Biomedical literature is a case in point: Light et al. (2004) estimate that 11% of sentences in MEDLINE abstracts contain speculations and argue that speculations are more important than established facts for researchers interested in current trends and future directions. Such statements may also have an effect on the reliability of the underlying scientific claim. Despite the prevalence and importance of such statements, natural language processing systems in the biomedical domain have largely focused on more foundational tasks, including named entity recognition (e.g., disorders, drugs) and relation extraction (e.g., biological events, genedisease associations), the former task addressing the conceptual level of meaning and the latter addressing the propositional level.
The last decade has seen significant research activity focusing on some extra-propositional aspects of meaning. The main concern of the studies that focused on the biomedical literature has been to distinguish facts from speculative, tentative knowledge (Light et al., 2004). The studies focusing on the clinical domain, on the other hand, have mainly aimed to identify whether findings, diseases, symptoms, or other concepts mentioned in clinical reports are present, absent, or uncertain (Uzuner et al., 2010). Various corpora have been annotated for relevant phenomena, including hedges (Medlock and Briscoe, 2007) and speculation/negation (Vincze et al., 2008;Kim et al., 2008). Several shared task challenges with subtasks focusing on these phenomena have been organized (Kim et al., 2009;Kim et al., 2012). Supervised machine learning and rule-based approaches have been proposed for these tasks. In general, these studies have been presented as extensions to named entity recognition or relation extraction systems, and they often settle for assigning discrete values to propositional meaning elements (e.g., assessing the certainty of an event).
oration, generally ignored in the studies of extrapropositional meaning (Morante and Sporleder, 2012). The framework uses semantic embedding as the core notion, predication as the representational means, and semantic composition as the methodology. It relies on a fine-grained linguistic characterization of extra-propositional meaning, including modality, valence shifters, and discourse connectives. In the current work, we present a case study of applying this framework to the task of assessing biomedical event factuality (whether an event is characterized as a fact, a counter-fact, or merely a possibility), an important step in determining current trends and future directions in scientific research. For evaluation, we rely on the meta-knowledge corpus (Thompson et al., 2011), in which biological events from the GENIA event corpus (Kim et al., 2008) have been annotated with several extrapropositional phenomena, including certainty level, polarity, and source. We discuss in this paper how two of these phenomena relevant to factuality (certainty and polarity) can be inferred from the semantic representations extracted by the framework. Our results demonstrate that certainty levels can be captured correctly to a large extent with our method and indicate that more research is needed for correct polarity assessment.

Related Work
Modality and negation are the two linguistic phenomena that are often considered in computational treatments of extra-propositional meaning. Morante and Sporleder (2012) provide a comprehensive overview of these phenomena from both theoretical and computational linguistics perspectives. In the FactBank corpus (Saurí and Pustejovsky, 2009), events from news articles are annotated with their factuality values, which are modeled as the interaction of epistemic modality and polarity and consist of eight values: FACT, PROBABLE, POSSIBLE, COUNTER-FACT, NOT PROBABLE, NOT CERTAIN, CERTAIN BUT UNKNOWN, and UNKNOWN. Saurí and Pustejovsky (2012) propose a factuality profiler that computes these values in a top-down manner using lexical and syntactic information. They capture the interaction between different factuality markers scoping over the same event. de Marneffe et al. (2012) investigate veridicality as the pragmatic component of factuality. Based on an annotation study that uses FactBank and MechanicalTurk subjects, they argue that veridicality judgments should be modeled as probability distributions. They show that context and world knowledge play an important role in assessing veridicality, in addition to lexical and semantic properties of individual markers, and use supervised machine learning to model veridicality. Szarvas et al. (2012) draw from previous categorizations and annotation studies to introduce a unified subcategorization of semantic uncertainty, with EPISTEMIC and HYPOTHETICAL as the top level categories. Re-annotating three corpora with this subcategorization and analyzing type distributions, they show that out-of-domain data can be gainfully exploited in assessing certainty using domain adaptation techniques, despite the domain-and genredependent nature of the problem.
In the biomedical domain, several corpora have been annotated for extra-propositional phenomena, in particular, negation and speculation. The GENIA event corpus (Kim et al., 2008) contains biological events from MEDLINE abstracts annotated with their certainty level (CERTAIN, PROBABLE, DOUBT-FUL) and assertion status (EXIST, NON-EXIST). The BioScope corpus (Vincze et al., 2008) consists of abstracts and full-text articles as well as clinical text annotated with negation and speculation markers and their scopes. While they clearly address similar linguistic phenomena, the representations used in these corpora are significantly different (cuescope representation vs. tagged events), and there have been attempts at reconciling these representations (Kilicoglu and Bergler, 2010;Stenetorp et al., 2012). BioNLP shared tasks on event extraction (Kim et al., 2009;Kim et al., 2012) and CoNLL 2010 shared task on hedge detection (Farkas et al., 2010) have focused on GENIA and BioScope negation/speculation annotations, respectively. Supervised machine learning techniques (Morante et al., 2010;Björne et al., 2012) as well as rule-based methods (Kilicoglu and Bergler, 2011) have been attempted in extracting these phenomena and their scopes. Wilbur et al. (2006) propose a more finegrained annotation scheme with multi-valued qualitative dimensions to characterize scientific sentence fragments: certainty (complete uncertainty to com-plete certainty), evidence (from no evidence to explicit evidence), polarity (positive or negative), and trend/direction (increase/decrease, high/low). In a similar vein, Thompson et al. (2011) annotate each event in the GENIA event corpus with five meta-knowledge elements: Knowledge Type (Investigation, Observation, Analysis, Method, Fact, Other), Certainty Level (considerable speculation, some speculation, and certainty), Polarity (negative and positive), Manner (high, low, neutral), and Source (Current, Other). Their annotations are more semantically precise as they are applied to events, rather than somewhat arbitrary sentence fragments used by Wilbur et al. (2006). Miwa et al. (2012) use a machine learning-based approach to assign metaknowledge categories to events. They cast the task as a classification problem and use syntactic (dependency paths), semantic (event structure), and discourse features (location of the sentence within the abstract). They apply their system to BioNLP shared task data, as well, overall slightly outperforming the state-of-the-art systems.

Methods
We provide a brief summary of the framework here, mainly focusing on predication representation, embedding predicate categorization, and the compositional algorithm.

Predications
The framework uses the predication construct to represent all levels of relational meaning. A predication consists of a predicate P and n logical arguments (logical subject, logical object, adjuncts). They can be nested; in other words, they can take other predications as arguments. We call such constructs embedding predications to distinguish them from atomic predications that can only take atomic terms as arguments. While some embedding predications operate at the basic propositional level, extra-propositional meaning is exclusively captured by embedding predications. We use the notion of semantic scope to characterize the structural relationships between predications. A predication Pr 1 is said to embed a predication Pr 2 if Pr 2 is an argument of Pr 1 . Similarly, a predication Pr 2 is is said to be within the semantic scope of a predication Pr 1 , if a) Pr 1 embeds Pr 2 , or b) there is a predication Pr 3 , such that Pr 1 embeds Pr 3 and Pr 2 is within the semantic scope of or shares an argument with Pr 3 . Scope relations play an important role in the composition procedure. A predication also encodes the source (S) and scalar modality value of the predication (MV Sc ). A formal definition of predication, then, is: By default, the source of a predication is the writer of the text (WR). The source may also indicate a term or predication that refers to the source (i.e., who said what is described by the predication? what is the evidence for the predication?). The scalar modality value of the predication is a value in the [0,1] range on a relevant modality scale (Sc), which is assigned according to lexical properties of the predicate P and modified by its discourse context. By default, an unmarked, declarative statement has the scalar modality value of 1 on the EPISTEMIC scale (denoted as 1 epistemic ), corresponding to a fact.

Categorization
With the embedding categorization, we aim to provide a fine-grained characterization of the kinds of extra-propositional meanings contributed by predicates that indicate embedding. A synthesis of various linguistic typologies and classifications, the categorization is similar to the certainty subcategorization proposed by Szarvas et al. (2012); however, it not only targets certainty-related phenomena, but is rather a more general categorization of embedding predicates that indicate extra-propositional meaning. We distinguish four main classes of embedding predicates: MODAL, RELATIONAL, VA-LENCE SHIFTER and PROPOSITIONAL; each class is further divided into subcategories. For the purposes of this paper, MODAL and VALENCE SHIFTER categories are most relevant (illustrated in Figure 1).
A MODAL predicate associates its embedded predication with a modality value on a scale determined by the semantic category of the modal predicate (e.g., EPISTEMIC scale, DEONTIC scale). The scalar modality value (MV Sc ) indicates how strongly the embedded predication is associated with the scale Sc, 1 indicating strongest positive association and 0 negative association. VALENCE SHIFTER predicates do not introduce new scales but trigger a scalar shift of the embedded predication on the associated scale. The MODAL subcategories relevant for factuality computation and examples of predicates belonging to these categories are as follows: • EPISTEMIC predicates indicate a judgement about the factual status of the embedded predication (e.g., may, possible).
• DYNAMIC predicates indicate ability or willingness towards an event (e.g., able, want).
• INTENTIONAL predicates indicate effort of an agent to perform an event (e.g., aim).
• INTERROGATIVE predicates indicate questioning or inquiry towards the embedded event (e.g., investigate).
• SUCCESS predicates indicate degree of success associated with the embedded predication (e.g., manage, fail).
Each subcategory is associated with its own modality scale, except the EVIDENTIAL category, which is associated with the EPISTEMIC scale. The categories listed above also have secondary epistemic readings, in addition to their primary scale; for example, IN-TERROGATIVE predicates can indicate uncertainty. The EPISTEMIC scale is the most relevant scale to investigate factuality. Our model of this scale and how modal auxiliaries correspond to it is illustrated in Figure 2. It is similar to the characterization of factuality values by Saurí and Pustejovsky (2012), although numerical epistemic values are assigned to predications (MV epistemic ), rather than discrete values like Probable or Fact. In this, the characterization follows that of Nirenburg and Raskin (2004), which lends itself more readily to the type of operations proposed for scalar modality values. The SCALE SHIFTER subcategory of valence shifters also plays a role in factuality assessment. Predicates belonging to this category change the scalar modality value of the predications in their scope. The subtypes of this category are NEGATOR, INTENSIFIER, DIMINISHER, and HEDGE. A DIMIN-ISHER predicate (e.g., hardly) lowers the modality value, while an INTENSIFIER increases it (e.g., strongly). On the other hand, a negation marker belonging to the NEGATOR category (e.g., no in no indication) inverts the modality value of the embedded predication. The HEDGE category contains attribute hedges (e.g., mostly, in general) (Hyland, 1998), whose effect is to make the embedded predication more vague. We model this by decreasing, increasing or leaving unchanged the modality value depending on the position of the embedded predication on the scale.
Lexical and semantic knowledge about predicates belonging to embedding categories are encoded in a dictionary, which currently consists of 987 predicates, 544 of them belonging to MODAL and 95 to SCALE SHIFTER categories. A very preliminary version of this dictionary was introduced in Kilicoglu and Bergler (2008). It was later extended and refined using several corpora and linguistic classifications (including Saurí (2008) and Nirenburg and Raskin (2004)). Since predicates collected from external resources do not neatly fit into embedding categories and we target deeper levels of meaning distinctions, the dictionary construction involved a fair amount of manual refinement. The dictionary encodes the lemma and part-of-speech of the predicate as well as its extra-propositional meaning senses. Each sense consists of five elements: 1. Embedding category, such as ASSUMPTIVE.
3. Embedding relation classes indicate the semantic dependencies used to identify the logical object argument of the predicate.
4. Scope type indicates whether the predicate allows a wide or narrow scope reading (for example, in I don't think that P, because think allows narrow scope reading, the negation is transferred to its complement (I think that not P)).

5.
Argument inversion (true/false) determines whether the object and subject arguments should be switched in semantic interpretation.
The entry in Table 1 indicates that the modal auxiliary may is associated with two modal senses (i.e., it is ambiguous) with differing scalar modality values. It also indicates that a predication embedded by SPECULATIVE may will be assigned the epistemic value of 0.5 initially. Scope type and argument inversion attributes are not explicitly given, indicating default values for each.

Composition
Semantic composition is the procedure of bottomup predication construction using the knowledge encoded in the dictionary and syntactic information in the form of dependency relations. Dependency relations are extracted using the Stanford CoreNLP toolkit (Manning et al., 2014). We use the Stanford collapsed dependency format (de Marneffe et al., 2006) for dependency relations. We illustrate the salient steps of this procedure on a sentence from the GENIA event corpus (sentence 9 from PMID 10089566 shown in row (1) in Table 2). For brevity, the simplified version of the sentence is given in row (2), in which textual spans are substituted with the corresponding event annotations.
As the first step in the procedure, the syntactic dependency graphs of sentences of a document are combined and transformed into a semantically enriched, directed, acyclic semantic document graph through a series of dependency transformations. The nodes of the semantic graph correspond to textual units of the document and the direction of the arcs reflects the direction of the semantic dependency between its endpoints. The transformation is guided by a set of rules, illustrated on row (3). For example, the first three transformations are due to the Verb Complex Transformation rule, which reorders the dependencies that a verb is involved in such that semantic scope relations with the auxiliaries and other verbal modifiers are made explicit. The resulting semantic dependencies on the right indicate that involve is within the scope of not, which in turn is in the scope of may, and the entire verb complex may not involve is within the scope of thus, which indicates a discourse relation.
The next steps of the compositional algorithm, argument identification and scalar modality value (1) Thus HIV-1 gp41-induced IL-10 up-regulation in monocytes may not involve NF-kappaB, MAPK, or PI3-kinase activation, but rather may operate through activation of adenylate cyclase and pertussis-toxin-sensitive Gi/Go protein to effect p70(S6)-kinase activation.
( composition, play a role in factuality assessment 1 . Argument identification is the process of determining the logical arguments of a predication, based on the bottom-up traversal of the semantic graph. It is guided by argument identification rules, each of which defines a mapping from a lexical category and an embedding class to a logical argument type. Such a rule applies to a predicate specified in the dictionary that belongs to the lexical category and serves as the head of a semantic dependency labeled with the embedding relation class. With argument identification rules, we determine, for example, that the second instance of may in the example, has as its logical object, the predication indicated by operate, since there is an AUX embedding relation between may and operate, which satisfies the constraint defined in the embedding dictionary (Table 1). Scalar modality value composition is the procedure of determining the relevant scale for a predication and its modality value on this scale. The following principles are applied: 1. Initially, every predication is assigned to EPIS-TEMIC scale with the value of 1 (i.e., a fact).
2. A MODAL predicate places its logical object on the relevant MODAL scale and assigns to it its prior scalar modality value, specified in the dictionary. 1 The compositional steps that we do not discuss here are source propagation and argument propagation.

A SCALE SHIFTER predicate does not intro-
duce a new scale but changes the existing scalar modality value of its logical object.
4. The scalar influence of an embedding predicate (P) extends beyond the predications it embeds to another predication in its scope (Pr e ), if one of the following constraints is met: Assuming that we have a predicate P which indicates an embedding predication Pr and a predication (Pr e ) under its scalar influence, the scalar modality value of Pr e is updated differently, based on whether the predicate P is a MODAL or a SCALE SHIFTER predicate. All update operations used for MODAL predicates are given in Table 3 and those for SCALE SHIFTER predicates in Table 4. For MODAL predicates, the composition is modeled as the interaction of the prior scalar modality value of the embedding predicate (MV Sc (P modal )) in the first column and the current scalar modality value associated with the embedded predication (MV Sc (Pr e )) in the second column, resulting in the value shown in the third column (MV Sc (P r e ) ). When P is a scaleshifting predicate, the update procedure is guided by its type, as illustrated in Table 4. X and Y represent arbitrary values in the range of [0,1]. For the example shown in Table 2, the computation in row (1) of Table 3 applies when we encounter the SPECULATIVE may node dominating the operate node in the semantic graph: since MV epistemic (may)=0.5 and operate at the time of composition has epistemic value of 1, its scalar modality value gets updated to 0.5.

Type
MV Sc (Pr e ) MV Sc (P r e ) (1) NEGATOR = 0.0 0.5  When not, a NEGATOR, is encountered in composition, the scalar modality value of its embedded predication (involve) is updated to 0, due to row (2) in Table 4 (1-1=0). In the next step of composition, when the first instance of SPECULATIVE may is encountered, the nodes in its scope, not and involve, have epistemic values of 1 and 0, respectively.
Row (4) in Table 2 shows the annotations generated by the system. The system takes as input GENIA event annotations (e.g., CORRELATION and REGULATION), which we expand with scalar modality values and sources. For example, E 32 ..E 34 , three events triggered by involve and annotated as CORRELATION events in GENIA, have epistemic value of 0.5 and WR as the source (only one of the events, E 32 , is shown for brevity). The system also generates other embedding predications (indicated with EM) corresponding to fine-grained extrapropositional meaning. To clarify, the content of first three predications (first an event and the latter two extra-propositional) are expressed in natural language below: • E 32 : Correlation between gp41-induced IL-10 upregulation and NF-kappaB activation is POS-SIBLE according to the author.
• EM 53 : That there is no correlation between gp41-induced IL-10 upregulation and NF-kappaB activation is POSSIBLE according to the author.
• EM 57 : That it is possible there no correlation between gp41-induced IL-10 upregulation and NF-kappaB activation is a FACT according to the author.

Data and Evaluation
We assessed our methodology on the metaknowledge corpus (Thompson et al., 2011), in which GENIA events are annotated with certainty levels (CL) and polarity. This corpus consists of 1000 MEDLINE abstracts and contains 34,368 event annotations. Uncertainty is only annotated in this dataset for events with Analysis knowledge type. Such events correspond to 17.6% of the entire corpus. Of all Analysis events, 33.6% are annotated with L2 (high confidence), 11.4% with L1 (low confidence), and 55% with L3 (certain) CL values. Polarity, on the other hand, is annotated for all events (6.1% negative).
Factuality values are often modeled as discrete categories (e.g., PROBABLE, FACT). Thus, to evaluate our approach, we converted the scalar modality values associated with predications (MV Sc ) to discrete CL and polarity values using mapping rules, shown in Table 5. The rules were based on the analysis of 100 abstracts that we used for training.

Condition
Annotation Table 5. Mapping scalar modality values to event certainty and polarity.
We evaluated CL mappings in two ways: a) we restricted it only to Analysis type events, the only ones annotated with L1 and L2 values, and b) we evaluated them on the entire corpus. For polarity, we only considered the entire corpus. As evaluation metrics, we calculated precision, recall, and F 1 score as well as accuracy on the discrete values we obtained by the mapping.
Another evaluation focused more directly on factuality. We represented the gold CL-polarity pairs as numerical values and calculated the average distance between these values and those generated by the system. The lower the distance, the better the system can be considered. In this evaluation scheme, annotating a considerably speculative (L1) event as somewhat speculative (L2) is penalized less than annotating it as certain (L3). We mapped the gold annotations to the numerical values as follows: L3-

Results and Discussion
The results of mapping the system annotations to discrete values annotated in the meta-knowledge corpus are provided in Table 6.
When the CL evaluation is limited to Analysis events, we obtain an accuracy of approximately 82%. The baseline considered by Miwa et al. (2012) Table 5. Secondly, knowing whether an event is an Analysis event or not is a significant factor in determining the CL value and their machine learning features are likely to have exploited this fact, whereas we did not attempt to identify the knowledge type of the event. Thirdly, L1 and L2 values appear only for Analysis events, therefore the evaluation scenario that only considers Analysis events is likely to overestimate the performance of our system on L1 and L2 and underestimate it on L3.
While our system performed similarly to Miwa et al.'s with regards to positive polarity, our mappings for negative polarity were less successful, which suggests that modeling negative polarity as the lower end of several modal scales (the last row of Table 5) may not be sufficient for correctly capturing the polarity values. Our preliminary analysis of the results indicate that scope relationships between predications could play a more significant role. In other words, whether an event is in the scope of a predication trigger by a NEGATOR predicate may be a better predictor of negative polarity.
With the evaluation scheme that is based on average distance, we obtained a distance score of 0.12. For the majority class baseline, this score would be 0.21. Our score shows clear improvement over the baseline; however, it is not directly comparable to Miwa et al.'s results. This evaluation scheme, to our knowledge, has not previously been used to evaluate factuality and we believe it is better suited to the gradable nature of factuality.
Analyzing the results, we note that many errors are due to problems in dependency relations and transformations that rely on them. Errors in dependency relations are common due to complexity of the language under consideration, and these errors are further compounded by hand-crafted transformation rules that can at times be inadequate in capturing semantic dependencies correctly. In the following example, the prepositional phrase attachment error caused by syntactic parsing (to suppress. . . is attached to the main verb result, instead of to ability) prevents the system from identifying the semantic dependendency between ability and suppress, causing a L2 recall error. While the system uses a transformation rule to correct some prepositional phrase attachment problems, this particular case was missed.
• The reduction in gene expression resulted from the ability of IL-10 to suppress IFN-induced assembly of signal transducer . . .
• prep to(result,suppress) vs. prep to(ability,suppress) Prior scalar modality values in the dictionary have been manually determined and are fixed. They are able to capture the meaning subtleties to a large ex-tent and the composition procedure attempts to capture the meaning changes due to markers in context. However, some uncertainty markers are clearly more ambiguous than others, leading to different certainty level annotations in similar contexts and our method may miss these differences due to the fixed value in the dictionary. For example, the adjective potential has been almost equally annotated as an L1 and L2 cue in the meta-knowledge corpus. This also seems to confirm the finding of de Marneffe et al. (2012) that world knowledge and context have an effect on the interpretation of factuality.
We also noted what seem like annotation errors in the corpus. For example, in the sentence L-1beta stimulation of epithelial cells did not generate any ROIs, the event expressed with generation of ROIs seems to have negative polarity, even though it is not annotated as such in the corpus.

Conclusion
We presented a rule-based compositional method for assessing factuality of biological events. The method is linguistically motivated and emphasizes generality over corpus-specific optimizations, and without making much use of the corpus for training, we were able to obtain results that are comparable to the performance of the state-of-the-art systems for certainty level assignments. The method was less successful with respect to polarity assessment, suggesting that the hypothesis that negative polarity can be modeled as corresponding to the lower end of the modal scales may be inadequate. In future work, we plan to develop a more nuanced approach to negative polarity.