Paths for uncertainty: Exploring the intricacies of uncertainty identification for news

Currently, news articles are produced, shared and consumed at an extremely rapid rate. Although their quantity is increasing, at the same time, their quality and trustworthiness is becoming fuzzier. Hence, it is important not only to automate information extraction but also to quantify the certainty of this information. Automated identification of certainty has been studied both in the scientific and newswire domains, but performance is considerably higher in tasks focusing on scientific text. We compare the differences in the definition and expression of uncertainty between a scientific domain, i.e., biomedicine, and newswire. We delve into the different aspects that affect the certainty of an extracted event in a news article and examine whether they can be easily identified by techniques already validated in the biomedical domain. Finally, we present a comparison of the syntactic and lexical differences between the the expression of certainty in the biomedical and newswire domains, using two annotated corpora.


Introduction
The increasing amount of data readily available in digital form across various domains presents challenges for both researchers and the general public. Although this has greatly improved access to data and dissemination of knowledge, it is becoming increasingly difficult to quickly identify a piece of information that is pertinent to our needs among the vast amounts of data, as well as to assess its certainty and credibility. Advances in information extraction methods and in particular event extraction tasks (McClosky et al., 2011;Nguyen et al., 2016;Cao et al., 2016), capture complex information structures to that can capture n-ary relations between entities, and better represent facts and statements made by authors.
While being able to extract rich information in a structured manner is important, not all extracted information is equally trustworthy. It is thus necessary to apply measures of confidence that will allow us to assess the credibility of events mined from different documents. Such measures may take into account different factors affecting our confidence in a specific event, such as the reliability of the source (Lucassen and Schraagen, 2010), the timeliness of the event (Pustejovsky, 2017), the performance of the event extraction tool etc. Along with such "external" factors affecting our trust in the event, another important aspect is how certainty is expressed in the context of the event by the author, since not all information mentioned in text is expressed with equal certainty. Some events are explicitly identified as speculations, as hypothetical situations, as disputed allegations, as conditional facts, and so on. Thus, it is important to complement event extraction methods with identification of such textual phenomena, in order to enrich extracted events with an attribute of certainty.
Identification of textual uncertainty and hedging is a mature research topic, with an emphasis on the scientific domain (Hyland, 1998). Methods to detect certainty and related types of information are widely applied in the field of biomedical text mining to assess the veracity of information, and the problem is approached both in terms of framing certainty and annotating corpora accordingly, and by applying machine learning techniques for the automated identification of uncertain statements and events (Kilicoglu et al., 2017;Malhotra et al., 2013). In the news domain, while machine learning techniques have been used to mine sentiment, subjectivity etc, efforts concerned with (un)certainty identification have focussed mostly on the provision of classification framework for uncertainty (Rubin, 2010) or its combination with polarity to determine event factuality (Sauri and Pustejovsky, 2007). However, there has been less emphasis on applications that focus on automatically recognising uncertainty, especially in relation to events. Moreover, early attempts at automated identification of uncertainty cues (weasels) in both the general and biomedical domains showed more than 0.30 difference in F-score between the two domains (0.50 for Wikipedia versus 0.87 for Bio (Tang et al., 2010)), thus illustrating the challenges of uncertainty identification in the general language domain.
Newswire text can prove more problematic in terms of uncertainty identification, since news stories tend to be reported in a subjective manner (Godbole et al., 2007;Vis, 2011) and allow for less strict use of language, while the truth value of reported events greatly depends on the time and context in which an article is written. As uncertainty identification is affected by various textual phenomena which are challenging to contextualise (metaphorical speech, colloquial expressions, etc), methods that identify event uncertainty from context are becoming increasingly crucial. The widespread use of the term "fake news" in recent years highlights the need to distinguish valuable and reliable facts, especially when it comes to automated information extraction. While detection of fake news is an involved process requiring more in depth discourse and stance analysis (Thorne et al., 2017), identifying certainty of extracted events is an important parameter towards the assessment of credibility of such events. The availability of an increasing number of resources annotated with news events and concepts related to uncertainty provide good opportunities to apply and adapt uncertainty identification techniques that are focussed on news articles.
In this work, we present our efforts on adapting uncertainty event extraction techniques developed for biomedical text, to allow them to be applied to newswire text. We use two corpora annotated with events and meta-knowledge (different types of interpretative information within a sentence that can affect an event (Thompson et al., 2011)) to analyse the differences between the two domains and we discuss the challenges that arise. We evaluate a hybrid machine learning approach to the identification of different uncertainty aspects (see Section 3.2.1) and propose ways of improving and customising uncertainty identification for newswire.

Related Work
In this section, we provide an overview of related work on uncertainty in both the scientific and newswire domains. We examine different classification frameworks of uncertainty and related concepts, the availability of annotations and existing classification systems used in each field.
The means of conveying uncertainty have long been studied by linguists, using a range of different terminology. Palmer (2001) introduced the term epistemic modality to refer to the degree of commitment to the truth of a proposition. The term continues to be used, especially for scientific text (De Waard and Maat, 2012;Vold, 2006) along with other related terms, such as factuality, which combines the notions of uncertainty and polarity (Saurí, 2017), veracity and evidentiality (Cornillie, 2009;Davis et al., 2007). The use of hedge words and their impact on the certainty of statements has also been studied extensively both in the scientific (Morante et al., 2010) and generic domain (Ganter and Strube, 2009). As computational technologies have evolved, there has been an increasing interest in the implications of textual uncertainty and the way it is expressed, resulting in a wide range of classification frameworks and annotation efforts.
In the scientific domain, Light (2004) studied uncertainty in biomedical papers, classifying expressions as denoting high or low certainty. Medlock and Briscoe (2007) further expanded the categorisation to incorporate the cases of admission of lack of knowledge, relays of hypotheses from others, speculative questions and hypotheses (investigation). More recently, Chen (2018) proposed a wider definition of uncertainty that covers phenomena of citation distortion, contradictions and claim inconsistencies, and also presented a method based on word embeddings for expanding a small seed list of cues to generate rich resources for uncertainty identification.
The aforementioned concepts have also been annotated in corpora at different levels of granularity. The BioScope corpus (Vincze et al., 2008), as well the biomedical part of the CoNLL 2010 task (Farkas et al., 2010) contain annotations of speculation and negation cues and their scope within the sentence. The BioNLP Shared Task corpora (Kim et al., 2009(Kim et al., , 2011Nédellec et al., 2013) also contain speculation and negation annotations, marked-up as attributes of events. The GENIA-MK corpus (Thompson et al., 2011) also contains event-level attribute annotations, but covering more meta-knowledge aspects, including certainty level, polarity and knowledge type (see Sec-tion 3.2.1). Various models for the automated identification of the types of information annotated in the aforementioned corpora have been developed , with the best performing methods using a combination of rules and machine learning approaches. Overall, performance is highest for sentence-based annotations, with recent work reaching an F-score of 0.97 on BioScope (Kilicoglu et al., 2017), while on the event-level annotations of GENIA-MK, the best reported F-score surpasses 0.80 for the 3-level certainty classification problem (Miwa et al., 2012) and 0.88 for the binary problem (Zerva et al., 2017).
Bridging definitions of uncertainty across different domains, Szarvas (2012) proposes a hierarchical categorisation which distinguishes between two main classes: hypothetical and epistemic uncertainty. Vincze (2013), attempts a different categorization, looking at discourse-level uncertainty and related phenomena as they appear in text in the generic domain (Wikipedia). They identify three different types of uncertainty; weasels (relevant but insufficiently specified arguments), hedges and peacocks (exaggerated, subjective statements).
On work dealing with newspaper articles, subjectivity is identified as a further phenomenon (along with hedging and speculation) that is inextricably related to the expression of uncertainty (Rubin, 2007;Morante and Daelemans, 2009). Moreover, Rubin (2010) proposes a fourdimensional classification of certainty, also pointing out the aspect of timeliness and focus (abstract versus factual information). Their proposed annotation schema was applied to a small corpus of 82 documents. In terms of further resources, Fact-Bank (Saurí and Pustejovsky, 2009) is a small corpus consisting of texts from the newswire domain annotated with events, accompanied with their factuality value (a combination of certainty level and polarity) judged from the viewpoint of their sources. The MPQA corpus (Cardie et al., 2003) elaborates on the issue of subjectivity and combines it with polarity markers to classify different opinions. The ACE 2005 corpus (Walker et al., 2006) contains events from news texts that are annotated with meta-knowledge attributes, among which modality and genericity. Subsequently, the meta-knowledge annotations were extended to include among others the aspect of subjectivity (see Section 3.2.1). More recently, there has been significant work in assessing factuality and credibility of news articles, as part of the fake-news challenge (FNC-I) that focusses on detection of stance.
In comparison to the scientific domain, there have been relatively fewer attempts to automatically identify uncertainty in news text, apart from the classification of particular aspects that embody uncertainty, such as subjectivity (Wilson, 2008). The most significant work is the wikipedia related task of CoNLL 2010, which concerned weasel cue detection. The best performing systems at the time compared poorly to the results in the biomedical field but more recently Jean (2016) proposed a probabilistic model that achieved an Fscore of 55.7, showing a promising degree of improvement. Even more encouragingly, there have recently been important efforts on the classification of factuality values based on FactBank and related factuality corpora (UW, MEANTIME), showing great improvements in their predictions (Stanovsky et al., 2017;Lee et al., 2015) compared to earlier attempts (Prabhakaran et al., 2010). Such efforts motivate our interest in studying the detection of uncertainty in the newswire domain.

Methods
In this section, we provide a definition of the problem we aim to tackle, as well as definitions of terms that we use subsequently. We also describe the datasets and resources that we have used, and we present the methods and technical details used for the experiments and analysis in Section 4.

Event Definition
In both the GENIA-MK and the ACE-MK corpora, the definition of events shares some core properties. An event consists necessarily of one trigger entity and usually one or more participant NEs (arguments) that are linked to the trigger. The trigger entity determines the type of the event, and is usually one word (can be verb, noun or adjective) that describes the event. Similarly, the relation between the trigger and each argument determines argument's role. Examples of events from the two domains are presented in Figure 1.

Uncertainty Identification Task
As described in the previous section, uncertainty can be interpreted in different ways. In this work, we cast uncertainty identification as the task of identifying textual information (cues) that render the truth of a specific event uncertain. Hence, uncertainty is treated as an attribute of an event, rather than an attribute of a sentence or clause. This is because it has been shown that a given unit of text may contain more than one event, each with a potentially different level of uncertainty (Saurí and Pustejovsky, 2009; Thompson et al., 2017). We limit the discovery of uncertainty cues to those occurring in same sentence as the event in question, following the annotations of the two corpora.
We cast uncertainty identification as a binary classification task, where an event can either be certain or uncertain. Our decision was motivated by the findings of Rubin (2007) who showed that a finer grained classification of uncertainty (5 levels) resulted in unacceptably low levels of interannotator agreement.
We treat uncertainty of an event as an attribute that can be affected by various factors (modality, hypothesis, subjectivity etc), that are already annotated in existing corpora. Hence, we want to take advantage of existing corpora annotations, and examine how such annotations relate to uncertainty, either individually or combined. We examine the performance and robustness of automated uncertainty identification method developed in (Zerva et al., 2017) based on different combinations of meta-knowledge dimensions to draw our conclusions, acknowledging that (as discussed in Section 2) for different domains there can be different dimensions affecting uncertainty. In the following section, we describe the datasets as well the meta-knowledge annotations that we consider to be related to uncertainty identification in the biomedical and newswire domains.

Datasets and Uncertainty
We focus our analysis for the newswire domain on the recent annotations of the ACE 2005 corpus (Walker et al., 2006) (English version).
The corpus was originally annotated with named entities (NEs), events, as well as some metaknowledge information and has been subsequently enriched with additional meta-knowledge annotations (Thompson et al., 2017). We refer to the meta-knowledge annotated version of the corpus as ACE-MK 1 . The corpus comprises of 600 news articles originating from various sources, and contains annotations for 5349 events. The ACE-MK meta-knowledge annotation scheme, includes 6 meta-knowledge attributes, of which four (4) were present in the original 2005 annotated corpus and the rest were introduced in the 2017 annotation enrichment effort (the latter are marked with an asterisk in the enumeration that follows). The respective cues for each type were annotated whenever present within a sentence.
1. Subjectivity (*) towards the event by the source. Can be Positive, Negative, Neutral or Multi-valued (two or more sources expressing opposite sentiments for the same event).
2. Source (*), that can be Author, Involved (attributed to a specified source, somehow involved with the event) or Third-Party.
3. Modality, that can have four possible values; Asserted, Speculated, Presupposed(*) and Other 4. Polarity, that can be either Positive or Negative.
5. Tense, that can be Past, Present, Future or Unspecified.
6. Genericity, that can either be Specific (event referring to a specific occurrence) or Generic.
As discussed in Section 2, various concepts, such as modality, subjectivity, genericity and timeliness have been linked to uncertainty in the newswire domain. In fact, most of the aforementioned event attributes annotated in ACE-MK could affect event certainty. In this work, we focus on the dimensions of Modality, Genericity and Subjectivity. (Saurí and Pustejovsky, 2009) Considering these three different attributes as well as their combination as uncertainty indicators, we generate four different test-sets, each corresponding to a different uncertainty definition: 1. M: uncertainty corresponds only to Modality, and only Asserted events are equivalent to Certain. Based on descriptions in (Baker et al., 2014;Szarvas et al., 2012).
2. G: uncertainty corresponds only to Genericity, and only Specific events are equivalent to Certain. We thus claim that that generic, more vague events lack certainty, inspired by the distinction between abstract and specific statements in (Rubin, 2010).
3. S: uncertainty corresponds only to Subjectivity, and only Neutral events are equivalent to Certain. Based on (Wiebe and Riloff, 2005) which has shown that positive or negative bias can affect the certainty of an event.
Multi-valued instances are treated as Uncertain since contradictory assertions have also been linked to uncertainty (Alamri, 2016).
4. MGS : uncertainty corresponds to the union of the above; only an event that is Asserted, Neutral and Specific is considered Certain.
In both corpora, the annotations of all metaknowledge dimensions are on the event level (the values of each event annotated separately). The evidence, if it can be attributed to one or more words in the same sentence as the event, is annotated as a cue, for the dimension annotated, and linked to the event(s) that it affects. In Figure 2 (ab) we demonstrate one example from each corpus where the cue affects only one of the events in a sentence. While in both corpora for most dimensions investigated the cues are word sequences different than the trigger of the event, for Subjectivity, we have cases where the trigger is also acting like a Subjectivity cue. This is because based on the definition of Subjectivity for ACE-MK, biased attitude expressed in text denotes subjectivity (including expressions of intention, command, fear, hope, condemn etc). Example (c) in Figure 2 demonstrates such a case. We train and test separate classifiers for each case and discuss their performance and the implication on the predictability of uncertainty.
We should note that Polarity has been identified as a dimension that is orthogonal to uncertainty (Saurí and Pustejovsky, 2009) and thus we choose not to include it in our investigation, although both corpora contain such annotations. In future work, we would like to further investigate the combination of certainty and polarity and maybe expand our analysis on the FactBank corpus. It would also be interesting, as future work, to expand our experiments and investigate whether Tense could also be used to account for the timeliness aspect, or whether Source could help to identify weaselling phenomena, thus expanding the coverage of uncertainty. For an efficient accounting of these two dimensions in future work, we would like to include additional resources such as timeliness or citation analysis components.
Apart from comparing performance among the different uncertainty-related definitions described above, we compare our results for ACE-MK with those obtained for a biomedical corpus, GENIA-MK (Kim et al., 2003;Thompson et al., 2011), for binary uncertainty identification using the same hybrid method, as reported in (Zerva et al., 2017).
The GENIA-MK corpus consists of 1000 abstracts extracted from PubMed and annotated with 36,858 events 2 . It has also been annotated with meta-knowledge attributes for each event, and the respective cues. The meta-knowledge attributes for each event include Certainty Level (L1, L2, L3), Polarity (Positive, Negative), Manner (High, Low and Neutral), Source (Current, Other) and Knowledge Type (Investigation, Observation, Analysis, Method, Fact, Other). Of those, Certainty Level L1 and L2 as well as Knowledge type of Investigation were treated as uncertainty indicators (denoting an event as Uncertain).

Machine Learning Approach
For the experiments described in Section 4.1 we use a hybrid machine learning approach to classify ACE-MK events as Certain or Uncertain. We use a Random Forest (RF) classifier (Liaw et al., 2002) and a range of semantic, lexical, syntactic and dependency features. The majority of the lexical features are related to the cue and its sur-face and grammatical properties, while syntactic and dependency features are related to the syntactic dependencies between the cue and the event.
Features also include dependency-based rules that capture one and two-hop paths between the cue and an event trigger. Finally, there is an additional set of features related to the semantics of the event itself (event type, arguments). A more detailed description and examples of the features can be found in Appendix A.
The full processing of ACE-MK corpus, including other NLP tasks such as sentence splitting, tokenisation etc, was performed using Argo platform, a web-based, graphical workbench that facilitates the construction and execution of modular text mining workflows . For the implementation of the RF classifier, dedicated components were implemented using the WEKA API (Frank et al., 2004). We used 10-fold cross-validation to evaluate and compare the performance of different generated models. Since some of the features are sentence and/or document based, we avoided the automated 10fold cross validation of the WEKA API, and instead modified the random fold generation so that no document would be split over several folds, thus ensuring the models were not biased or overfitted to specific documents.

WordNet-based Analysis
In order to interpret the differences in the performance of our models between the GENIA-MK and the ACE-MK, we compared the lexical and semantic properties of the cues in each corpus. For this purpose, we used WordNet (Miller, 1995) version 3.0 to examine the synsets and relations between uncertainty cues, the generated word graphs and the distributions of cues per synset. To process cues against information contained within Word-Net, the JWI API (Finlayson, 2014) was used.
In order to study the links between cues, we consider WordNet as a multi-graph where each word is a node, and all potential relations between two words constitute an edge. The types of relations are used as edge attributes. To generate the graph from each corpus, we start with the lemmatised cues and iteratively expand the graph using a set of available relations between words as well as synsets until there are no other nodes to visit. We use all relations available in WordNet between synsets and words, but we exclude expansion for some senses that are semantically irrelevant to all potential cues, as described in Appendix B.
The analysis and visualisation of the graphs was performed using Gephi (Bastian et al., 2009).

Automated Classification of Uncertainty
As a first step, we used the set of cues extracted from GENIA-MK for the generation of all features in the cue and dependency related feature sets. We then trained and evaluated the performance of the trained models on each of the test sets of the ACE-MK corpus, as shown in the top three rows of Table 1. The results show that the classifier trained with GENIA-MK cues does not achieve particularly high performance for any of the three cases of uncertainty, or for their combination. We subsequently proceeded to replace the GENIA-MK cues with the ones extracted from the ACE-MK corpus, and repeated the experiments, as shown in the bottom three rows Table 1.
When using ACE-MK cues, F-score increases significantly (p < 0.01) for all different test sets. This is mostly due to the consistent improvement in recall for all test sets (in terms of precision, it is only the case of Modality that the ACE-MK cues outperform the GENIA-MK cues). This result confirms the domain dependence of uncertainty expressions and stresses the need of domain specific approaches, to achieve higher performance.   More interestingly however, we notice that even when using ACE-MK cues, the performance we obtain is significantly lower compared to the performance obtained when the same method is applied to the GENIA-MK corpus. Indeed we see in Table 2 that on GENIA-MK even when using cues extracted from ACE-MK, performance is significantly higher for all metrics (Zerva et al., 2017).
Genericity seems to be the hardest attribute to distinguish, especially in terms of precision. This can be explained through an examination of the training data, which reveals that there are very few Generic event instances that are linked to a Genericity cue. Thus, while there is a sufficient number of training instances for Generic events (1132 Generic versus 4217 Specific) strong feature vectors can only be produced for a few of them. The classifier also seems to be having difficulties in predicting Subjectivity, but for different reasons. Looking more closely at the results for Subjectivity, we discovered that one issue relates to Multivalued test cases, which are particularly complex since they often involve the existence of more than one Subjectivity cue linked with the event, and at the same time they are significantly under-sampled (18 instances). Moreover, Subjectivity cues seem to involve more nouns and longer, often colloquial expressions compared to other dimensions.
Further enhancement of the machine learning approach and feature engineering could try to address such issues, in order to better identify Subjectivity and Genericity dimensions. A possible future direction would be to enhance current vectors methods that can account for positive or negative bias of nouns, or other methods borrowed by work on subjectivity. Coupled with a training corpus containing more positive instances, such methods could help drawing further conclusions.
In the last column of Table 1 we present the performance of the models trained on the combined dimensions. By combining the metaknowledge dimensions into one uncertainty identification task, we can see that we get improved performance, compared to the individual tasks. This provides an indication that relationships exist between these different dimensions in the context of detecting uncertainty. Still, as mentioned earlier, we notice that for all possible combinations, performance is lower compared to results reported for biomedical corpora using the same machine learning approach, even when we use cues extracted from the same corpus. This difference in score, even in the case of Modality, much like the one seen in the work of (Tang et al., 2010) for the CoNLL datasets, provides motivation to look more closely into the differences between the means of expressing uncertainty in the two different domains. In the next section, we attempt to interpret this difference in performance, explore why the cue and dependency based features used might be less effective for the newswire domain, and what could be done to remedy this.

Comparison of the Properties of
Uncertainty Cues Between Corpora

Dependency-based Comparison
As mentioned in Section 3.2.1 the machine learning classifiers used in this work, are heavily dependent on features related to the dependencies between potential uncertainty cues and the triggers of events. For the extraction of dependency paths we use a dependency parser in order to extract the dependency relations for each sentence of the corpus. The Enju dependency parser (Miyao et al., 2008) was used for both corpora, with models trained on biomedical and newswire data for GENIA-MK and ACE-MK respectively. We then treat the dependencies as a directed graph and examine the shortest paths between annotated cues and event triggers as shown in the example of Figure 4. In case of multi-word cues or multi-word events we consider the shortest possible path between any word of the cue and any word of the trigger. The comparison of dependency path lengths for the two corpora can be seen in Figure 3.
It is clear from the distribution that the dependency paths for the GENIA-MK corpus (gray- Figure 3: Histogram of length distribution for shortest dependency paths between uncertainty cues and triggers for ACE-MK and GENIA-MK. striped bars) follows a long-tail pattern, with more than 50% of the cues being directly linked to the trigger and more than 85% being at a distance of three or less dependency links. On the contrary for ACE-MK corpus we have a more evenly spread distribution of dependency paths, since to contain 85% of the cases we need to reach dependency paths of length 7. Looking at the last bar of the histogram, which accounts for paths longer than ten (10) hops or non-existent paths, we note that the percentage of such cases is double for ACE-MK compared to GENIA-MK.
This difference in the dependency path distribution, could explain why features based on dependency paths as well as dependency rules are not as efficient for newswire documents. Indeed, analysis of feature informativeness (using Mutual Information measures (Battiti, 1994)) for the two corpora further supports these observations. In the 30 top scoring features for GENIA-MK, 19 are dependency features (14 of them dependency rules) versus only 5 dependency features for ACE-MK (and only 1 dependency rule). These observations reveal a potential higher complexity in the sentence syntax and language structure in newswire texts as opposed to scientific texts. For example, in ACE-MK we observe more occurrences of event triggers being nouns that are not close to the main verb (and surrounding modals) and of cues indicating uncertainty (especially Subjectivity) found in a different sub-phrase than the event (see Figure  3). There are also some wrongly structured sentences where the dependency paths are distorted due to problematic syntax. This difference may occur as a result of the Figure 4: Dependency paths between cue (redbold) and trigger (green-underlined) for ACE-MK. Arrows denote the edges of the dependency graph that participate in the shortest path between cue and trigger. In (a) could is a Modality cue, influencing a Personel nominates event. In (b) we have a phrase that is annotated as a Subjectivity cue and the event is Personnel end position.
greater freedom of expression in news articles as opposed to scientific texts, where language and syntax follow stricter rules, and formal expressions are preferred to colloquial ones. Although it has be shown that even in scientific text, many statements are far from factual assertions, we can expect phenomena of vagueness, weaselling, hedging and speculating to be much more prevalent in news articles compared to scientific ones. It should though be noted that this difference might be further aggravated by the fact that GENIA-MK consists of abstracts, where requirements for precise language are even stricter.

Lexical Comparison
It seems that it is not only in syntax that the two corpora and respective domains differ. By focussing on the lexical and semantic properties of the cue lists in each case, we also found a set of differences at this level. A simple initial observation concerns the differences between the lengths of cues, in terms of the number of words, between the two domains. We can see in Figure 5 that in GENIA-MK, with the exception of some very lengthy outliers, most of the cues are one or two word expressions. In contrast, ACE-MK contains more lengthy uncertainty expressions, including various colloquial expressions, weasels etc. We also examined the semantic properties of the two cue-lists and generated two WordNet graphs for each corpus as described in Section 3.4. Apart from the sense limitation mentioned before, there was no further attempt to disambiguate cues that belonged to more than one synset. Instead, all possible synsets for each word were added to the graph ending, resulting in a total of 781 synsets covered by the cues for GENIA-MK, compared to 1444 synsets for ACE-MK. Thus the cues in ACE- MK seems to have a far broader semantic coverage, which means much greater lexical variability and harder to predict cues. To generate the graphs, we use the words in the cue list as seed nodes and then expand them to include all 1-hop neighbors and corresponding edges for each cue. We end up with a graph of 4293 nodes for GENIA-MK and 6123 nodes for ACE-MK.
Looking at the connectivity properties of the two graphs and the number of fully connected components (sub-graphs), we notice that the GENIA-MK graph has only two fully connected sub graphs, versus fifteen (15) for ACE-MK. The difference in sub graphs is another indication supporting the difference in semantic range for the two corpora, although it should be noted that for both corpora 85% of the nodes is contained in the largest sub graph.
We then proceeded to carry out modularity based community detection for the two graphs (Newman, 2006) in order to identify and visualise patterns in the senses of each graph. We focussed on the first 10 largest communities (size calculated on the basis of node count) and their central nodes.
To identify central nodes, we ranked nodes using three different centrality measures: betweenness, closeness (Brandes, 2001) and eccentricity (Hage and Harary, 1995) and then used the intersection of the top ranked nodes for each measure. We provide the visualisation of the graphs in Appendix C. As expected, in both graphs the communities are semantically related, and it is easy to see that in some communities the central nodes are related to uncertainty (likelihood, probability etc). Some of the communities evolve around similar concepts, such as ability, probability, communication and investigation, although the concepts are expressed using different terms.
It is important to note that using only 1-hop expansion of the original cues gathered from the two corpora, we were able to generate a graph with semantically meaningful communities. Hence, it would be interesting to further explore the use of WordNet and other semantic graphs as an unsupervised way to expand cue lists and use them on previously unseen data. This could prove particularly useful for domains lacking annotated resources.

Conclusion
In this paper we have analysed uncertainty identification in the newswire domain and compared it with the scientific (biomedical) domain both in terms of uncertainty definition and performance of methods. We have explored different metaknowledge aspects available in newswire corpora, in terms of their relation to uncertainty and the feasibility of their automated identification in text.
We have shown that it is possible to transfer methods similar to the ones employed in the biomedical domain for the automated identification of uncertain events in the news text. However we found that regardless of whether detecting uncertainty is restricted to individual dimensions, or they are treated as a combined task, the performance is significantly lower than the performance obtained by applying the same methods to biomedical articles. To try to understand reasons for this difference, we have analysed the syntactic and lexical properties of textual uncertainty in the newswire domain, and have discovered a number of factors that render the task of uncertainty identification more difficult to tackle in newswire documents. Our analysis has highlighted the role of longer dependencies between cues and events as one of the main issues that complicate the task in newswire articles, along with lengthy cues with increased semantic variability.
We consider this work a promising first step towards a more detailed and fine-tuned approach to uncertainty identification in the newswire domain. As future work, we aim to take advantage of our findings regarding the syntactic and lexical properties that were highlighted above, in order to build more robust classifiers. Moreover, we would like to expand our analysis of uncertainty in the newswire domain using word-embeddings and potentially expand the uncertainty definition in a similar fashion to (Chen et al., 2018). To support this goal, we also intend to experiment with further corpora in the newswire domain.
Efficient uncertainty identification will provide a useful tool for a more meaningful and semantically interpretable information extraction. Kenton Lee, Yoav Artzi, Yejin Choi, and Luke Zettlemoyer. 2015. Event detection and factuality assessment with non-expert supervision. In the case of 2-hop rules, the lemma of the word between the cue and the event trigger in the path, is also captured as part of the rule (as shown in the Modality rule of Figure 6).

B Appendix B: WordNet Senses
When using Wordnet for the graph generation we excluded some of the lexicographic sense groups that are available in WordNet, since they were judged to be too distant to uncertainty expressions (eg referring to specific objects etc). The choice was guided by the description of each sense, in order to avoid senses that do not relate to any of the dimensions of uncertainty described in the main document. By thus excluding senses related to concepts such as food, countries, activities etc we achieve reduced complexity, size and processing time of the resulting graphs. Nevertheless, inclusion of such senses could be interesting to consider in future experiments to see if they can better account for metaphors and colloquial expressions. Alternatively, graphs generated by word embedding approaches could be studied and compared against the WordNet ones. We list the inclusion/exclusion decision for each of the senses in the

C Appendix C: WordNet Graphs
We present below the ACE-MK and GENIA-MK graphs that are described in Section 4.2.2 of the main part of the article. Different colors signify different communities as identified by community detection based on the modularity index of nodes. We visualise only the ten largest (in terms of the participating nodes) communities). We also visualise the top scoring words (regarded as representatives of each community) for the combination of Closeness, Betweenness and Eccentricity metrics.
In Figure 7 we observe the graph for the ACE-MK corpus while in Figure 8 the one for GENIA-MK.