Speculation and Negation Scope Detection via Convolutional Neural Networks

Speculation and negation are important information to identify text factuality. In this paper, we propose a Convolutional Neural Network (CNN)-based model with probabilistic weighted average pooling to address speculation and negation scope detection. In particular, our CNN-based model extracts those meaningful features from various syntactic paths be-tween the cues and the candidate tokens in both constituency and dependency parse trees. Evaluation on BioScope shows that our CNN-based model significantly outperforms the state-of-the-art systems on Abstracts, a sub-corpus in BioScope, and achieves comparable performances on Clinical Records, another sub-corpus in BioScope.


Introduction
Factual information is critical to understand a sentence or a document in most typical NLP applications. Speculation and negation extraction has been drawing more and more attentions in recent years due to its importance in distinguishing counterfactual or uncertain information from the facts. Generally speaking, speculation is a type of uncertain expression between certainty and negation, while negation is a grammatical category which reverses the truth value of a proposition.
Commonly, speculation and negation extraction involves two typical subtasks: cue identification and scope detection. Here, a cue is a word or phrase that has speculative or negative meaning (e.g., suspect, guess, deny, not), while a scope is a text fragment governed by the corresponding cue in a sentence. Consider the following two sentences for examples: (S1) The doctors warn that smoking [may harm our lungs].
(S2) He does [not like playing football] but likes swimming. 1 In sentence S1, the speculative cue "may" governs the scope "may harm our lungs", while the negative cue "not" governs the scope "not like playing football" in sentence S2.
Previous work have achieved quite success on cue identification (e.g., with F1-score of 86.79 for speculative cue detection in Tang et al. (2010)). In comparison, speculation and negation scope detection is still a challenge due to its inherent difficulties and those upstream errors. In this paper, we focus on scope detection. Previous work on scope detection can be classified into heuristic rules based methods (e.g., Özgür et al., 2009; Øvrelid et al., 2010), machine learning based methods (e.g., Tang et al., 2010;Zou et al., 2013), and hybrid approaches which integrate empirical models with manual rules .
Different from those previous studies, this paper presents a Convolutional Neural Network (CNN)based approach for scope detection. CNN models, firstly invented to capture more abstract features for computer vision (LeCun et al., 1989), have achieved certain success on various NLP tasks in recent years, such as semantic role labeling (Collobert et al., 2011), machine translation (Meng et al., 2015;Hu et al., 2015), event extraction Nguyen et al., 2015), etc. These studies have proved the ability of CNN models in learning meaningful features.
In particular, our CNN-based model extracts various kinds of meaningful features from the syntactic paths between the cue and the candidate token in both constituency and dependency parse trees. The importance of syntactic information in scope detection has been justified in previous work Lapponi et al., 2012;Zou et al., 2013, etc). Our model can also benefit from the ability of neural networks in extracting useful information from syntactic paths (Xu et al., 2015a;Xu et al., 2015b) or more complex syntactic trees (Ma et al., 2015;Tai et al., 2015). Moreover, instead of traditional average pooling, our CNN-based model utilizes probabilistic weighted average pooling to alleviate the overfitting problem (Zeiler et al., 2013). Experimental results on BioScope prove the effectiveness of our CNNbased model.
The reminder of this paper is organized as follows: Section 2 gives an overview of the related work. Section 3 describes our CNN-based model with probabilistic weighted average pooling for scope detection. Section 4 illustrates the experimental settings, and reports the experimental results and analysis. Finally, Section 5 draws the conclusion.

Related Work
In this section, we give an overview of previous work on both scope detection and utilization of CNNs in NLP applications.

Scope Detection
Earlier studies on speculation and negation scope detection focused on developing various heuristic rules manually to detect scopes. Chapman et al. (2001) developed various regular expressions for negation scope detection. Subsequently, various kinds of heuristic rules began to emerge. Özgür et al. (2009) resorted to the part-ofspeech of the speculative cues and the syntactic structures of the current sentences for identifying scopes, and developed heuristic rules according to the syntactic trees. Øvrelid et al. (2010) construct-ed a set of heuristic rules on dependency structures and obtained the accuracy of 66.73% on the CoNLL evaluation data. The approaches based on heuristic rules were effective because the sentence structures in BioScope satisfy some grammatical rules to a certain extent.
With the release of the BioScope corpus (Szarvas et al., 2008), machine learning based methods began to dominate the research of speculation and negation scope detection. Morante et al. (2008) regarded negation scope detection as a chunk classification task utilizing lexical and syntactic features. Morante et al. (2009a) further implemented a scope detection system combining three classifiers, i.e., TiMBL, SVM and CRF, based on shallow syntactic features, and achieved the performance of 77.13% and 73.36% in Percentage of Correct Scopes (PCS) on speculation and negation scope detection on Abstracts, a sub-corpus of BioScope.  explored a hybrid method, adopting manually crafted rules over dependency parse trees and a discriminative ranking function over nodes in constituent parse trees. Zou et al. (2013) proposed a tree kernel based approach on the syntactic parse trees to detect speculation and negation scopes.
Alternative studies treated scope detection as a sequential labeling task. Tang et al. (2010) proposed a CRF model with POS, chunks, NERs, dependency relations as features. Similarly, Lapponi et al. (2012) employed a CRF model with lexical and dependency features for negation scope and event resolution on the Conan Doyle corpus. These machine learning methods manifest the effectiveness of syntactic features.

CNN based NLP Applications
Currently, CNNs have obtained certain success on various NLP tasks, e.g., part-of-speech tagging, chunking, named entity recognition (Collobert et al., 2011). Specifically, CNNs have been proven effective in extracting sentence-level features. For instance, Zeng et al. (2014) utilized a CNN-based model to extract sentence-level features for relation classification. Zhang et al. (2015) proposed a shallow CNN-based model for implicit discourse relation recognition.  presented a CNN-based model with dynamic multi-pooling on event extraction.
More recently, researchers tend to learn features from complex syntactic trees. Ma et al. (2015) use a CNN-based model for sentence embedding, utilizing dependency tree-based n-grams. Xu et al. (2015a) exploited a CNN-based model to learn features from the shortest dependency path between the subject and the object for semantic relation classification.

CNN-based Modeling with Probabilistic Weighted Average Pooling
This section describes our CNN-based model for speculation and negation scope detection, which is recast as a classification task to determine whether each token in a sentence belongs to the scope of the corresponding cue or not. Principally, our CNN-based model first extracts path features from syntactic trees with a convolutional layer and concatenates them with their relative positions into one feature vector, which is then fed into a softmax layer to compute the confidence scores of its location labels, described in subsection 3.1.

Token Labeling
We employ following labeling scheme for each candidate token:  A token is labeled as O if it is NOT an element of a speculation or negation scope;  A token is labeled as B if it is inside a scope and occurs before the cue, i.e., P token ＜P cue , where P token and P cue are the positions of the token and the cue in a sentence, respectively;  A token is labeled as A if it is inside a scope and occurs after the cue (inclusive), i.e., P token ≥P cue .
Under this scheme, each token in a sentence is classified into B, A or O. For example, the labels of all the tokens in sentence S3 are shown in sentence S4.
(S3) They think that [those bacteria may be killed by white blood cells] , but other researchers do not think so. (

S4) They/O think/O that/O [those/B bacteria/B may/A be/A killed/A by/A white/A blood/A cells/A] ,/O but/O other/O researchers/O do/O not/O think/O so/O ./O
The advantage of our scheme is that it can describe the location relationship among the tokens, cues and scopes more precisely than some previous studies, which regarded scope detection as a binary classification task (Øvrelid et al., 2010;Zou et al., 2013). Compared to other schemes with more than two labels (Morante et al., 2009a;Tang et al., 2010;Lapponi et al., 2012), our scheme can much alleviate the imbalance of labels, because the tokens occurring at the first or last positions of the scopes are much fewer than other tokens. Figure 1 shows the framework of our model based on neural network. We concentrate on Path Feature and Position Feature. They are concatenated into one feature vector, which is finally fed into the softmax layer to obtain the output vector.  Figure 1: The framework of CNN for scope detection. Relative Position has been proven useful in previous studies (Zeng et al., 2014;. In this paper, relative position is defined as the relative distance of the cue to the candidate token. For instance, in sentence S1, the relative distances of the cue "may" to the candidate tokens "warn" and "our" are 3 and -2, respectively. The values of position features are mapped into a vector P of dimension d p , with P initialized randomly.

Input Representation
Instead of the word sequence (e.g., Zeng et al., 2014;Zhang et al. 2015;, we argue that the Shortest Syntactic Path from the cue to the candidate token can offer effective features to determine whether a token belongs to the scope. It is remarkable that the lowest common ancestor node of the cue and the token is the highest tree node in the path. Figure 2 illustrates the architecture of our CNNbased model to extract path features. Here, convolutional features are first extracted from the matrix of embeddings of the path, and then fed into the hidden layer to produce more complicated features.
In this paper, the syntactic paths between the cues and the candidate tokens in constituency and dependency parse trees are both considered. Figure 3 presents the constituency parse tree of sentence S1 and the constituency path from the cue "may" to the candidate token "our". It shows that the tokens are at both the beginning and the end of the path with the arrows indicating the directions. Meanwhile, Figure 4 displays the dependency parse tree of sentence S2 and the dependency path from the cue "not" to the token "playing".
As the input of our CNN-based model, both the constituency path and the dependency path between the cue and the token can be regarded as the special "sentences" S=(t 1 , t 2 ,…, t n ), whose "words" can be tokens of sentences, syntactic categories, dependency relations, and arrows.
Similar to other CNN-based models, we also consider a fixed size window of tokens around the current token to capture its local features in the path. Here, the window size is set as an odd number w, indicating that there are (w-1)/2 tokens before and after the candidate token, respectively. In this case, path S is transferred into matrix , where d 0 is the dimension of the embeddings and |T 0 | is the size of the table.

Convolutional Neural Networking
After fed into the convolutional layer, the matrix of the syntactic path X 0 is processed with a linear operation: where 1 01 rn    . Moreover, we extract a convolutional feature Cpavg, whose elements are probabilistic weighted average values of rows in Y 1 . Formally, Cpavg can be written as: In Equation (4), p i is the probability of the element ( , ) Then, C is fed into the hidden layer to learn more complex and meaningful features. Here, we process C with a linear operation just like in the convolutional layer, and choose hyperbolic tanh as the activation function to get is the bias term. To produce the output of the hidden layer, a normalization operation is applied to eliminate the manifold differences among various features: In this way we obtain the path feature 2 n  H R for each candidate token and then concatenate it with the position feature P into one vector F 0: [ , ] where f n  R 0 F is the feature vector of a candidate token with the dimension equaling the sum of n 2 and the dimension of P. Besides, we also consider the dropout operation for regularization to prevent the co-adaptation of hidden units on the penultimate layer: where is an element-wise multiplication and M is a mask vector whose elements follow the Bernoulli distribution with the probability p of being 1. We determine whether the candidate token is in the scope of the current cue according to its F 1 .

Output
Finally, F 1 is fed into the softmax layer: To learn the parameters of the network, we supervise the predicted labels of O with the gold labels in the training set, and utilize the following training objection function: where ( | , ) ii p y x  is the confidence score of the golden label y i (B, A, O) of the training instance x i , m is the number of the training instances, λ is the regularization coefficient and θ={W 0 , W 1 , b 1 , W 2 , b 2 , W 3 , b 3 } is the set of parameters. To train the CNN-based model, the Stochastic Gradient Descent algorithm is applied to fine-tune θ.

Experimentation
In this section, we first introduce the evaluation data, and then describe the experimental settings. Finally, we report the experimental results and analysis.

Corpus
We evaluate our CNN-based model on BioScope (Szarvas et al., 2008;Vincze et al., 2008), a widely used and freely available resource consisting of sentences annotated with speculative and negative cues and their scopes in biomedical domain. 3.87 (Notes: "Ave. Len" denotes average length; "Abs", "Papers" and "Cli" denote Abstracts, Full Papers and Clinical Records, respectively; "Spe" and "Neg" denote speculation and negation, respectively.) BioScope includes 3 different sub-corpora: Abstracts of biological papers from the GENIA corpus (Collier et al., 1999), Full scientific Papers from Flybase and BMC Bioinformatics website, and Clinical radiology Records corpus. These texts in three sub-corpora ensure that BioScope can capture the heterogeneity of language use in biomedical domain. While Abstracts and Full Papers share the same genre, Clinical Records consists of shorter sentences. Previous studies regarded Abstracts as the main resource for text mining applications due to its public accessibility (e.g. through Pub-Med). Table 1 shows the statistics of the BioScope corpus. While in both Abstracts and Full Papers, the average lengths of speculation and negation sentences are comparable (Abstracts: 29.77 vs 29.28; Full Papers: 30.76 vs 30.55). However, their average lengths of the negation scopes are shorter than those of speculation ones (Abstracts: 7.60 vs 15.10; Full Papers: 7.35 vs 13.38). Moreover, both the average lengths of sentences and scopes in Clinical Records are shorter than those of other two sub-corpora (Average length: 11.96 (speculation sentence), 8.53 (negation sentence), 4.92 (speculation scope) and 3.87 (negation scope)).

Experimental Settings
Following the previous work (e.g., Özgür et al., 2009;Morante et al., 2009aMorante et al., , 2009bZou et al., 2013), we divide the Abstracts sub-corpus into 10 folds to perform 10-fold cross-validation. Moreover, to examine the robustness of our CNN-based model towards different text types within biomedical domain, all the models are trained on the same Abstracts sub-corpus. Therefore, the results on Abstracts can be regarded as in-domain evaluation while the results on Clinical Records and Full Papers can be regarded as cross-domain evaluation.
For the measurement, traditional Precision, Recall, and F1-score are used to report the tokenbased performance in scope detection, while the Percentage of Correct Scopes (PCS) is adopted to report the scope-based performance, which considers a scope correct if all the tokens in the sentence have been assigned the correct scope classes for a specific cue. Obviously, PCS can better describe the overall performance in scope detection. Besides, Percentage of Correct Left Boundaries (PCLB) and Percentage of Correct Right Boundaries (PCRB) are reported as partial measurements.
In all our experiments, both the constituency and dependency parse trees are produced by Stanford Parser 2 . Specially, we train the parser on the GENIA Treebank 1.0 3 (Tateisi et al., 2005), which contains Penn Treebank-style syntactic (phrase structure) annotation for the GENIA corpus. The parser achieves the performance of 87.12% in F1score in terms of 10-fold cross-validation on GENIA TreeBank 1.0.
For the baseline, we utilize the classifier-based baseline developed by Zou et al. (2013). Besides those typical features, constituency and dependen-cy syntactic features are also included. Furthermore, Mallet 5 is selected as the classifier.
In addition, since our CNN-based model may result in discontinuous blocks, we utilize a postprocessing algorithm (Morante et al., 2008) to ensure the continuity of scopes. Meanwhile, the cue must be in its scope as defined in Bioscope. Table 2 summarizes the performances of scope detection on Abstracts. In Table 2, CNN_C and CNN_D refer the CNN-based model with constituency paths and dependency paths, respectively (the same below). It shows that our CNN-based models (both CNN_C and CNN_D) can achieve better performances than the baseline in most measurements. This indicates that our CNN-based models can better extract and model effective features. Besides, compared to the baseline, our CNN-based models consider fewer features and need less human intervention. It also manifests that our CNN-based models improve significantly more on negation scope detection than on speculation scope detection. Much of this is due to the better ability of our CNN-based models in identifying the right boundaries of scopes than the left ones on negation scope detection, with the huge gains of 29.44% and 25.25% on PCRB using CNN_C and CNN_D, respectively. Table 2 illustrates that the performance of speculation scope detection is higher than that of negation (Best PCS: 85.75% vs 77.14%). It is mainly attributed to the shorter scopes of negation cues. Under the circumstances that the average length of negation sentences is almost as long as that of speculation ones (29.28 vs 29.77), shorter negation scopes mean that more tokens do not belong to the scopes, indicating more negative instances. The imbalance between positive and negative instances has negative effects on both the baseline and the 5 http://mallet.cs.umass.edu/ CNN-based models for negation scope detection. Table 2 also shows that our CNN_D outperforms CNN_C in negation scope detection (PCS: 77.14% vs 70.86%), while our CNN_C performs better than CNN_D in speculation scope detection (PCS: 85.75% vs 74.43%). To explore the results of our CNN-based models in details, we present the analysis of top 10 speculative and negative cues below on CNN_C and CNN_D, respectively.  Figure 5 illustrates the PCSs of the most frequent 10 speculative cues using CNN_C. The cues in the horizontal axis are in the order of lowest to highest in frequency. Among those cues, "suggest", "may", "indicate", and "appear" are commonly used to express opinions of certain individuals. The scopes of these cues are integrated semantic fragments (probably clauses) governed by corresponding cues in grammatical sense, and the tokens in the scope tend to share the same chunk with the cue in the constituency parse tree. Hence, constituency paths are more useful for speculation scope detection. Figure 5 also shows that the PCSs of all the top 10 speculative cues are higher than 70% except "or" (PCS: 60.44%), mainly due to the flexible usage of "or", which can connect two words, two professional terms, or even two clauses. Abstracts sub-corpus. Figure 6 illustrates the performances of the most frequent 10 negative cues using CNN_D. In those negative cues, "not" is in the absolute majority, and "not" and "no" cover over 70%. We have noticed that most negative cues (e.g., "not", "no", "without", "fail") are often applied to negate phrases, and the tokens in negation scope tend to have the tight dependency relationship with them. Therefore, our model can achieve better results using dependency paths for negation scope.

Experimental Results on Abstracts
In Figure 6, most negative cues have good PCSs (higher than 70%). However, "unable" has poor PCS of 16.67%. This is due to the fact that "unable" usually occurs in the phrase structure "be unable to", which often follows a subject. It is notable that a cue is always in its scope and most cues in BioScope are much closer to the left boundaries than to the right ones. Hence, the tokens labeled as B (i.e., inside the scope and before the cue) are much fewer than the ones labeled as A or O. Such imbalance makes it hard to judge whether the tokens before "unable" are in of its scope or not.

Experimental Results on Clinical Records and Full Papers
The performances of our CNN-based models on the other two sub-corpora, i.e., Clinical Records and Full Papers, are presented in Table 3. Although Abstracts and Clinical Records have different genres, our CNN-based models can obtain satisfactory results on Clinical Records using both constituency paths and dependency paths, proving the portability of our models. Table 3 also shows that the results of negation scope are better than those of speculation scope on Clinical Records (PCS: 89.66% vs 73.92%). We argue the reason is that both the lengths of negation sentences and scopes (8.53 and 3.87, respectively) in Clinical Records are much shorter, indicating that the structures of negation sentences are simpler than those of speculation ones. After error analysis of speculation scopes, we find that 54.83% of our error scopes contain the annotated scopes, just like sentence S5: (S5) This does not [appear to represent a stone] and is not mobile.
The annotated scope of the cue "appear" is "appear to represent a stone". However, our CNN-based model identifies the whole sentence as the scope. These errors indicate that some words may be wrongly identified as the components of scopes because the scopes in Clinical Records are short and their structures are simple.
Compared with Abstracts and Clinical Records, the results on Full Papers are much lower. This is mainly due to the poor PCRBs, indicating a considerable quantity of right boundaries of scopes cannot be identified correctly. We should note that the average lengths of both speculation and negation sentences (30.76 and 30.55, respectively) in Full Papers are longer than those in Abstracts and Clinical Records. Normally, longer sentences mean more complicated syntactic structures.
Besides the results trained on Abstracts, we also consider the 10-fold cross-validation on Clinical Records and Full Papers. The PCSs of speculation and negation scope detection are 74.73% (CNN_C) and 91.03% (CNN_C) on Clinical Records, which are both higher than the ones trained on Abstracts. Remember that Abstracts and Clinical Records come from the different genres. However, we get lower PCSs on Full Papers (49.54% for speculation scope detection using CNN_C, and 44.67% for negation scope detection using CNN_C). In addition to the complex structures of long sentences, another reason is that the smaller size of the Full Papers sub-corpus compared to the other two sub-corpora. Fewer sentences and scopes (only 672 speculation scopes in 519 sentences and 376 negation scopes in 339 sentences) mean that we cannot get an excellent model. Table 4 compares our CNN-based models with the state-of-the-art systems. It shows that our CNNbased models can achieve higher PCSs (+1.54%) than those of the state-of-the-art systems for speculation scope detection and the second highest PCS for negation scope detection on Abstracts, and can get comparable PCSs on Clinical Records (73.92% vs 78.69% for speculation scopes, 89.66% vs 90.74% for negation scopes). It is worth noting that Abstracts and Clinical Records come from different genres.

Comparison with the State-of-the-Art
It also displays that our CNN-based models perform worse than the state-of-the-art on Full Papers due to the complex syntactic structures of the sentences and the cross-domain nature of our evaluation. Although our evaluation on Clinical Records is cross-domain, the sentences in Clinical Records are much simpler and the results on Clinical Records are satisfactory. Remind that our CNN-based models are all trained on Abstracts. Another reason is that those state-of-the-art systems on Full Papers (e.g., Li et al., 2010; are tree-based, instead of token-based. Li et al. (2010) proposed a semantic parsing framework and focused on determining whether a constituent, rather than a word, is in the scope of a negative cue.  presented a hybrid framework, combining a rule-based approach using dependency structures and a data-driven ap-proach for selecting appropriate subtrees in constituent structures. Normally, tree-based models can better capture long-distance syntactic dependency than token-based ones. Compared to those tree-based models, however, our CNN-based model needs less manual intervention. To improve the performances of scope detection task, we will explore this alternative in our future work.

Conclusion
This paper proposes a CNN-based model for speculation and negation scope detection. Compared with various lexical and syntactic features adopted in previous studies (e.g., Lapponi et al., 2012;Zou et al., 2013), our CNN-based model only considers the position feature and syntactic path feature. Experimental results on the BioScope corpus show that our CNN-based model can get the best performances for speculation scopes and the second highest performances for negation scopes on Abstracts in in-domain evaluation. In crossdomain evaluations, we can achieve comparable results on Clinical Records, but our CNN-based model performs worse on Full Papers. This suggests our future direction to extend the model from token level to parse tree level in better capturing long-distance syntactic dependency and to address the cross-domain adaptation issue.