Multitask Learning of Negation and Speculation using Transformers

Detecting negation and speculation in language has been a task of considerable interest to the biomedical community, as it is a key component of Information Extraction systems from Biomedical documents. Prior work has individually addressed Negation Detection and Speculation Detection, and both have been addressed in the same way, using 2 stage pipelined approach: Cue Detection followed by Scope Resolution. In this paper, we propose Multitask learning approaches over 2 sets of tasks: Negation Cue Detection & Speculation Cue Detection, and Negation Scope Resolution & Speculation Scope Resolution. We utilise transformer-based architectures like BERT, XLNet and RoBERTa as our core model architecture, and finetune these using the Multitask learning approaches. We show that this Multitask Learning approach outperforms the single task learning approach, and report new state-of-the-art results on Negation and Speculation Scope Resolution on the BioScope Corpus and the SFU Review Corpus.


Introduction
Detection of linguistic phenomena like Negation and Speculation are key components of Biomedical Information Retrieval systems, as they significantly alter the meaning of a sentence. While detecting these are also useful in Sentiment Analysis systems, and systems used to determine the veracity of information, their primary use is in biomedical systems. Thus, these tasks have attracted significant interest from researchers over the years, and due to the similarity between these tasks, similar approaches have been used to address them, and parallel corpora containing annotations for both Negation and Speculation have also been created, including: • BioScope Corpus (Szarvas et al., 2008) • SFU Review Corpus (Konstantinova et al.) Prior research has converged to using a 2 stage approach for both Negation Detection and Speculation Detection: Cue Detection and Scope Resolution, and solved each task independently. These subtasks and their relevance to the Biomedical domain can be better understood using the following example: (1) We found that T cells were [not] present, [perhaps] indicating an immuno deficiency. Cue Detection involves finding the word(s) that express the linguistic phenomena being detected. In the example given above, not is the negation cue, as it expresses the negation in the sentence. Similarly, perhaps is the speculation cue.
Scope Resolution involves finding the word(s) that were affected by the cue word of the linguistic phenomena being considered. In the example above, the underlined words outline the scope for each cue. Specifically, for the negation cue not, the word present was negatively affected by it. Similarly, for the speculation cue perhaps, the words indicating an immuno deficiency were affected by it, indicating that these words have an associated uncertainty.
The approaches addressing these tasks have varied significantly over the years, with recent work focusing on using transformer-based architectures to perform transfer learning, and have given the best results to date on Scope Resolution. On Cue Detection, they yield the best performance among neural models, but due to the small dataset sizes, the best performance is still given by rule-based heuristic approaches.
Despite the similarity among the subtasks, prior systems have looked at these tasks independently. We believe that a system can improve performance on both tasks by learning from both simultaneously, due to the similarity, which is what Multitask Learning is about.
Multitask Learning involves jointly training the same architecture to perform multiple tasks. It relies on the concept of using shared knowledge between both tasks, which is what the model is forced to learn to perform well at all tasks, eventually leading to better performance on all tasks. For neural models, this is especially useful, as the lower layers can share the same input representation, thus getting more data to learn better lower level features from the input, and a task specific layer final layer can learn the task specific features.
In this paper, inspired by the success of transformers, we propose a method to perform Multitask Learning of negation and speculation using transformer-based architectures. We explore a few design choices, and analyse the impact of these design choices. We show that our approach provides significant benefits over the normal independently trained version. We also make all our code publicly available 1 . This paper is structured as follows: Section 2 contains a brief Literature Review, Section 3 describes the Methodology in detail, Section 4 talks about the Experimentation Details, Section 5 contains the Results and their Analysis, and Section 6 contains the Conclusion and Future Scope.

Literature Review
Over the years, methods addressing these subtasks have ranged from simple whitelists based on frequency (rule-based), to traditional Machine Learning algorithms like SVMs, neural models like BiL-STMS and transformer based models. Khandelwal and Sawant (2020) provide an extensive literature review of the methods for Negation Cue Detection and Scope Resolution. For Speculation Cue Detection, most methods used were similar to the methods used for Negation Cue Detection. Below, we summarise a few papers that addressed Speculation Cue Detection and Scope Resolution.

Traditional Machine Learning Methods
Ozgür and Radev (2009) used a Support Vector Machine (SVM) with a linear kernel to detect speculation cues. Once the speculation cues were identified, the parts-of-speech tags and the syntactic structure were used for scope resolution. Morante and Daelemans (2009) used the IGTREE classifier with the help of gain ratio (TiMBL implementation) to classify the cues. For scope resolution, three classifiers were used to classify if a token was the first token in the scope se-1 adityak6798.github.io quence (F-scope), the last (L-scope), or neither. These three classifiers were Memory-based learning (as implemented in TiMBL), SVM and Conditional Random Field (CRF). A fourth classifier, the metalearner, used the output of the three classifiers to predict the scope classes. Velldal et al. (2010) used a Maximum Entropy Classification approach to find the speculation cue. For scope resolution, they used a rule-based approach. The rules operated on the dependency structure of the parser (MaltParser and XLE). Kilicoglu and Bergler (2008) used linguistic knowledge to detect speculation cues. This was achieved by using a semi-automatic lexical acquisition strategy as well as by using a dictionary of weighted speculation cues. In a follow-up paper (Kilicoglu and Bergler, 2010), they improved on their previous work with the help of vagueness quantifiers and syntactic dependency relations. Cues were detected with rules that operate on lexical information and syntactic information obtained from the Stanford Lexical Parser. The scopes of the cues were detected with the help of the Stanford Lexical Parser and dependency-based heuristics.
Velldal (2011) compiled a list of words that were observed to be cues in the training data, under the assumption that speculation cues can be treated as a closed class. He then checked occurrences of these words in the test data via a large-margin SVM classifier to determine whether they were a cue or not.
A CRF based approach was used in (Tang et al., 2010) to identify the hedge cues and their scopes in sentences. A CRF and large margin-based model were trained simultaneously. Their outputs were provided to another CRF model to get the cues of sentences. These cues were passed to another CRF model followed by post processing, to detect scopes of the given cues. Morante et al. (2010) described a memory based approach (IGTree as implemented in TiMBL) for cue detection. For scope resolution, a memorybased approach was used with the help of syntactic dependencies of a sentence. Read et al. (2011) described a methodology to resolve the scope of a sentence using an SVM based constituent ranker. The scope of a cue was assumed to be a constituent. Three broad classes of rules were used to extract features from the parse trees. The parse trees containing the cue were fed to the SVM-based ranker to output a ranked order of parse trees which was then declared to be the scope of the cue. Velldal et al. (2012) used an SVM classifier based on manually defined rules for speculation cue detection. For scope resolution, they experimented with three architectures: a rule-based system that used Data Driven Dependency Parsing to generate dependency structures, an SVM Ranker for selecting subtrees in the constituent structures obtained via a Grammar-Driven Phrase Structure Parser and hybrid of both the above systems.
The approach of (Moncecchi et al., 2012) was based on CRF and usage of domain knowledge. The task of cue detection was solved by using the sequential classifier of CRF. The scope of these cues was resolved by using the CRF at the initial stage with a window size of two. Later, domain knowledge was used to incorporate rules in the system, which showed an improvement in performance of the system. Cruz et al. (2016) used a classifier-based approach for speculation cue detection and scope resolution. The features for the classifier were manually defined. For cue detection, an SVM-based classifier was used to predict the BIO tags. The scope was also identified using an SVM-based classifier to predict the in-scope and out-of-scope tags when the cues and tokens were provided as input to the classifier. A Radial Basis Function (RBF) Kernel was used with Cost Sensitive Learning to handle imbalanced classes.

Deep Learning Methods
Qian et al. (2016) used a CNN based approach to re-solve the scope of a speculation cue. The CNN framework took as input position and path features.
Fei et al. (2020) used a Recursive Neural Network (RecurNN) followed by a CRF to detect the scope in a sentence which is named as the Recur-CRF model. The dependency tree based RecurNN learnt a high-level representation of words in the given content. The output of the RecurNN was given to the CRF to fully under-stand the contextual information required to predict the scope of a given cue.
Recently, Britto and Khandelwal (2020) extended the approach by Khandelwal and Sawant (2020), who used BERT (Devlin et al., 2018) to address Negation Cue Detection and Scope Resolution. They experimented with using various transformer-based architectures (BERT, XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019)), and jointly training on multiple datasets to address speculation cue detection and scope resolution. This approach gave the best results to date on Negation and Speculation Scope Resolution.

Multitask Learning using Negation Scope Resolution
We also review a couple of papers which have used Multitask Learning with Negation Scope Resolution as one of the many tasks to jointly train the model. It is important to note that these paradigms were explored to improve performance in the auxiliary tasks the model was trained, which were almost always harder than Negation Scope Resolution. Bhatia et al. (2018) perform joint entity extraction and negation detection for biomedical articles. Initially, they used a hierarchical encoder-decoder model used for Named Entity Recognition (NER), and adapted it for the Multitask setting by sharing the encoder, but using separate decoders for both the tasks. To overcome the overparameterization during low-resource settings, they propose usage of a conditional softmax shared decoder, where instead of using 2 different decoder architectures, they shared the decoder as well, and only had separate classification heads. They also feed the output of the NER head as an additional input to the negation head, which helps improve the performance. They use BiLSTMs for both the encoder and decoder. Barnes et al. (2019) explore another joint task that has been explored often, namely Sentiment Analysis systems that are jointly trained with Negation Detection systems. They mention that since Sentiment Analysis is a harder task than negation detection, and negation data is used as a task in the pipeline for sentiment analysis, they perform selective sharing of LSTM layers, and use negation as an auxillary task on which the sentiment analysis system is trained. Specifically, they use a separate CRF tagger for negation detection on the outputs of an intermediate layer for the sentiment analysis system, whose final layer is used for sentiment classification. They use a BiLSTM-based network.

Methodology
Similar to (Khandelwal and Sawant, 2020) and (Britto and Khandelwal, 2020), we use the transformer model (BERT/XLNet/RoBERTa) with a classification head as our base model. To jointly train the model, we propose the following additions to the model.

Cue Detection
For Cue Detection, we use 2 separate classification heads for Negation Cue Detection and Speculation Cue Detection respectively. The architecture can be visualized as in Figure 1. We feed an input sentence to the model, and use the output corresponding to the task we are looking to perform. This architecture halves the number of parameters and inference time if we want to perform negation and speculation detection simultaneously.
To train this model, we only train on those sentences that have both negation and speculation cue labels. Since we train on the BioScope Corpus, and the SFU Review Corpus, all training samples have labels for both negation and speculation. A single input sentence is fed, and the model is trained on the losses computed for both heads, negation and speculation.

Scope Resolution
For Scope Resolution, we use the same classification head for both Negation and Speculation Scope Resolution, and use preprocessing techniques to implicitly tell the model which task to perform. For Scope Resolution, we need to represent the cue words in the input sentence for which we want to find the scope. This could be done via the Augment Preprocessing method used by (Britto and . This involves appending a special token before the cue word in the input sentence which represents the type of cue word. The types of cue words considered are: • Single Word Cue: tok[0] • Part of a Multiword Cue: tok[1] • Affix (Suffix / Prefix): tok[2] Consider the following example: Input Sentence: It seems that the treatment is not successful. Negation Cues: not Preprocessed Sentence: It seems that the treatment is tok[0] not successful.
To jointly train the same model to make predictions, we have to also tell the model which task we expect it to perform. To do this, we propose the following methods which are slight modifications of the Augment preprocessing method: • Global: We represent the task by appending the name of the task to be performed at the end of the input sentence followed by a [SEP] token. The cue words for both negation and speculation are represented by the same set of special tokens. Specifically, Input Sentence: It seems that the treatment is not successful. Thus, the type of cue for both negation and speculation is the same (single word cue), hence we use the same token (tok[0]) to augment the input sentence. The task is represented by appending the task name to the end of the sentence.
• Local: Here, we use the following tokens to represent the different types of negation and speculation cues.
- Here, the tokens used to represent different types of negation cues are different than the tokens used to represent the different types of speculation cues, thus implicitly telling the model which scope it has to find.

• SFU Review Corpus (SFU)
We believe that by training on multiple datasets, the overfitting of the models can reduce, as the datasets are fairly small in size (200-2000 samples), despite the different domains of the datasets (Bio-Scope Corpora is from the Biomedical Domain, and SFU Review Corpus contains general online review text). Hence, we also experiment with training the models on multiple datasets, and testing on the individual datasets.
We use a 70-15-15 train-dev-test split. The results are reported as an average of 5 runs for training on a single dataset and an average of 3 runs for training on a combination of multiple datasets. We report the Macro F1 Average (Token-level) score for both Cue Detection and Scope Resolution.
We perform an early stopping (with a patience of 6) on the validation F1 Score. Since we jointly train 2 tasks, we experiment with these 2 ways to perform early stopping: • Separate: Here, we use 2 early stopping counters: One for Negation and one for Speculation. Specifically, we have separate validation sets for Negation and Speculation, and for each validation set, we run a different Early Stopping Counter. Thus, the final models for Negation and Speculation differ, although they are trained jointly.
• Combined: Here, there is only one Early Stopping used. Training is stopped when the average of the validation F1 scores on the Negation Validation set and the Speculation Validation set do not improve for 6 epochs. Here, we have the same final model for both Negation and Speculation.
We train the models using GPUs available via Google Colaboratory. The code is publicly available.

Results and Analysis
To perform a better comparison of independently trained models on multiple datasets, we train BERT, XLNet and RoBERTa on BF+BA, BF+SFU, BA+SFU and BF+BA+SFU for Negation Cue Detection and Negation Scope Resolution. We train the model as per the paper by Britto and , and average the results of 3 runs. The results are shown in Tables 1 and 2. An analysis of the results shown below is done in Section 5.4.

Cue Detection
The results for Negation and Speculation Cue Detection are shown in Table 3 (trained using the Com-   A comparison of the best models trained jointly on Negation and Speculation compared with the independently trained model variants and the stateof-the-art results is shown in Table 5.

Negation Scope Resolution
The results for Negation Scope Resolution are shown in Table 6 (trained using the Combined Early Stopping method) and Table 7 (trained using the Separate Early Stopping method). We compare the Combined and Early Stopping methods in Section 5.4.
A comparison of the best models trained jointly on Negation and Speculation compared with the state-of-the-art results for Negation Scope Resolution is shown in Table 8. Our joint training approach outperforms the existing state-of-the-art models (independently trained transformer based architectures) on all datasets that we experiment with.

Speculation Scope Resolution
The results for Speculation Scope Resolution are shown in Table 9 (trained using the Combined Early Stopping method) and Table 10 (trained using the Separate Early Stopping method). We compare the Combined and Early Stopping methods in Section 5.4.     A comparison of the best models trained jointly on Negation and Speculation compared with the state-of-the-art results for Speculation Scope Resolution is shown in Table 11. Our joint training approach outperforms the existing state-of-the-art results on BioScope Abstracts and SFU Review Corpus.

Analysis
Our proposed joint training scheme clearly yields substantial improvements over the independent task-specific training approach, as we outperform the independently trained models consistently, and report new state-of-the-art results, as is illustrated in Tables 5, 8 and 11.  • For Scope Resolution, the Global preprocessing method tends to outperform the Local preprocessing method. This trend is visible in Table 12, which contains the difference between results using the local preprocessing method and the global preprocessing method (i.e. Local -Global), averaged across all train-test dataset combinations, shown for each modeltask combination. The majority differences (8 out of 12, or 66%) are negative, showing that global preprocessing method outperforms the local preprocessing method.
• The Combined Early Stopping training method outperform the Separate Early Stopping training method. This trend is visible in Table 13, which contains the difference between the combined early stopping method

Conclusion
In this paper, we explored the realm of Multitask training to jointly train the same model to perform both negation and speculation detection. We experimented with transformer-based architectures (BERT, XLNet and RoBERTa), and proposed schemes to jointly train the cue detection model for both negation and speculation, and the scope resolution model for both negation and speculation. Our approach yielded improvements over the independently trained versions of the same architectures, and we reported new state-of-the-art results for both negation and speculation scope resolution on the BioScope Corpus and the SFU Review Corpus. We also evaluated the different design choices that were involved, and observed that the Combined Early Stopping variant gave the best overall performance. The future scope of this work would be to look at using this scheme to jointly train a model for more such tasks, like NER and Sentiment Analysis, along with Negation and Speculation Detection.