End-to-End Negation Resolution as Graph Parsing

We present a neural end-to-end architecture for negation resolution based on a formulation of the task as a graph parsing problem. Our approach allows for the straightforward inclusion of many types of graph-structured features without the need for representation-specific heuristics. In our experiments, we specifically gauge the usefulness of syntactic information for negation resolution. Despite the conceptual simplicity of our architecture, we achieve state-of-the-art results on the Conan Doyle benchmark dataset, including a new top result for our best model.


Introduction
Negation resolution (NR), the task of detecting negation and determining its scope, is relevant for a large number of applications in natural language processing, and has been the subject of several contrastive research efforts (Morante and Blanco, 2012;Fares et al., 2018). In this paper we cast NR as a graph parsing problem. More specifically, we represent negation cues and corresponding scopes as a bi-lexical graph and learn to predict this graph from the tokens. Under this representation, we may apply any dependency graph parser to the task of negation resolution. The specific parsing architecture that we use in this paper extends that of Dozat and Manning (2018).
Contributions This work (a) rationally reconstructs the previous state of the art in negation resolution; (b) develops a novel approach to the problem based on general graph parsing techniques; (c) proposes and evaluates different ways of integrating 'external' grammatical information; (d) gauges the utility of morpho-syntactic preprocessing at different levels of accuracy; (e) shifts experimental focus (back) to a complete, end-toend perspective on the task; and (f) reflects on un-certainty in judging experimental findings, including thorough significance testing.
Paper Structure In the following Section 2, we review selected related work on negation resolution. Section 3 describes the specific NR task that we address in this paper. In Section 4 we present our new encoding of negations and our parsing model, followed by the description of our experiments and results in Section 5. We discuss these results in Section 6 and summarize our findings in Section 7.

Related Work
While there exist a variety of datasets that annotate negation (Jiménez-Zafra et al., 2020), the Bio-Scope (Szarvas et al., 2008) and Conan Doyle datasets (ConanDoyle-neg; Morante and Daelemans, 2012) are most commonly used for evaluation. The latter was created for the shared task at *SEM 2012 (Morante and Blanco, 2012), where competing systems needed to predict both negation cues (linguistic expressions of negation) and their corresponding scopes, i.e. the part of the utterance being negated. Cues can be simple negation markers (such as not or without), but may also consist of multiple words (i.e. neither . . . nor), or be mere affixes (i.e. infrequent or clueless). In contrast to other datasets, ConanDoyle-neg also annotates negated events that are part of the scopes.
The analysis of negation is divided into two related sub-tasks, cue detection and scope resolution. While cue detection is mostly dependent on lexical or morphological features, relating cues to scopes is a structured prediction problem and will likely benefit from an analysis of morpho-syntactic or surface-semantic properties. The UiO 2 system SHERLOCK (Lapponi et al., 2012), the winner of the open track of the *SEM 2012 shared task, uses morpho-syntactic parts of speech and syntactic dependencies to classify tokens as either we have never gone out without keeping a sharp watch , and no one could have escaped our notice . " cue cue cue labels: Figure 1: An example of how overlapping ConanDoyle-neg annotations are converted to flat sequences of labels in SHERLOCK. In this example, an in-scope token is labelled with N, a cue with CUE, a negated event with E, a negation stop with S, and an out-of-scope token with O. Illustration taken from  in-scope or out-of-scope using a conditional random field (CRF). Another CRF further classifies scope tokens as events, and a heuristic is applied to distribute scope tokens to their respective cues. The SHERLOCK system was subsequently used by Elming et al. (2013) to evaluate various dependency conversions, and similarly served as one of three reference 'downstream' applications in the 2017 Extrinsic Parser Evaluation initiative (EPE; . The best results from this evaluation define the state of the art in NR. Deviating from the original *SEM 2012 setup, Packard et al. (2014) simplified the task to only evaluate the performance on finding scope tokens, assuming gold-standard information about negation cues. Fancellu et al. (2016 continued this trend, additionally treating each negation instance separately, and successfully used BiL-STM (bidirectional Long Short-Term Memory recurrent neural networks; Hochreiter and Schmidhuber, 1997). Recently, Sergeeva et al. (2019) used pre-trained transformers (Vaswani et al., 2017), namely BERT (Devlin et al., 2019), to further improve performance, albeit on a derivative of the original dataset (Liu et al., 2018). Using BERT in a two-stage sequence-labelling approach on the original ConanDoyle-neg corpus and other relevant negation corpora, Khandelwal and Sawant (2020) successfully improved previous results by a considerable margin. The 2018 follow-up to the EPE shared task (Fares et al., 2018) again used SHER-LOCK to evaluate parsing performance, this time restricting itself to participating systems in the colocated 2018 CoNLL Shared Task on Universal Dependency Parsing (Zeman et al., 2018).

Task and Data
We target the original *SEM 2012 shared task and aim to predict both negation cues and their scopes. We compare our approach with the baseline SHER-LOCK system and the state-of-the-art systems identified through the EPE shared tasks.

Data
The negation data of *SEM 2012 consists of selected Sherlock Holmes stories from the works of Arthur Conan Doyle, and contains 3,644 sentences in the training set, 787 sentences in the development set, and 1,089 sentences in the evaluation set. The corpus annotates a total of 1,420 instances of negation. Several sentences contain two or more instances of negation, while 4,294 sentences do not contain any at all.
Negation instances are annotated as tri-partite structures: Negation cues can be full tokens, multiword expressions, or affixal sub-tokens. For each cue, its scope is defined as the possibly discontinuous sequence of (sub-)tokens affected by the negation. Additionally, a subset of in-scope tokens can be marked as negated events (or states), provided that the sentence is factual and the events in question did not take place. For sentences containing multiple negation instances, their respective scope and event spans may nest or overlap.
The systems submitted to the EPE 2017 and 2018 tasks work on 'raw', unsegmented text, and apply different segmentation strategies. To evaluate these systems in the context of negation resolution, the gold-standard negation annotations have to be retrofitted to each system's output. Each system is then tested against their own 'personalized' gold standard. For more information on this projection procedure, we refer to .

Baseline System
As in the EPE shared tasks, our baseline is the SHERLOCK system of Lapponi et al. (2012, which approaches NR as a token-based sequence labelling problem and uses a Conditional Random Field (CRF) classifier (Lavergne et al., 2010). The token-wise negation annotations contain multiple layers of information. Tokens may or may not be negation cues; they can be in or out of scope for a specific cue; in-scope tokens may or may not be negated events. Moreover, as already stated, multiple negation instances may be (partially or fully) . . . an unmitigated scoundrel for whom there was neither pity nor excuse . overlapping. Before presenting the CRF with the annotations, SHERLOCK 'flattens' all negation instances in a sentence, assigning a six-valued extended begin-inside-outside labelling scheme, as indicated in Figure 1. After classification, hierarchical (overlapping) negation structures are reconstructed using a set of post-processing heuristics. The features of the classifier include different combinations of token-level observations, such as surface forms, part-of-speech tags, lemmas, and dependency labels. In addition, SHERLOCK employs features encoding both token and dependency distance to the nearest cue, together with the full shortest dependency paths. In the EPE context, gold-standard negation cues were provided as input to SHERLOCK. 1

Approach
In this section we define our graph-based encoding of negation structures, and present our parsing system and training procedure.

Negation Graphs
Instead of labelling each token sequentially with cue, scope or event markers, we reformulate NR as a parsing task, creating dependency-style negation graphs with lexicalized nodes and bilexical arcs i → j between a head i and a dependent j as target structures. This formulation allows us to more naturally encode the relationship between tokens and their cue(s), while being able to easily differentiate between regular scopes and events.
An example for a negation graph is shown in Figure 2. We adopt a convention from dependency parsing and visualize negation graphs with their nodes laid out as the words of the respective sentence, and their arcs drawn above the nodes. When transforming negation annotations into graphs, we mark negation cues i by special arcs r 0 → i emanating from an artificial root node r 0 . Scope and event tokens are marked by appropriately labelled arcs from their respective cue(s). For multi-word cues, only the first cue token is assigned as a root, while the remaining tokens are connected to the first with arcs labelled M. Since we do not split tokens into subtokens, we mark the full token containing an affixal cue as root. The negated part of the token is (by convention) annotated as an event, and thus marked by an appropriately labelled loop.
The resulting graphs thus contain unconnected nodes, multiple structural roots (dependents of the artificial root node r 0 ), loops, and nodes with multiple incoming arcs. Sentences that do not contain any negations are represented by empty graphs.

Neural Model
With the translation of the negation annotation into graphs, we can use parsers that learn how to jointly predict cues and their respective scopes, avoiding a cascade of classifiers and heuristics as in the SHER-LOCK system. Specifically, we use a reimplementation of the neural parser by Dozat and Manning (2018), which in turn is based on the architecture of Kiperwasser and Goldberg (2016). The parser learns to weigh all possible arcs, and predicts the output graph simply as the collection of all arcs with positive weights. At the heart of this parser is a bidirectional recurrent neural network with Long Short-Term Memory cells (BiLSTM; Hochreiter and Schmidhuber, 1997). Given an input sequence x = x 1 , . . . , x n and corresponding word embeddings w i , the network outputs a sequence of context-dependent embeddings c i : We augment the input word embeddings w i with additional part-of-speech tag and lemma embeddings, embeddings created by a character-based LSTM, and 100-dimensional GloVe (Pennington et al., 2014) embeddings. Based on the contextdependent embeddings, two feedforward neural networks (FNN) create specialized representations of each word as a potential head and dependent: These new representations are then scored via a bilinear model with weight tensor U : The inner dimension of the tensor U corresponds to the number of negation graph labels plus a special NONE label indicating the absence of an arc, and thus predicts arcs and labels jointly.

Adding External Graph Features
Similarly to SHERLOCK, our neural model is able to process external morpho-syntactic or surfacesemantic analyses of the input sentence in the form of dependency graphs. Inspired by Kurtz et al. (2019), we extend the contextualized embeddings that are computed by our parser by information derived from the external graph. For this we use three approaches: (i) attaching the sum of heads; (ii) scaled attention on the heads; and (iii) Graph Convolutional Networks (Kipf and Welling, 2017). In the following, we view the external graph in terms of its n × n adjacency matrix A and the contextualized embeddings as an n × d matrix C.

Sum of Heads
The first method generalizes that of Kurtz et al. (2019), who concatenate to each contextualized embedding the contextualized embedding of its head. This only works when the graphs are trees, that is, when every node has one incoming arc. When there is more than one incoming arc, we instead sum up all respective contextual embeddings. We express this as a matrix product sumoh( A, C) = A C .

Scaled Attention
The second approach is inspired by Vaswani et al. (2017), who compute the (scaled) dot product attention Q K between a matrix of queries Q and a matrix of keys K, and normalize it by a row-wise softmax function, which yields probabilistic weights on potential values. Noting the similarity between this normalized attention matrix and a probabilistic adjacency matrix, we replace Q K with the matrix A: Here, d is the size of the contextualized embeddings. In our case, where we merely want to extract features from a given graph, the matrix A is known and sparse; but the same scaled attention model could also be used in a multi-task setup to jointly learn to parse syntactico-semantic graphs and negations, in which case A would be learned and dense.
Graph Convolutional Networks Graph Convolutional Networks (GCNs; Kipf and Welling, 2017) generalize convolutional networks to graph-structured data. While they were developed with graphs much larger than our negation graphs in mind, Marcheggiani and Titov (2017) showed their usefulness for semantic role labelling. With X 0 = C at the first level, we compute, for each level l > 0, a combined representation of heads (H), dependents (D), and the nodes themselves (S), weighted by layer-specific weight matrices W l : When applying the next layer l, each node is updated with respect to its representation X l−1 from the previous layer, thus indirectly taking into account grandparents and grandchildren. As this method is the only one that not only uses a node's head but also its dependents, we expect it to benefit the most from external graph features.

Experiments
In this section we describe our experiments and review our baselines, methodology, and reported results.

Training
Our parser is trained with a softmax cross-entropy loss using the Adam optimizer (Kingma and Ba, 2015) and mini-batching. The training objective for our negation parsing system does not directly match the official evaluation measures, but is instead based on labelled per-arc F 1 scores (i.e. the harmonic mean of precision and recall), which measures the amount of (in)correctly predicted arcs and labels. For model selection, we train for 200 epochs and choose the model instance that performs best on the development set. Our network sizes, dropout rates, and training parameters are shown in Table 1. Despite having less than half as many trainable parameters than the model by Dozat and Manning (2018), our  model is still prone to overfitting, partly due to the rather small size of the training data. Hence we use only slightly smaller dropout rates than Dozat and Manning (2018). Following Gal and Ghahramani (2016), we apply variational dropout sharing the same dropout mask between all time steps in a sequence, and DropConnect (Wan et al., 2013;Merity et al., 2017) on the hidden states of the BiLSTM.

Evaluation Measures
Standard evaluation measures for the original *SEM 2012 task include scope tokens (ST), scope match (SM), event tokens (ET), and full negation (FN) F 1 scores. ST and ET are token-level scores for in-scope and negated event tokens, respectively, where a true positive is a correctly retrieved token of the relevant class (Morante and Blanco, 2012). FN is the strictest of these measures (and the primary evaluation metric for the NR part of the EPE shared task), counting as true positives only perfectly retrieved full scopes, including an exact match on negated events.

Baselines
In order to have a fair comparison with the previous results of the 2017 and 2018 EPE shared tasks, we evaluate on the ConanDoyle-neg data as processed by the best-performing systems from the two editions. The best-performing system on the nega-tion task of the 2017 edition of EPE, STANFORD-PARIS-06 (Schuster et al., 2017), uses enhanced Universal Dependencies (v1) and data from the Penn Treebank (Marcus et al., 1993), the Brown Corpus (Francis and Kučera, 1985) and the GENIA treebank (Tateisi et al., 2005). In contrast to this, the best performing system for the 2018 edition, TURKUNLP (Kanerva et al., 2018), only uses the English training data provided by the co-located UD parsing shared task. Both systems use the parser and hyperparameters of Dozat et al. (2017), the winning submission of the CoNLL 2017 Shared Task on parsing Universal Dependencies.
In the overview paper for the 2018 EPE shared task (Fares et al., 2018), the organizers report that the version of the SHERLOCK negation system that was used for EPE 2017 had a deficiency that could leak gold-standard scope and event annotations into system predictions, leading to potentially inflated scores. 2 The EPE 2018 version of SHERLOCK corrected this problem and added automated hyperparameter tuning, which Fares et al. (2018) suggest largely offset the negative effect on overall scores from the bug fix, at least when averaging over all submissions. They did not, however, rerun the EPE 2017 evaluation with the corrected and enhanced version of SHERLOCK, leaving substantive uncertainty about current state-of-the-art results. We address this problem by applying the improved (i.e. 2018) version of the baseline system, including the exact same tuning procedure described by Fares et al. (2018), to the originally best-performing STANFORD-PARIS dependency graphs. In this replication study, we observe a large (5 points FN F 1 ) drop in performance compared to the originally reported results. While STANFORD-PARIS still outperforms TURKUNLP, the margin between the two systems is narrowed down to less than 2 points FN F 1 .

Experiments
We report two sets of experiments. For all experiments, we run each of our neural network models 10 times with different random seeds and choose the best performing model with respect to performance on the development set in terms of FN F 1 .
Gold-Standard Cues Even though our approach can predict negation cues on its own, for our first set of experiments, we follow the setup of the EPE tasks and predict only scopes and events, adding   Table 3: Results of our NR parser on the STANFORD-PARIS and TURKUNLP versions of the ConanDoyle-neg development and evaluation sets when cues are predicted. The numerically best results are shown in bold. We test for significant differences between our gcn with syntax models for STANFORD-PARIS and TURKUNLP and respective models using no additional inputs. Only the * -marked measures are significantly different from their •-marked counterparts.  Stanford-Paris TurkuNLP Figure 4: Boxplots visualizing the variance of performance on the evaluation set for the systems additionally predicting cues. We compare the four models using no syntax and using syntax with each of the three methods.
gold-standard cues as external graph features. Overlapping the gold-cue inputs with the additional graph inputs is not optimal but avoids adding more complexity to the model. Similar to SHERLOCK, we handle affixal cues in post-processing, splitting and classifying five known prefixes and one suffix as cues, and the remainder as the negated event.
The results for these experiments are reported in Table 2. On the STANFORD-PARIS version of the evaluation data, our model with external syntactic features via GCNs (gcn with syntax) outperforms the SHERLOCK baseline by 2.85 FN F 1 points; on the TURKUNLP version, our best model uses syntactic features via scaled attention (scatt with syntax), beating the baseline by 1.84 FN F 1 points.
Predicted Cues For the second set of experiments, we also predict negation cues, and additionally report the F 1 for cues (CUE). In order to put these results into perspective, we contrast them with the winning system of the *SEM 2012 shared task by , and also with the MRS Crawler of Packard et al. (2014). The results for these experiments are reported in Table 3. Our best models for both versions of the evaluation data are the ones that do not use external syntactic features at all (no syntax), with FN F 1 scores of 59.40 (STANFORD-PARIS) and 55.18 (TURKUNLP), respectively. The former result is 1.77 points higher than the result reported by .
Significance Testing Given the rather small size of the dataset, we follow the advice of Dror et al. (2018) and test for signficance using the bootstrap method (Berg-Kirkpatrick et al., 2012). We compare our best-performing system for both STANFORD-PARIS and TURKUNLP with the respective SHERLOCK sytems, resampling the test sets 10 6 times and setting our threshold to 5%, following standard methodology. For the second set of experiments, where we additionally predict negation cues, we compare our best system to our second-best system. We furthermore visualize the variance of performance across all 10 systems on the evaluation sets in Figures 3 and 4.

Discussion
In this section we discuss the results of our experiments and place them in the broader context of the research literature on negation resolution.

Gold Cues
We first discuss our results when using gold-standard cues, as in the EPE tasks.
Effect of Pre-processing Similar to the SHER-LOCK baseline system, our system also performs better with STANFORD-PARIS rather than with TURKUNLP processed data (64.27 vs. 61.58 FN F 1 ), even when no syntactic inputs are used (62.15 vs. 60.48 FN F 1 ). The tokenization, partof-speech tagging and lemmatization done by STANFORD-PARIS thus seem to better fit the NR task, and have likely also benefitted from the larger and more diverse data used during training.
Handling Additional Inputs The most efficient method to handle gold cues at input time, it turns out, is our simplest method, concatenating each contextual token with the sum of its heads. A likely explanation for this is that this method is able to directly read off the gold-cue information. This method however is clearly not able to handle additional syntactic inputs (losing 4.47 FN F 1 points for STANFORD-PARIS), motivating the use of either of the more advanced techniques. Combining STANFORD-PARIS syntactic trees with the GCN clearly performs best here, but does not point towards a general trend; the plots in Figure 3 rather show that most of the systems perform similarly, with the exception of the sum-of-heads method when using additional syntactic inputs.

Predicted Cues
When we task our system to also predict cues, as in *SEM 2012, our best system outperforms  and Packard et al. (2014) on most measures. Our neural graph parsing approch is clearly better at identifying the relevant scope tokens (ST), due to its pairwise classification approach, respectively gaining 5.32 and 2.29 points in FN F 1 . This generally also results in better performance for matching complete scopes (SM). The system does however struggle with telling events and regular scopes apart, and is clearly outperformed by  on that measure (6.36 points ET F 1 for STANFORD-PARIS no syntax). Our system differentiates between scopes and events using arc labels only, and might not have seen enough data to sufficiently train the labelling part of the network. One slightly surprising result is that, even though our best systems for both the STANFORD-PARIS and the TURKUNLP version of the evaluation data use syntactic inputs when gold-standard cues were provided, our best systems for also predicting cues do not rely on syntactic inputs at all.

Significant Learning
While the boxplots in Figures 3 and 4 show the same general trends as our particular systems in Tables 2 and 3, they also illustrate the considerable variance of performance between runs. Choosing the final system with regards to performance on the development sets may lead to state-of-the-art performance on the evaluation sets-this is the case for our best performing system using gold cues and additional syntax processed by a GCN, which performs more than two points of FN F 1 better than the average system of its kind. However, we also see examples to the opposite. Our system using gold cues with scaled attention on STANFORD-PARIS for example, performs more than two points FN F 1 worse than the average on the evaluation set, even though it performs three points better on the devel-opment set. Thus, at least in this study, good performance on the development sets is not necessarily and indication for good performance on the final evaluation set. This notion is further reinforced by the lack of significant difference in performance of our best systems, compared to SHERLOCK. For the STANFORD-PARIS version of the data, even nearly three points of FN F 1 (64.27 vs. 61.42) do not constitue a significant difference. The difficulty to confidently analyse the results is also illustrated by the somewhat erratic performance differences across different settings and runs.
The NLP community has recently realized the importance of proper testing in favour of simple comparisons of benchmark scores (Gorman and Bedrick, 2019). This becomes even more pronounced when working with deep learning architectures, where model selection is more complicated (Moss et al., 2019) due to sensitivity to different random seeds. When working with smaller datasets such as ConanDoyle-neg, it is particularly important to thoroughly analyse the results before claiming that a new system improves the state of the art (Dror et al., 2017).

Conclusion
We have introduced a novel approach to negation resolution that remodels negation annotations into dependency-style graph structures. These negation graphs directly encode the pairwise cue-scope relationships, and thus enable our neural network to more easily learn them. We extended an already powerful neural graph-parsing approach further to additionally use arbitrary dependency graph structures as inputs. To validate our method, we revisited the EPE 2017 and 2018 shared tasks and the full *SEM 2012 shared task on negation resolution, clearly outperforming each previously best system, albeit none of our results is statistically significant.
We believe that our approach can be used to restructure other tasks as dependency-style graphs in similar fashion, and thus reuse existing systems as general purpose tools. Recasting the negation resolution task as a graph-parsing problem allows us to straightforwardly use a variety of such tools. With most of these now using neural networks, we can extend them to employ massive pre-trained models such as BERT (Devlin et al., 2019) or ELMo (Peters et al., 2018). This would allow us to leverage their general power into more specific tasks that have only limited data available.