Neural models of factuality

We present two neural models for event factuality prediction, which yield significant performance gains over previous models on three event factuality datasets: FactBank, UW, and MEANTIME. We also present a substantial expansion of the It Happened portion of the Universal Decompositional Semantics dataset, yielding the largest event factuality dataset to date. We report model results on this extended factuality dataset as well.


Introduction
A central function of natural language is to convey information about the properties of events. Perhaps the most fundamental of these properties is factuality: whether an event happened or not.
A natural language understanding system's ability to accurately predict event factuality is important for supporting downstream inferences that are based on those events. For instance, if we aim to construct a knowledge base of events and their participants, it is crucial that we know which events to include and which ones not to. The event factuality prediction task (EFP) involves labeling event-denoting phrases (or their heads) with the (non)factuality of the events denoted by those phrases Pustejovsky, 2009, 2012;de Marneffe et al., 2012). Figure  1 exemplifies such an annotation for the phrase headed by leave in (1), which denotes a factual event (⊕=factual, =nonfactual).
(1) Jo failed to leave no trace. ⊕ In this paper, we present two neural models of event factuality (and several variants thereof). We show that these models significantly outperform previous systems on four existing event factuality datasets -FactBank (Saurí and Pustejovsky, 2009), the UW dataset (Lee et al., 2015), MEAN-TIME (Minard et al., 2016), and Universal De- compositional Semantics It Happened v1 (UDS-IH1; White et al., 2016) -and we demonstrate the efficacy of multi-task training and ensembling in this setting. In addition, we collect and release an extension of the UDS-IH1 dataset, which we refer to as UDS-IH2, to cover the entirety of the English Universal Dependencies v1.2 (EUD1.2) treebank (Nivre et al., 2015), thereby yielding the largest event factuality dataset to date. 1 We begin with theoretical motivation for the models we propose as well as discussion of prior EFP datasets and systems ( §2). We then describe our own extension of the UDS-IH1 dataset ( §3), followed by our neural models ( §4). Using the data we collect, along with the existing datasets, we evaluate our models ( §6) in five experimental settings ( §5) and analyze the results ( §7).
(2) a. Jo didn't leave. b. Jo might leave. c. Jo left no trace. d. Jo never left. e. Jo failed to leave. f. Jo's leaving was fake. g. Jo's leaving was a hallucination.
Further, such words can interact to yield nontrivial effects on factuality inferences: (3a) conveys that the leaving didn't happen, while the superficially similar (3b) does not.
(3) a. Jo didn't remember to leave. b. Jo didn't remember leaving. ⊕ A main goal of many theoretical treatments of factuality is to explain why these sorts of interactions occur and how to predict them. It is not possible to cover all the relevant literature in depth, and so we focus instead on the broader kind of interactions our models need to be able to capture in order to correctly predict the factuality of an event denoted by a particular predicate-namely, interactions between that predicate's outside and inside context, exemplified in Figure 1.
(4) a. Jo forgot that Bo left. ⊕ b. Jo forgot to leave.
(5) a. Jo didn't forget that Bo left. ⊕ b. Jo didn't forget to leave. ⊕ When a predicate directly embedded by forget is tensed, as in (4a) and (5a), we infer that that predicate denotes a factual event, regardless of whether forget is negated. In contrast, when a predicate directly embedded by forget is untensed, as in (4b) and (5b), our inference is dependent on whether forget is negated. Thus, any model that correctly predicts factuality will need to not only be able to represent the effect of individual words in the outside context on factuality inferences, it will furthermore need to represent their interaction.
Inside context Knowledge of the inside context is important for integrating factuality information coming from a predicate's arguments-e.g. from determiners, like some and no.
(6) a. Some girl ate some dessert. ⊕ b. Some girl ate no dessert. c. No girl ate no dessert.
⊕ In simple monoclausal sentences like those in (6), the number of arguments that contain a negative quantifier, like no, determine the factuality of the event denoted by the verb. An even number (or zero) will yield a factuality inference and an odd number will yield a nonfactuality inference. Thus, as for outside context, any model that correctly predicts factuality will need to integrate interactions between words in the inside context.

The (non)necessity of syntactic information
One question that arises in the context of inside and outside information is whether syntactic information is strictly necessary for capturing the relevant interactions between the two. To what extent is linear precedence sufficient for accurately computing factuality?
We address these questions using two bidirectional LSTMs-one that has a linear chain topology and another that has a dependency tree topology. Both networks capture context on either side of an event-denoting word, but each does it in a different way, depending on its topology. We show below that, while both networks outperform previous models that rely on deterministic rules and/or hand-engineered features, the linear chainstructured network reliably outperforms the treestructured network. Saurí and Pustejovsky (2009) present the Fact-Bank corpus of event factuality annotations, built on top of the TimeBank corpus (Pustejovsky et al., 2006). These annotations (performed by trained annotators) are discrete, consisting of an epistemic modal {certain, probable, possible} and a polarity {+,−}. In FactBank, factuality judgments are with respect to a source; following recent work, here we consider only judgments with respect to a single source: the author. The smaller MEAN-TIME corpus (Minard et al., 2016) Lee et al. (2015) construct an event factuality dataset -henceforth, UW -on the TempEval-3 data (UzZaman et al., 2013) using crowdsourced annotations on a [−3, 3] scale (certainly did not happen to certainly did), with over 13,000 predicates. Adopting the [−3, 3] scale of Lee et al. (2015), Stanovsky et al. (2017) assemble a Unified Factuality dataset, mapping the discrete annotations of both FactBank and MEANTIME onto the UW scale. Each scalar annotation corresponds to a token representing the event, and each sentence may have more than one annotated token.

Event factuality datasets
The UDS-IH1 dataset (White et al., 2016) consists of factuality annotations over 6,920 event tokens, obtained with another crowdsourcing protocol. We adopt this protocol, described in §3, to collect roughly triple this number of annotations. We train and evaluate our factuality prediction models on this new dataset, UDS-IH2, as well as the unified versions of UW, FactBank, and MEANTIME. Table 1 shows the number of annotated predicates in each split of each factuality dataset used in this paper. Annotations relevant to event factuality and polarity appear in a number of other resources, including the Penn Discourse Treebank (Prasad et al., 2008), MPQA Opinion Corpus (Wiebe and Riloff, 2005), the LU corpus of author belief commitments (Diab et al., 2009), and the ACE and ERE formalisms. Soni et al. (2014) annotate Twitter data for factuality. Nairn et al. (2006) propose a deterministic algorithm based on hand-engineered lexical features for determining event factuality. They associate certain clause-embedding verbs with implication signatures (Table 2), which are used in a recursive polarity propagation algorithm. TruthTeller is also a recursive rule-based system for factuality ("predicate truth") prediction using implication signatures, as well as other lexical-and depen-dency tree-based features (Lotan et al., 2013).

Event factuality systems
Several systems use supervised models trained over rule-based features. Diab et al. (2009) andPrabhakaran et al. (2010) use SVMs and CRFs over lexical and dependency features for predicting author belief commitments, which they treat as a sequence tagging problem. Lee et al. (2015) train an SVM on lexical and dependency path features for their factuality dataset. Saurí and Pustejovsky (2012) and Stanovsky et al. (2017) train support vector models over the outputs of rule-based systems, the latter with TruthTeller.

Data collection
Even the largest currently existing event factuality datasets are extremely small from the perspective of related tasks, like natural language inference (NLI To begin to remedy this situation, we collect an extension of the UDS-IH1 dataset. The resulting UDS-IH2 dataset covers all predicates in EUD1.2. Beyond substantially expanding the amount of publicly available event factuality annotations, another major benefit is that EUD1.2 consists entirely of gold parses and has a variety of other annotations built on top of it, making future multitask modeling possible.
We use the protocol described by  to construct UDS-IH2. This protocol involves four kinds of questions for a particular predicate candidate: 1. UNDERSTANDABLE: whether the sentence is understandable 2. PREDICATE: whether or not a particular word refers to an eventuality (event or state) 3. HAPPENED: whether or not, according to the author, the event has already happened or is currently happening 4. CONFIDENCE: how confident the annotator is about their answer to HAPPENED from 0-4 If an annotator answers no to either UNDER-STANDABLE or PREDICATE, HAPPENED and CONFIDENCE do not appear.
The main differences between this protocol and the others discussed above are: (i) instead of asking about annotator confidence, the other proto- cols ask the annotator to judge either source confidence or likelihood; and (ii) factuality and confidence are separated into two questions. We choose to retain White et al.'s protocol to maintain consistency with the portions of EUD1.2 that were already annotated in UDS-IH1.
Annotators We recruited 32 unique annotators through Amazon's Mechanical Turk to annotate 20,580 total predicates in groups of 10. Each predicate was annotated by two distinct annotators. Including UDS-IH1, this brings the total number of annotated predicates to 27,289. Raw inter-annotator agreement for the HAP-PENED question was 0.84 (Cohen's κ=0.66) among the predicates annotated only for UDS-IH2. This compares to the raw agreement score of 0.82 reported by  for UDS-IH1.
To improve the overall quality of the annotations, we filter annotations from annotators that display particularly low agreement with other annotators on HAPPENED and CONFIDENCE. (See the Supplementary Materials for details.) Pre-processing To compare model results on UDS-IH2 to those found in the unified datasets of Stanovsky et al. (2017), we map the HAP-PENED and CONFIDENCE ratings to a single FAC-TUALITY value in [-3,3] by first taking the mean confidence rating for each predicate and mapping FACTUALITY to 3 4 CONFIDENCE if HAPPENED and -3 4 CONFIDENCE otherwise.
Response distribution Figure 2 plots the distribution of factuality ratings in the train and dev splits for UDS-IH2, alongside those of FactBank, UW, and MEANTIME. One striking feature of these distributions is that UDS-IH2 displays a much more entropic distribution than the other datasets. This may be due to the fact that, un-like the newswire-heavy corpora that the other datasets annotate, EUD1.2 contains text from genres -weblogs, newsgroups, email, reviews, and question-answers -that tend to involve less reporting of raw facts. One consequence of this more entropic distribution is that, unlike the datasets discussed above, it is much harder for systems that always guess 3 -i.e. factual with high confidence/likelihood -to perform well.

Models
We consider two neural models of factuality: a stacked bidirectional linear chain LSTM ( §4.1) and a stacked bidirectional child-sum dependency tree LSTM ( §4.2). To predict the factuality v t for the event referred to by a word w t , we use the hidden state at t from the final layer of the stack as the input to a two-layer regression model ( §4.3).
] otherwise. We set g to the pointwise nonlinearity tanh.

Stacked bidirectional tree LSTM
We use a stacked bidirectional extension to the child-sum dependency tree LSTM (T-LSTM; Tai et al., 2015), which is itself an extension of a standard unidirectional linear chain LSTM (L-LSTM). One way to view the difference between the L-LSTM and the T-LSTM is that the T-LSTM redefines prev → (t) to return the set of indices that correspond to the children of w t in some dependency tree. Because the cardinality of these sets varies with t, it is necessary to specify how multiple children are combined. The basic idea, which we make explicit in the equations for our extension, is to define f tk for each child index k ∈ prev → (t) in a way analogous to the equations in §4.1 -i.e. as though each child were the only child -and then sum across k within the equations for i t , o t ,ĉ t , c t , and h t .
Our stacked bidirectional extension (stacked T-biLSTM) is a minimal extension to the T-LSTM in the sense that we merely define the downward computation in terms of a prev ← (t) that returns the set of indices that correspond to the parents of w t in some dependency tree (cf. Miwa and Bansal 2016, who propose a similar, but less minimal, model for relation extraction). The same method for combining children in the upward computation can then be used for combining parents in the downward computation. This yields a minimal change to the stacked L-biLSTM equations.
We use a ReLU pointwise nonlinearity for g. These minimal changes allow us to represent the inside and the outside contexts of word t (at layer l) as single vectors:ĥ An important thing to note here is that -in contrast to other dependency tree-structured T-LSTMs Iyyer et al., 2014) -this T-biLSTM definition does not use the dependency labels in any way. Such labels could be straightforwardly incorporated to determine which parameters are used in a particular cell, but for current purposes, we retain the simpler structure (i) to more directly compare the L-and T-biLSTMs and (ii) because a model that uses dependency labels substantially increases the number of trainable pa-rameters, relative to the size of our datasets.

Regression model
To predict the factuality v t for the event referred to by a word w t , we use the hidden states from the final layer of the stacked L-or T-biLSTM as the input to a two-layer regression model.
in this case, smooth L1 -i.e. Huber loss with δ = 1. This loss function is effectively a smooth variant of the hinge loss used by Lee et al. (2015) and Stanovsky et al. (2017).
We also consider a simple ensemble method, wherein the hidden states from the final layers of both the stacked L-biLSTM and the stacked T-biLSTM are concatenated and passed through the same two-layer regression model. We refer to this as the H(ybrid)-biLSTM.  the stacked L-and T-biLSTMs, and 600, for the stacked H-biLSTM.
Bidirectional layers We consider stacked L-, T-, and H-biLSTMs with either one or two layers. In preliminary experiments, we found that networks with three layers badly overfit the training data.
Dependency parses For the T-and H-biLSTMs, we use the gold dependency parses provided in EUD1.2 when training and testing on UDS-IH2. On FactBank, MEANTIME, and UW, we follow Stanovsky et al. (2017) in using the automatic dependency parses generated by the parser in spaCy (Honnibal and Johnson, 2015). 3 Lexical features Recent work on neural models in the closely related domain of genericity/habituality prediction suggests that inclusion of hand-annotated lexical features can improve classification performance (Becker et al., 2017). To assess whether similar performance gains can be obtained here, we experiment with lexical features for simple factive and implicative verbs (Kiparsky and Kiparsky, 1970;Karttunen, 1971a). When in use, these features are concatenated to the network's input word embeddings so that, in principle, they may interact with one another and inform other hidden states in the biLSTM, akin to how verbal implicatives and factives are observed to influence the factuality of their complements. The hidden state size is increased to match the input embedding size. We consider two types: Signature features We compute binary features based on a curated list of 92 simple implicative and 95 factive verbs including their their type-level "implication signatures," as compiled by Nairn et al. (2006). 4 These signatures characterize the implicative or factive behavior of a verb with respect to its complement clause, how this behavior changes (or does not change) under negation, and how it composes with other such verbs under nested recursion. We create one indicator feature for each signature type.
Mined features Using a simplified set of pattern matching rules over Common Crawl data (Buck et al., 2014), we follow the insights of Pavlick and Callison-Burch (2016) -henceforth, PC -and use corpus mining to automatically score verbs for implicativeness. The insight of PC lies in Karttunen's (1971a) observation that "the main sentence containing an implicative predicate and the complement sentence necessarily agree in tense." Accordingly, PC devise a tense agreement score -effectively, the ratio of times an embedding predicate's tense matches the tense of the predicate it embeds -to predict implicativeness in English verbs. Their scoring method involves the use of fine-grained POS tags, the Stanford Temporal Tagger (Chang and Manning, 2012), and a number of heuristic rules, which resulted in a confirmation that tense agreement statistics are predictive of implicativeness, illustrated in part by observing a near perfect separation of a list of implicative and non-implicative verbs from Karttunen (1971a Table 3: Implicative (bold) and non-implicative (not bold) verbs from Karttunen (1971a) are nearly separable by our tense agreement scores, replicating the results of PC.
We replicate this finding by employing a simplified pattern matching method over 3B sentences of raw Common Crawl text. We efficiently search for instances of any pattern of the form: I $VERB to * $TIME, where $VERB and $TIME are pre-instantiated variables so their corresponding tenses are known, and ' * ' matches any one to three whitespace-separated tokens at runtime (not preinstantiated). 5 Our results in Table 3 (2) Table 4: All 2-layer systems, and 1-layer systems if best in column. State-of-the-art in bold; † is best in column (with row shaded in purple). Key: L=linear, T=tree, H=hybrid, (1,2)=# layers, S=single-task specific, G=singletask general, +lexfeats=with all lexical features, MultiSimp=multi-task simple, MultiBal=multi-task balanced, MultiFoc=multi-task focused, w/UDS-IH2=trained on all data incl. UDS-IH2. All-3.0 is the constant baseline.
replication of PC's findings. Prior work such as by PC is motivated in part by the potential for corpuslinguistic findings to be used as fodder in downstream predictive tasks: we include these agreement scores as potential input features to our networks to test whether contemporary models do in fact benefit from this information.
Training For all experiments, we use stochastic gradient descent to train the LSTM parameters and regression parameters end-to-end with the Adam optimizer (Kingma and Ba, 2015), using the default learning rate in pytorch (1e-3). We consider five training regimes: 6 1. SINGLE-TASK SPECIFIC (-S) Train a separate instance of the network for each dataset, training only on that dataset. 2. SINGLE-TASK GENERAL (-G) Train one instance of the network on the simple concatenation of all unified factuality datasets, {FactBank, UW, MEANTIME}. 3. MULTI-TASK SIMPLE (-MULTISIMP) Same with each of five past tense phrases ("yesterday," "last week," etc.) and five corresponding future tense phrases ("tomorrow," "next week," etc). See Supplement for further details. 6 Multi-task can have subtly different meanings in the NLP community; following terminology from Mou et al. (2016), our use is best described as "semantically equivalent transfer" with simultaneous (MULT) network training. Calibration Post-training, network predictions are monotonically re-adjusted to a specific dataset using isotonic regression (fit on train split only).
Evaluation Following Lee et al. (2015) and Stanovsky et al. (2017), we report two evaluation measures: mean absolute error (MAE) and Pearson correlation (r). We would like to note, however, that we believe correlation to be a better indicator of performance for two reasons: (i) for datasets with a high degree of label imbalance  Table 5: Mean gold labels, counts, and MAE for L-biLSTM(2)-S and T-biLSTM(2)-S model predictions on UDS-IH2-dev, grouped by modals and negation.
( Figure 2), a baseline that always guesses the mean or mode label can be difficult to beat in terms of MAE but not correlation, and (ii) MAE is harder to meaningfully compare across datasets with different label mean and variance.
Development Under all regimes, we train the model for 20 epochs -by which time all models appear to converge. We save the parameter values after the completion of each epoch and then score each set of saved parameter values on the development set for each dataset. The set of parameter values that performed best on dev in terms of Pearson correlation for a particular dataset were then used to score the test set for that dataset. Table 4 reports the results for all of the 2-layer L-,  The best-performing system for each dataset and metric are highlighted in purple, and when the best-performing system for a particular dataset was a 1-layer model, that system is included in Table 4.

Results
New state of the art For each dataset and metric, with the exception of MAE on UW, we achieve state of the art results with multiple systems. The highest-performing system for each is reported in Table 4. Our results on UDS-IH2 are the first reported numbers for this new factuality resource.
Linear v. tree topology On its own, the biL-STM with linear topology (L-biLSTM) performs consistently better than the biLSTM with tree 7 Full results are reported in the Supplementary Materials. Note that the 2-layer networks do not strictly dominate the 1-layer networks in terms of MAE and correlation.
topology (T-biLSTM). However, the hybrid topology (H-biLSTM), consisting of both a L-and T-biLSTM is the top-performing system on UW for correlation (Table 4). This suggests that the T-biLSTM may be contributing something complementary to the L-biLSTM. Evidence of this complementarity can be seen in Table 6, which contains a breakdown of system performance by governing dependency relation, for both linear and tree models, on UDS-IH2-dev. In most cases, the L-biLSTM's mean prediction is closer to the true mean. This appears to arise in part because the T-biLSTM is less confident in its predictions -i.e. its mean prediction tends to be closer to 0. This results in the L-biLSTM being too confident in certain cases -e.g. in the case of the xcomp governing relation, where the T-biLSTM mean prediction is closer to the true mean.
Lexical features have minimal impact Adding all lexical features (both SIGNATURE and MINED) yields mixed results. We see slight improvements on UW, while performance on the other datasets mostly declines (compare with SINGLE-TASK SPECIFIC). Factuality prediction is precisely the kind of NLP task one would expect these types of features to assist with, so it is notable that, in our experiments, they do not.
Multi-task helps Though our methods achieve state of the art in the single-task setting, the best performing systems are mostly multi-task (Table  4 and Supplementary Materials). This is an ideal setting for multi-task training: each dataset is relatively small, and their labels capture closelyrelated (if not identical) linguistic phenomena. UDS-IH2, the largest by a factor of two, reaps the smallest gains from multi-task. Attribute # Grammatical error present, incl. run-ons 16 Is an auxiliary or light verb 14 Annotation is incorrect 13 Future event 12 Is a question 5 Is an imperative 3 Is not an event or state 2 One or more of the above 43 Table 7: Notable attributes of 50 instances from UDS-IH2-dev with highest absolute prediction error (using H-biLSTM(2)-MultiSim w/UDS-IH2).

Analysis
As discussed in §2, many discrete linguistic phenomena interact with event factuality. Here we provide a brief analysis of some of those interactions, both as they manifest in the UDS-IH2 dataset, as well as in the behavior of our models. This analysis employs the gold dependency parses present in EUD1.2. Table 5 illustrates the influence of modals and negation on the factuality of the events they have direct scope over. The context with the highest factuality on average is no direct modal and no negation (first row); all other modal contexts have varying degrees of negative mean factuality scores, with will as the most negative. This is likely a result of UDS-IH2 annotation instructions to mark future events as not having happened. Table 7 shows results from a manual error analysis on 50 events from UDS-IH2-dev with highest absolute prediction error (using H-biLSTM(2)-MultiSim w/UDS-IH2). Grammatical errors (such as run-on sentences) in the underlying text of UDS-IH2 appear to pose a particular challenge for these models; informal language and grammatical errors in UDS-IH2 is a substantial distinction from the other factuality datasets used here.   Table 9: MAE of L-biLSTM(2)-S and L-biLSTM(2)-S+lexfeats, for predictions on events in UDS-IH2-dev that are xcomp-governed by an infinitival-taking verb. Table 8 shows that we can achieve similar separation between implicatives and nonimplicatives as the feature mining strategy presented in §5. That is, those features may be redundant with information already learnable from factuality datasets (UDS-IH2). Despite the underperformance of these features overall, Table 9 shows that they may still improve performance in the subset of instances where they appear.

Conclusion
We have proposed two neural models of event factuality prediction -a bidirectional linear-chain LSTM (L-biLSTM) and a bidirectional childsum dependency tree LSTM (T-biLSTM) -which yield substantial gains over previous models based on deterministic rules and hand-engineered features. We found that both models yield such gains, though the L-biLSTM outperforms the T-biLSTM; for some datasets, an ensemble of the two (H-biLSTM) improves over either alone.
We have also extended the UDS-IH1 dataset, yielding the largest publicly-available factuality dataset to date: UDS-IH2. In experiments, we see substantial gains from multi-task training over the three factuality datasets unified by Stanovsky et al. (2017), as well as UDS-IH2. Future work will further probe the behavior of these models, or extend them to learn other aspects of event semantics.

A.1 Dataset filtering
We filter our dataset to remove annotators with very low agreement in two ways: (i) based on the their agreement with other annotators on the HAP-PENED question; and (ii) based on the their agreement with other annotators on the CONFIDENCE question.
For the HAPPENED question, we computed, for each pair of annotators and each item that both of those annotators annotated, whether the two responses were equal. We then fit a random effects logistic regression to response equality with random intercepts for annotator. The Best Linear Unbiased Predictors (BLUPs) for each annotator were then extracted and z-scored. Annotators were removed if their z-scored BLUP was less than -2.
For the CONFIDENCE question, we first riditscored the ratings by annotator; and for each pair of annotators and each item that both of those annotators annotated, we computed the difference between the two ridit-scored confidences. We then fit a random effects linear regression to the resulting difference after logit-transformation with random intercepts for annotator. The same BLUPbased exclusion procedure was then used.
This filtering results in the exclusion of one annotator, who is excluded for low agreement on HAPPENED. 4,179 annotations are removed in the filtering, but because we remove only a single annotator, there remains at least one annotation for every predicate.

A.2 Mining Implicatives
All options for instantiating the $TIME pattern variable, described in §5, are listed here.