Lexicosyntactic Inference in Neural Models

We investigate neural models' ability to capture lexicosyntactic inferences: inferences triggered by the interaction of lexical and syntactic information. We take the task of event factuality prediction as a case study and build a factuality judgment dataset for all English clause-embedding verbs in various syntactic contexts. We use this dataset, which we make publicly available, to probe the behavior of current state-of-the-art neural systems, showing that these systems make certain systematic errors that are clearly visible through the lens of factuality prediction.

b. Jo doesn't know that Bo left. (2) a. Jo believes that Bo didn't leave.
b. Bo left. c. Bo didn't leave. A major finding of this literature is that lexically triggered inferences are conditioned by surprising aspects of the syntactic context that a word occurs in. For example, while (3a), (3b), and (4a) trigger the inference (2b), (4b) triggers the inference (2c).
b. Jo didn't remember that Bo left. (4) a. Bo remembered to leave.
b. Bo didn't remember to leave.
Accurately capturing such interactions -e.g. between clause-embedding verbs, negation, and embedded clause type -is important for any system that aims to do general natural language inference (MacCartney et al. 2008 et seq;cf. Dagan et al. 2006) or event extraction (see Grishman and Sundheim 1996 et seq), and it seems unlikely to be a trivial phenomenon to capture, given the complexity and variability of the inferences involved (see, e.g., Karttunen, 2012Karttunen, , 2013Karttunen et al., 2014;van Leusen, 2012;White, 2014;Baglini and Francez, 2016;Nadathur, 2016, on implicatives). In this paper, we investigate how well current state-of-the-art neural systems for a subtask of general event extraction -event factuality prediction (EFP; Nairn et al., 2006;Pustejovsky, 2009, 2012;de Marneffe et al., 2012;Lee et al., 2015;Stanovsky et al., 2017;Rudinger et al., 2018) -capture inferential interactions between lexical items and syntactic contextlexicosyntactic inferences -when trained on current event factuality datasets. Probing these particular systems is useful for understanding neural systems' behavior more generally because (i) the best performing neural models for EFP (Rudinger et al., 2018) are simple instances of common baseline models; and (ii) the task itself is relatively constrained.
To do this, we substantially extend the MegaVeridicality1 dataset (White and Rawlins, 2018) to cover all English clause-embedding verbs in a variety of the syntactic contexts covered by recent psycholinguistic work (White and Rawlins, 2016), and we use the resulting dataset -MegaVeridicality2 -to probe these models' behavior. We focus on clause-embedding verbs because they show effectively every possible patterning of lexicosyntactic inference (Karttunen, 2012).
We discuss three findings: (i) Tree biLSTMs (T-biLSTMs) are better able to correctly predict lexicosyntactic inferences than linear-chain biLSTMs (L-biLSTMs); (ii) L-biLSTMs and T-biLSTMs capture different lexicosyntactic inferences, and thus ensembling their predictions can reliably improve performance; and (iii) even when ensembled, these models show systematic errors -e.g. performing well when the polarity of the matrix clause matches the polarity of the true inference, but poorly when these polarities mismatch.
We furthermore release MegaVeridicality2 at MegaAttitude.io as a benchmark for probing the ability of neural systems -whether for factuality prediction or for general natural language inference -to capture lexicosyntactic inference.

Data collection
We substantially extend the MegaVeridicality1 dataset (White and Rawlins, 2018), which contains factuality judgments for all English clauseembedding verbs that take tensed subordinate clauses. In White and Rawlins's annotation protocol, all verbs that are grammatical with such subordinate clauses -based on the MegaAttitude dataset (White and Rawlins, 2016) -are slotted into contexts either like (5a) or (5b), depending on whether they take a direct object or not.
(5) a. Someone {knew, didn't know} that a particular thing happened. b. Someone {was, wasn't} told that a particular thing happened.
For each sentence generated in this way, 10 different annotators are asked to answer the question did that thing happen?: yes, maybe or maybe not, no.
There are two important aspects of these contexts to note. First, all lexical items besides the embedding verbs are semantically bleached to ensure that the measured lexicosyntactic inferences are only due to interactions between the embedding predicate -e.g. know or tell -and the syntactic context. Second, the matrix polarity -i.e. the presence or absence of not as a direct dependent of the embedding verb -is manipulated to create two sentences for each verb-context pair.
Our extension, MegaVeridicality2, includes judgments for a variety of infinitival subordinate clause types, exemplified in (6). 1 We investigate infinitival clauses because they can give rise to dif-1 We also explicitly manipulate two aspects of the subordinate clause in our extension of the MegaVeridicality dataset: (i) how NP embedded subjects are introduced; and (ii) whether the embedded clause contains an eventive predicate (do, happen) or a stative predicate (have). See Appendix A for details on the reasoning behind these manipulations.  age} to have a particular thing. For each sentence, we also collect judgments from 10 different annotators, using the same question as White and Rawlins for context (6a) and modified questions for contexts (6b)-(6g): did that person do that thing? for (6b), (6d), and (6f); and did that person have that thing? for for (6c), (6e), and (6g). Table 1 shows the number of verb types for each syntactic context. With the polarity manipulation, this yields a total of 3,938 sentences.
To build a factuality prediction test set from these sentences, we combine MegaVeridicality1 with our dataset and replace each instance of a particular person or a particular thing with someone or something (respectively). Then, following White and Rawlins, we normalize the 10 responses for each sentence to a single real value using an ordinal mixed model-based procedure. We refer to the resulting dataset as MegaVeridicality2.

Model and evaluation
We use MegaVeridicality2 to evaluate the performance of three state-of-the-art neural models of event factuality (Rudinger et al., 2018): a linearchain biLSTM (L-biLSTM), a dependency tree biLSTM (T-biLSTM), and a hybrid biLSTM (H-biLSTM) that ensembles the two. To predict the factuality of the event referred to by a particular predicate, these models pass the output state of the biLSTM at that predicate through a two-layer regression. In the case of the H-biLSTM, the output state of both the L-and T-biLSTMs are simply concatenated and passed through the regression. 2 Following the multi-task training regime described by Rudinger et al. (2018), we train these models on four standard factuality datasets -Fact-Bank Pustejovsky, 2009, 2012), UW (Lee et al., 2015), MEANTIME (Minard et al., 2016), and UDS Rudinger et al., 2018) -with tied biLSTM weights but regression parameters specific to each dataset. We then use these trained models to predict the factuality of the embedded predicate in our dataset.
To understand how much of these models' performance on our dataset is really due to a correct computation of lexicosyntactic inferences, we also generate predictions for the sentences in our dataset with the embedding verbs UNKed. In this case, the model can rely only on the syntactic context surrounding the predicate to make its inferences. We refer to the models with lexical information as the LEX models and the ones without lexical information as the UNK models.
Each model produces four predictions, corresponding to the four different datasets it was trained on. We consider three different ways of ensembling these predictions using a cross-validated ridge regression: (i) ensembling the four predictions for each specific model (LEX or UNK); (ii) ensembling the predictions for the LEX version of a particular model with the UNK version of that same model (LEX+UNK); and (iii) ensembling the predictions across all models (LEX, UNK, or LEX+UNK). Each ensemble is evaluated in a 10fold/10-fold nested cross-validation (see Cawley and Talbot, 2010). In each iteration of the outer cross-validation, a 10% test set is split off, and a 10-fold cross-validation to tune the regularization is conducted on the remaining 90%.  fold test sets of the nested cross-validation described in §3. We note three aspects of this plot. First, among the LEX models, the T-biLSTM performs best, followed by the L-biLSTM, then the H-biLSTM. This is somewhat surprising, since Rudinger et al. find the opposite pattern of performance: the L-and H-biLSTMs vie for dominance, both outperforming the T-biLSTM. This indicates that T-biLSTMs are better able to represent the lexicosyntactic inferences relevant to this dataset, even though they underperform on more general datasets. This possibility is bolstered by the fact that, in contrast to the L-and H-biLSTMs, the LEX version of the T-biLSTMs performs significantly better than the UNK version, suggesting that the T-biLSTM is potentially more reliant on the lexical information than the other two.

Results
Second, when the LEX and UNK version of each model is ensembled (LEX+UNK), we find comparable performance for all three biLSTMs -each outperforming the LEX version of the T-biLSTM. This indicates that each model captures similar amounts of information about lexicosyntactic inference, but this information is captured in the models' parameterizations in different ways.
Finally, when all three models are ensembled, we find that both the LEX and UNK version perform significantly better than any specific LEX+UNK model. This may indicate two things: (i) the models that only have access to syntax can perform just as well as ones that have access to both lexical information and syntax; but (ii) these models appear to capture different aspects of inference, since an ensemble of all models (All-LEX+UNK) performs significantly better than ei-  ther the All-LEX or All-UNK ensembles alone. Interestingly, however, even this ensemble performs more than 10 points worse than each model alone on FactBank, UW, and UDS. This raises the question of which lexicosyntactic inferences these models are missing -investigated below.

Analysis
We investigate two questions: (i) which inferences do all models do poorly on?; and (ii) what drives the differing strengths of each model?
Where do all models fail? Table 2 shows the 20 sentences with the highest prediction errors under the All-LEX+UNK ensemble. There are two interesting things to note about these sentences. First, most of them involve negative lexicosyntactic inferences that the model predicts to be either positive or near zero. Second, when the true inference is not positive, the matrix polarity of the original sentence is negative. This suggests that the models are not able to capture inferences whose polarity mismatches the matrix clause polarity.
One question that arises here is whether this inability affects all contexts equally. To answer this, we regress the absolute error of the predictions from this same ensemble (logged and standardized) against true factuality, matrix polarity, and context (as well as all of their two-and three-way interactions). 3 We find that the three-way interactions in this regression are reliable ( 2 (8)=27.97, p < 0.001) -suggesting that there are nontrivial differences in these state-of-the-art factuality systems' ability to capture inferential interactions across verbs and syntactic contexts. The differences can be verified visually in Figure 2, which 3 See Appendix C for further details, including a summary of the regression on which the above discussion is based.  plots the factuality predicted by this ensemble against the true factuality from MegaVeridicality2.
To elaborate, the ensemble does best overall on contexts like (7a) and (7b), and worst overall on contexts like (7c). The contrast between (7b) and (7c) is particularly interesting because (i) (7c) is just the passivized form of (7b); and (ii) we do not observe similar behavior for contexts (7d) and (7e), which are analogous to (7b) and (7c), but replace the stative have with the eventive do.  An additional nuance is that the ensemble does reliably better on the negative matrix polarity version of (7b) than on the positive, with the opposite true for (7e). This suggests these models do not capture an important inferential interaction between passivization and eventivity. This suggestion is further bolstered by the fact that the ensemble's ability to predict cases where the matrix polarity mismatches the true factuality are reliably poorer in context (7c) but not in its minimal pairs (7e) and (7b), where the ensemble performs reliably poorer when the two match. Indeed, it is contexts (7c) and (7f) that drive the polarity mismatch effect evident in Table 2.
What drives differences between models? In §4, we noted two ways that the biLSTMs we in- vestigate differ: (i) the T-biLSTM appears to be more reliant on lexical information than L-and H-biLSTMs; and (ii) each model appears to encode information about lexicosyntactic inference in its parameterizations in different ways. We hypothesize that these two differences are related -specifically, that the T-biLSTM's heavier reliance on lexical information comes about as a consequence of stronger entanglement between lexical and syntactic information in its hidden states.
To probe this, we ask to what extent the embedding verb's embedding can be recovered from the embedded verb's hidden state using linear functions. If the lexical information is more strongly entangled with the syntactic information, it should be more difficult to construct a homomorphic (linear) function to decode the embedding verb's embedding from the embedded verb's hidden state. To measure this, we conduct a Canonical Correlation Analysis (CCA; Hotelling, 1936) between these two vector space representations for every sentence in our dataset. Given two matrices X (the embedding verb embeddings column stacked) and Y (the embedded verb hidden states column stacked), CCA constructs matrices A and B, such that a i , b i = arg a 0 ,b 0 max corr(a 0 X, b 0 Y) and corr(a i X, a j X) = corr(b i Y, b j Y) = 0, 8i < j. This guarantees that the canonical correlation at component i, corr(a i X, b i Y), is nonincreasing in i, and thus the linearly decodable information about Y in X can be assessed using this function. Figure 3 plots the canonical correlations for the first 50 components for each of the biLSTMs we investigated. We find that the canonical correlations associated with the T-biLSTM are substantially lower than those associated with the Land H-biLSTMs across these first 50 components. This suggests that the T-biLSTM more strongly entangles lexical and syntactic information, per-haps explaining its apparently heavier reliance on lexical information, observed in §4.
Of note here is that the pattern seen in Figure 3 is probably at least partly a consequence of the different nonlinearities used for the L-biLSTM (tanh) and T-biLSTM (ReLU), and not the architectures themselves. But whether or not this pattern is due to the architectures, nonlinearities, or both, the entanglement hypothesis may still help explain the pattern of results discussed in §4.

Related work
This work is inspired by recent work in recasting various semantic annotations into natural language inference (NLI) datasets (White et al., 2017;Poliak et al., 2018a,b;Wang et al., 2018) to gain a better understanding of which phenomena standard neural NLI models (Bowman et al., 2015;Conneau et al., 2017) can capture -a line of work with deep roots (Cooper et al., 1996). The experimental setup -specifically, the idea of UNKing the embedding verb -was inspired by recent work that uses hypothesis-only baselines for a similar purpose (Gururangan et al., 2018;Poliak et al., 2018c;Tsuchiya, 2018). This work is also related to the broader investigation of sentence representations -particularly, tasks aimed at probing these representations' content (Pavlick and Callison-Burch, 2016;Adi et al., 2016;Conneau and Kiela, 2018;Dasgupta et al., 2018).

Conclusion
We investigated neural models' ability to capture lexicosyntactic inference, taking the task of event factuality prediction (EFP) as a case study. We built a factuality judgment dataset for all English clause-embedding verbs in various syntactic contexts and used this dataset to probe current stateof-the-art EFP systems. We showed that these systems make certain systematic errors that are clearly visible through the lens of factuality.