A Logic-Driven Framework for Consistency of Neural Models

While neural models show remarkable accuracy on individual predictions, their internal beliefs can be inconsistent across examples. In this paper, we formalize such inconsistency as a generalization of prediction error. We propose a learning framework for constraining models using logic rules to regularize them away from inconsistency. Our framework can leverage both labeled and unlabeled examples and is directly compatible with off-the-shelf learning schemes without model redesign. We instantiate our framework on natural language inference, where experiments show that enforcing invariants stated in logic can help make the predictions of neural models both accurate and consistent.


Introduction
Recent NLP advances have been powered by improved representations (e.g., ELMo, BERT-Peters et al., 2018;, novel neural architectures (e.g., Cheng et al., 2016;Seo et al., 2017;Parikh et al., 2016;Vaswani et al., 2017), and large labeled corpora (e.g., Bowman et al., 2015;Rajpurkar et al., 2016;Williams et al., 2018). Consequently, we have seen progressively improving performances on benchmarks such as GLUE (Wang et al., 2018). But, are models really becoming better? We take the position that, while tracking performance on a leaderboard is necessary to characterize model quality, it is not sufficient. Reasoning about language requires that a system has the ability not only to draw correct inferences about textual inputs, but also to be consistent its beliefs across various inputs.
To illustrate this notion of consistency, let us consider the task of natural language inference (NLI) which seeks to identify whether a premise entails, contradicts or is unrelated to a hypothesis (Dagan et al., 2013). Suppose we have three sentences P , H and Z, where P entails H and H contradicts Z. Using these two facts, we can infer that P contradicts Z. In other words, these three decisions are not independent of each other. Any model for textual inference should not violate this invariant defined over any three sentences, even if they are not labeled.
Neither are today's models trained to be consistent in this fashion, nor is consistency evaluated. The decomposable attention model of Parikh et al. (2016) updated with ELMo violates the above constraint for the following sentences: 1 P : John is on a train to Berlin. H: John is traveling to Berlin. Z: John is having lunch in Berlin.
Highly accurate models can be inconsistent in their beliefs over groups of examples. For example, using a BERT-based NLI model that achieves about 90% F-score on the SNLI test set (Bowman et al., 2015), we found that in about 46% of unlabeled sentence triples where P entails H and H contradicts Z, the first sentence does not contradict the third. Observations of a similar spirit were also made by Minervini and Riedel (2018), Glockner et al. (2018) and Nie et al. (2018).
To characterize and eliminate such errors, first, we define a method to measure the inconsistency of models with respect to invariants stated as firstorder logic formulas over model predictions. We show that our definition of inconsistency strictly generalizes the standard definition of model error.
Second, we develop a systematic framework for mitigating inconsistency in models by compiling the invariants into a differentiable loss functions using t-norms (Klement et al., 2013;Gupta and Qi, 1991) to soften logic. This allows us to take advantage of unlabeled examples and enforce consistency of model predictions over them. We show that the commonly used cross-entropy loss emerges as a specific instance of our framework. Our framework can be easily instantiated with modern neural network architectures.
To show the effectiveness of our approach, we instantiate it on the NLI task. We show that even state-of-the-art models can be highly inconsistent in their predictions, but our approach significantly reduces inconsistency.
In summary, our contributions are: 1. We define a mechanism to measure model inconsistency with respect to declaratively specified invariants. 2. We present a framework that compiles knowledge stated in first-order logic to loss functions that mitigate inconsistency. 3. We show that our learning framework can reduce prediction inconsistencies even with small amount of annotated examples without sacrificing predictive accuracy. 2

A Framework for (In)consistency
In this section, we will present a systematic approach for measuring and mitigating inconsistent predictions. A prediction is incorrect if it disagrees with what is known to be true. Similarly, predictions are inconsistent if they do not follow a known rule. Therefore, a model's errors can be defined by their concordance with declarative knowledge. We will formalize this intuition by first developing a uniform representation for both labeled examples and consistency constraints ( §2.1). Then, we will present a general definition of errors in the context of such a representation ( §2.2). Finally, we will show a logic-driven approach for designing training losses ( § 2.3). As a running example, we will use the NLI task whose goal is to predict one of three labels: Entailment (E), Contradiction (C), or Neutral (N ).

Representing Knowledge
Suppose x is a collection of examples (perhaps labeled). We write constraints about them as a conjunction of statements in logic: Here, L and R are Boolean formulas, i.e. antecedents and consequents, constructed from model predictions on examples in x.
One example of such an invariant is the constraint from §1, which can be written as E(P, H)∧ C(H, Z) → C(P, Z), where, e.g., predicate E(P, H) denotes that model predicted label E. We can also represent labeled examples as constraints: "If an example is annotated with label Y , then model should predict so." In logic, we write → Y (x). 3 Seen this way, the expression (1) could represent labeled data, unlabeled groups of examples with constraints between them, or a combination.

Generalizing Errors as Inconsistencies
Using the representation defined above, we can define how to evaluate predictors. We seek two properties of an evaluation metric: It should 1) quantify the inconsistency of predictions, and 2) also generalize classification error. To this end, we define two types of errors: global and conditional violation. Both are defined for a dataset D consisting of example collections x as described above.
Global Violation (ρ) The global violation is the fraction of examples in a dataset D where any constraint is violated. We have:

Learning by Minimizing Inconsistencies
With the notion of errors, we can now focus on how to train models to minimize them. A key technical challenge involves the unification of discrete declarative constraints with the standard loss-driven learning paradigm.
To address this, we will use relaxations of logic in the form of t-norms to deterministically compile rules into differentiable loss functions. 4 We treat predicted label probabilities as soft surrogates for Boolean decisions. In the rest of the paper, we will use lower case for model probabilitiese.g., e(P, H), and upper case-e.g., E(P, H)for Boolean predicates.
Different t-norms map the standard Boolean operations into different continuous functions. Table 1 summarizes this mapping for three tnorms: product, Gödel, and Łukasiewicz. Complex Boolean expressions can be constructed from these four operations. Thus, with t-norms to relax logic, we can systematically convert rules as in (1) into differentiable functions, which in turn serve as learning objectives to minimize constraint violations. We can use any off-the-shelf optimizer (e.g., ADAM Kingma and Ba, 2015). We will see concrete examples in the NLI case study in §3.
Picking a t-norm is both a design choice and an algorithmic one. Different t-norms have different numerical characteristics and their comparison is a question for future research. 5 Here, we will focus on the product t-norm to allow comparisons to previous work: as we will see in the next section, the product t-norm strictly generalizes the widely used cross entropy loss.
We study our framework using the NLI task as a case study. First, in §3.1, we will show how to represent a training set as in (1). We will also introduce two classes of domain constraints that apply to groups of premise-hypothesis pairs. Next, we will show how to compile these declaratively stated learning objectives to loss functions ( §3.2). Finally, we will end this case study with a discussion about practical issues ( §3.3).

Learning Objectives in Logic
Our goal is to build models that minimize inconsistency with domain knowledge stated in logic. Let us look at three such consistency requirements.

Annotation Consistency
For labeled examples, we expect that a model should predict what an annotator specifies. That is, we require where Y represents the ground truth label for the example (P, H). As mentioned at the end of §2.2, for the annotation consistency, both global and conditional violation rates are the same, and minimizing them is maximizing accuracy. In our experiments, we will report accuracy instead of violation rate for annotation consistency (to align with the literature).
Symmetry Consistency Given any premisehypothesis pair, the grounds for a model to predict Contradiction is that the events in the premise and the hypothesis cannot coexist simultaneously. That is, a (P, H) pair is a contradiction if, and only if, the (H, P ) pair is also a contradiction: Transitivity Consistency This constraint is applicable to any three related sentences P , H and Z. If we group the sentences into three pairs, namely (P, H), (H, Z) and (P, Z), the label definitions mandate that not all of the 3 3 = 27 assignments to these three pairs are allowed. The example in §1 is an allowed label assignment. We can enumerate all such valid labels as the conjunction: while real-valued probabilities are lower-cased (e.g. a). In this paper, we focus on the product t-norm.

Inconsistency Losses
Using the consistency constraints stated in §3.1, we can now derive the inconsistency losses to minimize. For brevity, we will focus on the annotation and symmetry consistencies. First, let us examine annotation consistency. We can write the universal quantifier in (4) as a conjunction to get: Using the product t-norm from Table 1, we get the learning objective of maximizing the probability of the true labels: Or equivalently, by transforming to the negative log space, we get the annotation loss: We see that we get the familiar cross-entropy loss function using the definition of inconsistency with the product t-norm! Next, let us look at symmetry consistency: Using the product t-norm, we get: Transforming to the negative log space as before, we get the symmetry loss: The loss for transitivity L tran can also be similarly derived. We refer the reader to the appendix for details.
The important point is that we can systematically convert logical statements to loss functions and cross-entropy is only one of such losses. To enforce some or all of these constraints, we add their corresponding losses. In our case study, with all constraints, the goal of learning is to minimize: Here, the λ's are hyperparameters to control the influence of each loss term.

Training Constrained Models
The derived loss functions are directly compatible with off-the-shelf optimizers. The symmetry/transitivity consistencies admit using unlabeled examples, while annotation consistency requires labeled examples. Thus, in §4, we will use both labeled and unlabeled data to power training. Ideally, we want the unlabeled dataset to be absolutely informative, meaning a model learns from every example. Unfortunately, obtaining such a dataset remains an open question since new examples are required to be both linguistically meaningful and difficult enough for the model. Minervini and Riedel (2018) used a language model to generate unlabeled adversarial examples. Another way is via pivoting through a different language, which has a long history in machine translation (e.g., Kay, 1997;Mallinson et al., 2017).
Since our focus is to study inconsistency, as an alternative, we propose a simple method to create unlabeled examples: we randomly sample sentences from the same topic. In §4, we will show that even random sentences can be surprisingly informative because the derived losses operate in real-valued space instead on discrete decisions.  Table 2: Inconsistencies (%) of models on our 100k evaluation dataset. Each number represents the average of three random runs. Models are trained using 5% and 100% of the train sets. SNLI+MultiNLI 2 : finetuned twice. ρ S and τ S : symmetry consistency violations. ρ T and τ T : transitivity consistency violations.

Experiments
In this section, we evaluate our framework using (near) state-of-the-art approaches for NLI, primarily based on BERT, and also compare to an LSTM model. We use the SNLI and MultiNLI (Wang et al., 2018) datasets to define annotation consistency. Our LSTM model is based on the decomposable attention model with a BiLSTM encoder and GloVe embeddings (Pennington et al., 2014). Our BERT model is based on the pretrained BERT base , finetuned on SNLI/MultiNLI. The constrained models are initialized with the finetuned BERT base and finetuned again with inconsistency losses. 6 For fair comparison, we also show results of BERT base models finetuned twice.
Our constrained models are trained on both labeled and unlabeled examples. We expect that the different inconsistencies do not conflict with each other. Hence, we select hyperparameters (e.g., the λ's) using development accuracy only (i.e., annotation consistency). We refer the reader to the appendix for details of our experimental setup.

Datasets
To be comprehensive, we will use both of the SNLI and MultiNLI to train our models, but we also show individual results.
We study the impact of the amount of label supervision by randomly sampling different percentages of labeled examples. For each case, we also sample the same percentages from the corresponding development sets for model selection. For the MultiNLI dataset, we use the matched dev for validation and mismatched dev for evaluation.
Mirrored Instances (M) Given a labeled example, we construct its mirrored version by swapping the premise and the hypothesis. This results in the 6 This is critical when label supervision is limited. same number of unlabeled sentence pairs as the annotated dataset. When sampling by percentage, we will only use the sampled examples to construct mirrored examples. We use this dataset for symmetry consistency.

Evaluation Dataset
We sample a different set of 100k example triples for measuring transitivity consistency. For symmetry consistency, we follow the above procedure for the dataset U to construct evaluation instance pairs. Recall that the definition of inconsistency allows measuring model quality with unlabeled data.

Inconsistency of Neural Models
In Table 2, we report the impact of the amount of annotated data on symmetry/transitivity consistencies by using different percentages of labeled examples. We see that both LSTM and BERT models have symmetry consistency violations, while the transitivity consistency has lower violations. Surprisingly, the LSTM model performed on par with BERT in terms of symmetry/transitivity consistency; stronger representations does not necessarily mean more consistent models.
The table shows that, given an example and its mirrored version, if the BERT baseline predicts a Contradiction on one, it has about 60% chance (τ S ) to make an inconsistent judgement on the other. Further, we see that the inconsistencies are not affected much by different datasets. Models trained on the SNLI are as inconsistent as ones trained on MultiNLI. Combining them only gives slight improvements. Also, finetuning twice does not improve much over models finetuned once.
Finally, with more annotation, a model has fewer symmetry consistency violations. However, the same observation does not apply to the transitivity consistency. In the following sections, we will show that we can almost annihilate these inconsistencies using the losses from §3.2.

Reducing Inconsistencies
We will study the effect of symmetry and transitivity consistency losses in turn using the BERT models. To the baseline models, we incrementally include the M, U, and T datasets. We expect that the constrained models should have accuracies at least on par with the baseline (though one of the key points of this paper is that accuracy by itself is not a comprehensive metric).
In Fig. 1, we present both of the global and conditional violation rates of baselines and the constrained models. We see that mirrored examples (i.e., the w/ M curve) greatly reduced the sym- metry inconsistency. Further, with 100k unlabeled example pairs (the w/ M,U curve), we can further reduce the error rate. The same observation also applies when combining symmetry with transitivity constraint. Fig. 2 shows the results for transitivity inconsistency. The transitivity loss is, again, greatly reduced both for the global and conditional violations. We refer the reader to the appendix for exact numbers.
We see that with our augmented losses, even a model using 1% label supervision can be much more consistent than the baselines trained on 100% training set! This suggests that label supervision does not explicitly encode the notion of consistency, and consequently models do not get this information from the training data.
With the simultaneous decline in global and conditional violation rate, the constrained models learn to agree with the consistency requirements specified declaratively. We will see in the next section, doing so does not sacrifice model accuracies.

Interaction of Losses
In Table 3, we show the impact of symmetry and transitivity consistency on test accuracy. And the interaction between symmetry and transitivity consistency is covered in Fig 1 and 2  Our goal is to minimize all inconsistencies without sacrificing one for another. In Table 3, we see that lower symmetry/transitivity inconsistency generally does not reduce test accuracy, but we do not observe substantial improvement either. In conjunction with the observations from above, this suggests that test sets do not explicitly measure symmetry/transitivity consistency.
From Fig 1 and 2, we see that models constrained by both symmetry and transitivity losses are generally more consistent than models using symmetry loss alone. Further, we see that in Fig. 2, using mirrored dataset alone can even mitigate the transitivity errors. With dataset P, the transitivity inconsistency is strongly reduced by the symmetry inconsistency loss. These observations suggest that the compositionality of constraints does not pose internal conflict to the model. They are in fact beneficial to each other.
Interestingly, in Fig 2, the models trained with mirrored dataset (w/ M) become more inconsistent in transitivity measurement when using more training data. We believe there are two factors causing this. Firstly, there is a vocabulary gap between SNLI/MultiNLI data and our unlabeled datasets (U and T). Secondly, the w/ M models are trained with symmetry consistency but evaluated with transitivity consistency. The slightly rising inconsistency implies that, without vocabulary coverage, training with one consistency might not always benefit another consistency, even using more training data.
When label supervision is limited (i.e. 1%), the models can easily overfit via the transitivity loss. As a result, models trained on the combined losses (i.e. w/ M,U,T) have slightly larger transitivity inconsistency than models trained with mirrored data (i.e. w/ M) alone. In fact, if we use no label supervision at all, the symmetry and transi-Data M U T 5% w/ M,U,T 99.8 99.4 12.0 100% w/ M,U,T 98.7 97.6 6.8 tivity losses can push every prediction towards label Neutral. But such predictions sacrifice annotation consistency. Therefore, we believe that some amount of label supervision is necessary.

Analysis
In this section, we present an analysis of how the different losses affect model prediction and how informative they are during training. Table 4 shows the coverage of the three unlabeled datasets during the first training epoch. Specifically, we count the percentage of unlabeled examples where the symmetry/transitivity loss is positive. The coverage decreases in subsequent epochs as the model learns to minimize constraint violations. We see that both datasets M and U have high coverage. This is because that, as mentioned in §2, our loss function works in real-valued space instead of discrete decisions. The coverage of the dataset T is much lower because the compositional antecedent in transitivity statements holds less often, which naturally leads to smaller coverage, unlike the unary antecedent for symmetry.

Distribution of Predictions
In Table 5, we present the distribution of model predictions on the 100k evaluation example pairs for symmetry consistency. Clearly, the number of   constraint-violating (off-diagonal) predictions significantly dropped. Also note that the number of Neutral nearly doubled in our constrained model. This meets our expectation because the example pairs are constructed from randomly sampled sentences under the same topic. We also present the distribution of predictions on example triples for the transitivity consistency in Table 6. As expected, with our transitivity consistency, the distribution of the label Neutral gets significantly higher as well. Further, in Table 7, we show the error rates of each individual transitivity consistencies. Clearly our framework mitigated the violation rates on all four statements.
While the logic-derived regularization pushes model prediction on unlabeled datasets towards Neutral, the accuracies on labeled test sets are not compromised. We believe this relates to the design of current NLI datasets where the three labels are balanced. But in the real world, neutrality represents potentially infinite negative space while entailments and contradictions are rarer. The total number of neutral examples across both the SNLI and MultiNLI test sets is about 7k. Can we use these 7k examples to evaluate the nearly infinite negative space? We believe not.

Related Works and Discussion
Logic, Knowledge and Statistical Models Using soft relaxations of Boolean formulas as loss functions has rich history in AI. The Łukasiewicz t-norm drives knowledge-driven learning and inference in probabilistic soft logic (Kimmig et al., 2012). Li and Srikumar (2019) show how to augment existing neural network architectures with domain knowledge using the Łukasiewicz t-norm. Xu et al. (2018) proposed a general framework for designing a semantically informed loss, without tnorms, for constraining a complex output space. In the same vein, Fischer et al. (2019) also proposed a framework for designing losses with logic, but using a bespoke mapping of the Boolean operators.
Our work is also conceptually related to posterior regularization (Ganchev et al., 2010) and constrained conditional models (Chang et al., 2012), which integrate knowledge with statistical models. Using posterior regularization with imitation learning, Hu et al. (2016) transferred knowledge from rules into neural parameters. Rocktäschel et al. (2015) embedded logic into distributed representations for entity relation extraction. Alberti et al. (2019) imposed answer consistency over generated questions for machine comprehension. Ad-hoc regularizers have been proposed for process comprehension (Du et al., 2019), semantic role labeling (Mehta et al., 2018), and summarization (Hsu et al., 2018).
Natural Language Inference In the literature, it has been shown that even highly accurate models show a decline in performance with perturbed examples. This lack of robustness of NLI models has been shown by comparing model performance on pre-defined propositional rules for swapped datasets (Wang et al., 2019) or outlining large-scale stress tests to measure stability of models to semantic, lexical and random perturbations (Naik et al., 2018). Moreover, adversarial training examples produced by paraphrasing training data (Iyyer et al., 2018) or inserting additional seemingly important, yet unrelated, information to training instances (Jia and Liang, 2017) have been used to show model inconsistency. Finally, adversarially labeled examples have been shown to improve prediction accuracy (Kang et al., 2018) . Also related in this vein is the idea of dataset inoculation (Liu et al., 2019), where models are finetuned by exposing them to a challenging dataset.
The closest related work to this paper is probably that of Minervini and Riedel (2018), which uses the Gödel t-norm to discover adversarial examples that violate constraints. There are three major differences compared to this paper: 1) our definition of inconsistency is a strict generalization of errors of model predictions, giving us a unified framework for that includes cross-entropy as a special case, 2) our framework does not rely on the construction of adversarial datasets, and 3) we studied the interaction of annotated examples vs. unlabeled examples via constraint, showing that our constraints can yield strongly consistent model with even a small amount of label supervision.

Conclusion
In this paper, we proposed a general framework to measure and mitigate model inconsistencies. Our framework systematically derives loss functions from domain knowledge stated in logic rules to constrain model training. As a case study, we instantiated the framework on a state-of-the-art model for the NLI task, showing that models can be highly accurate and consistent at the same time. Our framework is easily extensible to other domains with rich output structure, e.g., entity relation extraction, and multilabel classification. Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018  Data 1% 5% 20% 100% M 1 1 1 1 U 10 −5 10 −4 10 −3 10 −1 T 10 −6 10 −5 10 −4 10 −3 Table 9: Choice of λ's for different consistency and corresponding unlabeled datasets. For different sizes of annotation and different types of data, we adopt different λ's.
Having larger λ leads to significantly worse accuracy on the development set, especially that of SNLI. Therefore we did not select such models for evaluation. We hypothesize that it is because the SNLI and MultiNLI are crowdsourced from different domains while the MS COCO shares the same domain as the SNLI. Larger scaling factor could push unlabeled examples towards Neutral, thus sacrificing the annotation consistency on SNLI examples.

A.3.2 Results
We present the full experiment results on the natural language inference task in Table 8. Note that the accuracies of baselines finetuned twice are slightly better than models only finetuned once, while their symmetry/transitivity consistencies are roughly on par. We found such observation is consistent with different finetuning hyperparameters (e.g. warming, epochs, learning rate).