Evaluating Factuality in Generation with Dependency-level Entailment

Despite significant progress in text generation models, a serious limitation is their tendency to produce text that is factually inconsistent with information in the input. Recent work has studied whether textual entailment systems can be used to identify factual errors; however, these sentence-level entailment models are trained to solve a different problem than generation filtering and they do not localize which part of a generation is non-factual. In this paper, we propose a new formulation of entailment that decomposes it at the level of dependency arcs. Rather than focusing on aggregate decisions, we instead ask whether the semantic relationship manifested by individual dependency arcs in the generated output is supported by the input. Human judgments on this task are difficult to obtain; we therefore propose a method to automatically create data based on existing entailment or paraphrase corpora. Experiments show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods or those based on question generation, while additionally localizing the erroneous parts of the generation.


Introduction
The rise of pre-trained language models (Devlin et al., 2019;Radford et al., 2019) has led to strong text generation models for applications including summarization (Dong et al., 2019;, paraphrasing (Goyal and Durrett, 2020;Shen et al., 2020), story generation (Mao et al., 2019), and data augmentation Zhang and Bansal, 2019). However, while these models generate fluent and grammatical text, they are prone to making factual errors that contradict 1 Data and code available at https://github.com/ tagoyal/dae-factuality the input text (Cao et al., 2018). Automatic metrics used to evaluate text generation, such as ROUGE and BERTScore (Zhang et al., 2020), are not correlated with the factual consistency or faithfulness of the generated text (Falke et al., 2019;Kryściński et al., 2019). To address this, recent work has studied the use of textual entailment models to rank and filter non-factual generations (Falke et al., 2019;Maynez et al., 2020). However, these models suffer from issues such as dataset biases (Gururangan et al., 2018;Zhou and Bansal, 2020) and a mismatch between the training data (entailment) and the test data (model generations).
In this paper, we propose to decompose entailment decisions in a sentence to evaluate the faithfulness of generated text in a more fine-grained way. Rather than making a sentence-level entailment decision, we instead evaluate the entailment of dependency arcs of the generated sentence, as illustrated in Figure 1. This approach views dependency arcs as semantic units that can be interpreted in isolation. Each arc is therefore judged indepen-dently based on whether the relation it implies is entailed by the source sentence. This is helpful in localizing generation errors and consequently providing more interpretable model decisions.
Decomposing the factuality evaluation over components of structured representations can also be extended to other formalisms like AMR (Banarescu et al., 2013), UDS (White et al., 2016), and more. The chief advantage of dependency parsing over these is that pre-existing tools for dependency parsing report very high performance. Another line of work focuses on question answering-based semantic representations (FitzGerald et al., 2018; or generating freeform questions to capture factuality (Wang et al., 2020;Durmus et al., 2020). However, these systems require a separate question generation step at inference time, so generation is baked into their formalisms in a heavyweight way whereas we only require dependencies. A final advantage of our approach is that dependency arcs can be produced in an online fashion during inference, and hence, factuality can be enforced incrementally.
We evaluate our proposed dependency arc entailment approach in both summarization and paraphrase settings. In both settings, we show that we can automatically derive labels from actual generation data rather than rely on human annotation of dependency arc entailment, which is challenging to collect at scale. Nevertheless, our results show that our system's performance on factuality classification surpasses both sentence-level entailment and question generation and answering models. Our derived labels from actual generation data provide much better task-specific supervision compared to general entailment datasets. Finally, we demonstrate that predicted entailment scores for individual dependency arcs are meaningful and can be leveraged to understand and localize errors in system generations.

Dependency Arc Entailment (DAE)
Defining arc entailment Our notion of entailment starts by assuming a rough correspondence between predicates and arguments in two sentences. In natural language inference (NLI) annotation efforts, this has taken the form of anchoring judgments in an underlying imagined scene (Bowman et al., 2015). We make a similar assumption, that events and actors in the source and target sentences are in correspondence unless there is direct evi-dence to the contrary. For instance, in Figure 1, the military coup in the target sentence and its corresponding amod(coup→military) arc should be evaluated with respect to the military takeover in the source, giving coreference of the two the benefit of the doubt here.
With this assumption, we say that a dependency arc in the target sentence is entailed by the source if the semantic relationship it implies between its head and child is entailed by the source sentence. There is precedent for such a syntax-semantics correspondence: certain formalisms like meaning-text theory (Mel'čuk, 1988) have historically made this mapping more or less explicit. Consider the first hypothesis in Figure 1. Many of the arcs here either contain information analogous to that in semantic roles, or they specify nominal modifiers capturing important entity properties. 2 In our implementation, we exclude certain arc types which are not strongly tied to semantics, such as arcs involving punctuation; see the Appendix for details. Note that our method does not support commenting on arcs of the input that do not exist in the output; we discuss this later in Section 7.2.
In some ways, our view of entailment is equivalent with the entailment of NLI settings (Bowman et al., 2015;Williams et al., 2018): if a hypothesis is entailed under the NLI definition, then all dependency arcs of the hypothesis must be entailed by our DAE definition. However, in our formulation, arc entailment is a 2-class classification task with labels ∈ {entailed, non-entailed}. This means that arcs that would be neutral or contradiction in the generic entailment formulation are considered non-entailed in our scenario.
Annotating arc entailment To model this formulation, we require entailment annotations at the dependency arc level. However, there are several challenges associated with human annotation of arc entailment data. (1) Entailment is not truly a binary decision and is inherently subjective (Pavlick and Kwiatkowski, 2019). (2) Entailment of an arc may be fundamentally unknowable or undefined if, for example, too much of the context has changed for such a judgment to make sense. (3) Annotators would need to understand the meaning of dependency labels and to be able to isolate the semantics of these individual arcs in sentences. < l a t e x i t s h a 1 _ b a s e 6 4 = " L y L i 8 C B X A 4 N H n P V d w V K G L n 7 6 u J U = " > A A A B 7 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S L U S 9 m t g o K X o g g e K 9 g P a J e S T b N t b D Z Z k q x Y l v 4 H L x 4 U 8 e r / 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 2 l 5 Z X V t f X c R n 5 z a 3 t n t 7 C 3 3 9 A y U Y T W i e R S t Q K s K W e C 1 g 0 z n L Z i R X E U c N o M h t c T v / l I l W Z S 3 J t R T P 0 I 9 w U L G c H G S o 2 b 0 t P l 4 K R b K L p l d w q 0 S L y M F C F D r V v 4 6 v Q k S S I q D O F Y 6 7 b n x s Z P s T K M c D r O d x J N Y 0 y G u E / b l g o c U e 2 n 0 2 v H 6 N g q P R R K Z U s Y N F V / T 6 Q 4 0 n o U B b Y z w m a g 5 7 2 J + J / X T k x 4 4 a d M x I m h g s w W h Q l H R q L J 6 6 j H F C W G j y z B R D F 7 K y I D r D A x N q C 8 D c G b f 3 m R N C p l 7 7 R c u T s r V q + y O H J w C E d Q A g / O o Q q 3 U I M 6 E H i A Z 3 i F N 0 c 6 L 8 6 7 8 z F r X X K y m Q P 4 A + f z B 3 7 f j m s = < / l a t e x i t > d(h) < l a t e x i t s h a 1 _ b a s e 6 4 = " Z / 2 n r x O C S 1 V p f 0 P 6 M L I a X 0 L 0 M + c = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H X R 2 n i t A O i X m s e g H W l D N J O 4 Y Z T n u J o l g E n D 4 G k 9 v c f 3 y i S r N Y P p h p Q n 2 B R 5 J F j G C T S 2 F 9 f D 6 s 1 t y G O w d a J V 5 B a l C g P a x + D c K Y p I J K Q z j W u u + 5 i f E z r A w j n M 4 q g 1 T T B J M J H t G + p R I L q v 1 s f u s M n V k l R F G s b E m D 5 u r v i Q w L r a c i s J 0 C m 7 F e 9 n L x P 6 + f m u j a z 5 h M U k M l W S y K U o 5 M j P L H U c g U J Y Z P L c F E M X s r I m O s M D E 2 n o o N w V t + e Z V 0 m w 3 v o t G 8 v 6 y 1 b o o 4 y n A C p 1 A H D 6 6 g B X f Q h g 4 Q G M M z v M K b I 5 w X 5 9 3 5 W L S W n G L m G P 7 A + f w B U k + N w w = = < / l a t e x i t > r a < l a t e x i t s h a 1 _ b a s e 6 4 = " X 0 l q 7 K b V O S w k U 2 R c t Q 6 r p h r Z f w E = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 o P q 0 X 6 6 4 V X c O s k q 8 n F Q g R 6 N f / u o N Y p Z G K A 0 T V O u u 5 y b G z 6 g y n A m c l n q p x o S y M R 1 i 1 1 J J I 9 R + N j 9 1 S s 6 s M i B h r G x J Q + b q 7 4 m M R l p P o s B 2 R t S M 9 L I 3 E / / z u q k J r / 2 M y y Q 1 K N l i U Z g K Y m I y + 5 s M u E J m x M Q S y h S 3 t x I 2 o o o y Y 9 M p 2 R C 8 5 Z d X S a t W 9 S 6 q t f v L S v 0 m j 6 M I J  Therefore, in this work, we take another approach, which is to automatically label data from existing sources and outputs of generation models, which lets us collect large-scale data in a variety of domains. Specifically, we use paraphrase data to construct our dataset. Note however, that there is a fundamental difference between paraphrase pairs, which should be entailed in both directions, and past NLI data, which is forward-entailed by definition. For instance, the premise and hypothesis in Table 1 would classically be judged as entailed because political representative is a hypernym of prime minister, but the hypothesis is not a paraphrase of (even part of) the premise.
As a result, our automatically-derived dataset captures a more restricted notion of entailment, primarily consisting of entailment relations that are symmetric in nature: arcs in the target sentence entailed by a source sentence also entail some part of the source. However, this is actually closer to what is acceptable for generation models to produce in tasks such as summarization, and the dataset collected in such a manner is useful for downstream tasks, as we show in Section 6. Moreover, because our training and evaluation data will typically come from closely related sentences, we can sidestep the cases where judgments in our formalism become most difficult to define.

Model
Let x be the input context, h be a hypothesis produced by a generation model G, and d(h) be the set of arcs in the dependency parse of h. We want to predict the entailment decision for each arc a ∈ d(h) with respect to the input x, i.e. F a (a, x).
The overall model architecture of this dependency arc entailment model F a is outlined in Figure 2. First, we concatenate the input and the hypothesis. We use a pre-trained encoder model E to obtain contextual representations for each token in the concatenated sequence. From these token level representations, we derive a representation for each dependency arc a ∈ d(h): as shown in the inset in the figure. Here, a h , a c are the token indices corresponding to the head word and the child word of dependency arc a, and a d is their corresponding dependency label, which is also embedded with E (non-contextually).
Next, these arc representations are passed through a linear layer, followed by a softmax layer to obtain entailment label probabilities corresponding to each arc: p(y | a; x) = softmax(W r a )).
This DAE network is trained using standard binary cross entropy loss and requires supervision on the arc entailment labels y * ∈ {entailed, non-entailed} for the dependency arcs. Examples do not need to be fully labeled; training can use partial sets of annotations of arcs in d(h). However, while using the DAE model in downstream tasks such as hypotheses reranking, entailment decisions for all arcs in the candidate hypothesis are required.
Sentence-level factuality from dependencylevel judgments We want to evaluate the factual consistency of each hypothesis h with respect to input x, i.e. F(h, x). This is computed by combining arc-level entailment scores over the dependency arcs set d(h) of the generated hypothesis: We use the sentence-level score F(h, x) to rerank the generated hypotheses in Section 6. 3

Automatic Dataset Creation
We now describe our method for automatically collecting dependency-level DAE annotations from paraphrase or entailment corpora, avoiding manual annotation. In this creation process, we want data featuring a range of paraphrasing phenomenon, such as passivization, clausal reordering, synonym replacement, and more. Furthermore, we want a natural distribution of errors produced by generation models, such as wrong subject or objects for a verb or hallucination of new content.
We represent a single example in our dataset as a tuple x, h, (a i , y * i ) a i ∈d(h) . Here, x is the input, h is the hypothesis, a i denotes a single dependency arc in the hypothesis, and y i refers to the gold entailment label for that arc. To construct data of this form, we assume access to a paraphrase dataset D, containing pairs (x, h * ) of input sentences and their corresponding gold paraphrases. 4 Additionally, we employ a paraphrase generation model G p , which can output k candidate paraphrases {h 1 , h 2 , ...h k } given an input x. These noisy paraphrases will be used to derive realistic examples of generation errors to contrast with gold paraphrases, using the following techniques.
Positive labels from gold paraphrases Given a ground truth paraphrase, we assume that every arc in the target side of the paraphrase h * is entailed by the source side x. This is in line with 3 According to DAE definition, an output is non-factual if any of its arcs is non-entailed. However, min-pooling was very unstable, so we instead use mean-pooling in our experiments. 4 The paraphrase corpora we use in this work may come from automatic methods like backtranslation; however, we still assume that these are reliable gold-standard paraphrases. our definition of arc entailment in Section 2 and allows us to propagate sentence-level paraphrase judgements to arc-level entailment judgements. Because paraphrase datasets feature diverse linguistic phenomena, this approach leads to a range of positive examples. However, as described in Section 2, it is less likely to include arcs which are forwardentailed only (e.g., Table 1).
Auto-derived labels from model generations To find negative examples for entailment, we leverage the output generations {h 1 , h 2 , ...h k } of an automatic paraphrase model G p . These generations will include unseen arcs, which may be positively or negatively entailed.
Our key assumption here is that the outputs at the top of the beam are more likely to be factually correct, whereas outputs at the bottom of the beam are of lower quality and more prone to having factual inconsistencies. We assume that new arcs introduced in bad model generations (i.e., bottom-most generations of the beam) are not entailed by the input.
We can then noisily label the generated paraphrases with a mix of positive (entailed) and negative (non-entailed) labels. We first construct a set of entailed dependency arcs: this is a set containing all dependency arcs of the input and the gold paraphrase, i.e., d(x) ∪ d(h * ). Next, we annotate the dependency arcs of the bottom-most generations of the beam, say {h k , h k−1 , h k−2 }, in the following way: The middle case here leaves arcs that are in h 1 but not x or h * as unannotated. Such arcs are possibly factual under the model, coming from a high-quality generated output, but we do not have enough confidence to assign them a label. During model training, such unannotated arcs are ignored. Finally, we also include the positive arcs from the 1-best hypothesis in our DAE data: . This provides another source of hypothesis sentences with a slightly different distribution during model training.

Intrinsic Evaluation of DAE
Our experimental evaluation focuses on the following questions: (1) Does the automatic data collec-tion result in a high quality training dataset? (2) Is the DAE model we train a good classifier? (3) Does DAE allow us to filter model generations and improve reliability of a generation system?
We construct our DAE training dataset using the methodology defined in Section 4. For this, we leverage the paraphrase pair dataset PARANMT-50M (Wieting and Gimpel, 2018) as the base dataset D. We use the transformer-based encoderdecoder model for paraphrase generation from Goyal and Durrett (2020) as G p . We use the paraphrase model to generate 10 outputs for 20k sentence pairs from D. We use the Stanford CoreNLP library  to extract enhanced dependencies from the outputs sentences. Then, using the strategy outlined in Section 4, we generate 100k training samples (sentence pairs), 3k dev samples and 3k test samples. From this dataset, we derive 520k training, 14k dev, and 22k dependency level annotations, which we evaluate on in Section 5.2. The entailed to not-entailed ratio is roughly 70-30 in this dataset.

Dataset Quality
Before evaluating our modeling approach, we first evaluate whether the arc annotations in the training data follow the theoretical definition of DAE, outlined in Section 2. Figure 3 showcases examples from the dev set, corresponding to the same input example. We show positive entailed arcs (in green), negative non-entailed arcs (in red), and one unlabeled arc (in gray). Here, we can see that the gold paraphrase is important as it provides examples of valid synonym replacements, as well as other rephrasing of the input sentence. For negative examples, the examples from the bottom of the beam do indeed contain bad output and non-entailed arcs.  Agreement with human labels Next, we want to evaluate the quality of the auto-derived dataset by measuring its agreement with human annotations. For this, we manually annotated the dependency arc entailment labels for 100 sentence pairs from the dev set (consisting of 20 gold paraphrases and 80 generated paraphrases), according to our theoretical definition. We compared these manual annotations (gold) with the auto-derived annotations for this set, and observed that the two annotations agreed 82% of the time. This indicates that the automatic annotation strategy from Section 4 results in a high quality dataset. Further investigation into the disagreements between the manual and automatic labels revealed that false negatives included paraphrasing phenomena like synonym replacement, anaphora resolution during reordering, etc. We describe how to produce additional data to handle some of these cases later. On the other hand, false positives mainly consisted of exact arc matches in incorrect contexts.  Next, we intrinsically evaluate the performance of the dependency arc entailment model, outlined in Section 3, on held-out data from our automatic labeling method. We test the performance of two pre-trained models: BERT (bert-base-uncased, 110M parameters) (Devlin et al., 2019) and ELEC-TRA (electra-base-discriminator, 110M parameters) (Clark et al., 2020). We compare these models against a majority label baseline (entailment) and an lexical-match baseline that predicts y = entailment if the arc (head, child and label) in the output constitute a dependency arc in the input as well, and non-entailed otherwise.

Intrinsic Evaluation: DAE Classification
The performance of the different models is outlined in Table 2. Our pre-trained transformer models perform substantially better than the baselines, with BERT achieving 86.9% accuracy, and ELEC-TRA with 88.4%. These models also outperform the lexical-match baseline, showing that the DAE models learn to do more than simple dependency arc matching. Henceforth, we use the best performing ELECTRA model in all our experiments.

Other Data Generation Methods
Besides the data generation procedure we proposed, there are other ways to synthetically generate noisy annotations for premise-hypothesis pairs (Zhang et al.). We investigate these from two perspectives: first, does our data generation process cover these phenomena well, and second, can these additional sources of data prove useful?
First, we explore a word-swapping technique similar to Zhang et al.. Given a premise x, we form a hypothesis x by randomly swapping tokens that share a common part-of-speech tag to introduce errors. The intersection of the arcs in d(x) and the modified sentence d(x ) are annotated as positive arcs (y = entailment), whereas the newly created or changed arcs are annotated as negative (y = non-entailed).
Our synonym data is noisily constructed in the same manner as the gold paraphrases, but targets synonym replacements specifically. We extract pairs (x, h * ) from D that generally maintain similar sentence structure between the two sentences, 5 but with small lexical changes like synonym replacement. We assign a positive entailment label to all arcs: x, h * , {(a, 1) ∀a ∈ d(h * )} .
To construct data with hallucinations, we modify an input sentence x, which we take as the hypothesis by removing a randomly sampled span of contiguous tokens to derive a premise sentence x . Then, the following DAE model annotations are derived: x , x, {(a i , 0) ∀ a i ∈ d(x)\d(x )} . Additionally, for each input sentence x, we extract another sentence x with the highest 1-gram overlap in the dataset. From this we derive, Table 3 shows a comparison of word-swapping with our method (AD), and variants of our method augmented with synonyms and hallucinations. Note that the model trained on word swapped data performs well on a similarly constructed held-out set, but not on the test data for synonym data and auto-derived data. This indicates that artificially constructed data with rule based error introduction  does not cover the space of generation possibilities. On the other hand, the model trained on our auto-derived dataset performs well across both artificial and actual generation data, thereby covering a larger range of entailment possibilities. Additional augmentation of synonym-and hallucinationspecific data improves the performance further on the respective test sets while retaining the performance on generic entailment data. Henceforth, we use the (AD + S) model for our experiments.

Extrinsic Evaluation: Filtering Bad Generations
Moving beyond the dependency-level inference task, we now want to evaluate the sentence-level performance of our model formulation. Namely, can it usefully reject erroneous generations produced by models for summarization (Section 6.1) and paraphrasing (Section 6.2)?

Summary Ranking
We perform our evaluation on an abstractive summarization test dataset introduced in Falke et al.
(2019) and used in other previous work. It contains 373 test samples, each containing an input source sentence from CNN/DM and two summary sentences covering the same content generated using the model from Chen and Bansal (2018). One of these summary sentences is factually correct and the other is factually incorrect. The evaluation protocol measures how often the correct summary is scored higher than the incorrect summary for each candidate scoring technique. We compare against the following baselines:  Table 4: Performance of the different models at the summary reranking task. The human baseline is reported in Falke et al. (2019). The proposed DAE model performs on par or better than prior works and comes close to human performance.
2. Question Generation and Answering: Wang et al. (2020) propose an automatic evaluation metric QAGS that scores each summary by first generating questions pertaining to the summary content, and then comparing the answers to those questions in both the source sentence and the generated summary.
3. Rule-based: We score each summary sentence as the fraction of dependency arcs in the output that are common with the input sentence. In case both the correct and the incorrect sentence get the same score, we break ties randomly. Table 4 outlines the performance of the different models. The results show that the dependency arc entailment model outperforms the sentence-level NLI models and also the question generation and answering formulation (QAGS). Furthermore, the performance of our DAE model is close to the human performance reported in Falke et al. (2019). Interestingly, the rule-based dependency model also outperforms certain NLI models and QAGS, indicating that these more complex models may fail to capture straightforward lexical relations.
During our experimentation, we observed large variance in the performance of the NLI models at the reranking task with respect to their performance at the intrinsic entailment task. To illustrate this, in Figure 4, we plot the summarization reranking performance of the two model against the intrinsic task performance at different stages of the training. For DAE, the intrinsic task performance is reported by the dependency-level entailment classification accuracy, and for the MNLI model, we report the (details in the Appendix), improving on the results from Falke et al. (2019). We use the fine-tuned ROBERTA model released by AllenNLP (https://allennlp.org/). Accuracy Accuracy Figure 4: Performance of the ELECTRA-based MNLI model and the DAE model. The figure shows a much higher variance in reranking accuracy for the MNLI model, suggesting that the task-specific performance is not correlated with reranking performance. classification accuracy on the sentence-level MNLI entailment task. The graph shows a high variance in the summary reranking performance, with stable increase in the MNLI intrinsic task performance at different time steps. 7 This indicates that the general entailment task solves a fundamentally different problem than factuality. By contrast, the DAE model performance on the summarization reranking task is more stable.

Paraphrase Ranking
Next, we evaluate the DAE model in the paraphrasing setting. To do this, first, we create a test set, similar to the summarization test set from Falke et al. (2019). We use the transformer based seq2seq model (Goyal and Durrett, 2020) to obtain 10 candidate paraphrases for 100 input sentences from the PARANMT-50M dataset. We manually assign a label y ∈ {factual, not factual} to each input, candidate pair. Then for each input sentence, we randomly selected one correct and one incorrect paraphrase. This sentence triplet is used for our reranking experiments.

Model
Reranking Acc MNLI (ELECTRA) 79.0 DAE (ELECTRA) 79.0 Table 5: Performance on the paraphrase reranking task. The DAE performs on par with the NLI based model. Table 5 compares the performance of the MNLIbased model and the DAE models. Here, both are ELECTRA based models; these are shown to be the best performing models in the previous sec-7 Note that the best performance of the MNLI model on summary reranking is better than the best performance of the DAE model; however, it did not coincide with the task-level best performance for our particular hyperparameter choice.
tions. The results show that in this setting, the MNLI model and the DAE model perform similarly. Closer inspection of this data revealed that our model is biased towards predicting the label entailment for arcs that are common with the input, possibly because we are evaluating the same generation model that was used to produce our arc entailment data, and our model is therefore biased towards predicting non-entailed for arcs that are not present in the input. This poses a somewhat adversarial setting for our DAE model. 7 Analysis 7.1 Dependency-vs. sentence-level modeling Although our DAE model has shown strong performance, we have not yet performed a direct applesto-apples comparison of DAE versus a sentencelevel model when trained on the same sentences.
For MNLI We construct DAE data from the sentence-level entailment data as follows. First, we extract 10k examples from the MNLI data which have the label entailment. This is considered as the source data D . We use a paraphrase model (transformer seq2seq (Goyal and Durrett, 2020)) and the technique outlined in Section 4 to extract auto-derived labels from D . This gives us 42k training examples for training the DAE model. We compare this against an MNLI model trained on the original sentence-level entailment task with the same number of examples (42k).
For PARANMT For this dataset, we do not have negative (y = contradiction) annotations at the sentence-level. We derive these from our DAE training data as follows: we consider all pairs of sentences in the original dataset (x, h * ) as positive sentences (y = 1), in addition to any pair of the form (x, x). We treat the three generated sentences at the bottom of the beam as negative sentences, meaning that the model is trained to distinguish gold paraphrases from model generations. Table 6 outlines these results. For the paraphrase dataset, we see that the artificially constructed sentence-level dataset does not yield a good sentence-level discriminator. However, our dependency-level annotations can form an effective training set. The results on MNLI show that our dependency-level formulation performs better than sentence-level when trained on the same amount of data and is therefore more closely related to the factuality task than past entailment formulations.

Qualitative Evaluation
Error Localization Since the DAE formulation computes individual entailment scores for all arcs in the dependency tree structure, it is possible to localize errors in the generated summary or paraphrase.
We present examples of input sentences, generated text, and arc entailment scores for a few examples in Figure 5. For each input and output pair, we show the individual scores for the dependency arcs in the output sentence. Additionally, we report the MNLI score for the same example.
The illustrative examples show that the DAE model is capable of localizing errors where erroneous subject-object pairs were constructed, even when these are the only errors. These errors are tougher for the MNLI model to catch, which evaluates the whole sentence and is prone to lexical overlap biases (Zhou and Bansal, 2020). In the third example, from our paraphrase setting, we see that the model is able to recognize synonym replacement as a valid re-writing. However, for the last example, the model cannot perform this same judgement for the variations → changes replacement. Although, note that the model scores it higher than a erroneous replacement of the same word (variations → latter). This shows that the DAE model is able to rank sentences that incorporate the similar type of re-writing/editing. However, we observed that the model has different error rates for different types of re-writing changes. For example, it is better at identifying text hallucination, or cases where the subject object relation between words change, but has comparatively lesser accuracy over changes such as synonym replacements. Therefore, it may not be ideal for settings where different re-writing types need to be compared.
Limitations We comment on a few limitations of our approach. First, arcs in our dependency-based formalism are not marked with negation or quantification; these must be handled via the contextu-Input Text: Visitors to the Isle of Lewis will be hoping the clouds stay away on Friday for the solar eclipse. Output: Visitors to the Isle of Lewis will be hoping to stay away on Friday. alization of the hypothesis sentence rather than in the semantic representation. Second, our method cannot identify arcs that are missing from the input. For instance, consider the following premise: In the morning, he goes jogging and hypothesis: In the morning. Here, the hypothesis does not contain critical information from the source sentence; however, since all the present arcs are entailed by the source, our model would give this a high score. Furthermore, our model is trained on the PARANMT-50M dataset, which itself is constructed through a noisy backtranslation process. Therefore, we rely on noisy gold data for constructing our model. We believe that better quality paraphrase pairs would lead to a better quality model.

Related Work
Recent work in addressing faithfulness of text generations can be broadly divided into three groups: structured information based, multi-task formulations, and post-processing methods. The first group leverages structured knowledge, like Open IE triples (Cao et al., 2018;Goodrich et al., 2019), dependency trees (Song et al., 2018), or generated semantic roles (Fan et al., 2018) as additional input for generation. However, incorporation of these as additional embeddings in model architectures does not explain how these influence model generations. The second group leverages multi-task formulations and trains the generation model jointly with other factuality-related tasks, such as NLI entailment and question generation (Guo et al., 2018). Other work additionally incorporates a reward for generating summaries entailed by the the input . Our approach can be used to rank/filter outputs from any generation model in a black-box way, without additional augmentation or retraining.
In post-processing approaches, recent work has explored NLI-based (Falke et al., 2019;Maynez et al., 2020) post-generation filtering or ranking of output summaries. Our dependency-level models perform on par with these approaches, while additionally localizing the error in the generations. Other work (Durmus et al., 2020;Wang et al., 2020) has looked at using question generation and answering to reveal factual inconsistencies in generated text. However, more work is needed to figure out how to make these approaches reliable and broad coverage, as they primarily focus on specific factors like noun phrases and named entities.

Conclusion
In this work, we propose the dependency arc entailment formulation to identify factual errors in generated text in a more fine-grained manner. We show that the proposed formulation outperforms past approaches, while additionally providing an interpretable error analysis.

A Dependency Label Set
As outlined in Section 2, we restrict our analysis to a subset of dependency arcs which are more strongly connected to semantics. For each of hypothesis h corresponding to the input x, we extract Enhanced Dependencies using the Stanford CoreNLP tool, and assign entailment labels to this dependency arc set d(h) using the strategy outlined in Section 4. We exclude the following arcs from our analysis: punct, det, case, aux, auxpass, dep, cop, mark. This same subset of arcs are ignored while computing sentence-level factuality.

B Examples from Synonym Test Set
As outlined in Section 5.2, we additionally augment our auto-derived training data (AD) with synonym data (S) and show that this improves the model performance on the held out synonym only test set. Figure 6 provides some examples showing the predicted entailment probability for each arc using this augmented training data. The predictions show that our model learns some bias to recognize synonym replacements and small phrasal substitutions as arcs that are entailed by the input.
Input: you'd be a great inspiration to your fellow warriors. Output you'd be a great inspiration to the other soldiers.

83.5
Input: Turkey is a complicated country, with multiple dilemmas. Output: Turkey is a complex country with many dilemmas.
99.2 77.9 69.0 79.9 Figure 6: Example from the held-out synonym dataset. The scores are the arc entailment probabilities assigned by the (AD + S) model.

C Implementation Details
To train our DAE model, we fine-tune the pre-trained encoders BERT (bert-base-uncased, 110M parameters) and ELECTRA (electra-basediscriminator, 110M parameters), as outlined in Section 5. We perform 5 hyperparameter trials, varying only the learning rate using manual tuning. The models with the best dev set accuracy are used. The final hyperparameters used are:  Additionally, we fine-tune (bert-base-uncased, 110M parameters) and ELECTRA (electra-basediscriminator, 110M parameters) models on the MNLI dataset. We fine-tuned the model using 3 hyperparameter trials, varying only the learning rate using manual tuning. The models with the best dev set accuracy are used. The final hyperparameters used are shown in Table 8  We get a dev accuracy of 84.5% and 89.0% for the BERT and ELECTRA models respectively.